Data

The data module of gerrytools is designed to handle the loading and the processing of data. In partucular, it provides methods for loading and managing data from the census. Below is a basic tutorial on what we veiw to be a common workflow for someone using this module.

Note

Sometimes, when calling functions that work with the us module (and data does do this), you may see the following error:

ValueError: Unexpected response (URL: ...): Sorry, the system is currently
undergoing maintenance or is busy. Please try again later.

This is due to a Census API issue and cannot be fixed from the python side of things. However, re-running the code generally fixes the issue.

For this tutorial, we will assume that the reader is working through with a Jupyter notebook. All of the required packages needed to run this tutorial can be found in the tutorial_requirements.txt file.

Census

There are two different methods that data provides for loading census data: census20() and census10() . As the names would suggest, the former is liked to the US census data collected in 2020, and the latter is linked to the US census data collected in 2010. There is significant difference between the two methods, so please be sure to refer to the documentation.

For the purposes of this tutorial, we will be using the 2020 census data. The first thing to do is to load all of the necessary packages:

from gerrytools.data import *
import geopandas as gpd
import pandas as pd
import us

And now we would like to load the census data for the state of Massachusetts. When we go to load this data, we should be aware that there are 5 different tables that are available on the Census Bureau’s API for retrieving the 2020 Decennial Census PL 94-171 at the stated level of geography. These tables are:

P1: Race
P2: Hispanic or Latino, and Not Hispanic or Latino by Rac
P3: Race for the Population 18 Years and Over (Race by VAP)
P4: Hispanic or Latino, and Not Hispanic or Latiny by Race for the Population 18 Years and Over
P5: Group Quarters Population by Group Quarters Type

df = census20(
    us.states.MA,
    table="P3",
    columns={},
    geometry="tract",
)

df[["GEOID20", "VAP20", "WHITEVAP20", "BLACKVAP20", "ASIANVAP20", "OTHVAP20"]].head()

In jupyter, this will display the following table:

	GEOID20	VAP20	WHITEVAP20	BLACKVAP20	ASIANVAP20	OTHVAP20
0	25001012601	2657	1868	153	122	172
1	25001012602	4564	2444	517	147	547
2	25001012700	4059	3445	119	49	144
3	25001012800	3464	2971	86	42	84
4	25001012900	3568	3011	101	47	103

Of course, anyone that is familiar with the way that the census data is organized would realize that the column names here not the same as the ones that the Census Bureau uses. This is because the census20 method has it’s own mapping of the Census Bureau’s column names to the ones that are a bit easier to understand. If you would like to see the mapping, you use the variables() method; so, for the “P3” table, we would call:

variables("P3")

Which outputs the following:

{'P3_001N': 'VAP20',
'P3_003N': 'WHITEVAP20',
'P3_004N': 'BLACKVAP20',
'P3_005N': 'AMINVAP20',
'P3_006N': 'ASIANVAP20',
'P3_007N': 'NHPIVAP20',
'P3_008N': 'OTHVAP20',
'P3_011N': 'WHITEBLACKVAP20',
'P3_012N': 'WHITEAMINVAP20',
'P3_013N': 'WHITEASIANVAP20',
'P3_014N': 'WHITENHPIVAP20',
'P3_015N': 'WHITEOTHVAP20',
'P3_016N': 'BLACKAMINVAP20',
'P3_017N': 'BLACKASIANVAP20',
'P3_018N': 'BLACKNHPIVAP20',
'P3_019N': 'BLACKOTHVAP20',
'P3_020N': 'AMINASIANVAP20',
'P3_021N': 'AMINNHPIVAP20',
'P3_022N': 'AMINOTHVAP20',
'P3_023N': 'ASIANNHPIVAP20',
'P3_024N': 'ASIANOTHVAP20',
'P3_025N': 'NHPIOTHVAP20',
'P3_027N': 'WHITEBLACKAMINVAP20',
'P3_028N': 'WHITEBLACKASIANVAP20',
'P3_029N': 'WHITEBLACKNHPIVAP20',
'P3_030N': 'WHITEBLACKOTHVAP20',
'P3_031N': 'WHITEAMINASIANVAP20',
'P3_032N': 'WHITEAMINNHPIVAP20',
'P3_033N': 'WHITEAMINOTHVAP20',
'P3_034N': 'WHITEASIANNHPIVAP20',
'P3_035N': 'WHITEASIANOTHVAP20',
'P3_036N': 'WHITENHPIOTHVAP20',
'P3_037N': 'BLACKAMINASIANVAP20',
'P3_038N': 'BLACKAMINNHPIVAP20',
'P3_039N': 'BLACKAMINOTHVAP20',
'P3_040N': 'BLACKASIANNHPIVAP20',
'P3_041N': 'BLACKASIANOTHVAP20',
'P3_042N': 'BLACKNHPIOTHVAP20',
'P3_043N': 'AMINASIANNHPIVAP20',
'P3_044N': 'AMINASIANOTHVAP20',
'P3_045N': 'AMINNHPIOTHVAP20',
'P3_046N': 'ASIANNHPIOTHVAP20',
'P3_048N': 'WHITEBLACKAMINASIANVAP20',
'P3_049N': 'WHITEBLACKAMINNHPIVAP20',
'P3_050N': 'WHITEBLACKAMINOTHVAP20',
'P3_051N': 'WHITEBLACKASIANNHPIVAP20',
'P3_052N': 'WHITEBLACKASIANOTHVAP20',
'P3_053N': 'WHITEBLACKNHPIOTHVAP20',
'P3_054N': 'WHITEAMINASIANNHPIVAP20',
'P3_055N': 'WHITEAMINASIANOTHVAP20',
'P3_056N': 'WHITEAMINNHPIOTHVAP20',
'P3_057N': 'WHITEASIANNHPIOTHVAP20',
'P3_058N': 'BLACKAMINASIANNHPIVAP20',
'P3_059N': 'BLACKAMINASIANOTHVAP20',
'P3_060N': 'BLACKAMINNHPIOTHVAP20',
'P3_061N': 'BLACKASIANNHPIOTHVAP20',
'P3_062N': 'AMINASIANNHPIOTHVAP20',
'P3_064N': 'WHITEBLACKAMINASIANNHPIVAP20',
'P3_065N': 'WHITEBLACKAMINASIANOTHVAP20',
'P3_066N': 'WHITEBLACKAMINNHPIOTHVAP20',
'P3_067N': 'WHITEBLACKASIANNHPIOTHVAP20',
'P3_068N': 'WHITEAMINASIANNHPIOTHVAP20',
'P3_069N': 'BLACKAMINASIANNHPIOTHVAP20',
'P3_071N': 'WHITEBLACKAMINASIANNHPIOTHVAP20'}

For more information on the variables that are available in each of these tables, please refer to the census website .

ACS5

This is a method that is used to load the 5-year American Community Survey data that that he Census Bureau uses for the 5-year population estimates of the United States.

Warning

The ACS5 data uses geometries from the 2010 census, and not the 2020 census.

acs5_df = acs5(
    us.states.MA,
    geometry="block group", # data granularity, either "tract" (default) or "block group"
    year=2019,
)
acs5_df[["BLOCKGROUP10", "TOTPOP19", "WHITE19", "BLACK19", "ASIAN19", "OTH19"]].head()

This will print the following table:

i	BLOCKGROUP10	TOTPOP19	WHITE19	BLACK19	ASIAN19	OTH19
0	250173173012	571	340	15	137	0
1	250173531012	1270	660	311	93	0
2	250173222002	2605	2315	61	96	21
3	250251101035	1655	1077	242	82	0
4	250251101032	659	158	225	0	0

Estimating CVAP

AL Block Group Shapefile

Sometimes, we might want to estimate the citizen voting age population (CVAP) for a particular demographic group. This is especially true in the case where we are working with potentially new geometries for a particular state, as tends to happen after the Decennial census, which we would like to use to make projections based on our previous knowledge of the state demographics. In our case, we will be using the estimate_cvap10() method to estimate the CVAP for particular geometries in the year 2020 using information from the previous ACS.

The estimate_cvap10() method wraps the above cvap() and acs5() functions to help users pull forward CVAP estimates from 2019 (on 2010 geometries) to estimates for 2020 (on 2020 geometries). To use this, one must supply a base geodataframe with the 2020 geometries on which they want CVAP estimates. Additionally, users must specify the demographic groups whose CVAP statistics are to be estimated. For each group, users specify a triple \((X, Y, Z)\) where \(X\) is the old CVAP column for that group, \(Y\) is the old VAP column for that group, and \(Z\) is the new VAP column for that group, which must be an existing column on base. Then, the estimated new CVAP for that group will be constructed by multiplying \(X / Y \cdot Z\) for each new geometry.

Let’s start with grabbing the geometries for Alabama and looking at the acs5() and cvap() data:

base = gpd.read_file("al_bg")
acs5_cvap19 = acs4(us.states.AL, year=2019)
cvap_cvap19 = cvap(us.states.AL, year=2019)

Tips for picking \(X\), \(Y\), and \(Z\)

Your \(X\) should be any CVAP column returned by either acs5() or cvap(), so anything generated by:

print([col for col in pd.concat([acs_cvap19, cvap_cvap19]) if "CVAP" in col])])

Which, in our case, would be:

['WHITECVAP19', 'BLACKCVAP19', 'AMINCVAP19', 'ASIANCVAP19', 'NHPICVAP19', 'OTHCVAP19,
'2MORECVAP19', 'NHWHITECVAP19', 'HCVAP19', 'CVAP19', 'POCVAP19', 'CVAP19e', 'NHCVAP19',
'NHCVAP19e', 'NHAMINCVAP19', 'NHAMINCVAP19e', 'NHASIANCVAP19', 'NHASIANCVAP19e',
'NHBLACKCVAP19', 'NHBLACKCVAP19e', 'NHNHPICVAP19', 'NHNHPICVAP19e', 'NHWHITECVAP19e',
'NHWHITEAMINCVAP19', 'NHWHITEAMINCVAP19e', 'NHWHITEASIANCVAP19', 'NHWHITEASIANCVAP19e',
'NHWHITEBLACKCVAP19', 'NHWHITEBLACKCVAP19e', 'NHBLACKAMINCVAP19', 'NHBLACKAMINCVAP19e',
'NHOTHCVAP19', 'NHOTHCVAP19e', 'HCVAP19e', 'POCCVAP19']

Note that the acs5() method returns things like BCVAP19 or HCVAP19 (Black-alone CVAP and Hispanic CVAP, respectively) while the cvap() method returns things like NHBCVAP19 (Non-Hispanic Black-alone CVAP). There are also columns like NHBCWVAP19, which refer to all Non-Hispanic citizens of voting age who self-identified as Black and White. However, since our choice of \(Y\) is restricted to single-race or ethnicity columns, we recommend only estimating CVAP for single-race or ethnicity columns, like BCVAP19, HCVAP19, or NHBCVAP19).

Lastly, one should choose \(Z\) to match one’s choice for \(Y\) (say, BVAP20 to match BVAP19). However, in some cases it is reasonable to choose a \(Z\) that is a close but imperfect match. For example, setting \((X, Y, Z) =\) (BCVAP19, BVAP19, APBVAP20) (where \(Z =\) APBVAP refers to all people of voting age who selected Black alone or in combination with other Census-defined races) would allow one to estimate the 2020 CVAP population of people who selected Black alone or in combination with other races.

One final note: there are some instances in which, due to small Census reporting discrepancies, the acs5() and the cvap() methods disagree on CVAP19 estimates (this might happen for total CVAP19 or HCVAP19, for example). In these cases we default to the acs5() numbers.

Now we may construct the estimated CVAP for 2020:

estimates = estimatecvap2010(
    base,
    us.states.AL,

    # Group order goes (Old CVAP, Old VAP, new VAP)
    groups=[
        ("WHITECVAP19", "WHITEVAP19", "WVAP20"),
        ("BLACKCVAP19", "BLACKVAP19", "BVAP20"),
    ],
    ceiling=1,
    zfill=0.1,
    geometry10="tract"
)

The ceiling parameter marks when we will cap the CVAP / VAP ratio to 1. Set to 1, this means that if there is ever more CVAP19 in a geometry than VAP19 , we will “cap” the CVAP20 estimate to 100% of the VAP20 . The zfill parameter tells us what to do when there is 0 CVAP19 in a geometry. Set to 0.1, this will estimate that 10% of the VAP20 is CVAP.

Now we can print our results:

print(f"Al BLACKCVAP20: {estimates.BLACKCVAP20_EST.sum()}")
print(f"Al BLACKVAP19: {estimates.BLACKVAP19.sum()}")

Which returns to us:

AL BLACKCVAP20: 970120.3645540088
AL BLACKCVAP19: 970239

We can see that our estimate for Black-alone Voting Age Population in Alabama in 2020 is 970,120, down slightly from 970,239 in 2019.

We can also make estimates of Black VAP in Alabama among APBVAP — Alabamians who identified as Black alone or in combination with other races. This bumps up the estimate to around 1,007,363 as we can see below:

estimates = estimatecvap2010(
    base,
    us.states.AL,

    # Changing the new VAP column from BVAP20 -> APBVAP20
    groups=[
        ("BLACKCVAP19", "BLACKVAP19", "APBVAP20"),
    ],
    ceiling=1,
    zfill=0.1,
    geometry10="tract"
)

print(f"AL APBCVAP20 estimate: {estimates.BLACKCVAP20_EST.sum()}")

Which returns:

AL APBCVAP20 estimate: 1007362.5586538106