QA Sample: AK_precincts¶
Below are the steps involved in performing automated data quality checks on the AK_precincts shapefile from mggg-states.
This notebook does the following:
- Collects the following Alaska election data from
mggg-states, MEDSL, and Wikipedia:- 2016 United States presidential election
- 2016 United States Senate elections
- 2016 United States House of Representatives elections
- 2018 United States Senate elections
- 2018 United States House of Representatives elections
- Wrangles the datasets so that they can be compared against each other.
- Checks if column names in
AK_precinctsdiverge from the MGGG naming convention (as outlined innaming_convention.json). - Compares the vote counts in
AK_precinctswith those in the MEDSL and Wikipedia datasets. - Prints the aggregated votes in
AK_precinctsfor ease of spot checking against Secretary of State websites.
Note: the automated checks are not completely exhaustive and further manual checks are required. Please see Next Steps at the bottom of this notebook for continuing the data QA process.
Automation Check Timestamp: 02:00 pm ET, 14 August 2020
Step 0. Setup¶
[1]:
# Install useful Python packages
!conda install fiona shapely pyproj rtree && pip3 install wheel
!pip3 install pandas
!pip3 install geopandas
!pip3 install wikipedia
!pip3 install git+https://github.com/mggg/gdutils.git
[2]:
# Import useful Python modules
import numpy as np
import pandas as pd
import geopandas as gpd
import json # for parsing a json file
import wikipedia # unofficial Wikipedia package (wrapper of MediaWiki API)
import os # for ensuring file traversal works regardless of operating system
import gdutils.datamine as dm # data-mining module from gdutils
import gdutils.dataqa as dq # data QA module from gdutils
import gdutils.extract as et # table extraction module from gdutils
from typing import Any, List, Tuple, Dict, Hashable, Union, NoReturn
Step 1. Collect data¶
Step 1.1. Collect AK_precincts data from the mggg-states’ AK-shapefiles GitHub repository.
[3]:
# Clone 'AK-shapefiles' repository into 'output/mggg/'
dm.clone_gh_repos(account='mggg-states',
account_type='orgs',
repos=['AK-shapefiles'],
outpath=os.path.join('output', 'mggg'))
[4]:
# Extract a GeoDataFrame from 'AK-shapefiles/AK_precincts.zip'
mggg_gdf = et.read_file(os.path.join('output', 'mggg', 'AK-shapefiles',
'AK_precincts.zip')).extract()
mggg_gdf.head() # renders first 5 rows of the extracted gf
[4]:
| ID | AREA | DISTRICT | NAME | POPULATION | USH14D | USH14R | USH14L | PRES16D | PRES16R | ... | VAP | WVAP | BVAP | AMINVAP | ASIANVAP | NHPIVAP | OTHERVAP | 2MOREVAP | 2MORE | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 266.0 | 1.553231 | 01-446 | 01-446 AURORA | 2995.0 | 336 | 457 | 91 | 295 | 434 | ... | 2315 | 1740 | 92 | 237 | 78 | 2 | 48 | 118 | 229 | POLYGON ((294705.801 1667364.692, 294704.326 1... |
| 1 | 329.0 | 0.578508 | 01-455 | 01-455 FAIRBANKS NO. 1 | 659.0 | 72 | 106 | 16 | 65 | 113 | ... | 545 | 416 | 16 | 62 | 12 | 0 | 10 | 29 | 36 | POLYGON ((297483.985 1669129.153, 297485.509 1... |
| 2 | 267.0 | 0.469371 | 01-465 | 01-465 FAIRBANKS NO. 2 | 1542.0 | 108 | 166 | 44 | 120 | 157 | ... | 1312 | 853 | 85 | 252 | 37 | 1 | 20 | 64 | 100 | POLYGON ((297800.944 1668172.899, 297823.138 1... |
| 3 | 268.0 | 0.401854 | 01-470 | 01-470 FAIRBANKS NO. 3 | 1872.0 | 216 | 234 | 54 | 205 | 218 | ... | 1531 | 1047 | 97 | 232 | 36 | 3 | 30 | 86 | 138 | POLYGON ((296902.053 1668075.791, 296915.198 1... |
| 4 | 269.0 | 0.561294 | 01-475 | 01-475 FAIRBANKS NO. 4 | 1143.0 | 123 | 118 | 40 | 86 | 149 | ... | 883 | 622 | 28 | 153 | 27 | 0 | 14 | 39 | 106 | POLYGON ((296178.482 1666807.889, 296101.344 1... |
5 rows × 42 columns
Step 1.2. Gather MEDSL data for comparison purposes.
[5]:
# Print available MEDSL data to select applicable datasets
print('{:27} : {}'.format('Repo Name', 'Repo URL'))
print('------------------------------------------------------------------')
for (repo, url) in dm.list_gh_repos(account='MEDSL', account_type='orgs'):
print("{:27} : {}".format(repo, url))
Repo Name : Repo URL
------------------------------------------------------------------
elections : https://github.com/MEDSL/elections.git
official-precinct-returns : https://github.com/MEDSL/official-precinct-returns.git
primaries : https://github.com/MEDSL/primaries.git
data-management : https://github.com/MEDSL/data-management.git
election-scrapers : https://github.com/MEDSL/election-scrapers.git
medslcleaner : https://github.com/MEDSL/medslcleaner.git
precinct-shapefiles : https://github.com/MEDSL/precinct-shapefiles.git
documentation : https://github.com/MEDSL/documentation.git
elections-performance-index : https://github.com/MEDSL/elections-performance-index.git
constituency-returns : https://github.com/MEDSL/constituency-returns.git
state-returns : https://github.com/MEDSL/state-returns.git
county-returns : https://github.com/MEDSL/county-returns.git
2018-elections-unoffical : https://github.com/MEDSL/2018-elections-unoffical.git
2018-elections-official : https://github.com/MEDSL/2018-elections-official.git
Medium : https://github.com/MEDSL/Medium.git
ggmedsl : https://github.com/MEDSL/ggmedsl.git
resources : https://github.com/MEDSL/resources.git
election-data : https://github.com/MEDSL/election-data.git
2019-elections-unofficial : https://github.com/MEDSL/2019-elections-unofficial.git
primary-precinct-returns : https://github.com/MEDSL/primary-precinct-returns.git
2020-primaries-unofficial : https://github.com/MEDSL/2020-primaries-unofficial.git
2020-Primaries-Official : https://github.com/MEDSL/2020-Primaries-Official.git
healthy_elections : https://github.com/MEDSL/healthy_elections.git
[6]:
# Clone applicable MEDSL datasets
medsl_repos = ['official-precinct-returns', # precinct-level 2016 election results
'2018-elections-official'] # constituency-level 2018 election results
# this will take some time to complete
dm.clone_gh_repos(account='MEDSL',
account_type='orgs',
repos=medsl_repos,
outpath=os.path.join('output', 'medsl'))
[7]:
# Find Alaska-specific MEDSL data
dm.list_files_of_type('.zip', os.path.join('output', 'medsl'))
[7]:
['output/medsl/2018-elections-official/precinct_2018.zip',
'output/medsl/official-precinct-returns/2016-precinct-local/2016-precinct-local.zip',
'output/medsl/official-precinct-returns/source/2016-tn-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-ny-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-ut-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-wv-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-ia-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-vt-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-ma-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-ct-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-wi-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-mt-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-ms-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-va-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-sd-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-oh-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-sc-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-nm-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-nj-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-hi-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-ga-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-az-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-ca-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-il-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-dc-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-ok-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-pa-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-nv-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-ks-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-mo-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-tx-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-nh-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-de-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-wa-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-mi-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-mn-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-or-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-in-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-wy-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-fl-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-id-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-la-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-ar-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-md-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-ne-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-ri-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-me-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-co-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-ky-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-al-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-nc-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-nd-precinct.zip',
'output/medsl/official-precinct-returns/source/2016-ak-precinct.zip',
'output/medsl/official-precinct-returns/2016-precinct-house/2016-precinct-house.zip',
'output/medsl/official-precinct-returns/2016-precinct-state/2016-precinct-state.zip',
'output/medsl/official-precinct-returns/2016-precinct-president/2016-precinct-president.zip',
'output/medsl/official-precinct-returns/2016-precinct-senate/2016-precinct-senate.zip']
[8]:
# Extract DataFrames from:
# 'official-precinct-returns/2016-precinct-president/2016-precinct-president.zip',
# 'output/medsl/official-precinct-returns/2016-precinct-senate/2016-precinct-senate.zip',
# 'offical-precinct-returns/2016-precinct-house/2016-precinct-house.zip', and
# 2018-elections-offical/precinct_2018.zip'
medsl_16_path = os.path.join('output', 'medsl', 'official-precinct-returns')
medsl_18_path = os.path.join('output', 'medsl', '2018-elections-official')
[9]:
# 2016-precinct-president.zip
medsl_pres16_gdf = et.read_file(os.path.join(medsl_16_path, '2016-precinct-president',
'2016-precinct-president.zip')).extract()
# convert GeoDataFrame to a DataFrame (since don't need 'geometry')
medsl_pres16_df = pd.DataFrame(medsl_pres16_gdf.drop(columns=['geometry']))
medsl_pres16_df.head()
[9]:
| year | stage | special | state | state_postal | state_fips | state_icpsr | county_name | county_fips | county_ansi | ... | candidate_middle | candidate_full | candidate_suffix | candidate_nickname | candidate_fec | candidate_fec_name | candidate_google | candidate_govtrack | candidate_icpsr | candidate_maplight | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016 | gen | False | Alabama | AL | 1 | 41 | Autauga County | 1001.0 | 161526.0 | ... | NaN | NaN | NaN | NaN | P00003392 | CLINTON, HILLARY RODHAM / TIMOTHY MICHAEL KAINE | NaN | NaN | NaN | NaN |
| 1 | 2016 | gen | False | Alabama | AL | 1 | 41 | Autauga County | 1001.0 | 161526.0 | ... | NaN | NaN | NaN | NaN | P60012234 | JOHNSON, JOHN FITZGERALD MR. | NaN | NaN | NaN | NaN |
| 2 | 2016 | gen | False | Alabama | AL | 1 | 41 | Autauga County | 1001.0 | 161526.0 | ... | NaN | NaN | NaN | NaN | P20003984 | STEIN, JILL | NaN | NaN | NaN | NaN |
| 3 | 2016 | gen | False | Alabama | AL | 1 | 41 | Autauga County | 1001.0 | 161526.0 | ... | NaN | NaN | NaN | NaN | P80001571 | TRUMP, DONALD J. / MICHAEL R. PENCE | NaN | NaN | NaN | NaN |
| 4 | 2016 | gen | False | Alabama | AL | 1 | 41 | Autauga County | 1001.0 | 161526.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 37 columns
[10]:
# 2016-precinct-senate.zip
medsl_sen16_gdf = et.read_file(os.path.join(medsl_16_path, '2016-precinct-senate',
'2016-precinct-senate.zip')).extract()
medsl_sen16_df = pd.DataFrame(medsl_sen16_gdf.drop(columns=['geometry']))
medsl_sen16_df.head()
[10]:
| year | stage | special | state | state_postal | state_fips | state_icpsr | county_name | county_fips | county_ansi | ... | candidate_middle | candidate_full | candidate_suffix | candidate_nickname | candidate_fec | candidate_fec_name | candidate_google | candidate_govtrack | candidate_icpsr | candidate_maplight | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016 | gen | False | Alabama | AL | 1 | 41 | Autauga County | 1001.0 | 161526.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 2016 | gen | False | Alabama | AL | 1 | 41 | Autauga County | 1001.0 | 161526.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 2016 | gen | False | Alabama | AL | 1 | 41 | Autauga County | 1001.0 | 161526.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 2016 | gen | False | Alabama | AL | 1 | 41 | Autauga County | 1001.0 | 161526.0 | ... | NaN | NaN | NaN | NaN | S6AL00302 | CRUMPTON, RONALD (RON) STEVEN | NaN | NaN | NaN | NaN |
| 4 | 2016 | gen | False | Alabama | AL | 1 | 41 | Autauga County | 1001.0 | 161526.0 | ... | C. | Richard C. Shelby | NaN | NaN | S6AL00013 | NaN | kg:/m/020yj1 | 300089.0 | 14659.0 | 608.0 |
5 rows × 37 columns
[11]:
# 2016-precinct-house.zip
medsl_ush16_gdf = et.read_file(os.path.join(medsl_16_path, '2016-precinct-house',
'2016-precinct-house.zip')).extract()
medsl_ush16_df = pd.DataFrame(medsl_ush16_gdf.drop(columns=['geometry']))
medsl_ush16_df.head()
[11]:
| year | stage | special | state | state_postal | state_fips | state_icpsr | county_name | county_fips | county_ansi | ... | candidate_middle | candidate_full | candidate_suffix | candidate_nickname | candidate_fec | candidate_fec_name | candidate_google | candidate_govtrack | candidate_icpsr | candidate_maplight | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2016 | gen | False | Alabama | AL | 1 | 41 | Autauga County | 1001.0 | 161526.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 2016 | gen | False | Alabama | AL | 1 | 41 | Autauga County | 1001.0 | 161526.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 2016 | gen | False | Alabama | AL | 1 | 41 | Autauga County | 1001.0 | 161526.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 2016 | gen | False | Alabama | AL | 1 | 41 | Autauga County | 1001.0 | 161526.0 | ... | NaN | NaN | NaN | NaN | H6AL02167 | MATHIS, NATHAN | NaN | NaN | NaN | NaN |
| 4 | 2016 | gen | False | Alabama | AL | 1 | 41 | Autauga County | 1001.0 | 161526.0 | ... | NaN | Martha Roby | NaN | NaN | H0AL02087 | NaN | kg:/m/0drx5mb | 412394.0 | 21192.0 | 1408.0 |
5 rows × 37 columns
[12]:
# precinct_2018.zip
medsl_18_gdf = et.read_file(os.path.join(medsl_18_path, 'precinct_2018.zip')).extract()
medsl_18_df = pd.DataFrame(medsl_18_gdf.drop(columns=['geometry']))
medsl_18_df.head()
[12]:
| precinct | office | party | mode | votes | jurisdiction | county | candidate | district | dataverse | year | stage | state | special | writein | state_po | state_fips | state_cen | state_ic | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10 JONES COMMUNITY CTR | Straight Party | democratic | election day | 98 | Autauga | Autauga | Alabama Democratic Party | NaN | all | 2018 | gen | Alabama | False | False | AL | 1 | 63 | 41 |
| 1 | 10 JONES COMMUNITY CTR | Straight Party | republican | election day | 110 | Autauga | Autauga | Alabama Republican Party | NaN | all | 2018 | gen | Alabama | False | False | AL | 1 | 63 | 41 |
| 2 | 10 JONES COMMUNITY CTR | US House | democratic | election day | 118 | Autauga | Autauga | Tabitha Isner | 2 | house | 2018 | gen | Alabama | False | False | AL | 1 | 63 | 41 |
| 3 | 10 JONES COMMUNITY CTR | US House | republican | election day | 153 | Autauga | Autauga | Martha Roby | 2 | house | 2018 | gen | Alabama | False | False | AL | 1 | 63 | 41 |
| 4 | 10 JONES COMMUNITY CTR | US House | NaN | election day | 0 | Autauga | Autauga | NaN | 2 | house | 2018 | gen | Alabama | False | True | AL | 1 | 63 | 41 |
Step 1.3. Gather Wikipedia data for comparison purposes.
Note: The Wikipedia dataset was compiled from tables scraped from Wikipedia pages. You can review the scraping and wrangling notebook here <https://github.com/mggg/mggg-states-qa/blob/main/src/wikipedia-election-data-mining.ipynb>_.
[13]:
wiki_gdf = et.read_file('wiki_states.csv').extract()
wiki_df = pd.DataFrame(wiki_gdf.drop(columns=['geometry']))
wiki_df.head()
[13]:
| STATE | PRES16D | PRES16G | PRES16L | PRES16R | SEN16D | SEN16G | SEN16L | SEN16R | USH16D | ... | USH17L | USH17R | SEN18D | SEN18G | SEN18L | SEN18R | USH18D | USH18G | USH18L | USH18R | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ALABAMA | 729547.0 | 9391.0 | 44467.0 | 1318255.0 | 748709.0 | NaN | NaN | 1335104.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | ALASKA | 116454.0 | 5735.0 | 18725.0 | 163387.0 | 36200.0 | NaN | 90825.0 | 138149.0 | 111019.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 149779.0 |
| 2 | ARIZONA | 1161167.0 | 34345.0 | 106327.0 | 1252401.0 | 1031245.0 | 138634.0 | NaN | 1359267.0 | NaN | ... | NaN | NaN | 1191100.0 | 57442.0 | NaN | 1135200.0 | NaN | NaN | NaN | NaN |
| 3 | ARKANSAS | 380494.0 | 9473.0 | 29829.0 | 684872.0 | 400602.0 | NaN | 43866.0 | 661984.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | CALIFORNIA | 8753788.0 | 278657.0 | 478500.0 | 4483810.0 | 7542753.0 | NaN | NaN | NaN | NaN | ... | NaN | NaN | 6019422.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 29 columns
Step 2. Wrangle data¶
[14]:
state = 'Alaska'
2.1. Wrangle MEDSL data.
[15]:
def pivot_medsl_df(df: pd.DataFrame) -> pd.DataFrame:
"""
Given a MEDSL DataFrame, return a pivoted DataFrame where the columns
are elections and parties and the values are the votes for every precinct.
"""
medsl_pvt = df.pivot_table(index='precinct',
columns=['office', 'party'],
values='votes')
medsl_pvt.columns = [' '.join(col).strip() for col in medsl_pvt.columns.values]
return medsl_pvt
2.1.1. Extract and pivot Alaska-specific data from each MEDSL DataFrame.
[16]:
medsl_pres16_gdf = et.ExtractTable(medsl_pres16_df, column='state', value=state).extract()
medsl_pres16_df = pd.DataFrame(medsl_pres16_gdf.drop(columns=['geometry']))
medsl_pres16_df = pivot_medsl_df(medsl_pres16_df)
medsl_pres16_df.head()
[16]:
| US President constitution | US President democratic | US President green | US President libertarian | US President new alliance | US President republican | |
|---|---|---|---|---|---|---|
| precinct | ||||||
| 01-446 Aurora | 5.0 | 295.0 | 23.0 | 75.0 | 7.0 | 434.0 |
| 01-455 Fairbanks No. 1 | 2.0 | 65.0 | 2.0 | 8.0 | 1.0 | 113.0 |
| 01-465 Fairbanks No. 2 | 8.0 | 120.0 | 11.0 | 21.0 | 1.0 | 157.0 |
| 01-470 Fairbanks No. 3 | 6.0 | 205.0 | 12.0 | 35.0 | 2.0 | 218.0 |
| 01-475 Fairbanks No. 4 | 4.0 | 86.0 | 6.0 | 16.0 | 2.0 | 149.0 |
[17]:
medsl_sen16_gdf = et.ExtractTable(medsl_sen16_df, column='state', value=state).extract()
medsl_sen16_df = pd.DataFrame(medsl_sen16_gdf.drop(columns=['geometry']))
medsl_sen16_df = pivot_medsl_df(medsl_sen16_df)
medsl_sen16_df.head()
[17]:
| US Senate democratic | US Senate libertarian | US Senate new alliance | US Senate republican | |
|---|---|---|---|---|
| precinct | ||||
| 01-446 Aurora | 70.0 | 261.0 | 39.333333 | 409.0 |
| 01-455 Fairbanks No. 1 | 20.0 | 74.0 | 6.333333 | 85.0 |
| 01-465 Fairbanks No. 2 | 45.0 | 108.0 | 12.000000 | 129.0 |
| 01-470 Fairbanks No. 3 | 56.0 | 136.0 | 26.000000 | 225.0 |
| 01-475 Fairbanks No. 4 | 28.0 | 70.0 | 13.333333 | 137.0 |
[18]:
medsl_ush16_gdf = et.ExtractTable(medsl_ush16_df, column='state', value=state).extract()
medsl_ush16_df = pd.DataFrame(medsl_ush16_gdf.drop(columns=['geometry']))
medsl_ush16_df = pivot_medsl_df(medsl_ush16_df)
medsl_ush16_df.head()
[18]:
| US House democratic | US House libertarian | US House new alliance | US House republican | |
|---|---|---|---|---|
| precinct | ||||
| 01-446 Aurora | 334.0 | 109.0 | 20.0 | 389.0 |
| 01-455 Fairbanks No. 1 | 62.0 | 23.0 | 4.0 | 106.0 |
| 01-465 Fairbanks No. 2 | 120.0 | 51.0 | 7.0 | 142.0 |
| 01-470 Fairbanks No. 3 | 191.0 | 48.0 | 13.0 | 236.0 |
| 01-475 Fairbanks No. 4 | 104.0 | 27.0 | 10.0 | 133.0 |
[19]:
medsl_18_gdf = et.ExtractTable(medsl_18_df, column='state', value=state).extract()
medsl_18_df = pd.DataFrame(medsl_18_gdf.drop(columns=['geometry']))
medsl_18_df = pivot_medsl_df(medsl_18_df)
medsl_18_df.head()
[19]:
| Governor democratic | Governor independent | Governor libertarian | Governor republican | State House democratic | State House independent | State House libertarian | State House republican | State Senate democratic | State Senate independent | State Senate republican | US House democratic | US House republican | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| precinct | |||||||||||||
| 01-446 Aurora | 399.0 | 11.0 | 29.0 | 427.0 | 388.0 | NaN | NaN | 464.0 | 462.0 | NaN | 392.0 | 429.0 | 419.0 |
| 01-455 Fairbanks No. 1 | 65.0 | 3.0 | 3.0 | 84.0 | 60.0 | NaN | NaN | 93.0 | 75.0 | NaN | 79.0 | 66.0 | 88.0 |
| 01-465 Fairbanks No. 2 | 124.0 | 4.0 | 13.0 | 139.0 | 139.0 | NaN | NaN | 137.0 | 151.0 | NaN | 127.0 | 142.0 | 136.0 |
| 01-470 Fairbanks No. 3 | 253.0 | 9.0 | 16.0 | 203.0 | 248.0 | NaN | NaN | 237.0 | 277.0 | NaN | 205.0 | 265.0 | 223.0 |
| 01-475 Fairbanks No. 4 | 121.0 | 5.0 | 9.0 | 110.0 | 134.0 | NaN | NaN | 109.0 | 136.0 | NaN | 106.0 | 136.0 | 104.0 |
2.1.2. Combine MEDSL 2016 election dataframes into one.
[20]:
medsl_16_df = pd.concat([medsl_pres16_df, medsl_sen16_df, medsl_ush16_df], axis=1)
medsl_16_df.head()
[20]:
| US President constitution | US President democratic | US President green | US President libertarian | US President new alliance | US President republican | US Senate democratic | US Senate libertarian | US Senate new alliance | US Senate republican | US House democratic | US House libertarian | US House new alliance | US House republican | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| precinct | ||||||||||||||
| 01-446 Aurora | 5.0 | 295.0 | 23.0 | 75.0 | 7.0 | 434.0 | 70.0 | 261.0 | 39.333333 | 409.0 | 334.0 | 109.0 | 20.0 | 389.0 |
| 01-455 Fairbanks No. 1 | 2.0 | 65.0 | 2.0 | 8.0 | 1.0 | 113.0 | 20.0 | 74.0 | 6.333333 | 85.0 | 62.0 | 23.0 | 4.0 | 106.0 |
| 01-465 Fairbanks No. 2 | 8.0 | 120.0 | 11.0 | 21.0 | 1.0 | 157.0 | 45.0 | 108.0 | 12.000000 | 129.0 | 120.0 | 51.0 | 7.0 | 142.0 |
| 01-470 Fairbanks No. 3 | 6.0 | 205.0 | 12.0 | 35.0 | 2.0 | 218.0 | 56.0 | 136.0 | 26.000000 | 225.0 | 191.0 | 48.0 | 13.0 | 236.0 |
| 01-475 Fairbanks No. 4 | 4.0 | 86.0 | 6.0 | 16.0 | 2.0 | 149.0 | 28.0 | 70.0 | 13.333333 | 137.0 | 104.0 | 27.0 | 10.0 | 133.0 |
2.2. Wrangle Wikipedia Data.
[21]:
uppercase_state = state.upper()
wiki_df = wiki_df.set_index('STATE')
wiki_df = wiki_df.loc[[uppercase_state]]
wiki_df = wiki_df.dropna(axis=1)
wiki_df.head()
[21]:
| PRES16D | PRES16G | PRES16L | PRES16R | SEN16D | SEN16L | SEN16R | USH16D | USH16L | USH16R | USH18R | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| STATE | |||||||||||
| ALASKA | 116454.0 | 5735.0 | 18725.0 | 163387.0 | 36200.0 | 90825.0 | 138149.0 | 111019.0 | 31770.0 | 155088.0 | 149779.0 |
Step 3. Check naming convention compliance¶
Step 3.1 Generate naming conventions.
[22]:
with open('naming_convention.json') as json_file:
standards_raw = json.load(json_file)
offices = dm.get_keys_by_category(standards_raw, 'offices')
parties = dm.get_keys_by_category(standards_raw, 'parties')
counts = dm.get_keys_by_category(standards_raw, 'counts')
others = dm.get_keys_by_category(standards_raw, ['geographies',
'demographics',
'districts',
'other'])
elections = [office + format(year, '02') + party
for office in offices
for year in range(0, 21)
for party in parties
if not (office == 'PRES' and year % 4 != 0)]
counts = [count + format(year, '02')
for count in counts
for year in range(0, 20)]
standards = elections + counts + others
Step 3.2. Check AK_precincts compliance with naming conventions.
[23]:
naming_check = dq.compare_column_names(mggg_gdf, standards)
[24]:
# Print and store results of naming convention check
(matches, diffs) = naming_check
matched_columns = matches
diffs = list(diffs)
diffs.sort()
print('Discrepancies from naming convention:', diffs)
Discrepancies from naming convention: ['2MORE', 'AMIN', 'AREA', 'ASIAN', 'BLACK', 'DISTRICT', 'ID', 'NAME', 'NHPI', 'OTHER', 'POPULATION', 'PRES16C', 'WHITE']
Step 4. Compare AK_precincts against external sources¶
Step 4.1. Compare AK_precincts data against MEDSL data.
Step 4.1.1. Generate naming convention translations between MGGG and MEDSL.
- Note: Has to be done manually because some MEDSL data don’t use the same naming convention (e.g. ‘US President democratic’ vs ‘US Senate democrat’).
[25]:
pres16_cols = [
('PRES16D', 'US President democratic'),
('PRES16G', 'US President green'),
('PRES16L', 'US President libertarian'),
('PRES16R', 'US President republican')
]
sen16_cols = [
('SEN16D', 'US Senate democrat'),
('SEN16G', 'US Senate green'),
('SEN16L', 'US Senate libertarian'),
('SEN16R', 'US Senate republican')
]
ush16_cols = [
('USH16D', 'US House democratic'),
('USH16G', 'US House green'),
('USH16L', 'US House libertarian'),
('USH16R', 'US House republican')
]
mggg_medsl_18_cols = [
('SEN18D', 'US Senate democratic'),
('SEN18G', 'US Senate green'),
('SEN18L', 'US Senate libertarian'),
('SEN18R', 'US Senate republican'),
('USH18D', 'US House democrat'),
('USH18G', 'US House green'),
('USH18L', 'US House libertarian'),
('USH18R', 'US House republican')
]
[26]:
# Generate comparable columns
mggg_medsl_16_cols = pres16_cols + sen16_cols + ush16_cols
mggg_medsl_16_cols = [tup for tup in mggg_medsl_16_cols
if tup[0] in list(mggg_gdf.columns) and
tup[1] in list(medsl_16_df.columns)]
mggg_medsl_18_cols = [tup for tup in mggg_medsl_18_cols
if tup[0] in list(mggg_gdf.columns) and
tup[1] in list(medsl_16_df.columns)]
mggg_16_cols = [tup[0] for tup in mggg_medsl_16_cols]
medsl_16_cols = [tup[1] for tup in mggg_medsl_16_cols]
mggg_18_cols = [tup[0] for tup in mggg_medsl_18_cols]
medsl_18_cols = [tup[1] for tup in mggg_medsl_18_cols]
Step 4.1.2. Compare AK_precincts’ and MEDSL’s 2016 election data.
[27]:
mggg_medsl_16_results = dq.compare_column_sums(mggg_gdf, medsl_16_df,
mggg_16_cols, medsl_16_cols)
print('============================================================')
print('Comparison of AK_precincts\' and MEDSL\'s 2016 Election Data')
print('============================================================\n')
print('{:37} : {}'.format('AK_precincts [vs] MEDSL', 'difference in sums'))
print('------------------------------------------------------------')
for col_v_col, diff in mggg_medsl_16_results:
print('{:37} : {}'.format(col_v_col, diff))
============================================================
Comparison of AK_precincts' and MEDSL's 2016 Election Data
============================================================
AK_precincts [vs] MEDSL : difference in sums
------------------------------------------------------------
PRES16D [vs] US President democratic : -47220.0
PRES16G [vs] US President green : -1947.5
PRES16L [vs] US President libertarian : -6717.5
PRES16R [vs] US President republican : -59910.0
SEN16L [vs] US Senate libertarian : -30050.0
SEN16R [vs] US Senate republican : -50259.5
USH16D [vs] US House democratic : -43628.5
USH16L [vs] US House libertarian : -11665.5
USH16R [vs] US House republican : -54359.5
Step 4.1.3. Compare AK_precincts’ and MEDSL’s 2018 election data.
[28]:
mggg_medsl_18_results = dq.compare_column_sums(mggg_gdf, medsl_18_df,
mggg_18_cols, medsl_18_cols)
print('============================================================')
print('Comparison of AK_precincts\' and MEDSL\'s 2018 Election Data')
print('============================================================\n')
print('{:37} : {}'.format('AK_precincts [vs] MEDSL', 'Difference in sums'))
print('------------------------------------------------------------')
for col_v_col, diff in mggg_medsl_18_results:
print('{:37} : {}'.format(col_v_col, diff))
============================================================
Comparison of AK_precincts' and MEDSL's 2018 Election Data
============================================================
AK_precincts [vs] MEDSL : Difference in sums
------------------------------------------------------------
USH18R [vs] US House republican : -46390.5
Step 4.2. Compare AK_precincts data against Wikipedia data.
[29]:
mggg_wiki_cols = list(set(mggg_gdf.columns).intersection(set(wiki_df.columns)))
mggg_wiki_cols.sort()
mggg_wiki_results = dq.compare_column_sums(mggg_gdf, wiki_df,
mggg_wiki_cols, mggg_wiki_cols)
print('===========================================================')
print('Comparison of AK_precincts\' and Wikipedia\'s Election Data')
print('===========================================================\n')
print('{:27} : {}'.format('AK_precincts [vs] Wikipedia', 'Difference in sums'))
print('---------------------------------------------------')
for col_v_col, diff in mggg_wiki_results:
print('{:27} : {}'.format(col_v_col, diff))
===========================================================
Comparison of AK_precincts' and Wikipedia's Election Data
===========================================================
AK_precincts [vs] Wikipedia : Difference in sums
---------------------------------------------------
PRES16D [vs] PRES16D : -47357.0
PRES16G [vs] PRES16G : -1953.0
PRES16L [vs] PRES16L : -6721.0
PRES16R [vs] PRES16R : -59930.0
SEN16D [vs] SEN16D : -16598.0
SEN16L [vs] SEN16L : -30057.0
SEN16R [vs] SEN16R : -50295.0
USH16D [vs] USH16D : -43745.0
USH16L [vs] USH16L : -11670.0
USH16R [vs] USH16R : -54386.0
USH18R [vs] USH18R : -46394.0
Step 5. Check topological soundness of AK_precincts¶
Step 5.1. Check for empty or missing geometries
[30]:
warnings = 0
if dq.has_missing_geometries(mggg_gdf):
warnings += 1
print('AK_precincts has missing geometries.')
if dq.has_empty_geometries(mggg_gdf):
warnings += 1
print('AK_precints has empty geometries.')
if warnings == 0:
print('No missing or empty geometries.')
No missing or empty geometries.
Step 6. Sum datasets for manual checks¶
[31]:
# AK_precincts Sums
mggg_columns = [col for col in mggg_gdf.columns
if col != 'geometry' and col in elections]
mggg_sums = dq.sum_column_values(mggg_gdf, mggg_columns)
print('==========================')
print('AK_precincts\' column sums')
print('==========================\n')
print('{:10} : {}\t{}'.format('Column', 'Sums', 'Datatype'))
print('--------------------')
for col, summation in mggg_sums:
print('{:10} : {}\t{}'.format(col, summation, type(summation)))
==========================
AK_precincts' column sums
==========================
Column : Sums Datatype
--------------------
USH14D : 77004 <class 'numpy.int64'>
USH14R : 102464 <class 'numpy.int64'>
USH14L : 14715 <class 'numpy.int64'>
PRES16D : 69097 <class 'numpy.int64'>
PRES16R : 103457 <class 'numpy.int64'>
PRES16L : 12004 <class 'numpy.int64'>
PRES16G : 3782 <class 'numpy.int64'>
SEN16D : 19602 <class 'numpy.int64'>
SEN16R : 87854 <class 'numpy.int64'>
SEN16L : 60768 <class 'numpy.int64'>
USH16D : 67274 <class 'numpy.int64'>
USH16R : 100702 <class 'numpy.int64'>
USH16L : 20100 <class 'numpy.int64'>
GOV18D : 80954 <class 'numpy.int64'>
GOV18R : 100372 <class 'numpy.int64'>
GOV18L : 3970 <class 'numpy.int64'>
USH18D : 83855 <class 'numpy.int64'>
USH18R : 103385 <class 'numpy.int64'>
[32]:
# MEDSL 2016 Sums
medsl_16_sums = dq.sum_column_values(medsl_16_df, list(medsl_16_df.columns))
print('===========================================')
print('MEDSL 2016\' column sums')
print('===========================================\n')
print('{:25} : {}\t\t{}'.format('Column', 'Sums', 'Datatype'))
print('-------------------------------------------')
for col, summation in medsl_16_sums:
print('{:25} : {}\t\t{}'.format(col, summation, type(summation)))
===========================================
MEDSL 2016' column sums
===========================================
Column : Sums Datatype
-------------------------------------------
US President constitution : 3866.0 <class 'numpy.float64'>
US President democratic : 116317.0 <class 'numpy.float64'>
US President green : 5729.5 <class 'numpy.float64'>
US President libertarian : 18721.5 <class 'numpy.float64'>
US President new alliance : 1239.5 <class 'numpy.float64'>
US President republican : 163367.0 <class 'numpy.float64'>
US Senate democratic : 36103.5 <class 'numpy.float64'>
US Senate libertarian : 90818.0 <class 'numpy.float64'>
US Senate new alliance : 15183.833333333332 <class 'numpy.float64'>
US Senate republican : 138113.5 <class 'numpy.float64'>
US House democratic : 110902.5 <class 'numpy.float64'>
US House libertarian : 31765.5 <class 'numpy.float64'>
US House new alliance : 9091.5 <class 'numpy.float64'>
US House republican : 155061.5 <class 'numpy.float64'>
[33]:
# MEDSL 2018 Sums
medsl_18_sums = dq.sum_column_values(medsl_18_df, list(medsl_18_df.columns))
print('=======================================')
print('MEDSL 2018\' column sums')
print('=======================================\n')
print('{:25} : {}\t{}'.format('Column', 'Sums', 'Datatype'))
print('-------------------------------------')
for col, summation in medsl_18_sums:
print('{:25} : {}\t{}'.format(col, summation, type(summation)))
=======================================
MEDSL 2018' column sums
=======================================
Column : Sums Datatype
-------------------------------------
Governor democratic : 125739.0 <class 'numpy.float64'>
Governor independent : 5757.0 <class 'numpy.float64'>
Governor libertarian : 5402.0 <class 'numpy.float64'>
Governor republican : 145631.0 <class 'numpy.float64'>
State House democratic : 107045.0 <class 'numpy.float64'>
State House independent : 15229.5 <class 'numpy.float64'>
State House libertarian : 2274.0 <class 'numpy.float64'>
State House republican : 138238.0 <class 'numpy.float64'>
State Senate democratic : 51891.0 <class 'numpy.float64'>
State Senate independent : 6803.0 <class 'numpy.float64'>
State Senate republican : 65376.0 <class 'numpy.float64'>
US House democratic : 131143.5 <class 'numpy.float64'>
US House republican : 149775.5 <class 'numpy.float64'>
[34]:
# Wikipedia Sums
wiki_sums = dq.sum_column_values(wiki_df, list(wiki_df.columns))
print('==========================')
print('Wikipedia\' column sums')
print('==========================\n')
print('{:10} : {}\t{}'.format('Column', 'Sums', 'Datatype'))
print('----------------------')
for col, summation in wiki_sums:
print('{:10} : {}\t{}'.format(col, summation, type(summation)))
==========================
Wikipedia' column sums
==========================
Column : Sums Datatype
----------------------
PRES16D : 116454.0 <class 'numpy.float64'>
PRES16G : 5735.0 <class 'numpy.float64'>
PRES16L : 18725.0 <class 'numpy.float64'>
PRES16R : 163387.0 <class 'numpy.float64'>
SEN16D : 36200.0 <class 'numpy.float64'>
SEN16L : 90825.0 <class 'numpy.float64'>
SEN16R : 138149.0 <class 'numpy.float64'>
USH16D : 111019.0 <class 'numpy.float64'>
USH16L : 31770.0 <class 'numpy.float64'>
USH16R : 155088.0 <class 'numpy.float64'>
USH18R : 149779.0 <class 'numpy.float64'>
Step 7. Cleanup¶
[35]:
# Remove cloned repos
dm.remove_repos('output/')
[36]:
# Uninstall installed python packages
!echo y | pip3 uninstall numpy
!echo y | pip3 uninstall pandas
!echo y | pip3 uninstall geopandas
!echo y | pip3 uninstall wikipedia
!echo y | pip3 uninstall gdutils
[37]:
# Reset Jupyter Notebook IPython Kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")
Next Steps¶
After running the automated scripts, we recommend doing the following:
Data Standardization
- Manually evaluate column naming discrepancies to determine if changes are needed.
- Manually evaluate column datatypes to determine if changes are needed.
Data Comparison
- Manually investigate large differences found through comparing
AK_precinctsdata with external sources (e.g. Are absentee ballots counted? Are the precinct counts accurate?). - For overcounts, how are the votes counted? e.g. A
USH##Dcount may include votes for all Democratic candidates where external sources may be only counting one main Democratic candidate. - For more accurate comparisons, compare
AK_precinctsdata with those in each States’ Secretary of State website.
Topological Soundness
- Manually examine shapefiles for gaps and overlaps.
- Note: although gaps and overlaps are not necessarily indicators of inaccurate data (because some counties have precinct islands), they do mean that the data cannot be for chain runs.
Data Documentation
- Do the READMEs provide data sources?
- Do the READMEs describe what aggregation/disaggregation processes were used?
- Do the READMEs discuss discrepancies/caveats in the data?
- Do the READMEs provide scripts used and/or discuss the data wrangling/processing process?