QA Sample: AK_precincts

Below are the steps involved in performing automated data quality checks on the AK_precincts shapefile from mggg-states.

This notebook does the following:

  1. Collects the following Alaska election data from mggg-states, MEDSL, and Wikipedia:
    • 2016 United States presidential election
    • 2016 United States Senate elections
    • 2016 United States House of Representatives elections
    • 2018 United States Senate elections
    • 2018 United States House of Representatives elections
  2. Wrangles the datasets so that they can be compared against each other.
  3. Checks if column names in AK_precincts diverge from the MGGG naming convention (as outlined in naming_convention.json).
  4. Compares the vote counts in AK_precincts with those in the MEDSL and Wikipedia datasets.
  5. Prints the aggregated votes in AK_precincts for ease of spot checking against Secretary of State websites.

Note: the automated checks are not completely exhaustive and further manual checks are required. Please see Next Steps at the bottom of this notebook for continuing the data QA process.

Automation Check Timestamp: 02:00 pm ET, 14 August 2020

Step 0. Setup

[1]:
# Install useful Python packages

!conda install fiona shapely pyproj rtree && pip3 install wheel

!pip3 install pandas
!pip3 install geopandas
!pip3 install wikipedia

!pip3 install git+https://github.com/mggg/gdutils.git
[2]:
# Import useful Python modules

import numpy as np
import pandas as pd
import geopandas as gpd

import json # for parsing a json file
import wikipedia # unofficial Wikipedia package (wrapper of MediaWiki API)
import os # for ensuring file traversal works regardless of operating system

import gdutils.datamine as dm # data-mining module from gdutils
import gdutils.dataqa as dq   # data QA module from gdutils
import gdutils.extract as et  # table extraction module from gdutils

from typing import Any, List, Tuple, Dict, Hashable, Union, NoReturn

Step 1. Collect data

Step 1.1. Collect AK_precincts data from the mggg-statesAK-shapefiles GitHub repository.

[3]:
# Clone 'AK-shapefiles' repository into 'output/mggg/'

dm.clone_gh_repos(account='mggg-states',
                  account_type='orgs',
                  repos=['AK-shapefiles'],
                  outpath=os.path.join('output', 'mggg'))
[4]:
# Extract a GeoDataFrame from 'AK-shapefiles/AK_precincts.zip'

mggg_gdf = et.read_file(os.path.join('output', 'mggg', 'AK-shapefiles',
                                     'AK_precincts.zip')).extract()

mggg_gdf.head() # renders first 5 rows of the extracted gf
[4]:
ID AREA DISTRICT NAME POPULATION USH14D USH14R USH14L PRES16D PRES16R ... VAP WVAP BVAP AMINVAP ASIANVAP NHPIVAP OTHERVAP 2MOREVAP 2MORE geometry
0 266.0 1.553231 01-446 01-446 AURORA 2995.0 336 457 91 295 434 ... 2315 1740 92 237 78 2 48 118 229 POLYGON ((294705.801 1667364.692, 294704.326 1...
1 329.0 0.578508 01-455 01-455 FAIRBANKS NO. 1 659.0 72 106 16 65 113 ... 545 416 16 62 12 0 10 29 36 POLYGON ((297483.985 1669129.153, 297485.509 1...
2 267.0 0.469371 01-465 01-465 FAIRBANKS NO. 2 1542.0 108 166 44 120 157 ... 1312 853 85 252 37 1 20 64 100 POLYGON ((297800.944 1668172.899, 297823.138 1...
3 268.0 0.401854 01-470 01-470 FAIRBANKS NO. 3 1872.0 216 234 54 205 218 ... 1531 1047 97 232 36 3 30 86 138 POLYGON ((296902.053 1668075.791, 296915.198 1...
4 269.0 0.561294 01-475 01-475 FAIRBANKS NO. 4 1143.0 123 118 40 86 149 ... 883 622 28 153 27 0 14 39 106 POLYGON ((296178.482 1666807.889, 296101.344 1...

5 rows × 42 columns

Step 1.2. Gather MEDSL data for comparison purposes.

[5]:
# Print available MEDSL data to select applicable datasets

print('{:27} : {}'.format('Repo Name', 'Repo URL'))
print('------------------------------------------------------------------')

for (repo, url) in dm.list_gh_repos(account='MEDSL', account_type='orgs'):
    print("{:27} : {}".format(repo, url))
Repo Name                   : Repo URL
------------------------------------------------------------------
elections                   : https://github.com/MEDSL/elections.git
official-precinct-returns   : https://github.com/MEDSL/official-precinct-returns.git
primaries                   : https://github.com/MEDSL/primaries.git
data-management             : https://github.com/MEDSL/data-management.git
election-scrapers           : https://github.com/MEDSL/election-scrapers.git
medslcleaner                : https://github.com/MEDSL/medslcleaner.git
precinct-shapefiles         : https://github.com/MEDSL/precinct-shapefiles.git
documentation               : https://github.com/MEDSL/documentation.git
elections-performance-index : https://github.com/MEDSL/elections-performance-index.git
constituency-returns        : https://github.com/MEDSL/constituency-returns.git
state-returns               : https://github.com/MEDSL/state-returns.git
county-returns              : https://github.com/MEDSL/county-returns.git
2018-elections-unoffical    : https://github.com/MEDSL/2018-elections-unoffical.git
2018-elections-official     : https://github.com/MEDSL/2018-elections-official.git
Medium                      : https://github.com/MEDSL/Medium.git
ggmedsl                     : https://github.com/MEDSL/ggmedsl.git
resources                   : https://github.com/MEDSL/resources.git
election-data               : https://github.com/MEDSL/election-data.git
2019-elections-unofficial   : https://github.com/MEDSL/2019-elections-unofficial.git
primary-precinct-returns    : https://github.com/MEDSL/primary-precinct-returns.git
2020-primaries-unofficial   : https://github.com/MEDSL/2020-primaries-unofficial.git
2020-Primaries-Official     : https://github.com/MEDSL/2020-Primaries-Official.git
healthy_elections           : https://github.com/MEDSL/healthy_elections.git
[6]:
# Clone applicable MEDSL datasets

medsl_repos = ['official-precinct-returns', # precinct-level 2016 election results
               '2018-elections-official']   # constituency-level 2018 election results

# this will take some time to complete
dm.clone_gh_repos(account='MEDSL',
                  account_type='orgs',
                  repos=medsl_repos,
                  outpath=os.path.join('output', 'medsl'))
[7]:
# Find Alaska-specific MEDSL data

dm.list_files_of_type('.zip', os.path.join('output', 'medsl'))
[7]:
['output/medsl/2018-elections-official/precinct_2018.zip',
 'output/medsl/official-precinct-returns/2016-precinct-local/2016-precinct-local.zip',
 'output/medsl/official-precinct-returns/source/2016-tn-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ny-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ut-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-wv-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ia-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-vt-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ma-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ct-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-wi-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-mt-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ms-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-va-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-sd-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-oh-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-sc-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-nm-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-nj-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-hi-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ga-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-az-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ca-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-il-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-dc-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ok-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-pa-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-nv-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ks-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-mo-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-tx-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-nh-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-de-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-wa-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-mi-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-mn-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-or-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-in-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-wy-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-fl-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-id-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-la-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ar-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-md-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ne-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ri-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-me-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-co-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ky-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-al-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-nc-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-nd-precinct.zip',
 'output/medsl/official-precinct-returns/source/2016-ak-precinct.zip',
 'output/medsl/official-precinct-returns/2016-precinct-house/2016-precinct-house.zip',
 'output/medsl/official-precinct-returns/2016-precinct-state/2016-precinct-state.zip',
 'output/medsl/official-precinct-returns/2016-precinct-president/2016-precinct-president.zip',
 'output/medsl/official-precinct-returns/2016-precinct-senate/2016-precinct-senate.zip']
[8]:
# Extract DataFrames from:
# 'official-precinct-returns/2016-precinct-president/2016-precinct-president.zip',
# 'output/medsl/official-precinct-returns/2016-precinct-senate/2016-precinct-senate.zip',
# 'offical-precinct-returns/2016-precinct-house/2016-precinct-house.zip', and
# 2018-elections-offical/precinct_2018.zip'

medsl_16_path = os.path.join('output', 'medsl', 'official-precinct-returns')
medsl_18_path = os.path.join('output', 'medsl', '2018-elections-official')
[9]:
# 2016-precinct-president.zip

medsl_pres16_gdf = et.read_file(os.path.join(medsl_16_path, '2016-precinct-president',
                                             '2016-precinct-president.zip')).extract()

# convert GeoDataFrame to a DataFrame (since don't need 'geometry')
medsl_pres16_df = pd.DataFrame(medsl_pres16_gdf.drop(columns=['geometry']))

medsl_pres16_df.head()
[9]:
year stage special state state_postal state_fips state_icpsr county_name county_fips county_ansi ... candidate_middle candidate_full candidate_suffix candidate_nickname candidate_fec candidate_fec_name candidate_google candidate_govtrack candidate_icpsr candidate_maplight
0 2016 gen False Alabama AL 1 41 Autauga County 1001.0 161526.0 ... NaN NaN NaN NaN P00003392 CLINTON, HILLARY RODHAM / TIMOTHY MICHAEL KAINE NaN NaN NaN NaN
1 2016 gen False Alabama AL 1 41 Autauga County 1001.0 161526.0 ... NaN NaN NaN NaN P60012234 JOHNSON, JOHN FITZGERALD MR. NaN NaN NaN NaN
2 2016 gen False Alabama AL 1 41 Autauga County 1001.0 161526.0 ... NaN NaN NaN NaN P20003984 STEIN, JILL NaN NaN NaN NaN
3 2016 gen False Alabama AL 1 41 Autauga County 1001.0 161526.0 ... NaN NaN NaN NaN P80001571 TRUMP, DONALD J. / MICHAEL R. PENCE NaN NaN NaN NaN
4 2016 gen False Alabama AL 1 41 Autauga County 1001.0 161526.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 37 columns

[10]:
# 2016-precinct-senate.zip

medsl_sen16_gdf  = et.read_file(os.path.join(medsl_16_path, '2016-precinct-senate',
                                             '2016-precinct-senate.zip')).extract()
medsl_sen16_df = pd.DataFrame(medsl_sen16_gdf.drop(columns=['geometry']))

medsl_sen16_df.head()
[10]:
year stage special state state_postal state_fips state_icpsr county_name county_fips county_ansi ... candidate_middle candidate_full candidate_suffix candidate_nickname candidate_fec candidate_fec_name candidate_google candidate_govtrack candidate_icpsr candidate_maplight
0 2016 gen False Alabama AL 1 41 Autauga County 1001.0 161526.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2016 gen False Alabama AL 1 41 Autauga County 1001.0 161526.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2016 gen False Alabama AL 1 41 Autauga County 1001.0 161526.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2016 gen False Alabama AL 1 41 Autauga County 1001.0 161526.0 ... NaN NaN NaN NaN S6AL00302 CRUMPTON, RONALD (RON) STEVEN NaN NaN NaN NaN
4 2016 gen False Alabama AL 1 41 Autauga County 1001.0 161526.0 ... C. Richard C. Shelby NaN NaN S6AL00013 NaN kg:/m/020yj1 300089.0 14659.0 608.0

5 rows × 37 columns

[11]:
# 2016-precinct-house.zip

medsl_ush16_gdf  = et.read_file(os.path.join(medsl_16_path, '2016-precinct-house',
                                             '2016-precinct-house.zip')).extract()
medsl_ush16_df = pd.DataFrame(medsl_ush16_gdf.drop(columns=['geometry']))

medsl_ush16_df.head()
[11]:
year stage special state state_postal state_fips state_icpsr county_name county_fips county_ansi ... candidate_middle candidate_full candidate_suffix candidate_nickname candidate_fec candidate_fec_name candidate_google candidate_govtrack candidate_icpsr candidate_maplight
0 2016 gen False Alabama AL 1 41 Autauga County 1001.0 161526.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2016 gen False Alabama AL 1 41 Autauga County 1001.0 161526.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2016 gen False Alabama AL 1 41 Autauga County 1001.0 161526.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2016 gen False Alabama AL 1 41 Autauga County 1001.0 161526.0 ... NaN NaN NaN NaN H6AL02167 MATHIS, NATHAN NaN NaN NaN NaN
4 2016 gen False Alabama AL 1 41 Autauga County 1001.0 161526.0 ... NaN Martha Roby NaN NaN H0AL02087 NaN kg:/m/0drx5mb 412394.0 21192.0 1408.0

5 rows × 37 columns

[12]:
# precinct_2018.zip

medsl_18_gdf = et.read_file(os.path.join(medsl_18_path, 'precinct_2018.zip')).extract()
medsl_18_df = pd.DataFrame(medsl_18_gdf.drop(columns=['geometry']))

medsl_18_df.head()
[12]:
precinct office party mode votes jurisdiction county candidate district dataverse year stage state special writein state_po state_fips state_cen state_ic
0 10 JONES COMMUNITY CTR Straight Party democratic election day 98 Autauga Autauga Alabama Democratic Party NaN all 2018 gen Alabama False False AL 1 63 41
1 10 JONES COMMUNITY CTR Straight Party republican election day 110 Autauga Autauga Alabama Republican Party NaN all 2018 gen Alabama False False AL 1 63 41
2 10 JONES COMMUNITY CTR US House democratic election day 118 Autauga Autauga Tabitha Isner 2 house 2018 gen Alabama False False AL 1 63 41
3 10 JONES COMMUNITY CTR US House republican election day 153 Autauga Autauga Martha Roby 2 house 2018 gen Alabama False False AL 1 63 41
4 10 JONES COMMUNITY CTR US House NaN election day 0 Autauga Autauga NaN 2 house 2018 gen Alabama False True AL 1 63 41

Step 1.3. Gather Wikipedia data for comparison purposes.

Note: The Wikipedia dataset was compiled from tables scraped from Wikipedia pages. You can review the scraping and wrangling notebook here <https://github.com/mggg/mggg-states-qa/blob/main/src/wikipedia-election-data-mining.ipynb>_.

[13]:
wiki_gdf = et.read_file('wiki_states.csv').extract()
wiki_df = pd.DataFrame(wiki_gdf.drop(columns=['geometry']))

wiki_df.head()
[13]:
STATE PRES16D PRES16G PRES16L PRES16R SEN16D SEN16G SEN16L SEN16R USH16D ... USH17L USH17R SEN18D SEN18G SEN18L SEN18R USH18D USH18G USH18L USH18R
0 ALABAMA 729547.0 9391.0 44467.0 1318255.0 748709.0 NaN NaN 1335104.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 ALASKA 116454.0 5735.0 18725.0 163387.0 36200.0 NaN 90825.0 138149.0 111019.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 149779.0
2 ARIZONA 1161167.0 34345.0 106327.0 1252401.0 1031245.0 138634.0 NaN 1359267.0 NaN ... NaN NaN 1191100.0 57442.0 NaN 1135200.0 NaN NaN NaN NaN
3 ARKANSAS 380494.0 9473.0 29829.0 684872.0 400602.0 NaN 43866.0 661984.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 CALIFORNIA 8753788.0 278657.0 478500.0 4483810.0 7542753.0 NaN NaN NaN NaN ... NaN NaN 6019422.0 NaN NaN NaN NaN NaN NaN NaN

5 rows × 29 columns

Step 2. Wrangle data

[14]:
state = 'Alaska'

2.1. Wrangle MEDSL data.

[15]:
def pivot_medsl_df(df: pd.DataFrame) -> pd.DataFrame:
    """
    Given a MEDSL DataFrame, return a pivoted DataFrame where the columns
    are elections and parties and the values are the votes for every precinct.

    """
    medsl_pvt = df.pivot_table(index='precinct',
                               columns=['office', 'party'],
                               values='votes')
    medsl_pvt.columns = [' '.join(col).strip() for col in medsl_pvt.columns.values]
    return medsl_pvt

2.1.1. Extract and pivot Alaska-specific data from each MEDSL DataFrame.

[16]:
medsl_pres16_gdf = et.ExtractTable(medsl_pres16_df, column='state', value=state).extract()
medsl_pres16_df = pd.DataFrame(medsl_pres16_gdf.drop(columns=['geometry']))
medsl_pres16_df = pivot_medsl_df(medsl_pres16_df)

medsl_pres16_df.head()
[16]:
US President constitution US President democratic US President green US President libertarian US President new alliance US President republican
precinct
01-446 Aurora 5.0 295.0 23.0 75.0 7.0 434.0
01-455 Fairbanks No. 1 2.0 65.0 2.0 8.0 1.0 113.0
01-465 Fairbanks No. 2 8.0 120.0 11.0 21.0 1.0 157.0
01-470 Fairbanks No. 3 6.0 205.0 12.0 35.0 2.0 218.0
01-475 Fairbanks No. 4 4.0 86.0 6.0 16.0 2.0 149.0
[17]:
medsl_sen16_gdf = et.ExtractTable(medsl_sen16_df, column='state', value=state).extract()
medsl_sen16_df = pd.DataFrame(medsl_sen16_gdf.drop(columns=['geometry']))
medsl_sen16_df = pivot_medsl_df(medsl_sen16_df)

medsl_sen16_df.head()
[17]:
US Senate democratic US Senate libertarian US Senate new alliance US Senate republican
precinct
01-446 Aurora 70.0 261.0 39.333333 409.0
01-455 Fairbanks No. 1 20.0 74.0 6.333333 85.0
01-465 Fairbanks No. 2 45.0 108.0 12.000000 129.0
01-470 Fairbanks No. 3 56.0 136.0 26.000000 225.0
01-475 Fairbanks No. 4 28.0 70.0 13.333333 137.0
[18]:
medsl_ush16_gdf = et.ExtractTable(medsl_ush16_df, column='state', value=state).extract()
medsl_ush16_df = pd.DataFrame(medsl_ush16_gdf.drop(columns=['geometry']))
medsl_ush16_df = pivot_medsl_df(medsl_ush16_df)

medsl_ush16_df.head()
[18]:
US House democratic US House libertarian US House new alliance US House republican
precinct
01-446 Aurora 334.0 109.0 20.0 389.0
01-455 Fairbanks No. 1 62.0 23.0 4.0 106.0
01-465 Fairbanks No. 2 120.0 51.0 7.0 142.0
01-470 Fairbanks No. 3 191.0 48.0 13.0 236.0
01-475 Fairbanks No. 4 104.0 27.0 10.0 133.0
[19]:
medsl_18_gdf = et.ExtractTable(medsl_18_df, column='state', value=state).extract()
medsl_18_df = pd.DataFrame(medsl_18_gdf.drop(columns=['geometry']))
medsl_18_df = pivot_medsl_df(medsl_18_df)

medsl_18_df.head()
[19]:
Governor democratic Governor independent Governor libertarian Governor republican State House democratic State House independent State House libertarian State House republican State Senate democratic State Senate independent State Senate republican US House democratic US House republican
precinct
01-446 Aurora 399.0 11.0 29.0 427.0 388.0 NaN NaN 464.0 462.0 NaN 392.0 429.0 419.0
01-455 Fairbanks No. 1 65.0 3.0 3.0 84.0 60.0 NaN NaN 93.0 75.0 NaN 79.0 66.0 88.0
01-465 Fairbanks No. 2 124.0 4.0 13.0 139.0 139.0 NaN NaN 137.0 151.0 NaN 127.0 142.0 136.0
01-470 Fairbanks No. 3 253.0 9.0 16.0 203.0 248.0 NaN NaN 237.0 277.0 NaN 205.0 265.0 223.0
01-475 Fairbanks No. 4 121.0 5.0 9.0 110.0 134.0 NaN NaN 109.0 136.0 NaN 106.0 136.0 104.0

2.1.2. Combine MEDSL 2016 election dataframes into one.

[20]:
medsl_16_df = pd.concat([medsl_pres16_df, medsl_sen16_df, medsl_ush16_df], axis=1)

medsl_16_df.head()
[20]:
US President constitution US President democratic US President green US President libertarian US President new alliance US President republican US Senate democratic US Senate libertarian US Senate new alliance US Senate republican US House democratic US House libertarian US House new alliance US House republican
precinct
01-446 Aurora 5.0 295.0 23.0 75.0 7.0 434.0 70.0 261.0 39.333333 409.0 334.0 109.0 20.0 389.0
01-455 Fairbanks No. 1 2.0 65.0 2.0 8.0 1.0 113.0 20.0 74.0 6.333333 85.0 62.0 23.0 4.0 106.0
01-465 Fairbanks No. 2 8.0 120.0 11.0 21.0 1.0 157.0 45.0 108.0 12.000000 129.0 120.0 51.0 7.0 142.0
01-470 Fairbanks No. 3 6.0 205.0 12.0 35.0 2.0 218.0 56.0 136.0 26.000000 225.0 191.0 48.0 13.0 236.0
01-475 Fairbanks No. 4 4.0 86.0 6.0 16.0 2.0 149.0 28.0 70.0 13.333333 137.0 104.0 27.0 10.0 133.0

2.2. Wrangle Wikipedia Data.

[21]:
uppercase_state = state.upper()

wiki_df = wiki_df.set_index('STATE')
wiki_df = wiki_df.loc[[uppercase_state]]
wiki_df = wiki_df.dropna(axis=1)

wiki_df.head()
[21]:
PRES16D PRES16G PRES16L PRES16R SEN16D SEN16L SEN16R USH16D USH16L USH16R USH18R
STATE
ALASKA 116454.0 5735.0 18725.0 163387.0 36200.0 90825.0 138149.0 111019.0 31770.0 155088.0 149779.0

Step 3. Check naming convention compliance

Step 3.1 Generate naming conventions.

[22]:
with open('naming_convention.json') as json_file:
    standards_raw = json.load(json_file)

offices = dm.get_keys_by_category(standards_raw, 'offices')
parties = dm.get_keys_by_category(standards_raw, 'parties')
counts  = dm.get_keys_by_category(standards_raw, 'counts')
others  = dm.get_keys_by_category(standards_raw, ['geographies',
                                                  'demographics',
                                                  'districts',
                                                  'other'])

elections = [office + format(year, '02') + party
             for office in offices
             for year in range(0, 21)
             for party in parties
             if not (office == 'PRES' and year % 4 != 0)]

counts    = [count + format(year, '02')
             for count in counts
             for year in range(0, 20)]

standards = elections + counts + others

Step 3.2. Check AK_precincts compliance with naming conventions.

[23]:
naming_check = dq.compare_column_names(mggg_gdf, standards)
[24]:
# Print and store results of naming convention check

(matches, diffs) = naming_check
matched_columns = matches

diffs = list(diffs)
diffs.sort()

print('Discrepancies from naming convention:', diffs)
Discrepancies from naming convention: ['2MORE', 'AMIN', 'AREA', 'ASIAN', 'BLACK', 'DISTRICT', 'ID', 'NAME', 'NHPI', 'OTHER', 'POPULATION', 'PRES16C', 'WHITE']

Step 4. Compare AK_precincts against external sources

Step 4.1. Compare AK_precincts data against MEDSL data.

Step 4.1.1. Generate naming convention translations between MGGG and MEDSL.

  • Note: Has to be done manually because some MEDSL data don’t use the same naming convention (e.g. ‘US President democratic’ vs ‘US Senate democrat’).
[25]:
pres16_cols = [
    ('PRES16D', 'US President democratic'),
    ('PRES16G', 'US President green'),
    ('PRES16L', 'US President libertarian'),
    ('PRES16R', 'US President republican')
]

sen16_cols = [
    ('SEN16D', 'US Senate democrat'),
    ('SEN16G', 'US Senate green'),
    ('SEN16L', 'US Senate libertarian'),
    ('SEN16R', 'US Senate republican')
]

ush16_cols = [
    ('USH16D', 'US House democratic'),
    ('USH16G', 'US House green'),
    ('USH16L', 'US House libertarian'),
    ('USH16R', 'US House republican')
]

mggg_medsl_18_cols = [
    ('SEN18D', 'US Senate democratic'),
    ('SEN18G', 'US Senate green'),
    ('SEN18L', 'US Senate libertarian'),
    ('SEN18R', 'US Senate republican'),
    ('USH18D', 'US House democrat'),
    ('USH18G', 'US House green'),
    ('USH18L', 'US House libertarian'),
    ('USH18R', 'US House republican')
]
[26]:
# Generate comparable columns

mggg_medsl_16_cols = pres16_cols + sen16_cols + ush16_cols
mggg_medsl_16_cols = [tup for tup in mggg_medsl_16_cols
                          if tup[0] in list(mggg_gdf.columns) and
                             tup[1] in list(medsl_16_df.columns)]

mggg_medsl_18_cols = [tup for tup in mggg_medsl_18_cols
                          if tup[0] in list(mggg_gdf.columns) and
                             tup[1] in list(medsl_16_df.columns)]

mggg_16_cols  = [tup[0] for tup in mggg_medsl_16_cols]
medsl_16_cols = [tup[1] for tup in mggg_medsl_16_cols]

mggg_18_cols  = [tup[0] for tup in mggg_medsl_18_cols]
medsl_18_cols = [tup[1] for tup in mggg_medsl_18_cols]

Step 4.1.2. Compare AK_precincts’ and MEDSL’s 2016 election data.

[27]:
mggg_medsl_16_results = dq.compare_column_sums(mggg_gdf, medsl_16_df,
                                               mggg_16_cols, medsl_16_cols)

print('============================================================')
print('Comparison of AK_precincts\' and MEDSL\'s 2016 Election Data')
print('============================================================\n')
print('{:37} : {}'.format('AK_precincts [vs] MEDSL', 'difference in sums'))
print('------------------------------------------------------------')

for col_v_col, diff in mggg_medsl_16_results:
    print('{:37} : {}'.format(col_v_col, diff))
============================================================
Comparison of AK_precincts' and MEDSL's 2016 Election Data
============================================================

AK_precincts [vs] MEDSL               : difference in sums
------------------------------------------------------------
PRES16D [vs] US President democratic  : -47220.0
PRES16G [vs] US President green       : -1947.5
PRES16L [vs] US President libertarian : -6717.5
PRES16R [vs] US President republican  : -59910.0
SEN16L [vs] US Senate libertarian     : -30050.0
SEN16R [vs] US Senate republican      : -50259.5
USH16D [vs] US House democratic       : -43628.5
USH16L [vs] US House libertarian      : -11665.5
USH16R [vs] US House republican       : -54359.5

Step 4.1.3. Compare AK_precincts’ and MEDSL’s 2018 election data.

[28]:
mggg_medsl_18_results = dq.compare_column_sums(mggg_gdf, medsl_18_df,
                                               mggg_18_cols, medsl_18_cols)

print('============================================================')
print('Comparison of AK_precincts\' and MEDSL\'s 2018 Election Data')
print('============================================================\n')
print('{:37} : {}'.format('AK_precincts [vs] MEDSL', 'Difference in sums'))
print('------------------------------------------------------------')

for col_v_col, diff in mggg_medsl_18_results:
    print('{:37} : {}'.format(col_v_col, diff))
============================================================
Comparison of AK_precincts' and MEDSL's 2018 Election Data
============================================================

AK_precincts [vs] MEDSL               : Difference in sums
------------------------------------------------------------
USH18R [vs] US House republican       : -46390.5

Step 4.2. Compare AK_precincts data against Wikipedia data.

[29]:
mggg_wiki_cols = list(set(mggg_gdf.columns).intersection(set(wiki_df.columns)))
mggg_wiki_cols.sort()

mggg_wiki_results = dq.compare_column_sums(mggg_gdf, wiki_df,
                                           mggg_wiki_cols, mggg_wiki_cols)

print('===========================================================')
print('Comparison of AK_precincts\' and Wikipedia\'s Election Data')
print('===========================================================\n')
print('{:27} : {}'.format('AK_precincts [vs] Wikipedia', 'Difference in sums'))
print('---------------------------------------------------')

for col_v_col, diff in mggg_wiki_results:
    print('{:27} : {}'.format(col_v_col, diff))
===========================================================
Comparison of AK_precincts' and Wikipedia's Election Data
===========================================================

AK_precincts [vs] Wikipedia : Difference in sums
---------------------------------------------------
PRES16D [vs] PRES16D        : -47357.0
PRES16G [vs] PRES16G        : -1953.0
PRES16L [vs] PRES16L        : -6721.0
PRES16R [vs] PRES16R        : -59930.0
SEN16D [vs] SEN16D          : -16598.0
SEN16L [vs] SEN16L          : -30057.0
SEN16R [vs] SEN16R          : -50295.0
USH16D [vs] USH16D          : -43745.0
USH16L [vs] USH16L          : -11670.0
USH16R [vs] USH16R          : -54386.0
USH18R [vs] USH18R          : -46394.0

Step 5. Check topological soundness of AK_precincts

Step 5.1. Check for empty or missing geometries

[30]:
warnings = 0

if dq.has_missing_geometries(mggg_gdf):
    warnings += 1
    print('AK_precincts has missing geometries.')

if dq.has_empty_geometries(mggg_gdf):
    warnings += 1
    print('AK_precints has empty geometries.')

if warnings == 0:
    print('No missing or empty geometries.')
No missing or empty geometries.

Step 6. Sum datasets for manual checks

[31]:
# AK_precincts Sums

mggg_columns = [col for col in mggg_gdf.columns
                    if col != 'geometry' and col in elections]

mggg_sums = dq.sum_column_values(mggg_gdf, mggg_columns)

print('==========================')
print('AK_precincts\' column sums')
print('==========================\n')
print('{:10} : {}\t{}'.format('Column', 'Sums', 'Datatype'))
print('--------------------')

for col, summation in mggg_sums:
    print('{:10} : {}\t{}'.format(col, summation, type(summation)))
==========================
AK_precincts' column sums
==========================

Column     : Sums       Datatype
--------------------
USH14D     : 77004      <class 'numpy.int64'>
USH14R     : 102464     <class 'numpy.int64'>
USH14L     : 14715      <class 'numpy.int64'>
PRES16D    : 69097      <class 'numpy.int64'>
PRES16R    : 103457     <class 'numpy.int64'>
PRES16L    : 12004      <class 'numpy.int64'>
PRES16G    : 3782       <class 'numpy.int64'>
SEN16D     : 19602      <class 'numpy.int64'>
SEN16R     : 87854      <class 'numpy.int64'>
SEN16L     : 60768      <class 'numpy.int64'>
USH16D     : 67274      <class 'numpy.int64'>
USH16R     : 100702     <class 'numpy.int64'>
USH16L     : 20100      <class 'numpy.int64'>
GOV18D     : 80954      <class 'numpy.int64'>
GOV18R     : 100372     <class 'numpy.int64'>
GOV18L     : 3970       <class 'numpy.int64'>
USH18D     : 83855      <class 'numpy.int64'>
USH18R     : 103385     <class 'numpy.int64'>
[32]:
# MEDSL 2016 Sums

medsl_16_sums = dq.sum_column_values(medsl_16_df, list(medsl_16_df.columns))

print('===========================================')
print('MEDSL 2016\' column sums')
print('===========================================\n')
print('{:25} : {}\t\t{}'.format('Column', 'Sums', 'Datatype'))
print('-------------------------------------------')

for col, summation in medsl_16_sums:
    print('{:25} : {}\t\t{}'.format(col, summation, type(summation)))
===========================================
MEDSL 2016' column sums
===========================================

Column                    : Sums                Datatype
-------------------------------------------
US President constitution : 3866.0              <class 'numpy.float64'>
US President democratic   : 116317.0            <class 'numpy.float64'>
US President green        : 5729.5              <class 'numpy.float64'>
US President libertarian  : 18721.5             <class 'numpy.float64'>
US President new alliance : 1239.5              <class 'numpy.float64'>
US President republican   : 163367.0            <class 'numpy.float64'>
US Senate democratic      : 36103.5             <class 'numpy.float64'>
US Senate libertarian     : 90818.0             <class 'numpy.float64'>
US Senate new alliance    : 15183.833333333332          <class 'numpy.float64'>
US Senate republican      : 138113.5            <class 'numpy.float64'>
US House democratic       : 110902.5            <class 'numpy.float64'>
US House libertarian      : 31765.5             <class 'numpy.float64'>
US House new alliance     : 9091.5              <class 'numpy.float64'>
US House republican       : 155061.5            <class 'numpy.float64'>
[33]:
# MEDSL 2018 Sums

medsl_18_sums = dq.sum_column_values(medsl_18_df, list(medsl_18_df.columns))

print('=======================================')
print('MEDSL 2018\' column sums')
print('=======================================\n')
print('{:25} : {}\t{}'.format('Column', 'Sums', 'Datatype'))
print('-------------------------------------')

for col, summation in medsl_18_sums:
    print('{:25} : {}\t{}'.format(col, summation, type(summation)))
=======================================
MEDSL 2018' column sums
=======================================

Column                    : Sums        Datatype
-------------------------------------
Governor democratic       : 125739.0    <class 'numpy.float64'>
Governor independent      : 5757.0      <class 'numpy.float64'>
Governor libertarian      : 5402.0      <class 'numpy.float64'>
Governor republican       : 145631.0    <class 'numpy.float64'>
State House democratic    : 107045.0    <class 'numpy.float64'>
State House independent   : 15229.5     <class 'numpy.float64'>
State House libertarian   : 2274.0      <class 'numpy.float64'>
State House republican    : 138238.0    <class 'numpy.float64'>
State Senate democratic   : 51891.0     <class 'numpy.float64'>
State Senate independent  : 6803.0      <class 'numpy.float64'>
State Senate republican   : 65376.0     <class 'numpy.float64'>
US House democratic       : 131143.5    <class 'numpy.float64'>
US House republican       : 149775.5    <class 'numpy.float64'>
[34]:
# Wikipedia Sums

wiki_sums = dq.sum_column_values(wiki_df, list(wiki_df.columns))

print('==========================')
print('Wikipedia\' column sums')
print('==========================\n')
print('{:10} : {}\t{}'.format('Column', 'Sums', 'Datatype'))
print('----------------------')

for col, summation in wiki_sums:
    print('{:10} : {}\t{}'.format(col, summation, type(summation)))
==========================
Wikipedia' column sums
==========================

Column     : Sums       Datatype
----------------------
PRES16D    : 116454.0   <class 'numpy.float64'>
PRES16G    : 5735.0     <class 'numpy.float64'>
PRES16L    : 18725.0    <class 'numpy.float64'>
PRES16R    : 163387.0   <class 'numpy.float64'>
SEN16D     : 36200.0    <class 'numpy.float64'>
SEN16L     : 90825.0    <class 'numpy.float64'>
SEN16R     : 138149.0   <class 'numpy.float64'>
USH16D     : 111019.0   <class 'numpy.float64'>
USH16L     : 31770.0    <class 'numpy.float64'>
USH16R     : 155088.0   <class 'numpy.float64'>
USH18R     : 149779.0   <class 'numpy.float64'>

Step 7. Cleanup

[35]:
# Remove cloned repos

dm.remove_repos('output/')
[36]:
# Uninstall installed python packages

!echo y | pip3 uninstall numpy
!echo y | pip3 uninstall pandas
!echo y | pip3 uninstall geopandas
!echo y | pip3 uninstall wikipedia

!echo y | pip3 uninstall gdutils
[37]:
# Reset Jupyter Notebook IPython Kernel

from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

Next Steps

After running the automated scripts, we recommend doing the following:

Data Standardization

  • Manually evaluate column naming discrepancies to determine if changes are needed.
  • Manually evaluate column datatypes to determine if changes are needed.

Data Comparison

  • Manually investigate large differences found through comparing AK_precincts data with external sources (e.g. Are absentee ballots counted? Are the precinct counts accurate?).
  • For overcounts, how are the votes counted? e.g. A USH##D count may include votes for all Democratic candidates where external sources may be only counting one main Democratic candidate.
  • For more accurate comparisons, compare AK_precincts data with those in each States’ Secretary of State website.

Topological Soundness

  • Manually examine shapefiles for gaps and overlaps.
  • Note: although gaps and overlaps are not necessarily indicators of inaccurate data (because some counties have precinct islands), they do mean that the data cannot be for chain runs.

Data Documentation

  • Do the READMEs provide data sources?
  • Do the READMEs describe what aggregation/disaggregation processes were used?
  • Do the READMEs discuss discrepancies/caveats in the data?
  • Do the READMEs provide scripts used and/or discuss the data wrangling/processing process?