gdutils.extract

extract is a module in package gdutils that provides a class ExtractTable. ExtractTable is used in converting and extracting tabular data.


Examples Setup

The following commands are used for setting up the examples below.

Note: The example input files were pulled and converted from the GeoJSON link provided in the geopandas IO docs.

[1]:
# Install ``gdutils`` package
!conda install fiona shapely pyproj rtree && pip3 install wheel
!pip3 install git+https://github.com/mggg/gdutils.git > /dev/null
[2]:
import gdutils.extract as et # imports the ``extract`` module

import geopandas as gpd
import pandas as pd

Example 1. Extract a table

Note: returns a geopandas GeoDataFrame

Example 1.1. Extract a table from a file

  • Example 1.1.1. Extract from a shapefile
[3]:
# Ex. 1.1.1

shp_path = 'example-inputs/example-shp/example.shp' # path to file containing table to extract
shp_et = et.ExtractTable(shp_path) # alternative: et.read_file(filepath)
shp_gdf = shp_et.extract() # extracts table as a geopandas GeoDataframe

shp_gdf.head() # renders first 5 rows of table
[3]:
scalerank featurecla geometry
0 1 Country POLYGON ((-59.57209 -80.04018, -59.86585 -80.5...
1 1 Country POLYGON ((-159.20818 -79.49706, -161.12760 -79...
2 1 Country POLYGON ((-45.15476 -78.04707, -43.92083 -78.4...
3 1 Country POLYGON ((-121.21151 -73.50099, -119.91885 -73...
4 1 Country POLYGON ((-125.55957 -73.48135, -124.03188 -73...
  • Example 1.1.2. Extract from a CSV
[4]:
# Ex. 1.1.2

csv_path = 'example-inputs/example.csv'
csv_et = et.read_file(csv_path) # using alternative
csv_gdf = csv_et.extract()

csv_gdf.head()
[4]:
scalerank featurecla geometry
0 1 Country POLYGON ((-59.57209 -80.04018, -59.86585 -80.5...
1 1 Country POLYGON ((-159.20818 -79.49706, -161.12760 -79...
2 1 Country POLYGON ((-45.15476 -78.04707, -43.92083 -78.4...
3 1 Country POLYGON ((-121.21151 -73.50099, -119.91885 -73...
4 1 Country POLYGON ((-125.55957 -73.48135, -124.03188 -73...
  • Example 1.1.3. Extract from an Excel file
[5]:
# Ex. 1.1.3

excel_path = 'example-inputs/example.csv'
excel_gdf = et.read_file(excel_path).extract() # shorthand equivalent

excel_gdf.head()
[5]:
scalerank featurecla geometry
0 1 Country POLYGON ((-59.57209 -80.04018, -59.86585 -80.5...
1 1 Country POLYGON ((-159.20818 -79.49706, -161.12760 -79...
2 1 Country POLYGON ((-45.15476 -78.04707, -43.92083 -78.4...
3 1 Country POLYGON ((-121.21151 -73.50099, -119.91885 -73...
4 1 Country POLYGON ((-125.55957 -73.48135, -124.03188 -73...
  • Example 1.1.4. Extract from a ZIP file
[6]:
# Ex. 1.1.4

zip_path = 'example-inputs/example.zip'
zip_gdf = et.read_file(zip_path).extract()

zip_gdf.head()
[6]:
scalerank featurecla geometry
0 1 Country POLYGON ((-59.57209 -80.04018, -59.86585 -80.5...
1 1 Country POLYGON ((-159.20818 -79.49706, -161.12760 -79...
2 1 Country POLYGON ((-45.15476 -78.04707, -43.92083 -78.4...
3 1 Country POLYGON ((-121.21151 -73.50099, -119.91885 -73...
4 1 Country POLYGON ((-125.55957 -73.48135, -124.03188 -73...

Example 1.2. Extract a table from a URL

[7]:
# Ex. 1.2

url = 'http://d2ad6b4ur7yvpq.cloudfront.net/naturalearth-3.3.0/ne_110m_land.geojson'
    # URL copied from https://geopandas.org/io.html
url_gdf = et.ExtractTable(url).extract()

url_gdf.head()
[7]:
scalerank featureclass geometry
0 1 Country POLYGON ((-59.57209 -80.04018, -59.86585 -80.5...
1 1 Country POLYGON ((-159.20818 -79.49706, -161.12760 -79...
2 1 Country POLYGON ((-45.15476 -78.04707, -43.92083 -78.4...
3 1 Country POLYGON ((-121.21151 -73.50099, -119.91885 -73...
4 1 Country POLYGON ((-125.55957 -73.48135, -124.03188 -73...

Example 1.3. Extract a table from a pandas DataFrame

[8]:
# Ex. 1.3

pandas_df = pd.read_csv(csv_path)
pandas_gdf = et.ExtractTable(pandas_df).extract()

pandas_gdf.head()
[8]:
scalerank featurecla geometry
0 1 Country POLYGON ((-59.57209 -80.04018, -59.86585 -80.5...
1 1 Country POLYGON ((-159.20818 -79.49706, -161.12760 -79...
2 1 Country POLYGON ((-45.15476 -78.04707, -43.92083 -78.4...
3 1 Country POLYGON ((-121.21151 -73.50099, -119.91885 -73...
4 1 Country POLYGON ((-125.55957 -73.48135, -124.03188 -73...

Example 1.4. Extract a table from a geopandas GeoDataFrame

[9]:
# Ex. 1.4

geopandas_gdf = et.ExtractTable(csv_gdf).extract()

geopandas_gdf.head()
[9]:
scalerank featurecla geometry
0 1 Country POLYGON ((-59.57209 -80.04018, -59.86585 -80.5...
1 1 Country POLYGON ((-159.20818 -79.49706, -161.12760 -79...
2 1 Country POLYGON ((-45.15476 -78.04707, -43.92083 -78.4...
3 1 Country POLYGON ((-121.21151 -73.50099, -119.91885 -73...
4 1 Country POLYGON ((-125.55957 -73.48135, -124.03188 -73...

Example 2. Extract a table with a selected index

Example 2.1. Extract a table with a known column label as the index

[10]:
# Ex. 2.1

known_column = 'featurecla'
known_column_gdf = et.ExtractTable(shp_path, column=known_column).extract()
    # alternative: et.read_file(shp_path, column=known_column)

known_column_gdf.head()
[10]:
scalerank geometry
featurecla
Country 1 POLYGON ((-59.57209 -80.04018, -59.86585 -80.5...
Country 1 POLYGON ((-159.20818 -79.49706, -161.12760 -79...
Country 1 POLYGON ((-45.15476 -78.04707, -43.92083 -78.4...
Country 1 POLYGON ((-121.21151 -73.50099, -119.91885 -73...
Country 1 POLYGON ((-125.55957 -73.48135, -124.03188 -73...

Example 2.2. Extract a table without a known column label as the index

[11]:
# Ex. 2.2

unknown_column_et = et.ExtractTable(shp_path)
columns_list = unknown_column_et.list_columns() # returns a list of columns from which to choose
print(columns_list)
['scalerank' 'featurecla' 'geometry']
[12]:
unknown_column_et.column = 'scalerank' # selects the 'scalerank' column as the index
unknown_column_gdf = unknown_column_et.extract()

unknown_column_gdf.head()
[12]:
featurecla geometry
scalerank
1 Country POLYGON ((-59.57209 -80.04018, -59.86585 -80.5...
1 Country POLYGON ((-159.20818 -79.49706, -161.12760 -79...
1 Country POLYGON ((-45.15476 -78.04707, -43.92083 -78.4...
1 Country POLYGON ((-121.21151 -73.50099, -119.91885 -73...
1 Country POLYGON ((-125.55957 -73.48135, -124.03188 -73...

Example 3. Extract a subtable

Note: The counties.zip file contains a shapefile of California county boundaries as sourced from the 2010 US decennial census. The shapefile was pulled from the NHGIS database.

[13]:
counties_file = 'example-inputs/counties.zip'

Example 3.1. Extract a subtable without a known column value

[14]:
# Ex. 3.1

unknown_value_et = et.read_file(counties_file)
print(unknown_value_et.list_columns())
['STATEFP10' 'COUNTYFP10' 'COUNTYNS10' 'GEOID10' 'NAME10' 'NAMELSAD10'
 'LSAD10' 'CLASSFP10' 'MTFCC10' 'CSAFP10' 'CBSAFP10' 'METDIVFP10'
 'FUNCSTAT10' 'ALAND10' 'AWATER10' 'INTPTLAT10' 'INTPTLON10' 'GISJOIN'
 'Shape_area' 'Shape_len' 'geometry']
[15]:
unknown_value_et.column = 'NAME10'
print(unknown_value_et.list_values()) # alternatively, use `list_values(unique=True)` to get unique values
['Orange' 'Tehama' 'Colusa' 'Santa Barbara' 'Mono' 'Monterey' 'Placer'
 'Amador' 'Calaveras' 'Imperial' 'Siskiyou' 'Sonoma' 'Santa Clara' 'Kern'
 'Yolo' 'Mendocino' 'Sacramento' 'Madera' 'Yuba' 'Tulare' 'San Diego'
 'Plumas' 'San Benito' 'Shasta' 'Stanislaus' 'Mariposa' 'Fresno' 'Alpine'
 'Marin' 'Glenn' 'Lassen' 'Del Norte' 'Napa' 'San Luis Obispo' 'San Mateo'
 'Nevada' 'San Joaquin' 'San Bernardino' 'Sutter' 'Riverside' 'Trinity'
 'Contra Costa' 'Ventura' 'Tuolumne' 'Butte' 'Sierra' 'Lake' 'Modoc'
 'El Dorado' 'Los Angeles' 'Alameda' 'Inyo' 'San Francisco' 'Santa Cruz'
 'Kings' 'Humboldt' 'Solano' 'Merced']
[16]:
unknown_value_et.value = 'Alameda' # can also take in a list e.g. = ['Alameda', 'Alpine']
unknown_value_gdf = unknown_value_et.extract()

unknown_value_gdf.head()
[16]:
STATEFP10 COUNTYFP10 COUNTYNS10 GEOID10 NAMELSAD10 LSAD10 CLASSFP10 MTFCC10 CSAFP10 CBSAFP10 METDIVFP10 FUNCSTAT10 ALAND10 AWATER10 INTPTLAT10 INTPTLON10 GISJOIN Shape_area Shape_len geometry
NAME10
Alameda 06 001 01675839 06001 Alameda County 06 H1 G4020 488 41860 36084 A 1914046110 213184643 +37.6480811 -121.9133039 G0600010 1.928979e+09 376006.655638 MULTIPOLYGON (((-174219.106 -59155.702, -17418...

Example 3.2. Extract a subtable with a known column value

[17]:
# Ex. 3.2

known_value_gdf = et.read_file(counties_file, column='NAME10', value='Alameda').extract()

known_value_gdf.head()
[17]:
STATEFP10 COUNTYFP10 COUNTYNS10 GEOID10 NAMELSAD10 LSAD10 CLASSFP10 MTFCC10 CSAFP10 CBSAFP10 METDIVFP10 FUNCSTAT10 ALAND10 AWATER10 INTPTLAT10 INTPTLON10 GISJOIN Shape_area Shape_len geometry
NAME10
Alameda 06 001 01675839 06001 Alameda County 06 H1 G4020 488 41860 36084 A 1914046110 213184643 +37.6480811 -121.9133039 G0600010 1.928979e+09 376006.655638 MULTIPOLYGON (((-174219.106 -59155.702, -17418...

Example 3.3. Extract a subtable with known column values

[18]:
# Ex. 3.3

known_values = ['Alameda', 'San Francisco', 'Napa']
known_values_gdf = et.read_file(counties_file, column='NAME10', value=known_values).extract()

known_values_gdf.head()
[18]:
STATEFP10 COUNTYFP10 COUNTYNS10 GEOID10 NAMELSAD10 LSAD10 CLASSFP10 MTFCC10 CSAFP10 CBSAFP10 METDIVFP10 FUNCSTAT10 ALAND10 AWATER10 INTPTLAT10 INTPTLON10 GISJOIN Shape_area Shape_len geometry
NAME10
Napa 06 055 00277292 06055 Napa County 06 H1 G4020 488 34900 None A 1938247338 104169342 +38.5073511 -122.3259947 G0600550 2.037287e+09 288618.399143 MULTIPOLYGON (((-201156.864 17876.015, -201155...
Alameda 06 001 01675839 06001 Alameda County 06 H1 G4020 488 41860 36084 A 1914046110 213184643 +37.6480811 -121.9133039 G0600010 1.928979e+09 376006.655638 MULTIPOLYGON (((-174219.106 -59155.702, -17418...
San Francisco 06 075 00277302 06075 San Francisco County 06 H6 G4020 488 41860 41884 C 121399963 479190317 +37.7272391 -123.0322294 G0600750 1.231584e+08 154624.606681 MULTIPOLYGON (((-209365.731 -25846.453, -20937...

Example 4. Extract to a file

[19]:
!mkdir outputs # creates a folder called 'output'

output_path = 'outputs/output.shp'
    # the output filetype depends on the provided extension
    # e.g. 'output/output.csv' writes to a CSV file
    # e.g. 'output/output.xlsx' writes to an Excel file

Ex. 4.1. Extract to file from a geopandas GeoDataFrame

[20]:
# Ex. 4.1.

et.ExtractTable(known_values_gdf).extract_to_file(output_path)
[21]:
# Let's look at the extracted file:
et.ExtractTable(output_path).extract().head()
[21]:
NAME10 STATEFP10 COUNTYFP10 COUNTYNS10 GEOID10 NAMELSAD10 LSAD10 CLASSFP10 MTFCC10 CSAFP10 ... METDIVFP10 FUNCSTAT10 ALAND10 AWATER10 INTPTLAT10 INTPTLON10 GISJOIN Shape_area Shape_len geometry
0 Napa 06 055 00277292 06055 Napa County 06 H1 G4020 488 ... None A 1938247338 104169342 +38.5073511 -122.3259947 G0600550 2.037287e+09 288618.399143 MULTIPOLYGON (((-201156.864 17876.015, -201155...
1 Alameda 06 001 01675839 06001 Alameda County 06 H1 G4020 488 ... 36084 A 1914046110 213184643 +37.6480811 -121.9133039 G0600010 1.928979e+09 376006.655638 MULTIPOLYGON (((-174219.106 -59155.702, -17418...
2 San Francisco 06 075 00277302 06075 San Francisco County 06 H6 G4020 488 ... 41884 C 121399963 479190317 +37.7272391 -123.0322294 G0600750 1.231584e+08 154624.606681 MULTIPOLYGON (((-209365.731 -25846.453, -20937...

3 rows × 21 columns

Ex. 4.2. Extract to file from input file without known values

[22]:
# Ex. 4.2.

unknown_to_file_et = et.read_file(counties_file)
print(unknown_to_file_et.list_values(column='NAME10', unique=True))
['Orange' 'Tehama' 'Colusa' 'Santa Barbara' 'Mono' 'Monterey' 'Placer'
 'Amador' 'Calaveras' 'Imperial' 'Siskiyou' 'Sonoma' 'Santa Clara' 'Kern'
 'Yolo' 'Mendocino' 'Sacramento' 'Madera' 'Yuba' 'Tulare' 'San Diego'
 'Plumas' 'San Benito' 'Shasta' 'Stanislaus' 'Mariposa' 'Fresno' 'Alpine'
 'Marin' 'Glenn' 'Lassen' 'Del Norte' 'Napa' 'San Luis Obispo' 'San Mateo'
 'Nevada' 'San Joaquin' 'San Bernardino' 'Sutter' 'Riverside' 'Trinity'
 'Contra Costa' 'Ventura' 'Tuolumne' 'Butte' 'Sierra' 'Lake' 'Modoc'
 'El Dorado' 'Los Angeles' 'Alameda' 'Inyo' 'San Francisco' 'Santa Cruz'
 'Kings' 'Humboldt' 'Solano' 'Merced']
[23]:
unknown_to_file_et.column = 'NAME10'
unknown_to_file_et.value = ['Merced', 'Solano', 'Humboldt', 'Kings', 'Santa Cruz']
unknown_to_file_et.outfile = 'outputs/output.csv' # sets output path

unknown_to_file_et.extract_to_file()
[24]:
# Let's look at the extracted file:
et.read_file('outputs/output.csv').extract().head()
[24]:
NAME10 STATEFP10 COUNTYFP10 COUNTYNS10 GEOID10 NAMELSAD10 LSAD10 CLASSFP10 MTFCC10 CSAFP10 ... METDIVFP10 FUNCSTAT10 ALAND10 AWATER10 INTPTLAT10 INTPTLON10 GISJOIN Shape_area Shape_len geometry
0 Santa Cruz 6 87 277308 6087 Santa Cruz County 6 H1 G4020 488.0 ... NaN A 1152986019 419568865 37.012488 -122.007205 G0600870 1.155957e+09 239149.547419 POLYGON ((-140716.159 -123018.344, -141122.104...
1 Kings 6 31 277280 6031 Kings County 6 H1 G4020 NaN ... NaN A 3598582308 5468555 36.072478 -119.815530 G0600310 3.604052e+09 286325.134354 POLYGON ((41703.011 -247393.941, 41055.159 -24...
2 Humboldt 6 23 1681908 6023 Humboldt County 6 H1 G4020 NaN ... NaN A 9241044673 1254256396 40.706673 -123.925818 G0600230 9.285519e+09 635828.325095 MULTIPOLYGON (((-374245.354 277821.256, -37428...
3 Solano 6 95 277312 6095 Solano County 6 H1 G4020 488.0 ... NaN A 2128361199 218665937 38.267226 -121.939594 G0600950 2.178626e+09 408608.211983 MULTIPOLYGON (((-168280.104 5555.363, -168277....
4 Merced 6 47 277288 6047 Merced County 6 H1 G4020 NaN ... NaN A 5011554741 112760487 37.194806 -120.722802 G0600470 5.124315e+09 343377.690641 POLYGON ((-4614.978 -92627.507, -4623.314 -926...

5 rows × 21 columns

Ex. 4.3. Extract to file from input file with known columns and values

[25]:
# Ex. 4.3.

to_file_values = ['Sacramento', 'Yolo', 'San Diego']

et.ExtractTable(counties_file, 'outputs/output.csv',
                column='NAME10', value=to_file_values).extract_to_file()
[26]:
# Let's look at the extracted file:
et.read_file('outputs/output.csv').extract().head()
[26]:
NAME10 STATEFP10 COUNTYFP10 COUNTYNS10 GEOID10 NAMELSAD10 LSAD10 CLASSFP10 MTFCC10 CSAFP10 ... METDIVFP10 FUNCSTAT10 ALAND10 AWATER10 INTPTLAT10 INTPTLON10 GISJOIN Shape_area Shape_len geometry
0 Yolo 6 113 277321 6113 Yolo County 6 H1 G4020 472.0 ... NaN A 2628032408 22969178 38.679268 -121.903178 G0601130 2.650946e+09 325597.089926 POLYGON ((-202671.894 103421.056, -202487.312 ...
1 Sacramento 6 67 277298 6067 Sacramento County 6 H1 G4020 472.0 ... NaN A 2498415670 76077248 38.450011 -121.340441 G0600670 2.537265e+09 377415.585089 MULTIPOLYGON (((-156925.313 2228.787, -156863....
2 San Diego 6 73 277301 6073 San Diego County 6 H1 G4020 NaN ... NaN A 10895120648 826347909 33.023604 -116.776117 G0600730 1.097434e+10 645134.817837 MULTIPOLYGON (((261506.133 -578090.189, 261556...

3 rows × 21 columns


Examples Cleanup

The following commands are used to reset and clean up the examples above.

[27]:
# Remove outputs
!rm -r outputs
[28]:
# Uninstall Package
!echo y | pip uninstall gdutils
[29]:
# Reset Jupyter Notebook IPython Kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")