gdutils.extract¶
extract is a module in package gdutils that provides a class ExtractTable. ExtractTable is used in converting and extracting tabular data.
Examples Setup
The following commands are used for setting up the examples below.
Note: The example input files were pulled and converted from the GeoJSON link provided in the geopandas IO docs.
[1]:
# Install ``gdutils`` package
!conda install fiona shapely pyproj rtree && pip3 install wheel
!pip3 install git+https://github.com/mggg/gdutils.git > /dev/null
[2]:
import gdutils.extract as et # imports the ``extract`` module
import geopandas as gpd
import pandas as pd
Example 1. Extract a table¶
Note: returns a geopandas GeoDataFrame
Example 1.1. Extract a table from a file
- Example 1.1.1. Extract from a shapefile
[3]:
# Ex. 1.1.1
shp_path = 'example-inputs/example-shp/example.shp' # path to file containing table to extract
shp_et = et.ExtractTable(shp_path) # alternative: et.read_file(filepath)
shp_gdf = shp_et.extract() # extracts table as a geopandas GeoDataframe
shp_gdf.head() # renders first 5 rows of table
[3]:
| scalerank | featurecla | geometry | |
|---|---|---|---|
| 0 | 1 | Country | POLYGON ((-59.57209 -80.04018, -59.86585 -80.5... |
| 1 | 1 | Country | POLYGON ((-159.20818 -79.49706, -161.12760 -79... |
| 2 | 1 | Country | POLYGON ((-45.15476 -78.04707, -43.92083 -78.4... |
| 3 | 1 | Country | POLYGON ((-121.21151 -73.50099, -119.91885 -73... |
| 4 | 1 | Country | POLYGON ((-125.55957 -73.48135, -124.03188 -73... |
- Example 1.1.2. Extract from a CSV
[4]:
# Ex. 1.1.2
csv_path = 'example-inputs/example.csv'
csv_et = et.read_file(csv_path) # using alternative
csv_gdf = csv_et.extract()
csv_gdf.head()
[4]:
| scalerank | featurecla | geometry | |
|---|---|---|---|
| 0 | 1 | Country | POLYGON ((-59.57209 -80.04018, -59.86585 -80.5... |
| 1 | 1 | Country | POLYGON ((-159.20818 -79.49706, -161.12760 -79... |
| 2 | 1 | Country | POLYGON ((-45.15476 -78.04707, -43.92083 -78.4... |
| 3 | 1 | Country | POLYGON ((-121.21151 -73.50099, -119.91885 -73... |
| 4 | 1 | Country | POLYGON ((-125.55957 -73.48135, -124.03188 -73... |
- Example 1.1.3. Extract from an Excel file
[5]:
# Ex. 1.1.3
excel_path = 'example-inputs/example.csv'
excel_gdf = et.read_file(excel_path).extract() # shorthand equivalent
excel_gdf.head()
[5]:
| scalerank | featurecla | geometry | |
|---|---|---|---|
| 0 | 1 | Country | POLYGON ((-59.57209 -80.04018, -59.86585 -80.5... |
| 1 | 1 | Country | POLYGON ((-159.20818 -79.49706, -161.12760 -79... |
| 2 | 1 | Country | POLYGON ((-45.15476 -78.04707, -43.92083 -78.4... |
| 3 | 1 | Country | POLYGON ((-121.21151 -73.50099, -119.91885 -73... |
| 4 | 1 | Country | POLYGON ((-125.55957 -73.48135, -124.03188 -73... |
- Example 1.1.4. Extract from a ZIP file
[6]:
# Ex. 1.1.4
zip_path = 'example-inputs/example.zip'
zip_gdf = et.read_file(zip_path).extract()
zip_gdf.head()
[6]:
| scalerank | featurecla | geometry | |
|---|---|---|---|
| 0 | 1 | Country | POLYGON ((-59.57209 -80.04018, -59.86585 -80.5... |
| 1 | 1 | Country | POLYGON ((-159.20818 -79.49706, -161.12760 -79... |
| 2 | 1 | Country | POLYGON ((-45.15476 -78.04707, -43.92083 -78.4... |
| 3 | 1 | Country | POLYGON ((-121.21151 -73.50099, -119.91885 -73... |
| 4 | 1 | Country | POLYGON ((-125.55957 -73.48135, -124.03188 -73... |
Example 1.2. Extract a table from a URL
[7]:
# Ex. 1.2
url = 'http://d2ad6b4ur7yvpq.cloudfront.net/naturalearth-3.3.0/ne_110m_land.geojson'
# URL copied from https://geopandas.org/io.html
url_gdf = et.ExtractTable(url).extract()
url_gdf.head()
[7]:
| scalerank | featureclass | geometry | |
|---|---|---|---|
| 0 | 1 | Country | POLYGON ((-59.57209 -80.04018, -59.86585 -80.5... |
| 1 | 1 | Country | POLYGON ((-159.20818 -79.49706, -161.12760 -79... |
| 2 | 1 | Country | POLYGON ((-45.15476 -78.04707, -43.92083 -78.4... |
| 3 | 1 | Country | POLYGON ((-121.21151 -73.50099, -119.91885 -73... |
| 4 | 1 | Country | POLYGON ((-125.55957 -73.48135, -124.03188 -73... |
Example 1.3. Extract a table from a pandas DataFrame
[8]:
# Ex. 1.3
pandas_df = pd.read_csv(csv_path)
pandas_gdf = et.ExtractTable(pandas_df).extract()
pandas_gdf.head()
[8]:
| scalerank | featurecla | geometry | |
|---|---|---|---|
| 0 | 1 | Country | POLYGON ((-59.57209 -80.04018, -59.86585 -80.5... |
| 1 | 1 | Country | POLYGON ((-159.20818 -79.49706, -161.12760 -79... |
| 2 | 1 | Country | POLYGON ((-45.15476 -78.04707, -43.92083 -78.4... |
| 3 | 1 | Country | POLYGON ((-121.21151 -73.50099, -119.91885 -73... |
| 4 | 1 | Country | POLYGON ((-125.55957 -73.48135, -124.03188 -73... |
Example 1.4. Extract a table from a geopandas GeoDataFrame
[9]:
# Ex. 1.4
geopandas_gdf = et.ExtractTable(csv_gdf).extract()
geopandas_gdf.head()
[9]:
| scalerank | featurecla | geometry | |
|---|---|---|---|
| 0 | 1 | Country | POLYGON ((-59.57209 -80.04018, -59.86585 -80.5... |
| 1 | 1 | Country | POLYGON ((-159.20818 -79.49706, -161.12760 -79... |
| 2 | 1 | Country | POLYGON ((-45.15476 -78.04707, -43.92083 -78.4... |
| 3 | 1 | Country | POLYGON ((-121.21151 -73.50099, -119.91885 -73... |
| 4 | 1 | Country | POLYGON ((-125.55957 -73.48135, -124.03188 -73... |
Example 2. Extract a table with a selected index¶
Example 2.1. Extract a table with a known column label as the index
[10]:
# Ex. 2.1
known_column = 'featurecla'
known_column_gdf = et.ExtractTable(shp_path, column=known_column).extract()
# alternative: et.read_file(shp_path, column=known_column)
known_column_gdf.head()
[10]:
| scalerank | geometry | |
|---|---|---|
| featurecla | ||
| Country | 1 | POLYGON ((-59.57209 -80.04018, -59.86585 -80.5... |
| Country | 1 | POLYGON ((-159.20818 -79.49706, -161.12760 -79... |
| Country | 1 | POLYGON ((-45.15476 -78.04707, -43.92083 -78.4... |
| Country | 1 | POLYGON ((-121.21151 -73.50099, -119.91885 -73... |
| Country | 1 | POLYGON ((-125.55957 -73.48135, -124.03188 -73... |
Example 2.2. Extract a table without a known column label as the index
[11]:
# Ex. 2.2
unknown_column_et = et.ExtractTable(shp_path)
columns_list = unknown_column_et.list_columns() # returns a list of columns from which to choose
print(columns_list)
['scalerank' 'featurecla' 'geometry']
[12]:
unknown_column_et.column = 'scalerank' # selects the 'scalerank' column as the index
unknown_column_gdf = unknown_column_et.extract()
unknown_column_gdf.head()
[12]:
| featurecla | geometry | |
|---|---|---|
| scalerank | ||
| 1 | Country | POLYGON ((-59.57209 -80.04018, -59.86585 -80.5... |
| 1 | Country | POLYGON ((-159.20818 -79.49706, -161.12760 -79... |
| 1 | Country | POLYGON ((-45.15476 -78.04707, -43.92083 -78.4... |
| 1 | Country | POLYGON ((-121.21151 -73.50099, -119.91885 -73... |
| 1 | Country | POLYGON ((-125.55957 -73.48135, -124.03188 -73... |
Example 3. Extract a subtable¶
Note: The counties.zip file contains a shapefile of California county boundaries as sourced from the 2010 US decennial census. The shapefile was pulled from the NHGIS database.
[13]:
counties_file = 'example-inputs/counties.zip'
Example 3.1. Extract a subtable without a known column value
[14]:
# Ex. 3.1
unknown_value_et = et.read_file(counties_file)
print(unknown_value_et.list_columns())
['STATEFP10' 'COUNTYFP10' 'COUNTYNS10' 'GEOID10' 'NAME10' 'NAMELSAD10'
'LSAD10' 'CLASSFP10' 'MTFCC10' 'CSAFP10' 'CBSAFP10' 'METDIVFP10'
'FUNCSTAT10' 'ALAND10' 'AWATER10' 'INTPTLAT10' 'INTPTLON10' 'GISJOIN'
'Shape_area' 'Shape_len' 'geometry']
[15]:
unknown_value_et.column = 'NAME10'
print(unknown_value_et.list_values()) # alternatively, use `list_values(unique=True)` to get unique values
['Orange' 'Tehama' 'Colusa' 'Santa Barbara' 'Mono' 'Monterey' 'Placer'
'Amador' 'Calaveras' 'Imperial' 'Siskiyou' 'Sonoma' 'Santa Clara' 'Kern'
'Yolo' 'Mendocino' 'Sacramento' 'Madera' 'Yuba' 'Tulare' 'San Diego'
'Plumas' 'San Benito' 'Shasta' 'Stanislaus' 'Mariposa' 'Fresno' 'Alpine'
'Marin' 'Glenn' 'Lassen' 'Del Norte' 'Napa' 'San Luis Obispo' 'San Mateo'
'Nevada' 'San Joaquin' 'San Bernardino' 'Sutter' 'Riverside' 'Trinity'
'Contra Costa' 'Ventura' 'Tuolumne' 'Butte' 'Sierra' 'Lake' 'Modoc'
'El Dorado' 'Los Angeles' 'Alameda' 'Inyo' 'San Francisco' 'Santa Cruz'
'Kings' 'Humboldt' 'Solano' 'Merced']
[16]:
unknown_value_et.value = 'Alameda' # can also take in a list e.g. = ['Alameda', 'Alpine']
unknown_value_gdf = unknown_value_et.extract()
unknown_value_gdf.head()
[16]:
| STATEFP10 | COUNTYFP10 | COUNTYNS10 | GEOID10 | NAMELSAD10 | LSAD10 | CLASSFP10 | MTFCC10 | CSAFP10 | CBSAFP10 | METDIVFP10 | FUNCSTAT10 | ALAND10 | AWATER10 | INTPTLAT10 | INTPTLON10 | GISJOIN | Shape_area | Shape_len | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NAME10 | ||||||||||||||||||||
| Alameda | 06 | 001 | 01675839 | 06001 | Alameda County | 06 | H1 | G4020 | 488 | 41860 | 36084 | A | 1914046110 | 213184643 | +37.6480811 | -121.9133039 | G0600010 | 1.928979e+09 | 376006.655638 | MULTIPOLYGON (((-174219.106 -59155.702, -17418... |
Example 3.2. Extract a subtable with a known column value
[17]:
# Ex. 3.2
known_value_gdf = et.read_file(counties_file, column='NAME10', value='Alameda').extract()
known_value_gdf.head()
[17]:
| STATEFP10 | COUNTYFP10 | COUNTYNS10 | GEOID10 | NAMELSAD10 | LSAD10 | CLASSFP10 | MTFCC10 | CSAFP10 | CBSAFP10 | METDIVFP10 | FUNCSTAT10 | ALAND10 | AWATER10 | INTPTLAT10 | INTPTLON10 | GISJOIN | Shape_area | Shape_len | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NAME10 | ||||||||||||||||||||
| Alameda | 06 | 001 | 01675839 | 06001 | Alameda County | 06 | H1 | G4020 | 488 | 41860 | 36084 | A | 1914046110 | 213184643 | +37.6480811 | -121.9133039 | G0600010 | 1.928979e+09 | 376006.655638 | MULTIPOLYGON (((-174219.106 -59155.702, -17418... |
Example 3.3. Extract a subtable with known column values
[18]:
# Ex. 3.3
known_values = ['Alameda', 'San Francisco', 'Napa']
known_values_gdf = et.read_file(counties_file, column='NAME10', value=known_values).extract()
known_values_gdf.head()
[18]:
| STATEFP10 | COUNTYFP10 | COUNTYNS10 | GEOID10 | NAMELSAD10 | LSAD10 | CLASSFP10 | MTFCC10 | CSAFP10 | CBSAFP10 | METDIVFP10 | FUNCSTAT10 | ALAND10 | AWATER10 | INTPTLAT10 | INTPTLON10 | GISJOIN | Shape_area | Shape_len | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NAME10 | ||||||||||||||||||||
| Napa | 06 | 055 | 00277292 | 06055 | Napa County | 06 | H1 | G4020 | 488 | 34900 | None | A | 1938247338 | 104169342 | +38.5073511 | -122.3259947 | G0600550 | 2.037287e+09 | 288618.399143 | MULTIPOLYGON (((-201156.864 17876.015, -201155... |
| Alameda | 06 | 001 | 01675839 | 06001 | Alameda County | 06 | H1 | G4020 | 488 | 41860 | 36084 | A | 1914046110 | 213184643 | +37.6480811 | -121.9133039 | G0600010 | 1.928979e+09 | 376006.655638 | MULTIPOLYGON (((-174219.106 -59155.702, -17418... |
| San Francisco | 06 | 075 | 00277302 | 06075 | San Francisco County | 06 | H6 | G4020 | 488 | 41860 | 41884 | C | 121399963 | 479190317 | +37.7272391 | -123.0322294 | G0600750 | 1.231584e+08 | 154624.606681 | MULTIPOLYGON (((-209365.731 -25846.453, -20937... |
Example 4. Extract to a file¶
[19]:
!mkdir outputs # creates a folder called 'output'
output_path = 'outputs/output.shp'
# the output filetype depends on the provided extension
# e.g. 'output/output.csv' writes to a CSV file
# e.g. 'output/output.xlsx' writes to an Excel file
Ex. 4.1. Extract to file from a geopandas GeoDataFrame
[20]:
# Ex. 4.1.
et.ExtractTable(known_values_gdf).extract_to_file(output_path)
[21]:
# Let's look at the extracted file:
et.ExtractTable(output_path).extract().head()
[21]:
| NAME10 | STATEFP10 | COUNTYFP10 | COUNTYNS10 | GEOID10 | NAMELSAD10 | LSAD10 | CLASSFP10 | MTFCC10 | CSAFP10 | ... | METDIVFP10 | FUNCSTAT10 | ALAND10 | AWATER10 | INTPTLAT10 | INTPTLON10 | GISJOIN | Shape_area | Shape_len | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Napa | 06 | 055 | 00277292 | 06055 | Napa County | 06 | H1 | G4020 | 488 | ... | None | A | 1938247338 | 104169342 | +38.5073511 | -122.3259947 | G0600550 | 2.037287e+09 | 288618.399143 | MULTIPOLYGON (((-201156.864 17876.015, -201155... |
| 1 | Alameda | 06 | 001 | 01675839 | 06001 | Alameda County | 06 | H1 | G4020 | 488 | ... | 36084 | A | 1914046110 | 213184643 | +37.6480811 | -121.9133039 | G0600010 | 1.928979e+09 | 376006.655638 | MULTIPOLYGON (((-174219.106 -59155.702, -17418... |
| 2 | San Francisco | 06 | 075 | 00277302 | 06075 | San Francisco County | 06 | H6 | G4020 | 488 | ... | 41884 | C | 121399963 | 479190317 | +37.7272391 | -123.0322294 | G0600750 | 1.231584e+08 | 154624.606681 | MULTIPOLYGON (((-209365.731 -25846.453, -20937... |
3 rows × 21 columns
Ex. 4.2. Extract to file from input file without known values
[22]:
# Ex. 4.2.
unknown_to_file_et = et.read_file(counties_file)
print(unknown_to_file_et.list_values(column='NAME10', unique=True))
['Orange' 'Tehama' 'Colusa' 'Santa Barbara' 'Mono' 'Monterey' 'Placer'
'Amador' 'Calaveras' 'Imperial' 'Siskiyou' 'Sonoma' 'Santa Clara' 'Kern'
'Yolo' 'Mendocino' 'Sacramento' 'Madera' 'Yuba' 'Tulare' 'San Diego'
'Plumas' 'San Benito' 'Shasta' 'Stanislaus' 'Mariposa' 'Fresno' 'Alpine'
'Marin' 'Glenn' 'Lassen' 'Del Norte' 'Napa' 'San Luis Obispo' 'San Mateo'
'Nevada' 'San Joaquin' 'San Bernardino' 'Sutter' 'Riverside' 'Trinity'
'Contra Costa' 'Ventura' 'Tuolumne' 'Butte' 'Sierra' 'Lake' 'Modoc'
'El Dorado' 'Los Angeles' 'Alameda' 'Inyo' 'San Francisco' 'Santa Cruz'
'Kings' 'Humboldt' 'Solano' 'Merced']
[23]:
unknown_to_file_et.column = 'NAME10'
unknown_to_file_et.value = ['Merced', 'Solano', 'Humboldt', 'Kings', 'Santa Cruz']
unknown_to_file_et.outfile = 'outputs/output.csv' # sets output path
unknown_to_file_et.extract_to_file()
[24]:
# Let's look at the extracted file:
et.read_file('outputs/output.csv').extract().head()
[24]:
| NAME10 | STATEFP10 | COUNTYFP10 | COUNTYNS10 | GEOID10 | NAMELSAD10 | LSAD10 | CLASSFP10 | MTFCC10 | CSAFP10 | ... | METDIVFP10 | FUNCSTAT10 | ALAND10 | AWATER10 | INTPTLAT10 | INTPTLON10 | GISJOIN | Shape_area | Shape_len | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Santa Cruz | 6 | 87 | 277308 | 6087 | Santa Cruz County | 6 | H1 | G4020 | 488.0 | ... | NaN | A | 1152986019 | 419568865 | 37.012488 | -122.007205 | G0600870 | 1.155957e+09 | 239149.547419 | POLYGON ((-140716.159 -123018.344, -141122.104... |
| 1 | Kings | 6 | 31 | 277280 | 6031 | Kings County | 6 | H1 | G4020 | NaN | ... | NaN | A | 3598582308 | 5468555 | 36.072478 | -119.815530 | G0600310 | 3.604052e+09 | 286325.134354 | POLYGON ((41703.011 -247393.941, 41055.159 -24... |
| 2 | Humboldt | 6 | 23 | 1681908 | 6023 | Humboldt County | 6 | H1 | G4020 | NaN | ... | NaN | A | 9241044673 | 1254256396 | 40.706673 | -123.925818 | G0600230 | 9.285519e+09 | 635828.325095 | MULTIPOLYGON (((-374245.354 277821.256, -37428... |
| 3 | Solano | 6 | 95 | 277312 | 6095 | Solano County | 6 | H1 | G4020 | 488.0 | ... | NaN | A | 2128361199 | 218665937 | 38.267226 | -121.939594 | G0600950 | 2.178626e+09 | 408608.211983 | MULTIPOLYGON (((-168280.104 5555.363, -168277.... |
| 4 | Merced | 6 | 47 | 277288 | 6047 | Merced County | 6 | H1 | G4020 | NaN | ... | NaN | A | 5011554741 | 112760487 | 37.194806 | -120.722802 | G0600470 | 5.124315e+09 | 343377.690641 | POLYGON ((-4614.978 -92627.507, -4623.314 -926... |
5 rows × 21 columns
Ex. 4.3. Extract to file from input file with known columns and values
[25]:
# Ex. 4.3.
to_file_values = ['Sacramento', 'Yolo', 'San Diego']
et.ExtractTable(counties_file, 'outputs/output.csv',
column='NAME10', value=to_file_values).extract_to_file()
[26]:
# Let's look at the extracted file:
et.read_file('outputs/output.csv').extract().head()
[26]:
| NAME10 | STATEFP10 | COUNTYFP10 | COUNTYNS10 | GEOID10 | NAMELSAD10 | LSAD10 | CLASSFP10 | MTFCC10 | CSAFP10 | ... | METDIVFP10 | FUNCSTAT10 | ALAND10 | AWATER10 | INTPTLAT10 | INTPTLON10 | GISJOIN | Shape_area | Shape_len | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Yolo | 6 | 113 | 277321 | 6113 | Yolo County | 6 | H1 | G4020 | 472.0 | ... | NaN | A | 2628032408 | 22969178 | 38.679268 | -121.903178 | G0601130 | 2.650946e+09 | 325597.089926 | POLYGON ((-202671.894 103421.056, -202487.312 ... |
| 1 | Sacramento | 6 | 67 | 277298 | 6067 | Sacramento County | 6 | H1 | G4020 | 472.0 | ... | NaN | A | 2498415670 | 76077248 | 38.450011 | -121.340441 | G0600670 | 2.537265e+09 | 377415.585089 | MULTIPOLYGON (((-156925.313 2228.787, -156863.... |
| 2 | San Diego | 6 | 73 | 277301 | 6073 | San Diego County | 6 | H1 | G4020 | NaN | ... | NaN | A | 10895120648 | 826347909 | 33.023604 | -116.776117 | G0600730 | 1.097434e+10 | 645134.817837 | MULTIPOLYGON (((261506.133 -578090.189, 261556... |
3 rows × 21 columns
Examples Cleanup
The following commands are used to reset and clean up the examples above.
[27]:
# Remove outputs
!rm -r outputs
[28]:
# Uninstall Package
!echo y | pip uninstall gdutils
[29]:
# Reset Jupyter Notebook IPython Kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")