gdutils.extract¶

extract is a module in package gdutils that provides a class ExtractTable. ExtractTable is used in converting and extracting tabular data.

Examples Setup

The following commands are used for setting up the examples below.

Note: The example input files were pulled and converted from the GeoJSON link provided in the geopandas IO docs.

[1]:

# Install ``gdutils`` package
!conda install fiona shapely pyproj rtree && pip3 install wheel
!pip3 install git+https://github.com/mggg/gdutils.git > /dev/null

[2]:

import gdutils.extract as et # imports the ``extract`` module

import geopandas as gpd
import pandas as pd

Example 1. Extract a table¶

Note: returns a geopandas GeoDataFrame

Example 1.1. Extract a table from a file

Example 1.1.1. Extract from a shapefile

[3]:

# Ex. 1.1.1

shp_path = 'example-inputs/example-shp/example.shp' # path to file containing table to extract
shp_et = et.ExtractTable(shp_path) # alternative: et.read_file(filepath)
shp_gdf = shp_et.extract() # extracts table as a geopandas GeoDataframe

shp_gdf.head() # renders first 5 rows of table

[3]:

	scalerank	featurecla	geometry
0	1	Country	POLYGON ((-59.57209 -80.04018, -59.86585 -80.5...
1	1	Country	POLYGON ((-159.20818 -79.49706, -161.12760 -79...
2	1	Country	POLYGON ((-45.15476 -78.04707, -43.92083 -78.4...
3	1	Country	POLYGON ((-121.21151 -73.50099, -119.91885 -73...
4	1	Country	POLYGON ((-125.55957 -73.48135, -124.03188 -73...

Example 1.1.2. Extract from a CSV

[4]:

# Ex. 1.1.2

csv_path = 'example-inputs/example.csv'
csv_et = et.read_file(csv_path) # using alternative
csv_gdf = csv_et.extract()

csv_gdf.head()

[4]:

	scalerank	featurecla	geometry
0	1	Country	POLYGON ((-59.57209 -80.04018, -59.86585 -80.5...
1	1	Country	POLYGON ((-159.20818 -79.49706, -161.12760 -79...
2	1	Country	POLYGON ((-45.15476 -78.04707, -43.92083 -78.4...
3	1	Country	POLYGON ((-121.21151 -73.50099, -119.91885 -73...
4	1	Country	POLYGON ((-125.55957 -73.48135, -124.03188 -73...

Example 1.1.3. Extract from an Excel file

[5]:

# Ex. 1.1.3

excel_path = 'example-inputs/example.csv'
excel_gdf = et.read_file(excel_path).extract() # shorthand equivalent

excel_gdf.head()

[5]:

	scalerank	featurecla	geometry
0	1	Country	POLYGON ((-59.57209 -80.04018, -59.86585 -80.5...
1	1	Country	POLYGON ((-159.20818 -79.49706, -161.12760 -79...
2	1	Country	POLYGON ((-45.15476 -78.04707, -43.92083 -78.4...
3	1	Country	POLYGON ((-121.21151 -73.50099, -119.91885 -73...
4	1	Country	POLYGON ((-125.55957 -73.48135, -124.03188 -73...

Example 1.1.4. Extract from a ZIP file

[6]:

# Ex. 1.1.4

zip_path = 'example-inputs/example.zip'
zip_gdf = et.read_file(zip_path).extract()

zip_gdf.head()

[6]:

	scalerank	featurecla	geometry
0	1	Country	POLYGON ((-59.57209 -80.04018, -59.86585 -80.5...
1	1	Country	POLYGON ((-159.20818 -79.49706, -161.12760 -79...
2	1	Country	POLYGON ((-45.15476 -78.04707, -43.92083 -78.4...
3	1	Country	POLYGON ((-121.21151 -73.50099, -119.91885 -73...
4	1	Country	POLYGON ((-125.55957 -73.48135, -124.03188 -73...

Example 1.2. Extract a table from a URL

[7]:

# Ex. 1.2

url = 'http://d2ad6b4ur7yvpq.cloudfront.net/naturalearth-3.3.0/ne_110m_land.geojson'
    # URL copied from https://geopandas.org/io.html
url_gdf = et.ExtractTable(url).extract()

url_gdf.head()

[7]:

	scalerank	featureclass	geometry
0	1	Country	POLYGON ((-59.57209 -80.04018, -59.86585 -80.5...
1	1	Country	POLYGON ((-159.20818 -79.49706, -161.12760 -79...
2	1	Country	POLYGON ((-45.15476 -78.04707, -43.92083 -78.4...
3	1	Country	POLYGON ((-121.21151 -73.50099, -119.91885 -73...
4	1	Country	POLYGON ((-125.55957 -73.48135, -124.03188 -73...

Example 1.3. Extract a table from a pandas DataFrame

[8]:

# Ex. 1.3

pandas_df = pd.read_csv(csv_path)
pandas_gdf = et.ExtractTable(pandas_df).extract()

pandas_gdf.head()

[8]:

	scalerank	featurecla	geometry
0	1	Country	POLYGON ((-59.57209 -80.04018, -59.86585 -80.5...
1	1	Country	POLYGON ((-159.20818 -79.49706, -161.12760 -79...
2	1	Country	POLYGON ((-45.15476 -78.04707, -43.92083 -78.4...
3	1	Country	POLYGON ((-121.21151 -73.50099, -119.91885 -73...
4	1	Country	POLYGON ((-125.55957 -73.48135, -124.03188 -73...

Example 1.4. Extract a table from a geopandas GeoDataFrame

[9]:

# Ex. 1.4

geopandas_gdf = et.ExtractTable(csv_gdf).extract()

geopandas_gdf.head()

[9]:

	scalerank	featurecla	geometry
0	1	Country	POLYGON ((-59.57209 -80.04018, -59.86585 -80.5...
1	1	Country	POLYGON ((-159.20818 -79.49706, -161.12760 -79...
2	1	Country	POLYGON ((-45.15476 -78.04707, -43.92083 -78.4...
3	1	Country	POLYGON ((-121.21151 -73.50099, -119.91885 -73...
4	1	Country	POLYGON ((-125.55957 -73.48135, -124.03188 -73...

Example 2. Extract a table with a selected index¶

Example 2.1. Extract a table with a known column label as the index

[10]:

# Ex. 2.1

known_column = 'featurecla'
known_column_gdf = et.ExtractTable(shp_path, column=known_column).extract()
    # alternative: et.read_file(shp_path, column=known_column)

known_column_gdf.head()

[10]:

	scalerank	geometry
featurecla
Country	1	POLYGON ((-59.57209 -80.04018, -59.86585 -80.5...
Country	1	POLYGON ((-159.20818 -79.49706, -161.12760 -79...
Country	1	POLYGON ((-45.15476 -78.04707, -43.92083 -78.4...
Country	1	POLYGON ((-121.21151 -73.50099, -119.91885 -73...
Country	1	POLYGON ((-125.55957 -73.48135, -124.03188 -73...

Example 2.2. Extract a table without a known column label as the index

[11]:

# Ex. 2.2

unknown_column_et = et.ExtractTable(shp_path)
columns_list = unknown_column_et.list_columns() # returns a list of columns from which to choose
print(columns_list)

['scalerank' 'featurecla' 'geometry']

[12]:

unknown_column_et.column = 'scalerank' # selects the 'scalerank' column as the index
unknown_column_gdf = unknown_column_et.extract()

unknown_column_gdf.head()

[12]:

	featurecla	geometry
scalerank
1	Country	POLYGON ((-59.57209 -80.04018, -59.86585 -80.5...
1	Country	POLYGON ((-159.20818 -79.49706, -161.12760 -79...
1	Country	POLYGON ((-45.15476 -78.04707, -43.92083 -78.4...
1	Country	POLYGON ((-121.21151 -73.50099, -119.91885 -73...
1	Country	POLYGON ((-125.55957 -73.48135, -124.03188 -73...

Example 3. Extract a subtable¶

Note: The counties.zip file contains a shapefile of California county boundaries as sourced from the 2010 US decennial census. The shapefile was pulled from the NHGIS database.

[13]:

counties_file = 'example-inputs/counties.zip'

Example 3.1. Extract a subtable without a known column value

[14]:

# Ex. 3.1

unknown_value_et = et.read_file(counties_file)
print(unknown_value_et.list_columns())

['STATEFP10' 'COUNTYFP10' 'COUNTYNS10' 'GEOID10' 'NAME10' 'NAMELSAD10'
 'LSAD10' 'CLASSFP10' 'MTFCC10' 'CSAFP10' 'CBSAFP10' 'METDIVFP10'
 'FUNCSTAT10' 'ALAND10' 'AWATER10' 'INTPTLAT10' 'INTPTLON10' 'GISJOIN'
 'Shape_area' 'Shape_len' 'geometry']

[15]:

unknown_value_et.column = 'NAME10'
print(unknown_value_et.list_values()) # alternatively, use `list_values(unique=True)` to get unique values

['Orange' 'Tehama' 'Colusa' 'Santa Barbara' 'Mono' 'Monterey' 'Placer'
 'Amador' 'Calaveras' 'Imperial' 'Siskiyou' 'Sonoma' 'Santa Clara' 'Kern'
 'Yolo' 'Mendocino' 'Sacramento' 'Madera' 'Yuba' 'Tulare' 'San Diego'
 'Plumas' 'San Benito' 'Shasta' 'Stanislaus' 'Mariposa' 'Fresno' 'Alpine'
 'Marin' 'Glenn' 'Lassen' 'Del Norte' 'Napa' 'San Luis Obispo' 'San Mateo'
 'Nevada' 'San Joaquin' 'San Bernardino' 'Sutter' 'Riverside' 'Trinity'
 'Contra Costa' 'Ventura' 'Tuolumne' 'Butte' 'Sierra' 'Lake' 'Modoc'
 'El Dorado' 'Los Angeles' 'Alameda' 'Inyo' 'San Francisco' 'Santa Cruz'
 'Kings' 'Humboldt' 'Solano' 'Merced']

[16]:

unknown_value_et.value = 'Alameda' # can also take in a list e.g. = ['Alameda', 'Alpine']
unknown_value_gdf = unknown_value_et.extract()

unknown_value_gdf.head()

[16]:

	STATEFP10	COUNTYFP10	COUNTYNS10	GEOID10	NAMELSAD10	LSAD10	CLASSFP10	MTFCC10	CSAFP10	CBSAFP10	METDIVFP10	FUNCSTAT10	ALAND10	AWATER10	INTPTLAT10	INTPTLON10	GISJOIN	Shape_area	Shape_len	geometry
NAME10
Alameda	06	001	01675839	06001	Alameda County	06	H1	G4020	488	41860	36084	A	1914046110	213184643	+37.6480811	-121.9133039	G0600010	1.928979e+09	376006.655638	MULTIPOLYGON (((-174219.106 -59155.702, -17418...

Example 3.2. Extract a subtable with a known column value

[17]:

# Ex. 3.2

known_value_gdf = et.read_file(counties_file, column='NAME10', value='Alameda').extract()

known_value_gdf.head()

[17]:

	STATEFP10	COUNTYFP10	COUNTYNS10	GEOID10	NAMELSAD10	LSAD10	CLASSFP10	MTFCC10	CSAFP10	CBSAFP10	METDIVFP10	FUNCSTAT10	ALAND10	AWATER10	INTPTLAT10	INTPTLON10	GISJOIN	Shape_area	Shape_len	geometry
NAME10
Alameda	06	001	01675839	06001	Alameda County	06	H1	G4020	488	41860	36084	A	1914046110	213184643	+37.6480811	-121.9133039	G0600010	1.928979e+09	376006.655638	MULTIPOLYGON (((-174219.106 -59155.702, -17418...

Example 3.3. Extract a subtable with known column values

[18]:

# Ex. 3.3

known_values = ['Alameda', 'San Francisco', 'Napa']
known_values_gdf = et.read_file(counties_file, column='NAME10', value=known_values).extract()

known_values_gdf.head()

[18]:

	STATEFP10	COUNTYFP10	COUNTYNS10	GEOID10	NAMELSAD10	LSAD10	CLASSFP10	MTFCC10	CSAFP10	CBSAFP10	METDIVFP10	FUNCSTAT10	ALAND10	AWATER10	INTPTLAT10	INTPTLON10	GISJOIN	Shape_area	Shape_len	geometry
NAME10
Napa	06	055	00277292	06055	Napa County	06	H1	G4020	488	34900	None	A	1938247338	104169342	+38.5073511	-122.3259947	G0600550	2.037287e+09	288618.399143	MULTIPOLYGON (((-201156.864 17876.015, -201155...
Alameda	06	001	01675839	06001	Alameda County	06	H1	G4020	488	41860	36084	A	1914046110	213184643	+37.6480811	-121.9133039	G0600010	1.928979e+09	376006.655638	MULTIPOLYGON (((-174219.106 -59155.702, -17418...
San Francisco	06	075	00277302	06075	San Francisco County	06	H6	G4020	488	41860	41884	C	121399963	479190317	+37.7272391	-123.0322294	G0600750	1.231584e+08	154624.606681	MULTIPOLYGON (((-209365.731 -25846.453, -20937...

Example 4. Extract to a file¶

[19]:

!mkdir outputs # creates a folder called 'output'

output_path = 'outputs/output.shp'
    # the output filetype depends on the provided extension
    # e.g. 'output/output.csv' writes to a CSV file
    # e.g. 'output/output.xlsx' writes to an Excel file

Ex. 4.1. Extract to file from a geopandas GeoDataFrame

[20]:

# Ex. 4.1.

et.ExtractTable(known_values_gdf).extract_to_file(output_path)

[21]:

# Let's look at the extracted file:
et.ExtractTable(output_path).extract().head()

[21]:

	NAME10	STATEFP10	COUNTYFP10	COUNTYNS10	GEOID10	NAMELSAD10	LSAD10	CLASSFP10	MTFCC10	CSAFP10	...	METDIVFP10	FUNCSTAT10	ALAND10	AWATER10	INTPTLAT10	INTPTLON10	GISJOIN	Shape_area	Shape_len	geometry
0	Napa	06	055	00277292	06055	Napa County	06	H1	G4020	488	...	None	A	1938247338	104169342	+38.5073511	-122.3259947	G0600550	2.037287e+09	288618.399143	MULTIPOLYGON (((-201156.864 17876.015, -201155...
1	Alameda	06	001	01675839	06001	Alameda County	06	H1	G4020	488	...	36084	A	1914046110	213184643	+37.6480811	-121.9133039	G0600010	1.928979e+09	376006.655638	MULTIPOLYGON (((-174219.106 -59155.702, -17418...
2	San Francisco	06	075	00277302	06075	San Francisco County	06	H6	G4020	488	...	41884	C	121399963	479190317	+37.7272391	-123.0322294	G0600750	1.231584e+08	154624.606681	MULTIPOLYGON (((-209365.731 -25846.453, -20937...

3 rows × 21 columns

Ex. 4.2. Extract to file from input file without known values

[22]:

# Ex. 4.2.

unknown_to_file_et = et.read_file(counties_file)
print(unknown_to_file_et.list_values(column='NAME10', unique=True))

['Orange' 'Tehama' 'Colusa' 'Santa Barbara' 'Mono' 'Monterey' 'Placer'
 'Amador' 'Calaveras' 'Imperial' 'Siskiyou' 'Sonoma' 'Santa Clara' 'Kern'
 'Yolo' 'Mendocino' 'Sacramento' 'Madera' 'Yuba' 'Tulare' 'San Diego'
 'Plumas' 'San Benito' 'Shasta' 'Stanislaus' 'Mariposa' 'Fresno' 'Alpine'
 'Marin' 'Glenn' 'Lassen' 'Del Norte' 'Napa' 'San Luis Obispo' 'San Mateo'
 'Nevada' 'San Joaquin' 'San Bernardino' 'Sutter' 'Riverside' 'Trinity'
 'Contra Costa' 'Ventura' 'Tuolumne' 'Butte' 'Sierra' 'Lake' 'Modoc'
 'El Dorado' 'Los Angeles' 'Alameda' 'Inyo' 'San Francisco' 'Santa Cruz'
 'Kings' 'Humboldt' 'Solano' 'Merced']

[23]:

unknown_to_file_et.column = 'NAME10'
unknown_to_file_et.value = ['Merced', 'Solano', 'Humboldt', 'Kings', 'Santa Cruz']
unknown_to_file_et.outfile = 'outputs/output.csv' # sets output path

unknown_to_file_et.extract_to_file()

[24]:

# Let's look at the extracted file:
et.read_file('outputs/output.csv').extract().head()

[24]:

	NAME10	STATEFP10	COUNTYFP10	COUNTYNS10	GEOID10	NAMELSAD10	LSAD10	CLASSFP10	MTFCC10	CSAFP10	...	METDIVFP10	FUNCSTAT10	ALAND10	AWATER10	INTPTLAT10	INTPTLON10	GISJOIN	Shape_area	Shape_len	geometry
0	Santa Cruz	6	87	277308	6087	Santa Cruz County	6	H1	G4020	488.0	...	NaN	A	1152986019	419568865	37.012488	-122.007205	G0600870	1.155957e+09	239149.547419	POLYGON ((-140716.159 -123018.344, -141122.104...
1	Kings	6	31	277280	6031	Kings County	6	H1	G4020	NaN	...	NaN	A	3598582308	5468555	36.072478	-119.815530	G0600310	3.604052e+09	286325.134354	POLYGON ((41703.011 -247393.941, 41055.159 -24...
2	Humboldt	6	23	1681908	6023	Humboldt County	6	H1	G4020	NaN	...	NaN	A	9241044673	1254256396	40.706673	-123.925818	G0600230	9.285519e+09	635828.325095	MULTIPOLYGON (((-374245.354 277821.256, -37428...
3	Solano	6	95	277312	6095	Solano County	6	H1	G4020	488.0	...	NaN	A	2128361199	218665937	38.267226	-121.939594	G0600950	2.178626e+09	408608.211983	MULTIPOLYGON (((-168280.104 5555.363, -168277....
4	Merced	6	47	277288	6047	Merced County	6	H1	G4020	NaN	...	NaN	A	5011554741	112760487	37.194806	-120.722802	G0600470	5.124315e+09	343377.690641	POLYGON ((-4614.978 -92627.507, -4623.314 -926...

5 rows × 21 columns

Ex. 4.3. Extract to file from input file with known columns and values

[25]:

# Ex. 4.3.

to_file_values = ['Sacramento', 'Yolo', 'San Diego']

et.ExtractTable(counties_file, 'outputs/output.csv',
                column='NAME10', value=to_file_values).extract_to_file()

[26]:

# Let's look at the extracted file:
et.read_file('outputs/output.csv').extract().head()

[26]:

	NAME10	STATEFP10	COUNTYFP10	COUNTYNS10	GEOID10	NAMELSAD10	LSAD10	CLASSFP10	MTFCC10	CSAFP10	...	METDIVFP10	FUNCSTAT10	ALAND10	AWATER10	INTPTLAT10	INTPTLON10	GISJOIN	Shape_area	Shape_len	geometry
0	Yolo	6	113	277321	6113	Yolo County	6	H1	G4020	472.0	...	NaN	A	2628032408	22969178	38.679268	-121.903178	G0601130	2.650946e+09	325597.089926	POLYGON ((-202671.894 103421.056, -202487.312 ...
1	Sacramento	6	67	277298	6067	Sacramento County	6	H1	G4020	472.0	...	NaN	A	2498415670	76077248	38.450011	-121.340441	G0600670	2.537265e+09	377415.585089	MULTIPOLYGON (((-156925.313 2228.787, -156863....
2	San Diego	6	73	277301	6073	San Diego County	6	H1	G4020	NaN	...	NaN	A	10895120648	826347909	33.023604	-116.776117	G0600730	1.097434e+10	645134.817837	MULTIPOLYGON (((261506.133 -578090.189, 261556...

3 rows × 21 columns

Examples Cleanup

The following commands are used to reset and clean up the examples above.

[27]:

# Remove outputs
!rm -r outputs

[28]:

# Uninstall Package
!echo y | pip uninstall gdutils

[29]:

# Reset Jupyter Notebook IPython Kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")