gdutils.datamine¶
datamine is a module in package gdutils that provides functions for finding, listing, and mining data.
Examples Setup
The following commands are used for setting up the examples below.
Note: The example input files were pulled and converted from the GeoJSON link provided in the geopandas IO docs.
[1]:
# Install ``gdutils`` package
!conda install fiona shapely pyproj rtree && pip3 install wheel
!pip3 install git+https://github.com/mggg/gdutils.git > /dev/null
[2]:
import gdutils.datamine as dm # imports the ``datamine`` module
import geopandas as gpd
import pandas as pd
Example 1. Get a list of public GitHub repos¶
Example 1.1. Get a list of public repos from a GitHub user account
[3]:
# Ex. 1.1
user_account = 'octocat'
user_repos = dm.list_gh_repos(user_account, 'users') # gets repos
user_repos # renders raw list of repos
[3]:
[('boysenberry-repo-1', 'https://github.com/octocat/boysenberry-repo-1.git'),
('git-consortium', 'https://github.com/octocat/git-consortium.git'),
('hello-worId', 'https://github.com/octocat/hello-worId.git'),
('Hello-World', 'https://github.com/octocat/Hello-World.git'),
('linguist', 'https://github.com/octocat/linguist.git'),
('octocat.github.io', 'https://github.com/octocat/octocat.github.io.git'),
('Spoon-Knife', 'https://github.com/octocat/Spoon-Knife.git'),
('test-repo1', 'https://github.com/octocat/test-repo1.git')]
[4]:
# prints list of repos in pretty format using pattern-matching
print('{:20} : {}'.format('repo name', 'repo url'))
print('-------------------------------')
for (repo_name, repo_url) in user_repos:
print('{:20} : {}'.format(repo_name, repo_url))
repo name : repo url
-------------------------------
boysenberry-repo-1 : https://github.com/octocat/boysenberry-repo-1.git
git-consortium : https://github.com/octocat/git-consortium.git
hello-worId : https://github.com/octocat/hello-worId.git
Hello-World : https://github.com/octocat/Hello-World.git
linguist : https://github.com/octocat/linguist.git
octocat.github.io : https://github.com/octocat/octocat.github.io.git
Spoon-Knife : https://github.com/octocat/Spoon-Knife.git
test-repo1 : https://github.com/octocat/test-repo1.git
Example 1.2. Get a list of public repos from a GitHub organization account
[5]:
# Ex. 1.2.
org_account = 'mggg-states'
org_repos = dm.list_gh_repos(org_account, 'orgs')
# prints list of repos in pretty format using pattern-matching
print('{:20} : {}'.format('repo name', 'repo url'))
print('-------------------------------')
for repo_name, repo_url in org_repos:
print('{:20} : {}'.format(repo_name, repo_url))
repo name : repo url
-------------------------------
PA-shapefiles : https://github.com/mggg-states/PA-shapefiles.git
MA-shapefiles : https://github.com/mggg-states/MA-shapefiles.git
WI-shapefiles : https://github.com/mggg-states/WI-shapefiles.git
AK-shapefiles : https://github.com/mggg-states/AK-shapefiles.git
OH-shapefiles : https://github.com/mggg-states/OH-shapefiles.git
TX-shapefiles : https://github.com/mggg-states/TX-shapefiles.git
GA-shapefiles : https://github.com/mggg-states/GA-shapefiles.git
IL-shapefiles : https://github.com/mggg-states/IL-shapefiles.git
NC-shapefiles : https://github.com/mggg-states/NC-shapefiles.git
UT-shapefiles : https://github.com/mggg-states/UT-shapefiles.git
VA-shapefiles : https://github.com/mggg-states/VA-shapefiles.git
VT-shapefiles : https://github.com/mggg-states/VT-shapefiles.git
MI-shapefiles : https://github.com/mggg-states/MI-shapefiles.git
IA-shapefiles : https://github.com/mggg-states/IA-shapefiles.git
RI-shapefiles : https://github.com/mggg-states/RI-shapefiles.git
MN-shapefiles : https://github.com/mggg-states/MN-shapefiles.git
NM-shapefiles : https://github.com/mggg-states/NM-shapefiles.git
MD-shapefiles : https://github.com/mggg-states/MD-shapefiles.git
OR-shapefiles : https://github.com/mggg-states/OR-shapefiles.git
CO-shapefiles : https://github.com/mggg-states/CO-shapefiles.git
OK-shapefiles : https://github.com/mggg-states/OK-shapefiles.git
HI-shapefiles : https://github.com/mggg-states/HI-shapefiles.git
CT-shapefiles : https://github.com/mggg-states/CT-shapefiles.git
AZ-shapefiles : https://github.com/mggg-states/AZ-shapefiles.git
DE-shapefiles : https://github.com/mggg-states/DE-shapefiles.git
Example 2. Clone public GitHub repos¶
Example 2.1. Clone all repositories of a known account
[6]:
# Ex. 2.1
dm.clone_gh_repos(user_account, 'users')
Example 2.2. Clone specific repositories of a known account
[7]:
# Ex. 2.2.
dm.clone_gh_repos(org_account, 'orgs', ['AK-shapefiles', 'AZ-shapefiles'])
Example 2.3. Clone specific repos into a given directory
[8]:
# Ex. 2.3.
dm.clone_gh_repos(org_account, 'orgs', ['CT-shapefiles'], 'outputs/')
Example 2.4. Clone all repos into a given directory
[9]:
# Ex. 2.4.
dm.clone_gh_repos(user_account, 'users', outpath='outputs/')
Example 4. Get a list of local files of specific types¶
Example 4.1. Recursively list files of a given type starting from current working directory
[10]:
# Ex. 4.1.
files_from_cwd = dm.list_files_of_type('.zip')
files_from_cwd
[10]:
['./AK-shapefiles/AK_precincts.zip',
'./AZ-shapefiles/az_precincts.zip',
'./example-inputs/example.zip',
'./example-inputs/counties.zip',
'./outputs/CT-shapefiles/CT_precincts.zip']
Example 4.2. Recursively list files of a given type starting from a given directory
[11]:
# Ex. 4.2.
files_from_dir = dm.list_files_of_type('.zip', 'outputs/')
files_from_dir
[11]:
['outputs/CT-shapefiles/CT_precincts.zip']
Example 4.3. Recursively list files of given types starting from a given directory
[12]:
# Ex. 4.3.
zips_and_mds = dm.list_files_of_type(['.zip', '.md'], 'outputs/')
zips_and_mds
[12]:
['outputs/linguist/README.md',
'outputs/linguist/CONTRIBUTING.md',
'outputs/linguist/test/fixtures/Data/Modelines/example_smalltalk.md',
'outputs/linguist/samples/GCC Machine Description/pdp10.md',
'outputs/linguist/samples/Markdown/tender.md',
'outputs/linguist/vendor/grammars/Sublime-Inform/README.md',
'outputs/linguist/vendor/grammars/less.tmbundle/README.md',
'outputs/test-repo1/2016-02-24-first-post.md',
'outputs/test-repo1/2016-02-26-sample-post-jekyll.md',
'outputs/test-repo1/2015-04-12-test-post-last-year.md',
'outputs/CT-shapefiles/LICENSE.md',
'outputs/CT-shapefiles/CT_precincts.zip',
'outputs/CT-shapefiles/README.md',
'outputs/git-consortium/product-backlog.md',
'outputs/git-consortium/README.md',
'outputs/Spoon-Knife/README.md',
'outputs/boysenberry-repo-1/README.md',
'outputs/boysenberry-repo-1/READTHIS.md',
'outputs/hello-worId/README.md']
Example 4.4. Recursively list files of a given type from current working directory, including hidden files
[13]:
# Ex. 4.4.
files_incl_hidden = dm.list_files_of_type('.zip', exclude_hidden=False)
files_incl_hidden
[13]:
['./.example-hidden-file.zip',
'./AK-shapefiles/AK_precincts.zip',
'./AZ-shapefiles/az_precincts.zip',
'./example-inputs/example.zip',
'./example-inputs/counties.zip',
'./outputs/CT-shapefiles/CT_precincts.zip']
Example 5. Get a list of keys from a nested (categorized) dictionary¶
[14]:
# Example nested dictionary
example_dict = {
'category1' : [ # category
{'key1_1' : 'value1'}, # key-value pair
{'key1_2' : 2}
],
'category2' : [
{'key2_1' : True},
['key2_2', 'key2_3', 'key2_4'] # list of keys
],
'category3' : [
['key3']
]
}
Example 5.1. Get a list of keys from a single category
[15]:
keys = dm.get_keys_by_category(example_dict, 'category2')
keys
[15]:
['key2_1', 'key2_2', 'key2_3', 'key2_4']
Example 5.2. Get a list of keys from a list of categories
[16]:
keys = dm.get_keys_by_category(example_dict, ['category1', 'category3'])
keys
[16]:
['key1_1', 'key1_2', 'key3']
Example 6. Remove repos from local filesystem¶
Example 6.1. Remove a specific repository
[17]:
# Ex. 6.1.
path_to_repo_to_remove = 'outputs/Hello-World'
dm.remove_repos(path_to_repo_to_remove)
Example 6.2. Recursively remove all repos in a directory
[18]:
# Ex. 6.2.
dm.remove_repos('outputs/')
Examples Cleanup
The following commands are used to reset and clean up the examples above.
[19]:
# Remove all cloned repos
dm.remove_repos('.')
[20]:
# Remove outputs
!rm -r outputs
[21]:
# Uninstall Package
!echo y | pip uninstall gdutils
[22]:
# Reset Jupyter Notebook IPython Kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")