gdutils.datamine

datamine is a module in package gdutils that provides functions for finding, listing, and mining data.

Module Functions

datamine.list_gh_repos

gdutils.datamine.list_gh_repos(account: str, account_type: str) → List[Tuple[str, str]]

Returns a list of tuples of public GitHub repositories and their URLs associated with the given account and account type.

Parameters:
  • account (str) – Github account whose public repos are to be cloned.
  • account_type (str) – Type of github account whose public repos are to be cloned. Valid options: 'users', 'orgs'.
Returns:

A list of tuples of public Github repositories and their URLs. E.g.

[('boysenberry-repo-1',
  'https://github.com/octocat/boysenberry-repo-1.git'),
 ('git-consortium',
  'https://github.com/octocat/git-consortium.git'),
 ...
 ('test-repo1', https://github.com/octocat/test-repo1.git)]

Return type:

List[Tuple[str, str]]

Raises:
  • ValueError – Raised if the given account_type is neither 'users' nor 'orgs'.
  • RuntimeError – Raised if unable to query GitHub for repo information.

Examples

>>> repos = datamine.list_gh_repos('octocat', 'users')
# gets a list of all repos and their GitHub URLs for account 'octocat'
>>> for repo, url in repos:
...     print('{} : {}'.format(repo, url))
boysenberry-repo-1 : https://github.com/octocat/boysenberry-repo-1.git
git-consortium : https://github.com/octocat/git-consortium.git
hello-worId : https://github.com/octocat/hello-worId.git
Hello-World : https://github.com/octocat/Hello-World.git
linguist : https://github.com/octocat/linguist.git
octocat.github.io : https://github.com/octocat/octocat.github.io.git
Spoon-Knife : https://github.com/octocat/Spoon-Knife.git
test-repo1 : https://github.com/octocat/test-repo1.git

datamine.clone_gh_repos

gdutils.datamine.clone_gh_repos(account: str, account_type: str, repos: Optional[List[str]] = None, outpath: Union[str, pathlib.Path, None] = None, shallow: bool = True) → NoReturn

Clones public GitHub repositories into the given directory. If directory path is not provided, clones repos into the current working directory.

Parameters:
  • account (str) – GitHub account whose public repos are to be cloned.
  • account_type (str) – Type of GitHub account whose public repos are to be cloned. Valid options: 'users', 'orgs'.
  • repos (List[str], optional, default = None) – List of specific repositories to clone.
  • outpath (str | pathlib.Path, optional, default = None) – Path to which repos are to be cloned. If not specified, clones repos into current working directory.
  • shallow (bool | optional, default = True) – Determines whether the clone will be shallow or not. If not specified, defaults to a shallow git clone.
Raises:

ValueError – Raised if provided an account type other than 'users' or 'orgs'.

Examples

>>> datamine.clone_repos('mggg-states', 'orgs')
# clones all repositories of 'mggg-states' into the current directory
>>> datamine.clone_repos('mggg-states', 'orgs', ['AZ-shapefiles'])
# clones repo 'AZ-shapefiles' from 'mggg-states' into current directory
>>> datamine.clone_repos('mggg-states', 'orgs',
...                     ['AZ-shapefiles', 'HI-shapefiles'])
# clones repos 'AZ-shapefiles' & 'HI-shapefiles' into current directory
>>> datamine.clone_repos('mggg-states', 'orgs', ['HI-shapefiles'], 'shps/')
# clones repo 'HI-shapefiles' into directory 'shps/'
>>> datamine.clone_repos('octocat', 'users', outpath='cloned-repos/')
# clones all repos of 'octocat' into directory 'cloned-repos/'
>>> datamine.clone_repos('octocat', 'users', outpath='cloned-repos/',
...                      shallow=False)
# deep clones all repos of 'octocat' into directory 'cloned-repos/'

datamine.remove_repos

gdutils.datamine.remove_repos(dirpath: Union[str, pathlib.Path]) → NoReturn

Given a name/path of a directory, recursively removes all git repositories starting from the given directory. This action cannot be undone.

Warning: this function will remove the given directory if the given directory itself is a git repo.

Parameters:dirpath (str | pathlib.Path) – Name/path of directory from which recursive removal of repos begins.
Raises:FileNotFoundError – Raised if unable to find the given directory.

Examples

>>> datamine.remove_repos('repos_to_remove/')
# removes all repos in directory 'repos_to_remove/'
>>> datamine.remove_repos('repos_to_remove/repo1')
# removes repo 'repo1' in directory 'repos_to_remove/'

datamine.list_files_of_type

gdutils.datamine.list_files_of_type(filetype: Union[str, List[str]], dirpath: Union[str, pathlib.Path, None] = '.', exclude_hidden: Optional[bool] = True) → List[str]

Given a file extension and an optional directory path, returns a list of file paths of files containing the extension. If the directory path is not specified, function defaults to listing files from the current working directory.

Parameters:
  • filetype (str | List[str]) – File extension of files to list (e.g. '.zip'). Can be a list of extensions (e.g. ['.zip', '.shp', '.csv']).
  • dirpath (str | pathlib.Path, optional, default = '.'.) – Path to directory from which file listing begins. Defaults to current working directory if not specified.
  • exclude_hidden (bool, option, default = True) – If false, function includes hidden files in the search.
Returns:

List of file paths of files containing the given extension.

Return type:

List[str]

Raises:

FileNotFoundError – Raised if unable to find given directory.

Examples

>>> list_of_zips = datamine.list_files_of_type('.zip')
# recursively gets a list of '.zip' files from the current directory
>>> print(list_of_zips)
['./zipfile1.zip', './zipfile2.zip', './shapefiles/shape1.zip',
'./shapefiles/shape2.zip']
>>> list_of_shps = datamine.list_files_of_type('.shp', 'shapefiles/')
# recursively gets a list of '.shp' files from the 'shapefiles/' directory
>>> print(list_of_shps)
['./shapefiles/shape1/shape1.shp', './shapefiles/shape2/shape2.shp']
>>> list_of_csvs = datamine.list_files_of_type('.csv',
...                                            exclude_hidden = False)
# recursively gets a list of '.csv' files, including hidden files
>>> print(list_of_csvs)
['./csv1.csv', './.csv_hidden.csv']
>>> list_of_mix = datamine.list_files_of_type(['.shp', '.zip'])
# recursively gets a list of '.shp' and '.zip' files
>>> print(list_of_mix)
['./shapefiles/shape1/shape1.shp', './shapefiles/shape2/shape2.shp',
 './zipfile1.zip', './zipfile2.zip', './shapefiles/shape1.zip',
 './shapefiles/shape2.zip']

datamine.get_keys_by_category

gdutils.datamine.get_keys_by_category(dictionary: Dict[Hashable, List[Iterable[T_co]]], category: Union[Hashable, List[Hashable]]) → List[Hashable]

Given a dictionary with categories, returns a list of keys in the given category.

Examples of accepted forms of dictionary input:

{category1 : [{key1 : value1}, {key2 : value2}]
 category2 : [{key3 : value3},]}
{category1 : [[key1, key2, key3]]}
{category1 : [[key1]],
 category2 : [[key2], {key3: value3}]}
Parameters:
  • dictionary (Dict[Hashable, List[Iterable]]) – Dictionary containing categories in which keys are stored.
  • category (Hashable | List[Hashable]) – Category containing keys.
Returns:

List of keys of every key-value pair in the given category of the given dictionary.

Return type:

List[Hashable]

Examples

>>> sample_dict = {'category1' : [{'key1': 1}],
...                'category2' : [{'key2' : 2}, {'key3' : 3}]}
>>> keys = datamine.get_keys_by_category(sample_dict, 'category2')
# gets a list of keys under 'category2' from the dictionary 'sample_dict'
>>> print(keys)
['key2', 'key3']
>>> sample_dict =  {'category1' : [['key1', 'key4']],
...                 'category2' : [['key2'], {'key3': 'value3'}]}
>>> keys = datamine.get_keys_by_category(sample_dict, 'category2')
# note: keys can be stored in both list and dictionary form
>>> print(keys)
['key2', 'key3']
>>> keys = datamine.get_keys_by_category(sample_dict,
...                                      ['category1', 'category2'])
# gets a list of keys under categories 'category1' and 'category2'
>>> print(keys)
['key1', 'key2', 'key3']