gdutils.datamine¶
datamine is a module in package gdutils that provides functions for
finding, listing, and mining data.
Module Functions¶
datamine.list_gh_repos¶
-
gdutils.datamine.list_gh_repos(account: str, account_type: str) → List[Tuple[str, str]]¶ Returns a list of tuples of public GitHub repositories and their URLs associated with the given account and account type.
Parameters: - account (str) – Github account whose public repos are to be cloned.
- account_type (str) – Type of github account whose public repos are to be cloned.
Valid options:
'users','orgs'.
Returns: A list of tuples of public Github repositories and their URLs. E.g.
[('boysenberry-repo-1', 'https://github.com/octocat/boysenberry-repo-1.git'), ('git-consortium', 'https://github.com/octocat/git-consortium.git'), ... ('test-repo1', https://github.com/octocat/test-repo1.git)]
Return type: List[Tuple[str, str]]
Raises: ValueError– Raised if the given account_type is neither'users'nor'orgs'.RuntimeError– Raised if unable to query GitHub for repo information.
Examples
>>> repos = datamine.list_gh_repos('octocat', 'users') # gets a list of all repos and their GitHub URLs for account 'octocat' >>> for repo, url in repos: ... print('{} : {}'.format(repo, url)) boysenberry-repo-1 : https://github.com/octocat/boysenberry-repo-1.git git-consortium : https://github.com/octocat/git-consortium.git hello-worId : https://github.com/octocat/hello-worId.git Hello-World : https://github.com/octocat/Hello-World.git linguist : https://github.com/octocat/linguist.git octocat.github.io : https://github.com/octocat/octocat.github.io.git Spoon-Knife : https://github.com/octocat/Spoon-Knife.git test-repo1 : https://github.com/octocat/test-repo1.git
datamine.clone_gh_repos¶
-
gdutils.datamine.clone_gh_repos(account: str, account_type: str, repos: Optional[List[str]] = None, outpath: Union[str, pathlib.Path, None] = None, shallow: bool = True) → NoReturn¶ Clones public GitHub repositories into the given directory. If directory path is not provided, clones repos into the current working directory.
Parameters: - account (str) – GitHub account whose public repos are to be cloned.
- account_type (str) – Type of GitHub account whose public repos are to be cloned.
Valid options:
'users','orgs'. - repos (List[str], optional, default =
None) – List of specific repositories to clone. - outpath (str | pathlib.Path, optional, default =
None) – Path to which repos are to be cloned. If not specified, clones repos into current working directory. - shallow (bool | optional, default =
True) – Determines whether the clone will be shallow or not. If not specified, defaults to a shallow git clone.
Raises: ValueError– Raised if provided an account type other than'users'or'orgs'.Examples
>>> datamine.clone_repos('mggg-states', 'orgs') # clones all repositories of 'mggg-states' into the current directory
>>> datamine.clone_repos('mggg-states', 'orgs', ['AZ-shapefiles']) # clones repo 'AZ-shapefiles' from 'mggg-states' into current directory
>>> datamine.clone_repos('mggg-states', 'orgs', ... ['AZ-shapefiles', 'HI-shapefiles']) # clones repos 'AZ-shapefiles' & 'HI-shapefiles' into current directory
>>> datamine.clone_repos('mggg-states', 'orgs', ['HI-shapefiles'], 'shps/') # clones repo 'HI-shapefiles' into directory 'shps/'
>>> datamine.clone_repos('octocat', 'users', outpath='cloned-repos/') # clones all repos of 'octocat' into directory 'cloned-repos/'
>>> datamine.clone_repos('octocat', 'users', outpath='cloned-repos/', ... shallow=False) # deep clones all repos of 'octocat' into directory 'cloned-repos/'
datamine.remove_repos¶
-
gdutils.datamine.remove_repos(dirpath: Union[str, pathlib.Path]) → NoReturn¶ Given a name/path of a directory, recursively removes all git repositories starting from the given directory. This action cannot be undone.
Warning: this function will remove the given directory if the given directory itself is a git repo.
Parameters: dirpath (str | pathlib.Path) – Name/path of directory from which recursive removal of repos begins. Raises: FileNotFoundError– Raised if unable to find the given directory.Examples
>>> datamine.remove_repos('repos_to_remove/') # removes all repos in directory 'repos_to_remove/'
>>> datamine.remove_repos('repos_to_remove/repo1') # removes repo 'repo1' in directory 'repos_to_remove/'
datamine.list_files_of_type¶
-
gdutils.datamine.list_files_of_type(filetype: Union[str, List[str]], dirpath: Union[str, pathlib.Path, None] = '.', exclude_hidden: Optional[bool] = True) → List[str]¶ Given a file extension and an optional directory path, returns a list of file paths of files containing the extension. If the directory path is not specified, function defaults to listing files from the current working directory.
Parameters: - filetype (str | List[str]) – File extension of files to list (e.g.
'.zip'). Can be a list of extensions (e.g.['.zip', '.shp', '.csv']). - dirpath (str | pathlib.Path, optional, default =
'.'.) – Path to directory from which file listing begins. Defaults to current working directory if not specified. - exclude_hidden (bool, option, default =
True) – If false, function includes hidden files in the search.
Returns: List of file paths of files containing the given extension.
Return type: List[str]
Raises: FileNotFoundError– Raised if unable to find given directory.Examples
>>> list_of_zips = datamine.list_files_of_type('.zip') # recursively gets a list of '.zip' files from the current directory >>> print(list_of_zips) ['./zipfile1.zip', './zipfile2.zip', './shapefiles/shape1.zip', './shapefiles/shape2.zip']
>>> list_of_shps = datamine.list_files_of_type('.shp', 'shapefiles/') # recursively gets a list of '.shp' files from the 'shapefiles/' directory >>> print(list_of_shps) ['./shapefiles/shape1/shape1.shp', './shapefiles/shape2/shape2.shp']
>>> list_of_csvs = datamine.list_files_of_type('.csv', ... exclude_hidden = False) # recursively gets a list of '.csv' files, including hidden files >>> print(list_of_csvs) ['./csv1.csv', './.csv_hidden.csv']
>>> list_of_mix = datamine.list_files_of_type(['.shp', '.zip']) # recursively gets a list of '.shp' and '.zip' files >>> print(list_of_mix) ['./shapefiles/shape1/shape1.shp', './shapefiles/shape2/shape2.shp', './zipfile1.zip', './zipfile2.zip', './shapefiles/shape1.zip', './shapefiles/shape2.zip']
- filetype (str | List[str]) – File extension of files to list (e.g.
datamine.get_keys_by_category¶
-
gdutils.datamine.get_keys_by_category(dictionary: Dict[Hashable, List[Iterable[T_co]]], category: Union[Hashable, List[Hashable]]) → List[Hashable]¶ Given a dictionary with categories, returns a list of keys in the given category.
Examples of accepted forms of dictionary input:
{category1 : [{key1 : value1}, {key2 : value2}] category2 : [{key3 : value3},]}
{category1 : [[key1, key2, key3]]}
{category1 : [[key1]], category2 : [[key2], {key3: value3}]}
Parameters: - dictionary (Dict[Hashable, List[Iterable]]) – Dictionary containing categories in which keys are stored.
- category (Hashable | List[Hashable]) – Category containing keys.
Returns: List of keys of every key-value pair in the given category of the given dictionary.
Return type: List[Hashable]
Examples
>>> sample_dict = {'category1' : [{'key1': 1}], ... 'category2' : [{'key2' : 2}, {'key3' : 3}]} >>> keys = datamine.get_keys_by_category(sample_dict, 'category2') # gets a list of keys under 'category2' from the dictionary 'sample_dict' >>> print(keys) ['key2', 'key3']
>>> sample_dict = {'category1' : [['key1', 'key4']], ... 'category2' : [['key2'], {'key3': 'value3'}]} >>> keys = datamine.get_keys_by_category(sample_dict, 'category2') # note: keys can be stored in both list and dictionary form >>> print(keys) ['key2', 'key3']
>>> keys = datamine.get_keys_by_category(sample_dict, ... ['category1', 'category2']) # gets a list of keys under categories 'category1' and 'category2' >>> print(keys) ['key1', 'key2', 'key3']