gdutils.dataqa

dataqa is a module in package gdutils that provides functions used in data quality assurance.

Module Functions

dataqa.compare_column_names

gdutils.dataqa.compare_column_names(table: Union[pandas.core.frame.DataFrame, geopandas.geodataframe.GeoDataFrame], standards: Union[List[str], Set[str]]) → Tuple[Set[str], Set[str]]

Given either a pandas DataFrame or a geopandas GeoDataFrame and a list of standardized column names, returns a tuple containing the intersection between standardized column names and columns in the table and the set of columns names in the table that are not in the standards.

Parameters:
  • table (pd.DataFrame | gpd.GeoDataFrame) – Tabular data whose column names are to be compared against the standards.
  • standards (List[str] | Set[str]) – List/set of standardized column names to be compared against the given tabular data.
Returns:

The first set in the tuple contains the intersection of column names between the table and the standards list. The second set in the tuple contains the column name in the difference between the table and the standards list. E.g. ( {'match1', 'match2'}, {'diff1'} ).

Return type:

Tuple[Set[str], Set[str]]

Examples

>>> standards = ['COL1', 'COL2', 'COL3']
>>> df = pd.DataFrame(data=[[1, 2, 3], [4, 5, 6]],
...                   columns=['COL1', 'col2', 'COL3'])
>>> print(df)
   COL1  col2  COL3
0     1     2     3
1     4     5     6
>>> (matches, discrepancies) = dataqa.compare_column_names(df, standards)
# gets a tuple that splits DataFrame column names into two categories:
# 1. names in the 'standards' list (left)
# 2. names not in the 'standards' list (right)
>>> print(matches)
{'COL1', 'COL3'}
>>> print(discrepancies)
{'col2'}

dataqa.sum_column_values

gdutils.dataqa.sum_column_values(table: Union[pandas.core.frame.DataFrame, geopandas.geodataframe.GeoDataFrame], columns: Union[List[str], Set[str]]) → List[Tuple[str, int]]

Given a pandas DataFrame or a geopandas GeoDataFrame, and given a list of column names, returns a list of tuples of column names and the sum of their values. It is an unchecked runtime error if a column containing non-numerical values is passed into the function.

Parameters:
  • table (pd.DataFrame, gpd.GeoDataFrame) – Tabular data containing columns whose values are to be summed.
  • columns (List[str] | Set[str]) – A list/set of column names whose values are to be summed.
Returns:

A list of tuples of column names associated with the sum of their values. E.g. [ ('column 1', 100), ('column 2', 53) ].

Return type:

List[Tuple[str, int]]

Raises:

KeyError – Raised if given column name does not exist in table.

Examples

>>> cols = ['COL1', 'COL3']
>>> df = pd.DataFrame(data=[[1, 2, 3], [4, 5, 6]],
...                   columns=['COL1', 'COL2', 'COL3'])
>>> print(df)
   COL1  COL2  COL3
0     1     2     3
1     4     5     6
>>> totals = dataqa.sum_column_values(df, cols)
# gets a list of tuples containing two items:
# 1. column name (left)
# 2. sum of column's values (right)
>>> for total in totals:
...     (column, sum) = total
...     print("{}: {}".format(column, sum))
COL1: 5
COL3: 9

dataqa.compare_column_values

gdutils.dataqa.compare_column_values(table1: Union[pandas.core.frame.DataFrame, geopandas.geodataframe.GeoDataFrame], table2: Union[pandas.core.frame.DataFrame, geopandas.geodataframe.GeoDataFrame], columns1: List[str], columns2: List[str], rows1: Optional[List[Hashable]] = None, rows2: Optional[List[Hashable]] = None) → Dict[str, List[Tuple[Hashable, Any]]]

Given two tables and their corresponding columns and rows to compare, returns a dictionary containing the compared columns and a corresponding list of tuples containing row names and the differences of values.

Note: The comparison is a one-to-one and onto function. I.e. Each element in one given list must correspond to another element in the other list.

Parameters:
  • table1 (pd.DataFrame | gpd.GeoDataFrame) – Tabular data containing column values to compare.
  • table2 (pd.DataFrame | gpd.GeoDataFrame) – Tabular data containing column values to compare.
  • columns1 (List[str]) – Columns in table1 to compare.
  • columns2 (List[str]) – Columns in table2 to compare.
  • rows1 (List[Hashable], optional, default = None) – Rows in table1 to compare. AKA value(s) of table’s index. If None, function compares all rows.
  • rows2 (List[Hashable], optional, default = None) – Rows in table2 to compare. AKA value(s) of table’s index. If None, function compares all rows.
Returns:

A dictionary with string keys corresponding to names of compared columns and with List values of tuples corresponding to names of compared rows and the differences of their values. E.g.

{'c1 [vs] c2': [('row1 [vs] row1', 2), ('row2 [vs] row2', 0)],
 'cA [vs] cB': [('rowA1 [vs] rowB1', 5)]}

Return type:

Dict[str, List[Tuple[Hashable, Any]]]

Raises:
  • KeyError – Raised if unable to find column or row in tables.
  • TypeError – Raised if unable to calculate the difference between two values.
  • RuntimeError – Raised if given lists cannot be compared.

See also

dataqa.compare_column_sums()

Examples

>>> df1 = pd.DataFrame(data=[[1, 2, 3], [4, 5, 6]],
...                    columns=['COL1', 'COL2', 'COL3'])
>>> df2 = pd.DataFrame(data=[[4, 5], [1, 2]], columns=['col2', 'col1'])
>>> results = dataqa.compare_column_values(df1, df2, ['COL3'], ['col2'])
# gets a dictionary (collection of key-value pairs), where
# key : name of first column and name of second column
# value: list of tuples with two values:
#       1. name of row in first table and name of row in second table
#       2. difference between values in the columns and rows
>>> print(results)
{'COL3 [vs] col2': [('0 [vs] 0', -1), ('1 [vs] 1', 5)]}
>>> results = dataqa.compare_column_values(df1, df2, ['COL1', 'COL2'],
...                                        ['col1', 'col2'])
# compares columns 'COL1' with 'col1' and 'COL2' with 'col2'
>>> print(results['COL2 [vs] col2'][0])
('0 [vs] 0', 2)
>>> for column in results:
...     print('{} ----'.format(column))
...     for row, difference in results[column]:
...         print('{} : {}'.format(row, difference))
COL1 [vs] col1 ---
0 [vs] 0 : -4
1 [vs] 1 : 2
COL2-col2 ---
0 [vs] 0 : -2
1 [vs] 1 : 4
>>> results = dataqa.compare_column_values(df1, df2, ['COL1'],
...                                        ['col1'], [0], [1])
# compares value of column 'COL1' row 0 in table1 with
# value of column 'col1' row 1 in table2
>>> print(results['COL1 [vs] col1'][0])
('0 [vs] 1', -1)
>>> results = dataqa.compare_column_values(df1, df2, ['COL1'], ['col1'],
...                                        [0, 1], [1, 0])
# compares rows 0 and 1 (table1) with rows 1 and 0 (table2) in
# respective columns 'COL1' and 'col1'
>>> print(results['COL1 [vs] col1'])
[('0 [vs] 1', -1), ('1 [vs] 0', -1)]

dataqa.compare_column_sums

gdutils.dataqa.compare_column_sums(table1: Union[pandas.core.frame.DataFrame, geopandas.geodataframe.GeoDataFrame], table2: Union[pandas.core.frame.DataFrame, geopandas.geodataframe.GeoDataFrame], columns1: List[str], columns2: List[str]) → List[Tuple[Hashable, Any]]

Given two tables and two lists of column names corresponding to the tables, returns a list of tuples containing the compared column names and the difference between their corresponding sums. It is an unchecked runtime error if a column containing non-numerical values is passed into the function.

Note: The comparison is a one-to-one and onto function. I.e. each element in one list of column names must correspond to another element in the other list.

Parameters:
  • table1 (pd.DataFrame | gpd.GeoDataFrame) – Tabular data containing column values to compare.
  • table2 (pd.DataFrame | gpd.GeoDataFrame) – Tabular data containing column values to compare.
  • columns1 (List[str]) – Column(s) in table1 to compare.
  • columns2 (List[str]) – Column(s) in table2 to compare.
Returns:

A list containing tuples that contain a label describing the compared columns’ names and contain the difference between the sum of the values of the given columns. E.g. [ ('column1A-column1B', 4), ('column2A-column2B', 53) ].

Return type:

List[Tuple[Hashable, Any]]

Raises:
  • KeyError – Raised if a given column name does not exist in a given table.
  • RuntimeError – Raised if given columns cannot be compared.

See also

dataqa.compare_column_values()

Examples

>>> df1 = pd.DataFrame(data=[[1, 2, 3], [4, 5, 6]],
...                    columns=['COL1', 'COL2', 'COL3'])
>>> df2 = pd.DataFrame(data=[[4, 5], [1, 2]],
...                    columns=['col2', 'col1'])
>>> diffs = dataqa.compare_column_sums(df1, df2, ['COL1'], ['col1'])
# gets a list of tuples containing two values:
# 1. name of column in first table and name of column in second (left)
# 2. difference between the sum of values of both columns (right)
>>> print(diffs)
[('COL1 [vs] col1', -2)]
>>> diffs = dataqa.compare_column_sums(df1, df2, ['COL1', 'COL3'],
...                                    ['col1', 'col2'])
# compares column 'COL1' with column 'col1' and compares column
# 'COL3' with column 'col2'
>>> for column, difference in diffs:
...     print('{} : {}'.format(column, difference))
COL1 [vs] col1 : -2
COL3 [vs] col2 : 4

dataqa.has_missing_geometries

gdutils.dataqa.has_missing_geometries(gdf: geopandas.geodataframe.GeoDataFrame, threshold: Optional[float] = 0.0) → bool

Returns True if the given GeoDataFrame has missing geometries.

Parameters:
  • gdf (gpd.GeoDataFrame) – GeoDataFrame whose geometries are to be checked.
  • threshold (float, optional, default = 0.0) – Percentage of rows that are allowed to have missing geometries. e.g. threshold = 0.5 means that the function returns True if the number of missing geometries is greater than half of the number of rows.
Returns:

True if the given GeoDataFrame has missing geometries.

Return type:

bool

Raises:

KeyError – Raised if ‘geometry’ column is missing.

See also

dataqa.has_empty_geometries()

Examples

>>> from shapely.geometry import Point as Pt
>>> gdf = gpd.GeoDataFrame({'col'       : ['v1', 'v2', 'v3'],
...                         'geometry'  : [None, Pt(1, 2), Pt(2, 1)]})
>>> print(dataqa.has_missing_geometries(gdf))
# Check if gdf has missing geometries
True
>>> print(dataqa.has_missing_geometries(gdf, threshold=0.75))
# Check if more than 75% of the rows contain missing geometries
False

dataqa.has_empty_geometries

gdutils.dataqa.has_empty_geometries(gdf: geopandas.geodataframe.GeoDataFrame, threshold: Optional[float] = 0.0) → bool

Returns True if the given GeoDataFrame has empty geometries.

Parameters:
  • gdf (gpd.GeoDataFrame) – GeoDataFrame whose geometries are to be checked.
  • threshold (float, optional, default = 0.0) – Percentage of rows that are allowed to have empty geometries. e.g. threshold = 0.5 means that the function returns True if the number of empty geometries is greater than half of the number of rows.
Returns:

True if the given GeoDataFrame has empty geometries.

Return type:

bool

Raises:

KeyError – Raised if ‘geometry’ column is missing.

See also

dataqa.has_missing_geometries()

Examples

>>> from shapely.geometry import Polygon as Pg
>>> from shapelygeometry import Point as Pt
>>> gdf = gpd.GeoDataFrame({'col'       : ['v1', 'v2', 'v3'],
...                         'geometry'  : [Pt(1, 2), Pg([]), Pt(2, 1)]})
>>> print(dataqa.has_empty_geometries(gdf))
# Check if gdf has empty geometries
True
>>> print(dataqa.has_empty_geometries(gdf, threshold=0.75))
# Check if more than 75% of the rows contain empty geometries
False