Data Quality#

class DataQuality(df, test_params=None)[source]#

A class to assess and manage data quality across multiple dimensions for a given dataset.

The class supports operations such as initialising the data quality dimension instances, preparing and aggregating results, and writing out results to CSV files. It allows for both detailed error analysis and aggregate summaries.

df#

The dataset on which data quality checks are performed.

Type:

pandas.DataFrame

test_params#

A DataFrame specifying parameters for data quality tests. If not provided, default parameters are used.

Type:

pandas.DataFrame, optional

data_info#

Metadata information about the dataset fields and the timestamp of data quality assessment.

Type:

pandas.DataFrame

completeness, validity, uniqueness, timeliness, consistency, accuracy

Instances of data quality dimension classes for performing specific checks.

Type:

Object

raw_results(reduce_counts=False):

Compiles detailed error information across all data quality dimensions.

Parameters:

reduce_counts (bool, optional) – If True, reduces error counts to boolean values (error present or not). Defaults to False.

Returns:

A DataFrame containing detailed error information for each record in the dataset.

Return type:

pandas.DataFrame

aggregate_rows(reduce_counts=False):

Aggregates error counts by row for a high-level summary.

Parameters:

reduce_counts (bool, optional) – If True, reduces error counts to boolean values before aggregation. Defaults to False.

Returns:

A DataFrame with aggregated error counts by row.

Return type:

pandas.DataFrame

aggregate_results(reduce_counts=False):

Creates a field and metric level aggregate summary of errors.

Parameters:

reduce_counts (bool, optional) – Indicates whether to reduce error counts to binary indicators (True for any errors, False for no errors). Defaults to False, preserving actual count values.

Returns:

A DataFrame with aggregated error counts for each field and metric, sorted by field names as they appear in the original dataset.

Return type:

pandas.DataFrame

results_prep(reduce_counts):

Prepares error results from different data quality dimensions for further processing.

Parameters:

reduce_counts (bool) – If True, converts error counts to binary values (1 for error present, 0 for no error). Useful for simplifying error aggregation.

Returns:

A DataFrame containing merged error data from all data quality checks, with options for reduced counts.

Return type:

pandas.DataFrame

Notes

Error data from each dimension is corrected for missing values based on completeness checks before merging. This ensures that errors are accurately reflected even when data is missing.

write_out(out, output_table):

Writes the given DataFrame to a CSV file.

Parameters:
  • out (pandas.DataFrame) – The DataFrame to be written to a CSV file.

  • output_table (str) – The name of the output file (excluding the file extension).

get_test_params():

Returns a copy of the test parameters being used.

Returns:

A DataFrame containing the test parameters for data quality dimensions.

Return type:

pandas.DataFrame

get_data():

Returns a copy of the original dataset.

Returns:

The dataset that data quality checks are being performed on.

Return type:

pandas.DataFrame

set_test_params(test_params):

Sets new test parameters for data quality checks and re-initialises dimension instances.

Parameters:

test_params (pandas.DataFrame) – A DataFrame specifying the new parameters for data quality tests.

get_param_template():

Generates a template DataFrame for specifying test parameters for each data quality dimension.

Returns:

A DataFrame serving as a template for specifying data quality test parameters.

Return type:

pandas.DataFrame

save_user_lookup(user_lookup, file_name):

Saves a user-defined lookup table to a specified file.

Parameters:
  • user_lookup (pandas.DataFrame) – The user-defined lookup table to save.

  • file_name (str) – The name of the file (excluding the file extension) to save the lookup table as.

run_all_metrics():

Executes all configured data quality checks across the dataset.