Backend Python Tutorial#

Welcome to the backend tutorial on how to programmatically use DQMaRC. We begin by importing the necessary libraries including pandas and the DataQuality class. In this case, we will also import some functions from the UtilitiesDQMaRC module, including MetricCalculator, BarPlotGenerator, and DonutChartGenerator. These are supplementary functions for quick visualisation of the data quality markup report. The overall_quality_fx() function calculates the overall quality.

Source Data#

Import Libraries#

import pandas as pd
from pkg_resources import resource_filename

# Import DQMaRC
from DQMaRC import DataQuality

# Import UtilitiesDQMaRC for data quality visualisation
from DQMaRC.UtilitiesDQMaRC import (
    MetricCalculator,
    BarPlotGenerator,
    DonutChartGenerator,
    overall_quality_fx, col_good, col_bad)
  1. DQMaRC: the main package to assess multiple data quality dimensions for a given dataset.

  2. UtilitiesDQMaRC: a suite of modules to support the initial visualisation of the data quality markup report produced by DQMaRC. These utilities are also used to generate the shiny frontend interface.

  3. MetricCalculator: calculates sum totals and averages of data quality errors for each dimension.

  4. BarPlotGenerator: accepts the summary results produced by MetricCalculator to generate a plotly barplot for quick visualisation of data quality averages.

  5. overall_quality_fx: a function to generate an overall average of all data quality metrics and fields.

  6. col_good / col_bad: colour encodings for good and poor data quality percentages.

Data Import#

A screenshot image showing the image icon for synthetic data generator website Mockaroo.

We will use a synthetic dataset generated by Mockaroo and read it in as a pandas DataFrame. This data is included in the package but is purely synthetic.

# Read in example data as pandas dataframe
#df = pd.read_csv('../DQMaRC/data/toydf_subset.csv')
df = pd.read_csv(resource_filename('DQMaRC', 'data/toydf_subset.csv'))
df
Patient_ID Gender Date_of_Diagnosis Date_of_Birth Age Weight_kg Height_cm BMI Postcode ICD_10_Code Tumour_Stage Metastatic_Indicator Tumour_M_Stage Regimen_Start_Date Datetime_Event1 Datetime_Logging1 Datetime_Event2 Datetime_Logging2
0 1 Male 23/04/2033 07/06/2090 72.0 NaN 220.50 0.0 BS5 0NQ Y743 Incomplete Stage Present M0 04/10/2025 13/03/2024 00:00 13/03/2024 00:07 16/03/2024 00:00 16/03/2024 00:53
1 1 Male NaN 07/06/2090 181.0 882.01 184.53 68.0 BT78 3PN Y743 Incomplete Stage2 Present M0 04/10/2025 25/03/2024 00:00 25/03/2024 00:02 19/03/2024 00:00 19/03/2024 01:11
2 2 . 06/02/2020 NaN 140.0 414.60 170.05 67.0 UB7 0JP Unkown Incomplete Stage2 Present M0 24/08/2021 21/03/2024 00:00 21/03/2024 00:01 04/03/2024 00:00 04/03/2024 00:18
3 3 Male 30/07/2021 24/08/2003 151.0 703.49 30.99 39.0 MK9 3NZ Incomplete Stage2 Present M0 NaN 12/03/2024 00:00 12/03/2024 00:08 30/03/2024 00:00 30/03/2024 01:28
4 4 Male 17/09/2010 13/07/1999 NaN 804.32 7.48 55.0 B63 3QX Incomplete Stage2 Present M0 24/07/2015 20/03/2024 00:00 20/03/2024 00:07 29/03/2024 00:00 29/03/2024 00:47
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
104 104 Female 26/11/2010 01/02/1979 122.0 34.63 159.07 98.0 YO10 3JQ O3691X1 Stage 2 Present M0 10/12/2018 1991/03/2024 00:00:00 23/03/2024 00:00 13/03/2024 00:00 01-01-1122 01:25:00
105 105 Female 30/07/2011 08/10/1987 163.0 598.80 NaN 7.0 ME15 6TB T23129D Stage3 Present M0 12/09/2015 1991/03/2024 00:00:00 23/03/2024 00:00 13/03/2024 00:00 01-01-1122 01:25:00
106 106 Female 15/11/2014 17/10/1925 81.0 53.69 226.65 54.0 L5 2QA M778 Stage 4 Present M0 26/03/2017 1991/03/2024 00:00:00 23/03/2024 00:00 13/03/2024 00:00 01-01-1122 01:25:00
107 107 Female 15/03/2018 25/10/1974 45.0 788.70 105.50 16.0 N15 4GE D160 Stage 3 Present M0 10/06/2015 1991/03/2024 00:00:00 23/03/2024 00:00 13/03/2024 00:00 01-01-1122 01:25:00
108 108 Male 13/10/2017 02/06/1948 145.0 147.52 107.68 18.0 KY5 0NE S66117D Stage3 Present M0 01/09/2019 1991/03/2024 00:00:00 23/03/2024 00:00 13/03/2024 00:00 01-01-1122 01:25:00

109 rows × 18 columns

Test Parameters Setup#

Initialise DQMaRC#

DQMaRC must be initialised by applying the Data Quality class to your source dataset. DQMaRC then requires configuration of the test_params attribute. This is a dataset that DQMaRC uses to map which test parameters (or data quality tests) should be applied to the relevant fields or variables from the source dataset. Each test parameter is functionally supported by a respective method provided by DQMaRC. For example, the Completeness_NULL test parameter is calculated by the test_null() method.

If this is your first time using DQMaRC, you can automatically generate a test_params template using the get_param_template() method. This template is based off the source dataset and automatically activates the Completeness and Uniqueness parameters for all source data fields. In other words, users can run DQMaRC using the auto-generated template to see what the outputs look like and determine whether this is the right tool for the job.

We strongly encourage users to get familiar with the test parameters and to take the time to customise it. All test parameters are listed and explained in a table below. We recommend customising them using Excel due to ease of use, but python users also have the option to customise test parameters programmatically. We demonstrate how to this here.

# Initialise a DQ object by passing your data to the tool
dq = DataQuality(df)

# Retrieve default test parameter form the object. We will edit this in the next step
test_params = dq.get_test_params()

# View the test parameters template
test_params
Warning - Using default parameters. For best results edit parameters and reload using set_test_params() method
Field Date_Format Completeness_NULL Completeness_Empty Completeness_Encoded Completeness_Encoded_Mapping Uniqueness_Rows Consistency_Compare Consistency_Compare_Field Consistency_Compare_Mapping ... Validity_NHS_Number Validity_Postcode_UK Validity_Lookup_Table Validity_Lookup_Type Validity_Lookup_Codes Validity_Range Validity_Range_Numeric Validity_Pattern Validity_Pattern_Regex Accuracy_Gold_Standard
0 Patient_ID True True False NaN True False NaN NaN ... False False False NaN NaN False NaN False NaN False
1 Gender True True False NaN True False NaN NaN ... False False False NaN NaN False NaN False NaN False
2 Date_of_Diagnosis True True False NaN True False NaN NaN ... False False False NaN NaN False NaN False NaN False
3 Date_of_Birth True True False NaN True False NaN NaN ... False False False NaN NaN False NaN False NaN False
4 Age True True False NaN True False NaN NaN ... False False False NaN NaN False NaN False NaN False
5 Weight_kg True True False NaN True False NaN NaN ... False False False NaN NaN False NaN False NaN False
6 Height_cm True True False NaN True False NaN NaN ... False False False NaN NaN False NaN False NaN False
7 BMI True True False NaN True False NaN NaN ... False False False NaN NaN False NaN False NaN False
8 Postcode True True False NaN True False NaN NaN ... False False False NaN NaN False NaN False NaN False
9 ICD_10_Code True True False NaN True False NaN NaN ... False False False NaN NaN False NaN False NaN False
10 Tumour_Stage True True False NaN True False NaN NaN ... False False False NaN NaN False NaN False NaN False
11 Metastatic_Indicator True True False NaN True False NaN NaN ... False False False NaN NaN False NaN False NaN False
12 Tumour_M_Stage True True False NaN True False NaN NaN ... False False False NaN NaN False NaN False NaN False
13 Regimen_Start_Date True True False NaN True False NaN NaN ... False False False NaN NaN False NaN False NaN False
14 Datetime_Event1 True True False NaN True False NaN NaN ... False False False NaN NaN False NaN False NaN False
15 Datetime_Logging1 True True False NaN True False NaN NaN ... False False False NaN NaN False NaN False NaN False
16 Datetime_Event2 True True False NaN True False NaN NaN ... False False False NaN NaN False NaN False NaN False
17 Datetime_Logging2 True True False NaN True False NaN NaN ... False False False NaN NaN False NaN False NaN False

18 rows × 30 columns

Test Parameters Configurations#

Now we will demonstrate how to edit the test parameters programmatically. You can also export the test parameters template as a csv file and edit them in MS Excel.

Datetime Format#

The first test parameter to edit is the Date_Format. Although this is not strictly speaking a data quality metric, it does help python to correctly identify and calculate other data quality metrics for datetime fields.

# Datetime format
test_params.loc[test_params['Field']=='Date_of_Diagnosis', 'Date_Format'] = "%d/%m/%Y"
test_params.loc[test_params['Field']=='Date_of_Birth', 'Date_Format'] = "%d/%m/%Y"
test_params.loc[test_params['Field']=='Datetime_Event1', 'Date_Format'] = "%d/%m/%Y %H:%M"
test_params.loc[test_params['Field']=='Datetime_Logging1', 'Date_Format'] = "%d/%m/%Y %H:%M"

# Another way to do this:
# test_params.at[3,'Date_Format']="%d/%m/%Y" # Date of Diagnosis
# test_params.at[4,'Date_Format']="%d/%m/%Y" # Date of Birth
# test_params.at[15,'Date_Format']="%d/%m/%Y %H:%M" # Datetime_Event1
# test_params.at[16,'Date_Format']="%d/%m/%Y %H:%M" # Datetime_Logging1

Completeness Parameters#

The first data quality dimension we will edit is Completeness. When you initialise the test parameters template, DQMaRC assumes that the user may wish to calculate the number of NULL or Empty records for all fields. As such, the default values for Completeness_NULL (see the test_null() method) and Completeness_Empty (see the test_empty() method) are TRUE. We encourage the user to think of or look for other values that may signify missing data.

For example, values such as “Unknown”, “Not Known”, or “na” may indicate data missingness. To do this, we set the test parameter Completeness_Encoded == TRUE (see the test_na_strings() method) for the field in question (in this case it is “Gender”). Then we include the relevant missing codes in the Completeness_Encoded_Mapping. Each unique possible value or string encoding for missing data is separated by a pipe, e.g. ‘Unknown|Not Known|na’ (i.e. Unknown or Not Known or na)

# Completeness
# Set to true for all fields
test_params['Completeness_NULL'] = True # Default value
test_params['Completeness_Empty'] = True # Default value

# Set Completeness Encoding only for the "Gender" column
test_params.at[2,'Completeness_Encoded']=True
test_params.at[2,'Completeness_Encoded_Mapping']='Unknown|Not Known|na' # use pipes for multiple possible codes for missingness

Uniqueness Parameters#

Uniqueness evaluates if records are unique across the fields that are set to TRUE. In most cases, we assume that records should be unique across at least all fields combined. Therefore, DQMaRC by default sets the parameter Uniqueness_rows == TRUE for all fields. However, if records are expected to be unique across select fields, then we encourage the user to set TRUE only to those respective fields, and FALSE to the remaining fields.

# Uniqueness
# Set to true for all fields
test_params['Uniqueness_rows'] = True # Default value

Consistency Parameters#

Consistency is the evaluation of whether data between two fields in the same dataset are consistent as expected. It consists of two key metrics, including Consistency_Compare (see the test_one_to_one() method), which is the comparison of values between two fields, and Consistency_Date_Relations (see the date_relationships() method), which is a check of consistency in logical relationships between two datetime fields.

We demonstrate the first consistency metric by comparing Metastatic_Indicator with Tumour_M_Stage. In theory, both variables indicate whether cancer has spread to distant locations, but in real-world it is possible that these may come from different sources. The values in both fields are recorded differently, but mean similar things. For example, if Metastatic_Indicator == Absent, then we expect that Tumour_M_Stage == M0, meaning no metastasis is present. If Metastantic_Indicator == Present, then we expect to see Tumour_M_Stage with either M1, M1a, or other values, but here we abbreviate it to keep things simple.

For datetime fields, we may expect one date to occur before or after another in the same dataset. In this example, we demonstrate this using Date_of_Birth which obviously should occur before Date_of_Death. Importantly, we tell DQMaRC to raise an error if it detects that: Date_of_Birth > Date_of_Death.

# Consistency
# Set up consistency checks between the "Metastatic_Indicator" and "Tumour_M_Stage" columns
test_params.loc[test_params['Field']=='Metastatic_Indicator', 'Consistency_Compare'] = True
test_params.loc[test_params['Field']=='Metastatic_Indicator', 'Consistency_Compare_Field'] = 'Tumour_M_Stage'
test_params.loc[test_params['Field']=='Metastatic_Indicator', 'Consistency_Compare_Mapping'] = '{["Absent"]: ["M0"], ["Present"]: ["M1", "M1a"]}'

# Set up consistency checks between date fields "Date_of_Birth" and "Date_of_Diagnosis"
test_params.loc[test_params['Field']=='Date_of_Birth', 'Consistency_Date_Relations'] = True
test_params.loc[test_params['Field']=='Date_of_Birth', 'Consistency_Date_Relations_Field'] = 'Date_of_Diagnosis'
test_params.loc[test_params['Field']=='Date_of_Birth', 'Consistency_Date_Relationship'] = '>' # i.e. raise an error if Date of Birth > Date of Diagnosis

Timeliness Parameters#

Timeliness describes how fresh or up-to-date data is. One way to measure this is to define a threshold in time difference (see the date_diff_calc() method) between fields that record the timing of events and when those events are captured into the system. For example, in this case, the Datetime_Event1 field signifies the time that a user says an event occured, or when that data was observed in the real-world, whereas Datetime_Logging1 is the electronically generated datetime log data indicating when the user entered this data into the system. In a clinical or health setting, the freshness of data may be crucial.

For example, in critical care, patient observation data may be recorded several minutes or hours after the time of observation, instead of immediately. To flag these errors, we tell DQMaRC to raise an error if the time difference between Datetime_Event1 and Datetime_Logging1 is greater than 10 minutes.

# Timeliness
# Raise an error if a time difference threshold of 10 minutes is exceeded
test_params.loc[test_params['Field']=='Datetime_Event1', 'Timeliness_Date_Diff'] = True
test_params.loc[test_params['Field']=='Datetime_Event1', 'Timeliness_Date_Diff_Field'] = 'Datetime_Logging1'
test_params.loc[test_params['Field']=='Datetime_Event1', 'Timeliness_Date_Diff_Threshold'] = '10' # i.e. raise an error if timediff >10 minutes

Validity Parameters#

Data Validity describes data that conform to the expected standards, patterns, or ranges that reflect the intended real-world objects. DQMaRC has 7 validity metrics that are shown below. These include:

  1. Future Dates: flag a record containing a date that occur in the future (see Validity_Dates_Future() method).

    # (1) Future Dates
    test_params.loc[test_params['Field']=='Date_of_Diagnosis', 'Validity_Dates_Future'] = True
    test_params.loc[test_params['Field']=='Date_of_Birth', 'Validity_Dates_Future'] = True
    
  2. Date Outliers: checks if date records fall outside expected date ranges (see min_max_dates() method).

    # (2) Date Outliers
    test_params.loc[test_params['Field']=='Date_of_Diagnosis', 'Validity_Date_Range'] = True
    test_params.loc[test_params['Field']=='Date_of_Birth', 'Validity_Date_Range'] = True
    
    test_params.loc[test_params['Field']=='Date_of_Diagnosis', 'Validity_Date_Range_Min'] = '2011-01-01'
    test_params.loc[test_params['Field']=='Date_of_Birth', 'Validity_Date_Range_Min'] = '1900-01-01'
    
    test_params.loc[test_params['Field']=='Date_of_Diagnosis', 'Validity_Date_Range_Max'] = '2023-07-07'
    test_params.loc[test_params['Field']=='Date_of_Birth', 'Validity_Date_Range_Max'] = '2023-01-01'
    
  3. Numerical Outliers: checks if numerical data fall outside expected numerical ranges (see test_ranges() method).

    # (3) Numerical Outliers
    test_params.loc[test_params['Field']=='Age', 'Validity_Range'] = True
    test_params.loc[test_params['Field']=='Age', 'Validity_Range_Min'] = 0
    test_params.loc[test_params['Field']=='Age', 'Validity_Range_Max'] = 120
    
    test_params.loc[test_params['Field']=='Height_cm', 'Validity_Range'] = True
    test_params.loc[test_params['Field']=='Height_cm', 'Validity_Range_Min'] = 20
    test_params.loc[test_params['Field']=='Height_cm', 'Validity_Range_Max'] = 210
    
  4. NHS Number Validator: checks if NHS numbers are valid (see validate_nhs_number() method).

    # (4) NHS Number Validator
    # test_params.loc[test_params['Field']=='NHS_Number', 'Validity_NHS_Number'] = True
    
  5. UK Postcode Validator: checks if UK postcodes are valid (see test_postcode() method).

    # (5) UK Postcode Validator
    test_params.loc[test_params['Field']=='Postcode', 'Validity_Postcode_UK'] = True
    
  6. Data Standards: checks if values conform to expected permissible values as defined by a data standard (see test_against_lookup_tables() method).

    1. Either create and save your own data standard

      # (6a) Data Standards
      test_params.loc[test_params['Field']=='Tumour_M_Stage', 'Validity_Lookup'] = True
      test_params.loc[test_params['Field']=='Tumour_M_Stage', 'Validity_Lookup_Type'] = 'Values'
      test_params.loc[test_params['Field']=='Tumour_M_Stage', 'Validity_Lookup_Codes'] = 'M0|M1|M1b|pM1'
      
    2. Or access a pre-defined list saved as a csv file

      #(6b) Use an external csv data standard list of valid codes
      lu_filename = '..DQMaRC/data/lookups/LU_toydf_gender.csv'
      
      # Here we will apply a pre-defined data standard for "gender"
      test_params.loc[test_params['Field']=='Gender', 'Validity_Lookup'] = True
      test_params.loc[test_params['Field']=='Gender', 'Validity_Lookup_Type'] = 'File'
      test_params.loc[test_params['Field']=='Gender', 'Validity_Lookup_Codes'] = lu_filename
      
  7. Regular Expression Pattern: checks if data conform to expected pattern as defined by regular expression (see test_pattern_validity() method).

    # (7) Regular Expression Pattern
    test_params.loc[test_params['Field']=='Datetime_Event1', 'Validity_Pattern'] = True
    test_params.loc[test_params['Field']=='Datetime_Event1', 'Validity_Pattern_Regex'] = "(\d{2})/(\d{2})/(\d{4}) (\d{2}):(\d{2})"
    
    test_params.loc[test_params['Field']=='Datetime_Event2', 'Validity_Pattern'] = True
    test_params.loc[test_params['Field']=='Datetime_Event2', 'Validity_Pattern_Regex'] = "[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}"
    

Accuracy Parameters#

Accuracy measures the consistency of input data compared to a known gold standard. In this case, we will set the gold standard as the input data itself to demonstrate the functionality.

# Accuracy
# Set a manually validated version of the data set as the gold standard
test_params['Gold_Standard'] = True

# supply gold stand data - we are using the same dataset here for ease
dq.accuracy.set_gold_standard(df)

Test Parameters Definitions#

Here is a list of all test parameters and their respective descriptions and permissible configurations.

# Read in example data as pandas dataframe
test_params_definitions = pd.read_csv('../DQMaRC/data/test_params_definitions.csv')
test_params_definitions
Test Parameter Python_Data_Type Example_Data Description
0 Field String Metastatic_Indicator Name of the fields of the source input data.
1 Date_Format String %d/%m/%Y %H:%M The datetime format pattern specified by ISO s...
2 Completeness_NULL Boolean TRUE or FALSE Count of "NULL" values.
3 Completeness_Empty Boolean TRUE or FALSE Count of empty records.
4 Completeness_Encoded Boolean TRUE or FALSE TRUE/FALSE indicator if field uses codes to in...
5 Completeness_Encoded_Mapping String Missing|N/A|Empty A list of values or codes that represent missi...
6 Uniqueness_Rows Boolean TRUE or FALSE TRUE/FALSE indicator if field is expected to b...
7 Consistency_Compare Boolean TRUE or FALSE TRUE/FALSE to indicate if field should be comp...
8 Consistency_Compare_Field String Tumour_M_Stage Enter the name of the field to compare against...
9 Consistency_Compare_Mapping Pandas dictionary {"Present": ["M1", "M1a"]} Map the permissible correct comparison values ...
10 Consistency_Date_Relations Boolean TRUE or FALSE TRUE/FALSE if a date field should be compared ...
11 Consistency_Date_Relationship String <, >, <=, >= Relationship of the comparison (<, >, <=, or >...
12 Consistency_Date_Relations_Field String Date_of_Birth Name of the compared datetime field if "Consis...
13 Timeliness_Date_Diff Boolean TRUE or FALSE Indicator to perform a calculation between two...
14 Timeliness_Date_Diff_Field String Datetime_Logging1 If "Timeliness_Date_Diff" = "TRUE", here you s...
15 Timeliness_Date_Diff_Threshold Integer 10 What is the permissible threshold (in minutes)...
16 Validity_Dates_Future Boolean TRUE or FALSE TRUE/FALSE indicator to raise an error if a fu...
17 Validity_Date_Range Boolean TRUE or FALSE TRUE/FALSE if a date range is applicable.
18 Validity_Date_Range_Min String 01/01/1900 If "Validity_Dates_Future" = "TRUE", what is t...
19 Validity_Date_Range_Max String 01/01/2025 If "Validity_Dates_Future" = "TRUE", what is t...
20 Validity_NHS_Number Boolean TRUE or FALSE TRUE/FALSE if the NHS validity algorithm check...
21 Validity_Postcode_UK Boolean TRUE or FALSE TRUE/FALSE to raise an error if invalid UK pos...
22 Validity_Lookup_Table Boolean TRUE or FALSE TRUE/FALSE if a data standard is applicable.
23 Validity_Lookup_Table_Filename String LU_toydf_ICD10_v5.csv If "Validity_Lookup_Table"="TRUE", what is the...
24 Validity_Range Boolean TRUE or FALSE TRUE/FALSE if a numerical range applicable, e....
25 Validity_Range_Numeric String 0|110 If "Validity_Range"="TRUE", supply the numeric...
26 Validity_Pattern Boolean TRUE or FALSE TRUE/FALSE if a valid pattern is applicable.
27 Validity_Pattern_Regex String [0-9]{2}/[0-9]{2}/[0-9]{2} If "Validity_Pattern"="TRUE", supply the permi...
28 Accuracy_Gold_Standard Boolean TRUE or FALSE TRUE/FALSE if a gold standard reference datase...

Upload Custom Test Parameters#

We have shown how you can programmatically edit the test parameters dataframe. However, users may opt to make these edits in Excel instead. In fact, we encourage users to spend the time to do this to maximise the relevance of the data quality output reports. Here we show how you can upload a pre-defined test parameters=.

test_params_upload = pd.read_csv(resource_filename('DQMaRC', 'data/toydf_subset_test_params_24.05.16.csv'))
test_params_upload
Field Date_Format Completeness_NULL Completeness_Empty Completeness_Encoded Completeness_Encoded_Mapping Uniqueness_Rows Consistency_Compare Consistency_Compare_Field Consistency_Compare_Mapping ... Validity_NHS_Number Validity_Postcode_UK Validity_Lookup_Table Validity_Lookup_Type Validity_Lookup_Codes Validity_Range Validity_Range_Numeric Validity_Pattern Validity_Pattern_Regex Accuracy_Gold_Standard
0 Patient_ID NaN True True False NaN True False NaN NaN ... False False False NaN NaN False NaN False NaN False
1 Gender NaN True True True .|unknown|null|not known|na|n/a|none False False NaN NaN ... False False True File LU_toydf_gender.csv False NaN False NaN False
2 Date_of_Diagnosis %d/%m/%Y True True False NaN False False NaN NaN ... False False False NaN NaN False NaN True (\d{2})/(\d{2})/(\d{4}) False
3 Date_of_Birth %d/%m/%Y True True False NaN False False NaN NaN ... False False False NaN NaN False NaN True (\d{2})/(\d{2})/(\d{4}) False
4 Age NaN True True False NaN False False NaN NaN ... NaN NaN False NaN NaN True 0|110 False NaN False
5 Weight_kg NaN True True False NaN False False NaN NaN ... NaN NaN False NaN NaN True 5|150 False NaN False
6 Height_cm NaN True True False NaN False False NaN NaN ... NaN NaN False NaN NaN True 10|200 False NaN False
7 BMI NaN True True False NaN False False NaN NaN ... NaN NaN False NaN NaN True 5|60 False NaN False
8 Postcode NaN True True True .|unknown|null|not known|na|n/a|none False False NaN NaN ... False True False NaN NaN False NaN False NaN False
9 ICD_10_Code NaN True True True .|unknown|Unknown|null|not known|na|n/a|N/A|none False False NaN NaN ... False False False NaN NaN False NaN False NaN False
10 Tumour_Stage NaN True True True None|N/A|Incomplete Stage|Incomplete Stage2 False False NaN NaN ... False False True File LU_toydf_tumour_stage.csv False NaN False NaN False
11 Metastatic_Indicator NaN True True False NaN False True Tumour_M_Stage {"Absent": ["M0"], "Present": ["M1", "M1a", "M... ... False False False NaN NaN False NaN False NaN False
12 Tumour_M_Stage NaN True True False NaN False False NaN NaN ... False False True Values M0|M1|M1b|pM1 False NaN False NaN False
13 Regimen_Start_Date %d/%m/%Y True True False NaN False False NaN NaN ... False False False NaN NaN False NaN True [0-9]{2}/[0-9]{2}/[0-9]{2} False
14 Datetime_Event1 %d/%m/%Y %H:%M True True False NaN False False NaN NaN ... False False False NaN NaN False NaN True (\d{2})/(\d{2})/(\d{4}) (\d{2}):(\d{2}) False
15 Datetime_Logging1 %d/%m/%Y %H:%M True True False NaN False False NaN NaN ... False False False NaN NaN False NaN True (\d{2})/(\d{2})/(\d{4}) (\d{2}):(\d{2}) False
16 Datetime_Event2 %d/%m/%Y %H:%M True True False NaN False False NaN NaN ... False False False NaN NaN False NaN True [0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2} False
17 Datetime_Logging2 %d/%m/%Y %H:%M True True False NaN False False NaN NaN ... False False False NaN NaN False NaN True (\d{2})/(\d{2})/(\d{4}) (\d{2}):(\d{2}) False

18 rows × 30 columns

Set Test Parameters#

Now we are ready to set the test parameters. We are using the pre-defined uploaded test parameters.

dq.set_test_params(test_params_upload)
Using uploaded test parameters

Run DQMaRC#

Once you have set your chosen test parameters, you run DQMaRC by using the run_all_metrics() method. Users can optionally run each data quality dimension separately as indicated in the commented code below.

dq.run_all_metrics()

# To run separately use following methods.
# dq.completeness.run_metrics()
# dq.uniqueness.run_metrics()
# dq.consistency.run_metrics()
# dq.timeliness.run_metrics()
# dq.validity.run_metrics()
# dq.accuracy.run_metrics()

Get DQ Results#

Now you can export your results. The two main DQ reports generated by DQMaRC include the full cell-level markup and the field-wise aggregated results.

Full Results#

The deepest, and potentially most useful output results is the data quality markup called DQMaRC.DataQuality.raw_results(). This includes a cell-level binary markup of data quality errors, or flags. You can join the output dataset from DQMaRC.DataQuality.raw_results() to the original source dataset by the index.

raw = dq.raw_results()
raw

# The full results can be joined to the source data by the index.
# source_df_raw = df.join(raw)
Completeness_NULL_|_Patient_ID Completeness_NULL_|_Gender Completeness_NULL_|_Date_of_Diagnosis Completeness_NULL_|_Date_of_Birth Completeness_NULL_|_Age Completeness_NULL_|_Weight_kg Completeness_NULL_|_Height_cm Completeness_NULL_|_BMI Completeness_NULL_|_Postcode Completeness_NULL_|_ICD_10_Code ... validity_count_|_Datetime_Event2 validity_count_|_Datetime_Logging1 validity_count_|_Datetime_Logging2 validity_count_|_Gender validity_count_|_Height_cm validity_count_|_Postcode validity_count_|_Regimen_Start_Date validity_count_|_Tumour_M_Stage validity_count_|_Tumour_Stage validity_count_|_Weight_kg
0 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 1 0 1 0 0 0
1 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 1
2 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
3 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 1
4 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
104 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
105 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 1 1
106 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 1 0 0 0 0 0
107 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 0 1
108 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 0 0 1 0

109 rows × 114 columns

Aggregated Results#

You can also access a higher-level aggregation of the raw results. This contains the sum count of data quality errors detected for each test parameter and each source data field. To access this, you can simply use DQMaRC.DataQuality.aggregate_results() method.

agg = dq.aggregate_results()
agg
metric field Completeness_NULL Completeness_Empty Completeness_Encoded completeness_count row_uniqueness uniqueness_count Consistency_Compare Consistency_Date_Relations consistency_count Timeliness_Date_Diff timeliness_count Validity_Dates_Future Validity_Date_Range Validity_Postcode_UK Validity_Lookup_Table Validity_Range Validity_Pattern validity_count
12 Patient_ID 0.0 0.0 NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 Gender 11.0 0.0 7.0 18.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.0 NaN NaN 0.0
3 Date_of_Diagnosis 2.0 0.0 NaN 2.0 NaN NaN NaN 6.0 6.0 NaN NaN 1.0 79.0 NaN NaN NaN 0.0 80.0
2 Date_of_Birth 1.0 0.0 NaN 1.0 NaN NaN NaN 6.0 6.0 NaN NaN 1.0 NaN NaN NaN NaN 0.0 1.0
0 Age 11.0 0.0 NaN 11.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 50.0 NaN 50.0
17 Weight_kg 11.0 0.0 NaN 11.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 84.0 NaN 84.0
9 Height_cm 10.0 0.0 NaN 10.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 34.0 NaN 34.0
1 BMI 9.0 0.0 NaN 9.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 36.0 NaN 36.0
13 Postcode 0.0 0.0 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2.0 NaN NaN NaN 2.0
10 ICD_10_Code 2.0 4.0 1.0 7.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 Tumour_Stage 11.0 0.0 10.0 21.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 28.0 NaN NaN 28.0
11 Metastatic_Indicator 0.0 0.0 NaN 0.0 NaN NaN 49.0 NaN 49.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
15 Tumour_M_Stage 0.0 0.0 NaN 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 7.0 NaN NaN 7.0
14 Regimen_Start_Date 1.0 0.0 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN 0.0 2.0
4 Datetime_Event1 0.0 0.0 NaN 0.0 NaN NaN NaN 9.0 9.0 19.0 19.0 8.0 NaN NaN NaN NaN 9.0 17.0
6 Datetime_Logging1 0.0 0.0 NaN 0.0 NaN NaN NaN 9.0 9.0 NaN NaN 8.0 NaN NaN NaN NaN 0.0 8.0
5 Datetime_Event2 0.0 0.0 NaN 0.0 NaN NaN NaN 2.0 2.0 103.0 103.0 8.0 NaN NaN NaN NaN 0.0 8.0
7 Datetime_Logging2 0.0 0.0 NaN 0.0 NaN NaN NaN NaN NaN NaN NaN 8.0 NaN NaN NaN NaN 9.0 17.0
18 full_row_uniqueness NaN NaN NaN NaN 2.0 2.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Data Quality Visualisation#

Here we demonstrate examples of data visualisations to gain quick high-level overviews of the DQ results.

Overall data quality#

The overall quality is the average data quality for all metrics across all fields from the source dataset. The applied logic is: * if average > 90, return “Outstanding” * if average >= 80, return “Good” * if average >= 60, return “Requires Improvement” * if average <60, return “Inadequate”

# Prepare DQ Dashboard
raw_subset = raw.filter(regex='completeness|validity|consistency|uniqueness_count|accuracy|timeliness')
calculator = MetricCalculator(raw_subset)

# Simulate the calculation step, calculate aggregates
calculator.calculate_metrics()

summary_results = calculator.result
summary_results['Colour_Good'] = summary_results.apply(col_good, axis=1)
summary_results['Colour_Bad'] = summary_results.apply(col_bad, axis=1)
summary_results['Colour_NA'] = '#B2C3C6'

# Overall quality label
from IPython.display import HTML

# Function to display overall quality in a Jupyter Notebook
def display_overall_quality_label():
    if not summary_results.empty:
        data1 = summary_results[summary_results['Prop_NA'] == 0]
        avg_prop_good = data1['Prop_Good'].mean()
        overall_quality_level, background_color, text_colour = overall_quality_fx(avg_prop_good)

        overall_quality_text = f"Overall Quality: {overall_quality_level}"
        html = f"""
        <div style="
            background-color: {background_color};
            padding: 10px;
            border-radius: 5px;
            color: {text_colour};
            border: 2px solid {text_colour};
            text-align: center;
            width: 300px;">
            {overall_quality_text}
        </div>
        """
        return HTML(html)
    else:
        return HTML("<div style='text-align: center;'>No data available</div>")

# Use the function to display the result
display_overall_quality_label()
Overall Quality: Good

Donut Charts#

The donut charts represent the average DQ for all metrics and fields within a given DQ dimension.

DonutChartGenerator(summary_results).plot_donut_charts()

Bar Chart Completeness#

The barcharts represent the average DQ for each field of a given DQ dimension. Here we show all relevant fields’ average completeness quality.

BarPlotGenerator(summary_results, "completeness").plot_bar()

Bar Chart Consistency#

Here we show all relevant fields’ average consistency.

BarPlotGenerator(summary_results, "consistency").plot_bar()

Bar Chart Timeliness#

Here we show all relevant fields’ average timeliness.

BarPlotGenerator(summary_results, "timeliness").plot_bar()

Bar Chart Uniqueness#

Uniqueness indicates if records are unique across the combination of fields chosen.

BarPlotGenerator(summary_results, "uniqueness").plot_bar()

Bar Chart Validity#

Here we show all relevant fields’ average validity.

BarPlotGenerator(summary_results, "validity").plot_bar()

Example Errors by DQ Dimension#

First we must join the source data with the data quality markup called raw. We can do this using a pandas join method. Now, the df_DQ_full dataframe contains both source data and the data quality markup.

# Join source data to the full DQ markup results
df_DQ_full = df.join(raw, how="left")

Completeness Examples#

In this example, we showcase completeness errors present in the Gender variable of the source dataset. Values such as ., Na, unknown are flagged as errors, which are represented as 1 in the adjacent fields such as Completeness_NULL_|_Gender, Completeness_Empty_|_Gender and Completeness_NULL_|_Encoded. The completeness_count_|_Gender variable is the sum of these three DQ metrics for each variable (i.e. Gender in this case).

gender_completeness_conditions = (df_DQ_full['Completeness_NULL_|_Gender']>0) | \
    (df_DQ_full['Completeness_Empty_|_Gender']>0) | \
        (df_DQ_full['Completeness_Encoded_|_Gender']>0)

df_DQ_full[['Gender','Completeness_NULL_|_Gender',
            'Completeness_Empty_|_Gender', 'Completeness_Encoded_|_Gender',
            'completeness_count_|_Gender']].loc[(gender_completeness_conditions)]
Gender Completeness_NULL_|_Gender Completeness_Empty_|_Gender Completeness_Encoded_|_Gender completeness_count_|_Gender
2 . 0 0 1 1
7 . 0 0 1 1
10 NaN 1 0 0 1
11 NaN 1 0 0 1
15 . 0 0 1 1
26 . 0 0 1 1
27 NaN 1 0 0 1
31 unknown 0 0 1 1
32 unknown 0 0 1 1
33 unknown 0 0 1 1
37 NaN 1 0 0 1
46 NaN 1 0 0 1
55 NaN 1 0 0 1
57 NaN 1 0 0 1
69 NaN 1 0 0 1
82 NaN 1 0 0 1
91 NaN 1 0 0 1
101 NaN 1 0 0 1

Uniqueness Examples#

Uniqueness evaluates duplicate records for variables where we expect them to be unique. Here you can see that the row_uniqueness_|_full_row_uniqueness variable indicates 1 if a duplicate record is present across the Patient_ID and Gender source variables.

# Check which rows have duplication where they should be unique
df_DQ_full[['Patient_ID', 'Gender']][df_DQ_full[['Patient_ID', 'Gender']].duplicated()]

# Show how uniqueness flags are shown
uniqueness_conditions = (df_DQ_full['Patient_ID']==1) | (df_DQ_full['Patient_ID']==10)

df_DQ_full[['Patient_ID', 'Gender',
            'row_uniqueness_|_full_row_uniqueness']].loc[(uniqueness_conditions)]
Patient_ID Gender row_uniqueness_|_full_row_uniqueness
0 1 Male 1
1 1 Male 0
10 10 NaN 1
11 10 NaN 0

Consistency Examples#

We demonstrate consistency errors below by comparing source variables Metastantic_Indicator and Tumour_M_Stage. In the full DQ report, the variable Consistency_Compare_|_Metastantic_Indicator contains 1 if an inconsistency is detected between Metastantic_Indicator and Tumour_M_Stage based on the mapping provided in the test_params data.

consistency_conditions_metastatic_indicator = (df_DQ_full['Consistency_Compare_|_Metastatic_Indicator']>0)

df_DQ_full[['Tumour_M_Stage','Metastatic_Indicator',
            'Consistency_Compare_|_Metastatic_Indicator',
            'consistency_count_|_Metastatic_Indicator']].loc[(consistency_conditions_metastatic_indicator)]
Tumour_M_Stage Metastatic_Indicator Consistency_Compare_|_Metastatic_Indicator consistency_count_|_Metastatic_Indicator
0 M0 Present 1 1
1 M0 Present 1 1
2 M0 Present 1 1
3 M0 Present 1 1
4 M0 Present 1 1
5 M0 Present 1 1
6 M0 Present 1 1
7 M0 Present 1 1
8 M0 Present 1 1
9 M0 Present 1 1
10 M1 Absent 1 1
11 M1 Absent 1 1
12 M1 Absent 1 1
13 M1 Absent 1 1
14 M1a Absent 1 1
15 pM1a Absent 1 1
16 M1a Absent 1 1
17 M1b Absent 1 1
18 M1c Absent 1 1
19 pM1a Absent 1 1
80 M0 Present 1 1
81 M0 Present 1 1
82 M0 Present 1 1
83 M0 Present 1 1
84 M0 Present 1 1
85 M0 Present 1 1
86 M0 Present 1 1
87 M0 Present 1 1
88 M0 Present 1 1
89 M0 Present 1 1
90 M0 Present 1 1
91 M0 Present 1 1
92 M0 Present 1 1
93 M0 Present 1 1
94 M0 Present 1 1
95 M0 Present 1 1
96 M0 Present 1 1
97 M0 Present 1 1
98 M0 Present 1 1
99 M0 Present 1 1
100 M0 Present 1 1
101 M0 Present 1 1
102 M0 Present 1 1
103 M0 Present 1 1
104 M0 Present 1 1
105 M0 Present 1 1
106 M0 Present 1 1
107 M0 Present 1 1
108 M0 Present 1 1

We also demonstrate a consistency check between date of birth and date of diagnosis.

consistency_conditions_dates = (df_DQ_full['Consistency_Date_Relations_|_Date_of_Diagnosis']>0) | \
    (df_DQ_full['Consistency_Date_Relations_|_Date_of_Birth']>0)

df_DQ_full[['Date_of_Birth','Date_of_Diagnosis',
            'Consistency_Date_Relations_|_Date_of_Birth',
            'Consistency_Date_Relations_|_Date_of_Diagnosis']].loc[(consistency_conditions_dates)]
Date_of_Birth Date_of_Diagnosis Consistency_Date_Relations_|_Date_of_Birth Consistency_Date_Relations_|_Date_of_Diagnosis
0 07/06/2090 23/04/2033 1 1
19 25/10/2015 12/07/2011 1 1
36 10/07/2017 31/05/2015 1 1
66 04/07/2014 26/01/2010 1 1
67 13/01/2020 07/12/2010 1 1
73 28/01/2015 01/02/2014 1 1

Timeliness Examples#

Here we demonstrate example timeliness errors by comparing the datetime difference between Datetime_Event2 and Datetime_Logging2. According to the customised test_parameters, we set a threshold of 10 minutes to flag an error if the datetime difference is exceeded. In otherwords, if the difference in time between these variables is greater than 10 minutes, an error, i.e. 1, is present in the Timeliness_Date_Diff_|_Datetime_Event2 field of the DQ report.

timeliness_conditions = (df_DQ_full['Timeliness_Date_Diff_|_Datetime_Event2']>0)
df_DQ_full[['Datetime_Event2','Datetime_Logging2',
            'Timeliness_Date_Diff_|_Datetime_Event2',
            'timeliness_count_|_Datetime_Event2']].loc[(timeliness_conditions)]
Datetime_Event2 Datetime_Logging2 Timeliness_Date_Diff_|_Datetime_Event2 timeliness_count_|_Datetime_Event2
0 16/03/2024 00:00 16/03/2024 00:53 1 1
1 19/03/2024 00:00 19/03/2024 01:11 1 1
2 04/03/2024 00:00 04/03/2024 00:18 1 1
3 30/03/2024 00:00 30/03/2024 01:28 1 1
4 29/03/2024 00:00 29/03/2024 00:47 1 1
... ... ... ... ...
104 13/03/2024 00:00 01-01-1122 01:25:00 1 1
105 13/03/2024 00:00 01-01-1122 01:25:00 1 1
106 13/03/2024 00:00 01-01-1122 01:25:00 1 1
107 13/03/2024 00:00 01-01-1122 01:25:00 1 1
108 13/03/2024 00:00 01-01-1122 01:25:00 1 1

103 rows × 4 columns

Validity Examples#

Here we demonstrate example errors of data validity across all 7 validity metrics, including: future dates, outlier dates, invalid NHS numbers, invalid codes, numerical outlires, and invalid patterns.

Future Dates#

Future dates represent if a datetime field occurs in the future relative to the current datetime. This cannot detect a future date if it occurred in the past. Instead, this may be detected as an outlier date or inconsistent date instead.

validity_future_dates_conditions = (df_DQ_full['Validity_Dates_Future_|_Date_of_Diagnosis']>0)
df_DQ_full[['Date_of_Diagnosis','Validity_Dates_Future_|_Date_of_Diagnosis']].loc[(validity_future_dates_conditions)]
Date_of_Diagnosis Validity_Dates_Future_|_Date_of_Diagnosis
0 23/04/2033 1
Outlier Dates#

Outlier dates are dates that occur outside the ranges as set in the test parameters.

validity_outlier_dates_conditions = (df_DQ_full['Validity_Date_Range_|_Date_of_Diagnosis']>0)
df_DQ_full[['Date_of_Diagnosis','Validity_Date_Range_|_Date_of_Diagnosis']].loc[(validity_outlier_dates_conditions)]
Date_of_Diagnosis Validity_Date_Range_|_Date_of_Diagnosis
0 23/04/2033 1
4 17/09/2010 1
6 09/01/2013 1
7 19/06/2011 1
8 13/10/2012 1
... ... ...
104 26/11/2010 1
105 30/07/2011 1
106 15/11/2014 1
107 15/03/2018 1
108 13/10/2017 1

79 rows × 2 columns

Invalid NHS Numbers#

Invalid NHS numbers are flagged when NHS numbers do not meet the requirements as set by the NHS validation algorithm. Here is how you would check in your dataset which NHS numbers were invalid. We excluded this from our example to minimise the risk of sharing potentially true NHS numbers even when synthetically generated.

#validity_NHS_number_conditions = (df_DQ_full['Validity_NHS_Number_|_NHS_number']>0)
#df_DQ_full[['NHS_number','Validity_NHS_Number_|_NHS_number']].loc[(validity_NHS_number_conditions)]
Invalid UK Postcodes#

Invalid UK postcodes can be detected by the UK postcode validation algorithm. Other countries’ postcodes can be validated using pattern validation as shown below.

validity_UK_postcodes_conditions = (df_DQ_full['Validity_Postcode_UK_|_Postcode']>0)
df_DQ_full[['Postcode','Validity_Postcode_UK_|_Postcode']].loc[(validity_UK_postcodes_conditions)]
Postcode Validity_Postcode_UK_|_Postcode
6 AAA 1AA 1
8 BB1 1ABB 1
Invalid Tumour Stage Codes#

This is one example of a validation test applied to a categorical or coded source variable. In this case, we validate Tumour_Stage codes by comparing against a list of valid codes. We must provide the code list as a csv file and the name of it in the test parameters dataset.

validity_codes_conditions = (df_DQ_full['Validity_Lookup_Table_|_Tumour_Stage']>0)
df_DQ_full[['Tumour_Stage','Validity_Lookup_Table_|_Tumour_Stage']].loc[(validity_codes_conditions)]
Tumour_Stage Validity_Lookup_Table_|_Tumour_Stage
23 Stage3 1
27 Stage3 1
30 Stage3 1
31 stage4 1
34 Stage3 1
38 stage 4 1
42 Stage3 1
45 Stage3 1
46 stage4 1
49 Stage3 1
53 stage 4 1
57 Stage3 1
60 Stage3 1
61 stage4 1
64 Stage3 1
71 Stage3 1
74 Stage3 1
75 stage4 1
78 Stage3 1
82 stage 4 1
86 Stage3 1
90 Stage3 1
93 Stage3 1
94 stage4 1
97 Stage3 1
101 stage 4 1
105 Stage3 1
108 Stage3 1
Numerical Outliers#

Numerical outliers are detected by applying a minimum and maximum numerical range in the test parameters.

validity_numerical_ranges_conditions = (df_DQ_full['Validity_Range_|_Height_cm']>0)
df_DQ_full[['Height_cm','Validity_Range_|_Height_cm']].loc[(validity_numerical_ranges_conditions)]
Height_cm Validity_Range_|_Height_cm
0 220.50 1
4 7.48 1
8 296.97 1
10 210.99 1
12 289.45 1
16 202.02 1
17 227.60 1
19 248.42 1
21 297.19 1
23 249.22 1
25 275.20 1
26 265.48 1
30 217.05 1
32 252.69 1
33 281.95 1
35 283.26 1
39 230.48 1
44 247.59 1
47 253.83 1
49 217.83 1
52 211.40 1
57 240.35 1
62 246.27 1
71 210.46 1
72 255.92 1
74 292.97 1
79 6.35 1
80 256.97 1
82 209.31 1
84 299.91 1
85 231.40 1
92 226.65 1
99 226.65 1
106 226.65 1
Invalid Patterns#

The pattern validation check uses regular expression pattern. Below, you can see a datetime variable with invalid date formats or patterns. The test paramaters are set to only accept patterns that follow DD/MM/YYY HH:MM.

validity_pattern_conditions = (df_DQ_full['Validity_Pattern_|_Datetime_Event1']>0)
df_DQ_full[['Datetime_Event1','Validity_Pattern_|_Datetime_Event1']].loc[(validity_pattern_conditions)]
Datetime_Event1 Validity_Pattern_|_Datetime_Event1
100 1991/03/2024 00:00:00 1
101 1991/03/2024 00:00:00 1
102 1991/03/2024 00:00:00 1
103 1991/03/2024 00:00:00 1
104 1991/03/2024 00:00:00 1
105 1991/03/2024 00:00:00 1
106 1991/03/2024 00:00:00 1
107 1991/03/2024 00:00:00 1
108 1991/03/2024 00:00:00 1