Title: Data Quality Assessment Tools
Version: 0.1.0
Description: In the context of data quality assessment, this package provides a number of functions for evaluating data quality across various dimensions, including completeness, plausibility, concordance, conformance, currency, timeliness, and correctness. It has been developed based on two well-known frameworks—Michael G. Kahn (2016) <doi:10.13063/2327-9214.1244> and Nicole G. Weiskopf (2017) <doi:10.5334/egems.218>—for data quality assessment. Using this package, users can evaluate the quality of their datasets, provided that corresponding metadata are available.
License: MIT + file LICENSE
Encoding: UTF-8
Language: en-US
RoxygenNote: 7.3.2
Imports: ggplot2, data.table
Suggests: testthat (≥ 3.0.0)
Config/testthat/edition: 3
NeedsCompilation: no
Packaged: 2025-12-12 11:12:36 UTC; ELHAM
Author: Hamed Tabesh [aut], Elham Hosseinzadeh [aut, cre], Marziyeh Afkanpour [aut]
Maintainer: Elham Hosseinzadeh <elham.hh2012@gmail.com>
Repository: CRAN
Date/Publication: 2025-12-18 13:50:22 UTC

Check Missing Data Item-wise with Dependency Logic

Description

Analyzes missing data ('NA' values) for each variable (item-wise) by considering dependencies between variables. This function goes beyond simple NA counts by classifying missingness into different categories based on rules defined in metadata.

Usage

check_missing_itemwise(
  S_data,
  M_data,
  var_select = 1:nrow(M_data),
  Show_Plot = FALSE
)

Arguments

S_data

A data frame containing the source data to be checked.

M_data

A metadata data frame containing the validation rules.

var_select

A numeric or character vector specifying which variables to process. Can be indices or names from the 'VARIABLE' column of 'M_data'. Defaults to all variables.

Show_Plot

A logical value. If 'TRUE', a ggplot bar chart showing the missingness percentage for each variable is displayed.

Details

This function classifies each row for a given variable into one of four states:

The metadata ('M_data') must contain the following columns to define the rules:

Value

A 'data.table' summarizing the missing data analysis for each variable, with columns such as 'VARIABLE', 'Missing_Count', 'Jump_Count', 'Unexpected_Count', ‘Total_Applicable'(the variable’s value was expected to be completed based on metadata rules.), 'Percent_Complete', and 'Percent_Missing'.

See Also

Other missing data checks: check_missing_record(), check_missing_segments()

Examples

# 1. Define comprehensive sample data and metadata
Meta_data <- data.frame(
  stringsAsFactors = FALSE,
  VARIABLE = c(
    "ID", "Gender", "Age", "Has_Job", "Job_Title",
    "Job_Satisfaction", "Last_Promotion_Year", "Has_Insurance",
    "Insurance_Provider", "Annual_Checkup"
  ),
  VARIABLE_Code = 1:10,
  Var_order = 1:10,
  Segment_Names = c(
    "Demographic", "Demographic", "Demographic", "Employment", "Employment",
    "Employment", "Employment", "Health", "Health", "Health"
  ),
  Dependency = c(0, 0, 0, 0, 4, 5, 5, 0, 8, 8),
  Dep_Value = c(
    "0", "0", "0", "0", "Yes", "ANY", "ANY", "0", "Yes", "Yes"
  )
)

Source_data <- data.frame(
  ID = 1:10,
  Gender = c("Male", "Female", "Male", "Female", "Male",
             "Female", "Male", "Female", "Male", "Female"),
  Age = c(25, 42, 31, 55, 29, 38, 45, 22, 60, 33),
  Has_Job = c("Yes", "Yes", "No", "Yes", "Yes", "No", "Yes", "Yes", "No", "Yes"),
  Job_Title = c(NA, "Manager", NA, "Analyst", NA, "Student",
                "Director", "Engineer", NA, "Designer"),
  Job_Satisfaction = c(5, 9, NA, 8, 7, NA, 10, 9, NA, 6),
  Last_Promotion_Year = c(2020, 2021, NA, NA, NA, NA, 2024, 2022, NA, 2023),
  Has_Insurance = c("Yes", "No", "Yes", "Yes", "No", "Yes", "Yes", "No", "No", "Yes"),
  Insurance_Provider = c("Provider A", NA, "Provider B", "Provider C",
                         "Provider D", NA, "Provider E", NA, NA, "Provider F"),
  Annual_Checkup = c("Yes", NA, "No", "Yes", NA, "Yes", "Yes", "No", NA, "Yes")
)

# 2. Run the item-wise check with plot
item_report <- check_missing_itemwise(
  S_data = Source_data, M_data = Meta_data, Show_Plot = TRUE
)
print(item_report)

Check Missing Data by Record (Unit Check)

Description

Provides a high-level summary of data completeness across the entire dataset by classifying each row (or "record") as complete, incomplete, or missing.

Usage

check_missing_record(
  S_data,
  M_data,
  Show_Plot = FALSE,
  start_var = 1,
  skip_vars = NULL
)

Arguments

S_data

A data frame containing the source data to be checked.

M_data

A metadata data frame containing the validation rules.

Show_Plot

A logical value. If 'TRUE', a pie chart visualizing the proportions of complete, incomplete, and missing rows is displayed.

start_var

A numeric value indicating the starting variable index (from 'M_data') to include in the analysis. Defaults to 1.

skip_vars

A character or numeric vector of variables to exclude from the analysis. Can be variable names or column indices.

Details

This function evaluates all specified variables for each row and determines its overall status based on the same error logic as 'check_missing_segments'. A row is:

The metadata ('M_data') must contain the following columns to define the rules:

The function returns a single-row data frame summarizing the counts and percentages for the entire dataset.

Value

A single-row 'data.frame' with summary counts and percentages: 'Total_Rows', 'Complete_Count', 'Incomplete_Count', 'Missing_Count', 'Percent_Complete', 'Percent_Incomplete', and 'Percent_Missing'.

See Also

Other missing data checks: check_missing_itemwise(), check_missing_segments()

Examples

# 1. Define comprehensive sample data and metadata
Meta_data <- data.frame(
  stringsAsFactors = FALSE,
  VARIABLE = c(
    "ID", "Gender", "Age", "Has_Job", "Job_Title",
    "Job_Satisfaction", "Last_Promotion_Year", "Has_Insurance",
    "Insurance_Provider", "Annual_Checkup"
  ),
  VARIABLE_Code = 1:10,
  Var_order = 1:10,
  Segment_Names = c(
    "Demographic", "Demographic", "Demographic", "Employment", "Employment",
    "Employment", "Employment", "Health", "Health", "Health"
  ),
  Dependency = c(0, 0, 0, 0, 4, 5, 5, 0, 8, 8),
  Dep_Value = c(
    "0", "0", "0", "0", "Yes", "ANY", "ANY", "0", "Yes", "Yes"
  )
)

Source_data <- data.frame(
  ID = 1:10,
  Gender = c("Male", NA, "Male", "Female", "Male","Female", "Male", "Female", "Male", "Female"),
Age = c(25, NA, 31, 55, 29, 38, 45, 22, 60, 33),
Has_Job = c("Yes", NA, "No", "Yes", "Yes", "No", "Yes", "Yes", "No", "Yes"),
Job_Title = c(NA, NA, NA, "Analyst", NA, "Student","Director", "Engineer", NA, "Designer"),
Job_Satisfaction = c(5, NA, NA, 8, 7, NA, 10, 9, NA, 6),
Last_Promotion_Year = c(2020,NA , 2021, NA, NA, NA, 2024, 2022, NA, 2023),
Has_Insurance = c("Yes", NA, "Yes", "Yes", "No", "Yes", "Yes", "No", "No", "Yes"),
Insurance_Provider = c("Provider A", NA, "Provider B", "Provider C","Provider D", NA, "Provider E",
 NA, NA, "Provider F"),
Annual_Checkup = c("Yes", NA, "No", "Yes", NA, "Yes", "Yes", "No", NA, "Yes")
)
# 4. Run the row-wise check with plot
row_report <- check_missing_record(
  S_data = Source_data, M_data = Meta_data, skip_vars = "ID", Show_Plot = TRUE
)
print(row_report)

Check Missing Data by Segments

Description

Analyzes data completeness at the segment level. A segment is a group of variables defined in the 'Segment_Names' column of the metadata.

Usage

check_missing_segments(S_data, M_data, Show_Plot = FALSE)

Arguments

S_data

A data frame containing the source data to be checked.

M_data

A metadata data frame containing the validation rules, including a 'Segment_Names' column.

Show_Plot

A logical value. If 'TRUE', a stacked bar chart visualizing the proportions for each segment is displayed.

Details

For each segment, this function evaluates every row of the source data ('S_data') and classifies it into one of three categories:

The metadata ('M_data') must contain the following columns to define the rules:

The function returns a summary table with counts and percentages for each category per segment.

Value

A 'data.frame' summarizing the analysis for each segment, with columns: 'SEGMENT', 'Total_Rows', 'Complete_Count', 'Incomplete_Count', 'Missing_Count', 'Percent_Complete', 'Percent_Incomplete', and 'Percent_Missing'.

See Also

Other missing data checks: check_missing_itemwise(), check_missing_record()

Examples

# 1. Define comprehensive sample data and metadata
Meta_data <- data.frame(
  stringsAsFactors = FALSE,
  VARIABLE = c(
    "ID", "Gender", "Age", "Has_Job", "Job_Title",
    "Job_Satisfaction", "Last_Promotion_Year", "Has_Insurance",
    "Insurance_Provider", "Annual_Checkup"
  ),
  VARIABLE_Code = 1:10,
  Var_order = 1:10,
  Segment_Names = c(
    "Demographic", "Demographic", "Demographic", "Employment", "Employment",
    "Employment", "Employment", "Health", "Health", "Health"
  ),
  Dependency = c(0, 0, 0, 0, 4, 5, 5, 0, 8, 8),
  Dep_Value = c(
    "0", "0", "0", "0", "Yes", "ANY", "ANY", "0", "Yes", "Yes"
  )
)

Source_data <- data.frame(
  ID = 1:10,
Gender = c("Male", NA, "Male", "Female", "Male","Female", "Male", "Female", "Male", "Female"),
Age = c(25, NA, 31, 55, 29, 38, 45, 22, 60, 33),
Has_Job = c("Yes", NA, "No", "Yes", "Yes", "No", "Yes", "Yes", "No", "Yes"),
Job_Title = c(NA, NA, NA, "Analyst", NA, "Student","Director", "Engineer", NA, "Designer"),
Job_Satisfaction = c(5, NA, NA, 8, 7, NA, 10, 9, NA, 6),
Last_Promotion_Year = c(2020,NA , 2021, NA, NA, NA, 2024, 2022, NA, 2023),
Has_Insurance = c("Yes", NA, "Yes", "Yes", "No", "Yes", "Yes", "No", "No", "Yes"),
Insurance_Provider = c("Provider A", NA, "Provider B", "Provider C","Provider D", NA, "Provider E",
 NA, NA, "Provider F"),
Annual_Checkup = c("Yes", NA, "No", "Yes", NA, "Yes", "Yes", "No", NA, "Yes")
)
# 3. Run the segment check with plot
segment_report <- check_missing_segments(
  S_data = Source_data, M_data = Meta_data, Show_Plot = TRUE
)
print(segment_report)

Perform Concordance Check on Data Based on Defined Rules

Description

This function evaluates a source dataframe ('S_data') against a set of rules defined in a metadata dataframe ('M_data'). It Checks multiple concordance rules (logical and clinical conditions) on columns of a data frame, based on metadata specifications. Supports flexible rule definition, date handling, and customizable output.

Usage

concordance_check(
  S_data,
  M_data,
  Result = FALSE,
  show_column = NULL,
  date_parser_fun = smart_to_gregorian_vec,
  var_select = "all",
  verbose = FALSE
)

Arguments

S_data

data.frame. The source data in which rules will be evaluated. Each column may be referenced by the rules.

M_data

data.frame. Metadata describing variables and their concordance rules. Must include at least columns VARIABLE and Concordance_Rule. Optionally includes TYPE and Concordance_Error_Type.

Result

logical (default: FALSE). If TRUE, returns row-by-row evaluation results for each rule. If FALSE, returns a summary table for each rule.

show_column

character vector (default: NULL). Names of columns from S_data to include in the result when Result = TRUE. Ignored otherwise.

date_parser_fun

function (default: smart_to_gregorian_vec). Converting Persian dates to English,Function to convert date values or date literals to Date class. Must accept character vectors and return Date objects.

var_select

character, numeric, or "all" (default: "all"). Subset of variables (rules) to check. Can be a character vector of variable names, numeric vector of row indices in M_data, or "all" to run all rules.

verbose

logical (default: FALSE). If TRUE, prints diagnostic messages during rule processing and evaluation.

Details

The metadata data.frame (M_data) must contain at least the following columns:

For each variable described in M_data, the function:

The function supports flexible rule definitions, including conditions involving multiple columns,clinical rules, date comparisons, and custom logic using R expressions.

If Result = FALSE, the function returns a summary table for each rule, including counts and percentages of rows that meet or do not meet the condition, as well as the error type from the metadata.

If Result = TRUE, the function returns a data.frame with one column per rule/variable, each containing logical values (TRUE, FALSE, or NA) for every row, plus any extra columns from S_data listed in show_column.

Value

If Result = FALSE: a data.frame summary with columns:

Examples

# build the long rule in multiple short source lines to avoid >100 char Rd lines
rule_bp <- paste0(
  "(ifelse(is.na(val) | is.na(Systolic_BP2), NA, ",
  "(abs(val - Systolic_BP2) >= 15) & (val > 15 & Prescription_drug == '')))"
)

# Source data
S_data <- data.frame(
  National_code = c("123", "1456", "789","545","4454","554"),
  LastName = c("Aliyar","Johnson","Williams","Brown","Jones","Garcia"),
  VisitDate = c("2025-09-23", "2021-01-10", "2021-01-03","1404-06-28","1404-07-28",NA),
  Test_date = c("1404-07-01", "2021-01-09", "2021-01-14","1404-06-29","2025-09-19",NA),
  Certificate_validity = c("2025-09-21", "2025-01-12", "2025-02-11","1403-06-28","2025-09-19",NA),
  Systolic_BP1 = c(110, NA, 145, 125,114,NA),
  Systolic_BP2 = c(125, 150, NA, 110,100,NA),
  Prescription_drug= c("Atorvastatin", "Metformin", "Amlodipine",
    "Omeprazole", "Aspirin","Metoprolol"),
  Blood_type = c("A-","B+","AB","A+","O-","O+"),
  stringsAsFactors = FALSE
)

# META DATA (use the short-built rule)
M_data <- data.frame(
  VARIABLE = c("National_code", "Certificate_validity", "VisitDate",
               "Test_date","LastName","Systolic_BP1","Systolic_BP2",
               "Prescription_drug","Blood_type"),
  Concordance_Rule = c(
    "", "", "VisitDate<=Test_date", "Test_date-VisitDate < 7", "",
    rule_bp, "", "", ""
  ),
  TYPE=c("numeric","date","date","date","character",
         "numeric","numeric","character","character"),
  Concordance_Error_Type = c("type1",NA,"type2","type3",NA,NA,NA,NA,"type4"),
  stringsAsFactors = FALSE
)

# test call
result <- concordance_check(S_data = S_data, M_data = M_data, Result = TRUE,
show_column = c("National_code"))
print(result)

Perform Conformance Check on Data Based on Defined Rules

Description

This function evaluates a source dataframe ('S_data') against a set of rules defined in a metadata dataframe ('M_data'). It uses a set of default rule functions but can also use a user-provided file.

Usage

conformance_check(
  S_data,
  M_data,
  rule_file = NULL,
  na_as_error = FALSE,
  var_select = "all"
)

Arguments

S_data

A dataframe containing the source data to be checked.

M_data

A metadata dataframe that specifies the rules. It must contain the columns 'VARIABLE', 'Conformance_Rule', and 'Value'.

rule_file

The path to a custom R file where rule functions are defined. If 'NULL' (default), the standard rule definitions file included with the'DQA' package will be used. Instructions for using this file are available under the name 'conformance_rules'.

na_as_error

A logical value. If 'TRUE', 'NA' values in the source data are treated as errors (non-conformant). If 'FALSE' (default), they are ignored.

var_select

Character or integer vector of variables to check. Accepts variable names, column numbers, or a mix. Default is "all" (check all variables in M_data).

Details

The metadata ('M_data') for conformance_check must include:

Value

A dataframe containing the results of the conformance check for each rule.

Examples

# 1. Create sample source data (S_data)
S_data <- data.frame(
  id = 1:10,
  national_id = c("1234567890", "0987654321", "123", NA, "1112223334",
                  "1234567890", "5556667778", "9998887770", "12345", "4445556667"),
  gender = c(1, 2, 1, 3, 2, 1, NA, 2, 1, 2), # 1=Male, 2=Female, 3=Error
  age = c(25, 40, 150, 33, -5, 65, 45, 29, 70, 55),
  part_a = c(10, 15, 20, 25, 30, 35, 40, 45, 50, 55),
  part_b = c(5, 10, 15, 20, 25, 30, 35, 40, 45, 50),
  total_parts = c(15, 25, 35, 45, 55, 65, 75, 85, 94, 105), # one error in row 9
  stringsAsFactors = FALSE
)

# 2. Create sample metadata (M_data)
M_data <- data.frame(
  VARIABLE = c(
    "national_id",
    "national_id",
    "gender",
    "total_parts"
  ),
  Conformance_Rule = c(
    "length_check",
    "unique_check",
    "category_check",
    "arithmetic_check"
  ),
  Value = c(
    "10",                  # national_id length must be 10
    "",                    # unique
    "1 | 2",               # Allowed values for gender
    "part_a + part_b"      # Computational rule for total_parts
  ),
  stringsAsFactors = FALSE
)

# 3. Run the conformance check using the package's default rules
# Ensure the 'DQA' package is loaded before running
 conformance_results <- conformance_check(S_data = S_data, M_data = M_data)
 print(conformance_results)


Default Conformance Check Rules (Reference)

Description

The 'DQA' package provides a suite of built-in rule functions for use with 'conformance_check()'. You can use these rule names directly in the 'Conformance_Rule' column of your metadata dataframe.

Details

Most rules accept the column vector as 'x', and some accept extra parameters (e.g., 'val_num', 'val_lit') parsed from the 'Value' column of your metadata.

General Purpose Rules

unique_check(x)

Checks for unique (non-duplicate) values.

character_check(x)

Checks for the presence of alphabetic characters.

email_check(x)

Checks if values are valid email addresses.

numeric_check(x)

Checks if values can be interpreted as numbers.

integer_check(x)

Checks if values are integers (whole numbers).

length_check(x, val_num)

Checks if string length matches the given value (from the 'Value' column).

category_check(x, val_num, val_ops, val_lit)

Checks membership in allowed categories (numbers or literals).

set_check(x, val_lit)

Checks if value is in a user-defined set (from 'Value').

not_null_check(x)

Checks if value is not missing ('NA').

Numeric/Date Rules

date_check(x, val_lit)

Checks if value is a valid date (optionally in specified format).

regex_check(x, val_lit)

Checks values against a user-supplied regular expression.

Computational

bmi_check(x, val_lit = NULL, val = NULL)

Checks a BMI column for consistency with the corresponding Weight and Height columns. Put your "Weight and Height" columns in the Value column in the metadata for calculation

The function reads the column names for weight and height dynamically from the val_lit argument, which is typically provided by the metadata (M_data) in the Value column as a comma- or pipe-separated string (for example, Weight,Height or Weight|Height; use the actual column names in your dataset).

The function automatically detects the unit for height (cm or m) based on the data, and allows for custom rounding digits and tolerance via the optional val argument (e.g., val = c(1, 0.01) for 1 decimal digit and 0.01 tolerance).

Returns a logical vector indicating which rows pass the BMI consistency check.

Notes

See the actual implementation of these functions in the package by obtaining a copy with 'conformance_scaffold_rules()'.


Validate Data Against Correctness Rules

Description

This function validates a data frame against a set of correctness rules specified in another data frame. It allows for complex validation operations, comparison with reference data, and detailed reporting.

Usage

correctness_check(
  S_data,
  M_data,
  Result = FALSE,
  show_column = NULL,
  date_parser_fun = smart_to_gregorian_vec,
  golden_data = NULL,
  key_column = NULL,
  external_data = NULL,
  var_select = "all",
  batch_size = 1000,
  verbose = FALSE
)

Arguments

S_data

A data frame containing the data to be validated.

M_data

A data frame containing the validation rules. Must have at least the following columns:

  • VARIABLE: The name of the variable to validate (must match column names in S_data)

  • Correctness_Rule: The validation rule as an R expression (string)

  • TYPE: The data type of the variable ("date", "numeric", or other)

  • Correctness_Error_Type: (Optional) Classification of the error type

Result

Logical. If TRUE, returns the detailed results for each row in S_data. If FALSE (default), returns a summary of validation results.

show_column

Character vector. When Result=TRUE, specifies additional columns from S_data to include in the output.

date_parser_fun

Function to convert date strings to Date objects. Default is smart_to_gregorian_vec, which should handle various date formats including Jalali dates.

golden_data

Optional data frame or list containing reference data for validation. Accessible within rules via the GOLDEN variable.

key_column

Character string specifying the column name that links rows in S_data to corresponding rows in golden_data. Required when comparing individual rows with golden_data.

external_data

Optional list or data frame containing additional data for validation rules.

var_select

Character vector or numeric indices specifying which variables from M_data to validate. By default, it validates all variables.

batch_size

integer. Number of rows to process in each batch (for efficiency).

verbose

logical. If TRUE, prints progress messages.

Details

The function evaluates each rule specified in M_data against the corresponding data in S_data. Rules are R expressions written as strings, evaluated in an environment where:

Type conversion is applied to variables in S_data based on the TYPE column in M_data:

Special handling for date comparisons is provided, including automatic wrapping of GOLDEN references when comparing dates.

Value

If Result=FALSE (default): A data frame with one row per validated variable containing:

If Result=TRUE: A data frame with one row per row in S_data, containing:

Examples

Authorized_drug<-data.frame(
  Drug_ID = 1:10,
  Drug_Name = c("Atorvastatin", "Metformin", "Amlodipine", "Omeprazole", "Aspirin",
                "Levothyroxine", "Sertraline", "Pantoprazole", "Losartan", "ASA"),
  stringsAsFactors = FALSE
)

golde<-data.frame(
  National_code = c("123", "456", "789","545","4454","554","665"),
  LastName = c("Bahman","Johnson","Williams","Brown","Jones","Garcia","Miller"),
  Certificate_Expiry = c("1404-07-01", "2030-01-12", "2025-01-11",
  "1404-06-28","2025-09-19",NA,NA),
  Blood_type = c("A-","B+","AB","A+","O-","O+","AB-"),
  stringsAsFactors = FALSE
)

S_data <- data.frame(
  National_code = c("123", "1456", "789","545","4454","554"),
  LastName = c("Aliyar","Johnson","Williams","Brown","Jones","Garcia"),
  VisitDate = c("2025-09-23", "2021-01-10", "2021-01-03","1404-06-28","1404-07-28",NA),
  Test_date = c("1404-07-01", "2021-01-09", "2021-01-14","1404-06-29","2025-09-19",NA),
  Certificate_validity = c("2025-09-23", "2025-01-12", "2025-02-11","1403-06-28","2025-09-19",NA),
  Systolic_Reading1 = c(110, NA, 145, 125,114,NA),
  Systolic_Reading2 = c(125, 150, NA, 110,100,NA),
  Prescription_drug= c("Atorvastatin", "Metformin", "Amlodipine",
   "Omeprazole", "Aspirin","Metoprolol"),
  Blood_type = c("A-","B+","AB","A+","O-","O+"),
  Height = c(178,195,165,NA,155,1.80),
  stringsAsFactors = FALSE
)

M_data <- data.frame(
  VARIABLE = c("National_code", "Certificate_validity", "VisitDate","Test_date",
               "LastName","Systolic_Reading1","Systolic_Reading2",
               "Prescription_drug","Blood_type","Height"),
  Correctness_Rule = c(
    "National_code %in% GOLDEN$National_code",
    "val <= GOLDEN$Certificate_Expiry",
    "((val >= '1404-06-01' & val <= '1404-06-31') | val == as.Date('2021-01-02'))",
    "val != VisitDate",
    "val %in% GOLDEN$LastName",
    "",
    "",
    "val %in% Authorized_drug$Drug_Name",
    "val %in% GOLDEN$Blood_type",
    ""),
  TYPE=c("numeric","date","date","date","character","numeric",
  "numeric","character","character","numeric"),
  Correctness_Error_Type=c("Error",NA,"Warning","Error",NA,NA,NA,NA,"Error","Warning"),
  stringsAsFactors = FALSE
)

result <- correctness_check(
  S_data = S_data,
  M_data = M_data,
  golden_data = golde,
  key_column = c("National_code"),
  Result =FALSE,
  external_data = Authorized_drug
)

print(result)
#
result <- correctness_check(
  S_data = S_data,
  M_data = M_data,
  golden_data = golde,
  #key_column = c("National_code"),#If you do not select a key, you can use Gold Data as a
  #list and your logical rules will be NA.
  Result =TRUE,
  external_data = Authorized_drug
)
print(result)


Perform Currency Check for Data Frame Columns

Description

This function evaluates a source dataframe ('S_data') against a set of rules defined in a metadata dataframe ('M_data'). it Checks currency rules (Temporal conditions) on columns of a data frame, based on metadata specifications. Supports flexible rule definition,date literal handling, and customizable output.

Usage

currency_check(
  S_data,
  M_data,
  Result = FALSE,
  show_column = NULL,
  date_parser_fun = smart_to_gregorian_vec,
  var_select = "all",
  verbose = FALSE
)

Arguments

S_data

data.frame. The source data in which rules will be evaluated. Each column may be referenced by the rules.

M_data

data.frame. Metadata describing variables and their currency rules. Must include columns VARIABLE, Currency_Rule and TYPE. Optionally includes Currency_Error_Type.

Result

logical (default: FALSE). If TRUE, returns row-by-row evaluation results for each rule. If FALSE, returns a summary table for each rule.

show_column

character vector (default: NULL). Names of columns from S_data to include in the result when Result = TRUE. Ignored otherwise.

date_parser_fun

function (default: smart_to_gregorian_vec). Converting Persian dates to English,Function to convert date values or date literals to Date class. Must accept character vectors and return Date objects.

var_select

character, numeric, or "all" (default: "all"). Subset of variables (rules) to check. Can be a character vector of variable names, numeric vector of row indices in M_data, or "all" to run all rules.

verbose

logical (default: FALSE). If TRUE, prints diagnostic messages during rule processing and evaluation.

Details

The metadata data.frame (M_data) **must** contain the following columns:

For each variable described in M_data, the function:

If Result = FALSE, returns a summary table with counts and percentages of rows meeting/not meeting each condition. If Result = TRUE, returns a data.frame with boolean results for each rule, optionally including selected columns from the source data.

Value

If Result = FALSE: a data.frame summary with columns:

If Result = TRUE: a data.frame with one column per rule (variable), each containing logical values for every row, plus optional columns from the source data.

Examples

# Source data
S_data <- data.frame(
  VisitDate = c("2025-09-23", "2021-01-10", "2021-01-03","1404-06-28","1404-07-28",NA),
  Test_date = c("1404-07-01", "2021-01-09", "2021-01-14","1404-06-29","2025-09-19",NA)
)

# META DATA
M_data <- data.frame(
  VARIABLE = c("VisitDate","Test_date"),
  Currency_Rule = c(
    "",
    "VisitDate<=Test_date",
    " Test_date-VisitDate <10 ",
    ""),
  TYPE=c("date","date"),
  Currency_Error_Type = c("Error","warning"),
  stringsAsFactors = FALSE
)

result <- currency_check(
  S_data = S_data,
  M_data = M_data,
  Result = TRUE,
  show_column = FALSE
)

print(result)

result <- currency_check(
  S_data = S_data,
  M_data = M_data,
  Result = FALSE,
  var_select = c("VisitDate","Test_date")
)

print(result)


Perform plausibility Check for Data Frame Columns

Description

This function evaluates Plausibility Rule Checking for Data Frame Columns It Checks logical and clinical rules on columns of a data frame, based on metadata specifications. Supports flexible rule definition, check logical range of variables, and customizable output.

Usage

plausible_check(
  S_data,
  M_data,
  Result = FALSE,
  show_column = NULL,
  date_parser_fun = smart_to_gregorian_vec,
  var_select = "all",
  verbose = FALSE
)

Arguments

S_data

data.frame. The source data in which rules will be evaluated. Each column may be referenced by the rules.

M_data

data.frame. Metadata describing variables and their plausibility rules. Must include at least columns VARIABLE , Plausible_Rule, TYPE and Plausible_Error_Type.

Result

logical (default: FALSE). If TRUE, returns row-by-row evaluation results for each rule. If FALSE, returns a summary table for each rule.

show_column

character vector (default: NULL). Names of columns from S_data to include in the result when Result = TRUE. Ignored otherwise.

date_parser_fun

function (default: smart_to_gregorian_vec). Converting Persian dates to English,Function to convert date values or date literals to Date class. Must accept character vectors and return Date objects.

var_select

character, numeric, or "all" (default: "all"). Subset of variables (rules) to check. Can be a character vector of variable names, numeric vector of row indices in M_data, or "all" to run all rules.

verbose

logical (default: FALSE). If TRUE, prints diagnostic messages during rule processing and evaluation.

Details

The metadata data.frame (M_data) must contain at least the following columns:

For each variable described in M_data, the function:

The function supports flexible rule definitions, including conditions involving multiple columns, and custom logic using R expressions.

If Result = FALSE, the function returns a summary table for each rule, including counts and percentages of rows that meet or do not meet the condition, as well as the error type from the metadata if present.

If Result = TRUE, the function returns a data.frame with one column per rule/variable, each containing logical values (TRUE, FALSE, or NA) for every row, plus any extra columns from S_data listed in show_column.

Value

If Result = FALSE: a data.frame summary with columns:

Examples

# Source data
S_data <- data.frame(
  National_code = c("123", "1456", "789","545","4454","554"),
  LastName = c("Aliyar","Johnson","Williams","Brown","Jones","Garcia"),
  VisitDate = c("2025-09-23", "2021-01-10", "2021-01-03","1404-06-28","1404-07-28",NA),
  Test_date = c("1404-07-01", "2021-01-09", "2021-01-14","1404-06-29","2025-09-19",NA),
  Certificate_validity = c("2025-09-21", "2025-01-12", "2025-02-11","1403-06-28","2025-09-19",NA),
  DiastolicBP = c(110, NA, 145, 125,114,NA),
  SystolicBP = c(125, 150, NA, 110,100,NA),
  Prescription_drug= c("Atorvastatin", "Metformin", "Amlodipine",
  "Omeprazole", "Aspirin","Metoprolol"),
  Blood_type = c("A-","B+","AB","A+","O-","O+"),
  stringsAsFactors = FALSE
)

# META DATA
M_data <- data.frame(
  VARIABLE = c("National_code", "Certificate_validity", "VisitDate",
               "Test_date","LastName","DiastolicBP","SystolicBP",
               "Prescription_drug","Blood_type"),
  Plausible_Rule = c(
    "val<=123",
    "",
    "",
    "",
    "",
    "val < 40 | val > 145",
    "val < 50 | val > 230",
    "",
    ""),
  TYPE=c("numeric","date","date","date","character",
         "numeric","numeric","character","character"),
  Plausible_Error_Type = c("warning",NA,"Error","warning",NA,"warning","warning",NA,"Error"),
  stringsAsFactors = FALSE
)

result <- plausible_check(
  S_data = S_data,
  M_data = M_data,
  Result = TRUE,
  show_column = c("National_code")
)

print(result)

result <- plausible_check(
  S_data = S_data,
  M_data = M_data,
  Result = FALSE,
  var_select = c("DiastolicBP","DiastolicBP")
)

print(result)


Perform timeliness Check for Data Frame Columns

Description

This function evaluates a source dataframe ('S_data') against a set of rules defined in a metadata dataframe ('M_data'). it Checks timeliness rules (Temporal and availability conditions) on columns of a data frame, based on metadata specifications. Supports flexible rule definition,date literal handling, and customizable output.

Usage

timeliness_check(
  S_data,
  M_data,
  Result = FALSE,
  show_column = NULL,
  date_parser_fun = smart_to_gregorian_vec,
  var_select = "all",
  verbose = FALSE
)

Arguments

S_data

data.frame. The source data in which rules will be evaluated. Each column may be referenced by the rules.

M_data

data.frame. Metadata describing variables and their timeliness rules. Must include columns VARIABLE, Timeliness_Rule and TYPE. Optionally includes Timeliness_Error_Type.

Result

logical (default: FALSE). If TRUE, returns row-by-row evaluation results for each rule. If FALSE, returns a summary table for each rule.

show_column

character vector (default: NULL). Names of columns from S_data to include in the result when Result = TRUE. Ignored otherwise.

date_parser_fun

function (default: smart_to_gregorian_vec). Converting Persian dates to English,Function to convert date values or date literals to Date class. Must accept character vectors and return Date objects.

var_select

character, numeric, or "all" (default: "all"). Subset of variables (rules) to check. Can be a character vector of variable names, numeric vector of row indices in M_data, or "all" to run all rules.

verbose

logical (default: FALSE). If TRUE, prints diagnostic messages during rule processing and evaluation.

Details

The metadata data.frame (M_data) **must** contain the following columns:

For each variable described in M_data, the function:

If Result = FALSE, returns a summary table with counts and percentages of rows meeting/not meeting each condition. If Result = TRUE, returns a data.frame with boolean results for each rule, optionally including selected columns from the source data.

Value

If Result = FALSE: a data.frame summary with columns:

If Result = TRUE: a data.frame with one column per rule (variable), each containing logical values for every row, plus optional columns from the source data.

Examples

# Source data
S_data <- data.frame(
  VisitDate = c("2025-09-23", "2021-01-10", "2021-01-03","1404-06-28","1404-07-28",NA),
  Test_date = c("1404-07-01", "2021-01-09", "2021-01-14","1404-06-29","2025-09-19",NA)
)

# META DATA
M_data <- data.frame(
  VARIABLE = c("VisitDate","Test_date"),
  Timeliness_Rule = c(
    "",
    "VisitDate<=Test_date",
    " Test_date-VisitDate <10 ",
    ""),
  TYPE=c("date","date"),
  Timeliness_Error_Type = c("Error","warning"),
  stringsAsFactors = FALSE
)

result <- timeliness_check(
  S_data = S_data,
  M_data = M_data,
  Result = TRUE,
  show_column = FALSE
)

print(result)

result <- timeliness_check(
  S_data = S_data,
  M_data = M_data,
  Result = FALSE,
  var_select = c("VisitDate","Test_date")
)

print(result)