data_quality_manager.py — Enterprise DQA Framework
BCHPR · 28+ instruments · 2023 – present
11,007-line data quality platform with fluent QueryBuilder, persistent query lifecycle tracking, duplicate analysis, and double-data-entry verification across 28+ instruments — with SQLite persistence and Polars acceleration.
Highlights
- DataFrameComparator with 7 comparison modes (date tolerance, numeric tolerance, string normalisation, blank-matching).
- AutoQueryTracker: SQLite-backed open/closed/aged query lifecycle with stable hash IDs across runs.
- DuplicateRecordAnalyzer: detects duplicates within / across instruments with concordance % and reconciliation plans.
- DoubleDataEntryTracker: quality scores (concordance × completion) with persistent discordance detection.
- DataValidationTracker: SQLite row-loss history raising DataValidationError before overwriting files with fewer rows.
- Formal 7-dimension DQ taxonomy (Completeness · Validity · Accuracy · Consistency · Timeliness · Uniqueness · Integrity).
Related projects
Architect
my_functions.py — Centralised Python Library
The 21,086-line shared Python library that every BCHPR data project depends on — APIManager, PathsManager, REDCap wrappers, study-ID generation, SharePoint I/O, and dozens of cross-project utilities.
Engineer
study_id_patterns.py — Study-ID Regex Registry
2,611-line centralised registry of 8 study-ID patterns and 14 site-code patterns across Cameroon, Nigeria, and Vietnam projects — with vectorised extraction, validation, classification, and cleaning.
Engineer
date_utils.py — Date Parser & Power BI Calendar Generator
3,746-line date engine handling 60+ formats, Excel serials, timezone conversion, and Power BI dimension tables with 99+ attributes (fiscal periods, holidays, relative categories, sort orders).