study_id_patterns.py — Study-ID Regex Registry
BCHPR · 2023 – present
2,611-line centralised registry of 8 study-ID patterns and 14 site-code patterns across Cameroon, Nigeria, and Vietnam projects — with vectorised extraction, validation, classification, and cleaning.
Highlights
- 8 ID patterns: GHIT Cameroon / Nigeria / Vietnam · Image Quality · Wave 11 screening & testing · RapidTB · Start4All.
- extract_ids, extract_multi_ids, extract_all_ids for wide / long / multi-column extraction modes.
- Five validation modes: boolean flags, type assignment, or structured DQA reports (VALID / INVALID_FORMAT / WRONG_PROJECT / WRONG_COUNTRY / DUPLICATE).
- Pre-compiled regex with @property caching and single-pass combined patterns — O(n) not O(n × p).
- Negative lookahead for Wave 11 screening-vs-testing mutual exclusivity (IDs ending in X).
Related projects
Architect
my_functions.py — Centralised Python Library
The 21,086-line shared Python library that every BCHPR data project depends on — APIManager, PathsManager, REDCap wrappers, study-ID generation, SharePoint I/O, and dozens of cross-project utilities.
Architect
data_quality_manager.py — Enterprise DQA Framework
11,007-line data quality platform with fluent QueryBuilder, persistent query lifecycle tracking, duplicate analysis, and double-data-entry verification across 28+ instruments — with SQLite persistence and Polars acceleration.
Engineer
date_utils.py — Date Parser & Power BI Calendar Generator
3,746-line date engine handling 60+ formats, Excel serials, timezone conversion, and Power BI dimension tables with 99+ attributes (fiscal periods, holidays, relative categories, sort orders).