data_utilities.py — DataFrame Toolbox
BCHPR · 2023 – present
5,764-line DataFrame utility library: smart deduplication with null prioritisation, column manipulation, HTML cleaning, Cameroon-specific phone-number standardisation, and Polars bulk operations.
Highlights
- Deduplication prioritising non-null values with custom sort columns (e.g. prefer most-recent date).
- Bulk column operations: ensure_columns, move_column_after, rename_in_bulk, drop_duplicates_prioritize_non_null.
- Cameroon phone cleaner with international-format standardisation and SMS-friendly output.
- Polars ⇄ pandas bridge with automatic 5-10× speedup on large operations.
- HTML-tag stripper for REDCap text exports with malformed markup.
Related projects
Architect
my_functions.py — Centralised Python Library
The 21,086-line shared Python library that every BCHPR data project depends on — APIManager, PathsManager, REDCap wrappers, study-ID generation, SharePoint I/O, and dozens of cross-project utilities.
Architect
data_quality_manager.py — Enterprise DQA Framework
11,007-line data quality platform with fluent QueryBuilder, persistent query lifecycle tracking, duplicate analysis, and double-data-entry verification across 28+ instruments — with SQLite persistence and Polars acceleration.
Engineer
study_id_patterns.py — Study-ID Regex Registry
2,611-line centralised registry of 8 study-ID patterns and 14 site-code patterns across Cameroon, Nigeria, and Vietnam projects — with vectorised extraction, validation, classification, and cleaning.