Extract, Load, Transform (ELT)
Emory's OMOP pipeline follows an ELT pattern — data is extracted from source systems (Epic, CDW), loaded into a staging area, and then transformed into the OMOP CDM using DBT (Data Build Tool).
Architecture
| Component | Role |
|---|---|
| Source systems | Epic Clarity, CDW — the raw clinical data |
| DBT | Transforms raw data into OMOP CDM tables, generates documentation, and runs data quality tests |
| Apache Airflow | Orchestrates scheduled model runs and manages pipeline dependencies |
| Amazon Redshift | The final OMOP data lake where researchers query data |
Documentation
ETL documentation is generated continuously from the DBT project itself — every model, column description, and test result is auto-documented as part of each pipeline run.
Emory Enterprise OMOP DBT Documentation
Versioning
The pipeline implements a DataOps versioning paradigm (see Data Quality Design) where code, data, and subsamples are each versioned and tracked within the documentation and test result tracking system. This ensures reproducibility and transparency across the ETL process.
Related Pages
- Data Quality Design — the DataOps framework behind our testing approach
- Data Quality Results — current test pass/fail status per table
- Vocabulary Mapping Coverage — mapping completeness across CVB vocabulary projects