Skip to content

Extract, Load, Transform (ELT)

Emory's OMOP pipeline follows an ELT pattern — data is extracted from source systems (Epic, CDW), loaded into a staging area, and then transformed into the OMOP CDM using DBT (Data Build Tool).

Architecture

Component Role
Source systems Epic Clarity, CDW — the raw clinical data
DBT Transforms raw data into OMOP CDM tables, generates documentation, and runs data quality tests
Apache Airflow Orchestrates scheduled model runs and manages pipeline dependencies
Amazon Redshift The final OMOP data lake where researchers query data

Documentation

ETL documentation is generated continuously from the DBT project itself — every model, column description, and test result is auto-documented as part of each pipeline run.

Emory Enterprise OMOP DBT Documentation

Versioning

The pipeline implements a DataOps versioning paradigm (see Data Quality Design) where code, data, and subsamples are each versioned and tracked within the documentation and test result tracking system. This ensures reproducibility and transparency across the ETL process.