DRAFT — Internal Review Only · Not for distribution in wide release · See releases and roadmap for details
Back to Data in Enterprise OMOP
80% of clinical data never reaches your database.
It lives in physician notes, discharge summaries, and radiology reports.
NLP can extract it — but can you trust what it found?
The promise

NLP turns free text into structured clinical data

A discharge summary reads: "Patient presents with worsening shortness of breath, elevated troponin at 0.42 ng/mL, and a history of poorly controlled type 2 diabetes."

An NLP pipeline reads that sentence and extracts three structured facts: a condition (dyspnea), a measurement (troponin 0.42 ng/mL), and a comorbidity (type 2 diabetes). These facts flow into research databases, power cohort definitions, and inform clinical decision support.

The field has advanced rapidly. Rule-based systems, statistical NER, transformer models, and now large language models can all perform clinical extraction with increasing accuracy. Institutions across the OHDSI network are deploying NLP pipelines against millions of clinical notes.

But there is a problem growing quietly beneath the surface.


The reporting crisis

Most NLP studies don't tell you how they work

In 2023, Sunyang Fu and colleagues at Mayo Clinic published a landmark scoping review examining how NLP-assisted observational studies report their methods. They reviewed 50 studies published between 2009 and 2021.

What they found was alarming.

58%
failed to report
model definitions
74%
omitted normalization
techniques
14%
provided no NLP
evaluation at all

Only 12% of studies reported all three essential definitions: the model used, the normalization vocabulary, and the context parameters (negation, temporality, experiencer). The rest left readers with no way to reproduce, validate, or even understand the NLP that generated the clinical data they were analyzing.

"The absence of detailed reporting guidelines may create ambiguity in the use of NLP-derived content, knowledge gaps in the current research reporting practices, and reproducibility challenges."

— Fu et al., Clinical and Translational Science, 2023

Read the full paper  ·  See Figure 6: Reporting gap breakdown

This isn't a minor documentation gap. When a researcher builds a cohort using NLP-extracted conditions, they are making clinical decisions based on outputs from a system whose configuration, training data, confidence thresholds, and versioning are invisible.


The LLM era

Large language models made the problem bigger, not smaller

The arrival of GPT-4, Claude, and domain-specific clinical LLMs has supercharged NLP capabilities. Extraction tasks that once required custom-trained models now work with a prompt. The barrier to deploying NLP against clinical notes has never been lower.

But the reporting problem has gotten worse. When the model is a black-box API, the provenance chain collapses entirely. What prompt was used? What version of the model? What temperature? Was the output post-processed? How was the concept mapped? These details rarely survive past the developer who wrote the script.

The community has noticed. In January 2025, TRIPOD-LLM was published in Nature Medicine — a 19-item, 50-subitem checklist specifically for reporting studies that use large language models in healthcare. A month later, FUTURE-AI appeared in The BMJ — a 30-recommendation lifecycle framework for trustworthy AI built by 117 experts from 50 countries, organized around six principles: Fairness, Universality, Traceability, Usability, Robustness, and Explainability.

These are important steps. But checklists and frameworks solve the publication and governance problem, not the production problem. A TRIPOD-LLM–compliant paper and a FUTURE-AI–aligned development process still don't help a downstream researcher who joins your NLP-extracted conditions to their cohort six months later and needs to know: which pipeline, which version, which confidence threshold, which execution date?


The trust question

If you can't trace it, you can't trust it

Imagine two rows in a condition_occurrence table. Both say the patient has type 2 diabetes. One came from an ICD-10 code entered by a physician during an encounter. The other was extracted from a radiology report by an NLP pipeline you've never heard of, running a model version that may no longer exist, with a confidence score that was rounded before storage.

These two rows look identical. They sit in the same table, share the same schema, and will both be included in your cohort query. But they have fundamentally different provenance, fundamentally different confidence levels, and fundamentally different implications for your research.

This is the core problem: NLP-derived data is structurally indistinguishable from discrete clinical data once it lands in the CDM. And the metadata needed to distinguish them — the pipeline, the model, the execution, the confidence — has no standard place to live.

The question isn't whether NLP works. It's whether the people who consume NLP outputs have the metadata they need to decide if those outputs are appropriate for their specific research question.

This is why NLP methods centralization matters. Not centralization of which model you use — use any model, any framework, any LLM. But centralization of how you record what happened. Open, structured, queryable metadata that follows the data from extraction through to the research table.


The standard

Every NLP-derived observation should be fully traceable

At Emory, we are building an NLP infrastructure for OMOP that treats metadata as a first-class citizen. Every extracted fact carries a provenance chain from the research table back to the original note:

measurement_DERIVED row
  → note_nlp_modifier (what attributes were applied)
    → note_span (exact text, offsets, confidence)
      → nlp_execution (when, which pipeline)
        → pipeline + components (what config)
          → nlp_system (which system, what version)
            → note (the original clinical text)

NLP-extracted data lands in _DERIVED suffix tablescondition_DERIVED, measurement_DERIVED, drug_DERIVED — that mirror the OMOP schema but are structurally separated from discrete clinical data. Researchers choose when and how to join them.

The infrastructure is model-agnostic. MedSpaCy, BioBERT, GPT-4, a custom LSTM — it doesn't matter. What matters is that the system, the pipeline, the components, the execution, and the confidence are all recorded in a standard, queryable schema.

This is what moves NLP from "it works on my laptop" to "I can defend this in a methods section, and a downstream researcher can audit it two years from now."

Read the full architecture

Our NOTE and NLP Infrastructure whitepaper describes the 4-layer, 13-table schema that makes this possible — from pipeline registration through _DERIVED tables.