A discharge summary reads: "Patient presents with worsening shortness of breath, elevated troponin at 0.42 ng/mL, and a history of poorly controlled type 2 diabetes."
An NLP pipeline reads that sentence and extracts three structured facts: a condition (dyspnea), a measurement (troponin 0.42 ng/mL), and a comorbidity (type 2 diabetes). These facts flow into research databases, power cohort definitions, and inform clinical decision support.
The field has advanced rapidly. Rule-based systems, statistical NER, transformer models, and now large language models can all perform clinical extraction with increasing accuracy. Institutions across the OHDSI network are deploying NLP pipelines against millions of clinical notes.
But there is a problem growing quietly beneath the surface.
In 2023, Sunyang Fu and colleagues at Mayo Clinic published a landmark scoping review examining how NLP-assisted observational studies report their methods. They reviewed 50 studies published between 2009 and 2021.
What they found was alarming.
Only 12% of studies reported all three essential definitions: the model used, the normalization vocabulary, and the context parameters (negation, temporality, experiencer). The rest left readers with no way to reproduce, validate, or even understand the NLP that generated the clinical data they were analyzing.
"The absence of detailed reporting guidelines may create ambiguity in the use of NLP-derived content, knowledge gaps in the current research reporting practices, and reproducibility challenges."
— Fu et al., Clinical and Translational Science, 2023Read the full paper · See Figure 6: Reporting gap breakdown
This isn't a minor documentation gap. When a researcher builds a cohort using NLP-extracted conditions, they are making clinical decisions based on outputs from a system whose configuration, training data, confidence thresholds, and versioning are invisible.
The arrival of GPT-4, Claude, and domain-specific clinical LLMs has supercharged NLP capabilities. Extraction tasks that once required custom-trained models now work with a prompt. The barrier to deploying NLP against clinical notes has never been lower.
But the reporting problem has gotten worse. When the model is a black-box API, the provenance chain collapses entirely. What prompt was used? What version of the model? What temperature? Was the output post-processed? How was the concept mapped? These details rarely survive past the developer who wrote the script.
The community has noticed. In January 2025, TRIPOD-LLM was published in Nature Medicine — a 19-item, 50-subitem checklist specifically for reporting studies that use large language models in healthcare. A month later, FUTURE-AI appeared in The BMJ — a 30-recommendation lifecycle framework for trustworthy AI built by 117 experts from 50 countries, organized around six principles: Fairness, Universality, Traceability, Usability, Robustness, and Explainability.
These are important steps. But checklists and frameworks solve the publication and governance problem, not the production problem. A TRIPOD-LLM–compliant paper and a FUTURE-AI–aligned development process still don't help a downstream researcher who joins your NLP-extracted conditions to their cohort six months later and needs to know: which pipeline, which version, which confidence threshold, which execution date?
Imagine two rows in a condition_occurrence table. Both say the patient has type 2 diabetes. One came from an ICD-10 code entered by a physician during an encounter. The other was extracted from a radiology report by an NLP pipeline you've never heard of, running a model version that may no longer exist, with a confidence score that was rounded before storage.
These two rows look identical. They sit in the same table, share the same schema, and will both be included in your cohort query. But they have fundamentally different provenance, fundamentally different confidence levels, and fundamentally different implications for your research.
This is the core problem: NLP-derived data is structurally indistinguishable from discrete clinical data once it lands in the CDM. And the metadata needed to distinguish them — the pipeline, the model, the execution, the confidence — has no standard place to live.
The question isn't whether NLP works. It's whether the people who consume NLP outputs have the metadata they need to decide if those outputs are appropriate for their specific research question.
This is why NLP methods centralization matters. Not centralization of which model you use — use any model, any framework, any LLM. But centralization of how you record what happened. Open, structured, queryable metadata that follows the data from extraction through to the research table.
At Emory, we are building an NLP infrastructure for OMOP that treats metadata as a first-class citizen. Every extracted fact carries a provenance chain from the research table back to the original note:
NLP-extracted data lands in _DERIVED suffix tables — condition_DERIVED, measurement_DERIVED, drug_DERIVED — that mirror the OMOP schema but are structurally separated from discrete clinical data. Researchers choose when and how to join them.
The infrastructure is model-agnostic. MedSpaCy, BioBERT, GPT-4, a custom LSTM — it doesn't matter. What matters is that the system, the pipeline, the components, the execution, and the confidence are all recorded in a standard, queryable schema.
This is what moves NLP from "it works on my laptop" to "I can defend this in a methods section, and a downstream researcher can audit it two years from now."
Our NOTE and NLP Infrastructure whitepaper describes the 4-layer, 13-table schema
that makes this possible — from pipeline registration through _DERIVED tables.