Why AI gets things wrong, why it matters in drug development, and what it takes to fix it
The conventional narrative around LLM hallucinations is that they are a flaw in the system. Something that can be fixed with more training. Or with a more expensive model. This framing is comforting and mostly false. The more accurate representation, supported by research and by the model providers themselves, is that hallucinations are not an accident but rather a consequence of what they are trained to do.
During training, LLMs learn to predict the next word in a sentence. A model that confidently produces plausible but factually incorrect text will outscore, on nearly every performance metric, one that abstains or expresses uncertainty. The fact that LLMs generally produce correct, factual content is not because they are optimised for correctness. It is a side effect of the training distribution. The model is trained on human-written text, and most human-written text (particularly in the text corpus selected for training) happens to be factually accurate. So the model learns to produce outputs that pattern-match against a mostly-correct corpus. But it has no intrinsic truth-seeking mechanism. Factuality, in other words, is correlational with the training signal, not causal. The model appears to value truth only because truth happens to be the dominant pattern in the data it was trained on. Change the distribution, or push the model into domains where its training data is sparse, and the correlation breaks down. The confident fluency remains. But the correctness doesn‘t.
The first wave of foundation models was revolutionary simply because the models sounded convincingly human. In those early days (way back in 2022), the model didn’t even need to be factually correct in order to make headlines; it just needed to sound fluent. But as enterprises embed LLMs into mission-critical workflows, plausible is no longer sufficient. In life sciences, where a single inconsistency in a regulatory filing can delay a drug that saves lives, the margin for confident error is zero.
Risk in Life Sciences & Regulatory Contexts
In regulatory authoring, the primary concern is not that generative systems can be wrong in the abstract. It is that an error can be high-confidence, hard to detect in review, and difficult to trace back to a definitive source. That combination directly undermines the core expectations of a regulated dossier: internal consistency, traceability, and defensibility under questioning.
Even modest inconsistencies can trigger information requests, additional analyses, or rework that cascades across interdependent sections of an eCTD. More materially, an unsupported claim, a misrepresented source statement, or a fabricated reference can propagate into downstream narratives, tables, and cross-references. The operational impact is measurable in review cycles and timelines, but the higher-order risk is that teams lose the ability to distinguish what is evidenced from what is merely plausible, and even a single error has the ability to erode trust in the entire system.
For regulatory use cases, both hallucination avoidance and transparency into any errors that do make it into drafts are design requirements, not model quality aspirations (and certainly not quick fixes with more clever prompts). Systems must constrain generation to verifiable source content, preserve provenance at a granular level, and make gaps and uncertainties explicit so that unresolved questions are handled as decisions by accountable humans rather than by model completion.
Taxonomy of Errors: Factual, Faithful, and Right but Wrong
At a foundational level, an AI hallucination is the model returning an incorrect token with respect to the user’s instruction. But collapsing every error into the single word “hallucination” obscures meaningful distinctions. There are two primary categories that are worth understanding separately, because they have different causes and different mitigation strategies.
Factuality errors occur when a model relies on its parametric knowledge, i.e., the facts baked into its weights during training, and states something that is simply incorrect. A fabricated citation, an invented drug interaction, a hallucinated p-value: these are factuality failures. They tend to be most dangerous when the model has sparse or outdated training data on a topic and compensates by extrapolating.
Faithfulness errors occur when the model is provided with an instruction along with some grounding source material, such as a clinical study report or certificate of analysis, and it misrepresents or distorts it in its response. This can happen when there is ambiguous or conflicting information in the source material, in which case the model may pick one of several variants. Or if there are missing pieces in the source with respect to the user instruction, the model is more likely to try to fill in the blanks with plausible content rather than abstain from answering.
But in practice, the taxonomy needs a third category. At Weave, we have learned that a substantial share of real-world quality failures involve outputs that are factual and faithful yet still wrong for the user’s purpose. The model produces a study summary with excessive methodological detail when the user wanted a high-level overview. An automated QC check flags clinically insignificant variations as critical errors. A generated narrative is factually correct but stylistically inappropriate for a regulatory submission. These are right-but-wrong errors, and any serious mitigation strategy must account for them. Weave strives to produce LLM-generated content that is factual, faithful, and useful.
Moving Beyond “Hallucinations” As Catch-All: A Taxonomy of LLM Errors
| Factuality Error | Faithfulness Error | Right-But-Wrong Error | |
|---|---|---|---|
| Definition | States something that is simply incorrect (drawn from parametric “Knowledge” rather than grounded evidence). | Misrepresents provided source material (grounding exists, but the output distorts it). | Output is technically correct but not appropriate/useful for the user’s purpose. |
| Causes | Sparse/outdated training data, extrapolation, and overconfident completion when the model should abstain. | Incomplete/ambiguous/fragmented context; conflicting source statements; insufficiently constrained prompting. | Misalignment on audience, intent, tone, and level of detail; missing strategic context that only the user can supply. |
| How We Mitigate | Constrain generation to verifiable sources; require provenance; encourage explicit uncertainty/abstention when evidence is missing. | Provide complete context/ preserve granular source mapping; run sentence/table-cell-level QC against sources to catch distortions. | Give users direct control over prompts (tone/detail/strategic framing); make it easy for users to iterate on content. |
| Example | Drafting a Clinical Overview, the model cites an EMA guideline that doesn’t exist and uses it to justify a claim about immunogenicity risk. | A CSR states that 12 subjects discontinued due to AE’s, but the generated narrative says 21 (or attributes the discontinuations to the wrong AE) even though the correct number is in the provided excerpt. | The model correctly summarizes a study result, but writes it in a promotional, benefit-forward tone (e.g., “demonstrates superior efficacy”) that is inappropriate for an eCTD section and would trigger reviewer scrutiny. |
Mitigation by Design: How Weave Approaches the Problem
Faithfulness failures are fundamentally a context problem. The model distorts source material because the source material is incomplete, ambiguous, or fragmented within the prompt. Our approach is simple and empirically grounded: we always err on the side of providing the model too much context rather than too little. Through extensive experimentation, we have observed that models perform well even with a moderate quantity of excess context. The noise is manageable to the model. Missing context, by contrast, is a critical failure mode. When the model lacks the information it needs, it does not reliably flag the gap. It fills the gap with plausible fabrication, and we avoid that scenario at all costs.
Right-but-wrong errors are not model failures in the traditional sense. They are alignment failures between what the model produces and what the user actually needs. A better model or retrieval architecture won’t solve this, because the “right” answer depends on context that only the user possesses: the strategic narrative of their submission, the preferences of their target reviewer, the tone appropriate for a specific submission type. Our mitigation is to give users direct control over every model prompt in natural language. And to make the process streamlined and straightforward. They define the tone, the level of detail, the strategic framing, which is the part that humans do best. The LLM handles the technical execution: faithful synthesis of source data, consistent formatting, and rapid first drafts at scale. This division of labor is deliberate. It keeps the human in command of judgment while leveraging the model for throughput.
Beyond mitigation strategies, we have found that transparency is the critical partner to failure avoidance: when an error does make its way into a draft (either in the form of LLM- or human-authored content) we surface it quickly and clearly. Every sentence and every table cell produced by our document generation engine links directly back to the specific source content used to generate it. Automated QC evaluates generated content against that source material at the sentence and table-cell level, flagging potential faithfulness errors before the draft is ever finalized. When source files are modified, our consistency enforcement layer identifies downstream content that may need updating. Nothing is hidden. Our customers can have confidence in the output not because we claim the model is infallible, but because every output is auditable. Trust built on transparency is fundamentally more durable than trust built on promises of perfection.
How we measure and track performance
The previous generation of language models required fine-tuning to work in a specialized domain. This was a large up-front effort requiring specialized skills and considerable time investment in order to add a language model in production. Foundation models made that barrier to entry vanish overnight, but they also introduced an enticing trap. Because a foundation model can produce reasonable output zero-shot, it is tempting to write a prompt that “seems good” and ship it.
Our approach treats prompt engineering and upstream data processing as experimental disciplines. We don’t rely on the model’s ability to perform well without iteration. Instead, we iteratively define prompts and data pipelines that objectively perform well, measured against curated ground truth datasets for classification tasks, and against LLM-based evaluations calibrated to expert human judgment for long-form generation. We validate that LLM evaluator judgments align with (or are a little harsher than) human assessments on the same content, and we rerun calibration analyses whenever significant system changes occur. We continuously ensure that we are optimizing quality metrics that translate to user value, rather than numbers that look good on paper.
Experimental Performance Data
| Metric | Result | What it Means |
|---|---|---|
| Faithfulness in generated drafts across the eCTD | 98% | Generated content stays consistent with the underlying source material. |
| Automated QC accuracy | 93% | QC surfaces issues by checking each sentence and table cell against the underlying source material, including lapses in AI-authored or human-authored content. |
| Adherence to user instruction | 94% | The model strictly follows the instructions given by the user. |
| Technical writing quality (redundancy, clarity, efficiency) | 90% | Fewer style rewrites for the user. |
But strong experimental results are only half the story. We also need to know whether those measured gains translate into better outcomes once the system is embedded in real regulatory workflows. Experimentation tells us how the system performs under controlled conditions, whereas user action tracking tells us how it performs in the field. We instrument actions throughout the platform: whether a reviewer accepts or rejects an AI-suggested revision, adds or removes a data tag, overrides an automated verification flag. Each action becomes a data point for continuous improvement. By analyzing action logs we can identify which platform features are underperforming with respect to our experimental findings. It is this feedback loop that closes the gap between experimental performance and production reality.
Conclusion
The integration of generative AI into the life sciences holds immense promise, offering the potential to drastically reduce the time it takes to bring life-saving therapies to market. However, the persistent challenge of AI hallucinations reminds us that innovation cannot come at the expense of rigor. In heavily regulated environments, responsible AI requires more than just powerful foundational models; it demands purpose-built architectures that prioritize traceability, transparency, and human oversight. By grounding outputs in verifiable source data and continuously measuring quality through rigorous experimentation and real-world user actions, we can bridge the gap between AI’s probabilistic nature and the regulatory demand for deterministic accuracy, an AI strategy that generates value and trust, not just words.