How I Caught My LLM Fabricating Its Own Evidence

The language model behind my Graph RAG pipeline did something worse than getting a fact wrong. It fabricated the evidence. Each relation it extracted carried a quote that was supposed to come straight from the source article, and many of those quotes had never been written. They read perfectly. They did not exist.

What does fabricated evidence mean in a knowledge graph?

I am building the seed knowledge graph for 2asy.ai, a causal-chain intelligence system over trade and tariff news. Every relation and event in the graph carries an evidence field: the exact sentence from the source document that justifies it. That evidence is the whole point. It is what lets me, or a reader, trace a claim back to where it came from instead of trusting the model on faith.

The problem is that I was asking a language model to produce that evidence by quoting the source. And a language model is a text generator, not a copier. When I checked the evidence against the original articles, a large share of the quotes were not verbatim. They were fluent, on-topic, and invented.

The ellipsis was the tell

The clearest pattern was the ellipsis. The model would take two sentences from completely different parts of an article, drop a ... between them, and present the result as one continuous quotation. The seam looked like a normal editorial cut. It was not. It was two unrelated fragments fused to manufacture support for a relation the model had already decided to extract.

This is the dangerous kind of hallucination, because it is shaped exactly like real evidence. A wrong fact stands out. A fabricated quote that paraphrases something true reads as completely credible until you go back to the source and search for it character by character.

Why truncating the input made it worse

I had been doing something that looked harmless: truncating each article body to a few thousand characters before extraction, to stay inside a comfortable context window. That truncation was quietly licensing the fabrication. When the sentence that actually supported a relation sat past the cutoff, the model did not refuse. It filled the gap with a confident reconstruction of what the missing text probably said.

So I removed the truncation entirely and switched the collectors to full-body-or-skip: either the pipeline has the complete article text, or it does not process that document at all. A partial document is more dangerous than a missing one, because a partial document still produces output, and the output looks finished.

The fix: check the quote, do not grade it

The fix is almost embarrassingly simple, and that is the point. At commit time, before any relation is written to the graph, I check that its evidence string appears as an exact substring of the source document. If the quote is not literally in the text, the relation is rejected. No fuzzy matching, no second model asked to judge whether the evidence is good enough.

The instinct in this situation is to reach for another language model to verify the first one. I think that instinct is usually wrong. If you can check an output with a deterministic string operation, do that instead of grading one generator with another generator. A substring test cannot be talked into a plausible answer. It is cheap, it is exact, and it cannot hallucinate.

Cleaning up what had already shipped

The guard stops new fabrication, but it does not undo the relations already sitting in the graph. So I ran the same substring check backward over everything that had already been committed. I reverted 122 documents whose evidence had been stitched together from separate sentences, then cleaned out hundreds more whose quotes simply did not match their source, more than 500 documents in total across the cleanup passes.

That number is the real cost of having trusted generated evidence in the first place. Every one of those documents had passed through a pipeline that ran cleanly and produced output that looked correct. The volume of the cleanup is a measure of how convincing fabricated evidence is when nobody checks it against the source.

The lesson: store pointers, verify with code

If you are extracting structured claims from text with a language model, treat any quote it gives you as a hypothesis, not a fact, until a string operation confirms it. Store evidence as something you can verify, a span or an exact substring, not as free text the model is trusted to have copied faithfully.

All of this runs on local hardware, an RTX 4090 and an AMD W6800, with open models doing the extraction. The substring guard adds no model calls and no cloud cost. It is a few lines of code standing between a graph you can trust and a graph that quietly lies to you in complete sentences.