The Real Work in Graph RAG Is Not Extraction

Extraction is the easy part of Graph RAG. I learned this building the seed knowledge graph for 2asy.ai. The pipeline ran, the numbers looked fine, and the graph was still broken. The real work was normalization, and it took far longer than extraction.

Why does a freshly extracted knowledge graph look fine but break?

I built a seed knowledge graph for 2asy.ai from roughly 450 documents, extracting more than 3,000 entities and over 1,000 events. The extraction pipeline ran cleanly. The counts looked healthy. Then I actually looked at the graph, and it was not navigable.

A knowledge graph is only useful when you can walk it: follow an entity to an event, an event to its cause, a cause to the entity behind it. Mine could not be walked, because the same real-world idea had been recorded under many different names. The numbers measured volume, not consistency.

The relation layer: how 360 labels became 80 canonical types

The relation types were the worst offender. Over 360 distinct labels had accumulated. The language model named the same structural idea differently depending on the article it was reading at the time. caused_by, is_caused_by, and was_caused_by were three labels for one relationship. triggers, trigger, and triggered_by were three more.

A graph with 360 relation types is not a knowledge graph. It is an expensive document store. So I stopped extracting and went backward. I mapped every relation type down to 80 canonical forms and renamed the edges, hundreds of them. The graph shrank in variety and became navigable for the first time.

The entity layer: why "aluminium" and "aluminum" break a causal chain

The entity layer had a different failure. "Aluminium", "aluminum", and "ALUMINUM" were three separate nodes with zero edges connecting them. A causal chain breaks cleanly at the first spelling inconsistency, because the graph does not know the three nodes are the same metal.

I merged the duplicates. I also fixed evidence substrings so each relation points back to the exact sentence that produced it, not a paraphrased approximation. Evidence that does not match its source is evidence you cannot trust later.

How do you stop the cleanup from happening twice?

Cleaning once is not enough if the next extraction run reinvents the mess. So I added validation that enforces the 80-type vocabulary at extraction time. New runs must use an existing canonical relation type instead of inventing a new label. The cleanup becomes a one-time cost, not a recurring tax.

Running the whole pipeline on local hardware

All of this ran on local hardware: an RTX 4090 and an AMD W6800, with Gemma and Qwen models handling the language work. There was no cloud bill for the extraction, the normalization, or the validation. For a one-person operation, owning the inference cost structure is what makes a daily Graph RAG pipeline affordable to run at all.

The lesson: plan for cleanup before extraction

If you are building Graph RAG, plan for the cleanup phase before you start the extraction phase. The language model will produce plausible output. The graph will look dense. The counts will feel satisfying. None of that means the graph is usable.

A knowledge graph becomes useful only when the entity and relation layers are consistent enough to walk. In my case, extraction took days and normalization took weeks. The seed graph for 2asy.ai now runs causal-chain queries against 80 canonical relation types, with every edge linked back to its source sentence.