Entity Resolution as an API: How I Built It, Over-Fit It, and Fixed It

Entity resolution sounds simple until you try to build it. I have been turning it into an API called ER API, built on the open-source library Splink. This is an honest account of the design, including the part where I fooled myself by over-fitting my own evaluation set.

What is entity resolution and why is it hard?

Entity resolution answers one question: are these two records the same real-world thing? It sounds like a lookup. It is not. Names vary, abbreviations diverge, and the same company appears with a dozen spellings across a dozen documents. "Goldman Sachs", "Goldman Sachs & Co.", and "GS Group" all refer to one entity, and a naive string match misses that.

Doing this reliably at scale means probabilistic matching, not exact matching. That is real engineering, and most teams do not want to build it from scratch.

Why I wrapped Splink instead of building from scratch

ER API is a FastAPI service that wraps Splink, an open-source probabilistic record-linkage library built by the UK Ministry of Justice data team. Splink does the hard statistical work well, but it has a setup curve. The idea behind ER API is simple: you send records, you get back match decisions, and you never stand up a probabilistic matching pipeline yourself.

Credit where it is due. Splink is the engine. ER API is the layer that makes that engine usable through a single endpoint, with an added disambiguation step for the hard cases.

The registry: how do matches get better over time?

The part I did not plan for was the registry. Every time ER API confirms a match, it learns a new alias. "Goldman Sachs", "Goldman Sachs & Co.", and "GS Group" collapse to one canonical entry. The registry grows, and future matches improve without retraining.

That compounding behavior was not in the original spec. I added it mid-build because the alternative, starting cold on every request, produced too many false negatives on well-known entities. A system that forgets everything between requests is a system that keeps making the same mistakes.

The trap: over-fitting my own evaluation set

Then I walked into the classic trap. I was tuning match thresholds against a ground-truth set I had built myself. The results looked good, better than I expected, so I shipped that configuration.

When I tested ER API against a different dataset, quality dropped noticeably. The threshold was right for my evaluation set. It was not right for the problem. I had over-fit to my own test data, which means I had measured my ability to memorize one distribution, not my ability to generalize to new ones.

How I fixed it: a held-out evaluation set

The fix was not subtle, but the mistake still caught me. If you tune and evaluate on the same distribution, you are not measuring quality. You are measuring overlap. I rebuilt the evaluation process around a held-out set that I never touch during tuning.

The reported quality got more modest. The actual behavior got more honest. That trade is always worth it, because the only number that matters is the one you see on data you did not tune against.

What is next

ER API is still in active development. Registry persistence, a batch endpoint, and documentation are the next steps before launch. If you work on record linkage, deduplication, or entity resolution, I am looking for a small group of early users to give it a real workout. The contact link is below.