← Back to writing

Cross-Lingual Entity Resolution in a Trade Knowledge Graph: Adding 39,534 Aliases to 6,883 Nodes

By TaeHo Kim (Hannune)  ·  Published June 7, 2026

Korean, Japanese, and Chinese surface forms collapsing onto single canonical entities: 한국은행 and 韩国银行 to Bank of Korea, 三星电子 and 삼성전자 주식회사 to Samsung Electronics, 米国商務省 and 美国商务部 to U.S. Department of Commerce

Korean, Japanese, and Chinese aliases resolving onto canonical entities in the 2asy.ai registry

In the previous post, I described how 2asy.ai moved from plain vector search to a cross-domain ontology Graph RAG that resolves entities across documents and traverses causal chains. That post ended with an honest note: the graph is sparse, and it will get denser. This post is about one specific dimension of that density problem, and how I addressed it this week.

The problem is not just that the graph needs more documents. The problem is that the same real-world entity arrives in multiple languages.

Why Cross-Lingual Mentions Break a Knowledge Graph

2asy.ai reads Asian trade and tariff news published in English, but the underlying entities, companies, regulators, ministries, and policymakers, are often referenced in their local-language forms inside those articles. A Korean article about trade policy mentions 한국은행. A Chinese article about semiconductors mentions 三星电子. A Japanese article on auto tariffs mentions トヨタ自動車. And a wire story covering the same events writes Bank of Korea, Samsung Electronics, and Toyota Motor Corporation.

If entity resolution only handles English strings, each foreign-language mention becomes a separate node. The graph now has 한국은행 and Bank of Korea as two distinct entities. Any causal path that crosses the language boundary breaks. A tariff chain that starts with a Korean regulator's decision and ends at a multinational company's earnings call will never close.

This is the specific failure mode: the graph looks connected per language and disconnected across languages. Cross-domain ontology gets you over the per-document barrier. Cross-lingual resolution gets you over the per-language barrier. They are separate problems.

The Fix: Filling the Registry with Alias Tables, Not Rewriting Code

The entity registry at the center of 2asy.ai already had the mechanism. When a new mention arrives during extraction, the registry resolves it against every known surface form for each canonical entity. If the mention matches any alias, it maps to that entity's canonical ID. The registry was already doing this for English variants (partial names, abbreviations, alternate spellings). It just had no non-English entries.

So the change was to fill the alias tables, not to redesign the resolution layer.

The registry holds 13,371 canonical entities. Of those, 6,883 had at least one non-English surface form worth resolving, so I added ko, ja, and zh alias lists for them. The total additions:

  • ko (Korean): 13,757 aliases
  • ja (Japanese): 12,615 aliases
  • zh (Chinese): 13,162 aliases
  • Total: 39,534 aliases across 6,883 entities

Almost all of them made it in. The pipeline flagged two small categories for exclusion before writing to the registry:

  1. Collision: 3 alias candidates already resolved to a different canonical entity. All three were the abbreviation CENTCOM, generated as a Korean, Japanese, and Chinese alias for United States Central Command, which was already mapped to an existing US Central Command node. Writing them would have made one string resolve to two IDs, so they were skipped automatically.
  2. Low confidence: 5 candidates were held in a staging area rather than written, because the source could not confirm a verified local-language form. For example, LG Electronics has confirmed Korean forms (LG전자, 엘지전자) but no verified official Japanese or Simplified Chinese legal name was found, so its non-Korean entries scored zero and were held instead of guessed.

The excluded set is tiny next to the 39,534 accepted, but it is the part that matters: the collision check and confidence gate exist specifically to keep a wrong alias from silently corrupting every mention it touches.

After writing the accepted aliases, my local ER registry resolves these on exact lookup:

  • 한국은행 (and the Japanese 韓国銀行, Chinese 韩国银行) resolves to the canonical Bank of Korea entity.
  • 三星电子 (and the Korean 삼성전자 주식회사) resolves to Samsung Electronics.
  • 미국 상무부 (and the Japanese 米国商務省, Chinese 美国商务部) resolves to United States Department of Commerce.

Before this change, each foreign-language form would have produced no match and been written to the graph as a new, orphaned node.

What This Means for the Causal Graph in 2asy.ai

The effect, once this alias layer is rolled into the live graph, is that Korean, Japanese, and Chinese entity mentions in trade news resolve to the same canonical nodes as their English counterparts. When a Korean regulator, a Chinese manufacturer, and an American import duty appear in the same causal chain, the chain can close across all three languages instead of fragmenting at each language boundary.

This matters most for the cross-document chains. The whole point of moving from per-article graphs to a shared ontology was to let causality span the full corpus. A cause in one article connects to an effect in another. That connection only works if both articles are writing about the same canonical entity. Without cross-lingual resolution, the connection breaks the moment the two articles use different language forms for the same entity.

Cross-lingual entity resolution is not a cosmetic feature. In a multilingual news corpus, it is a prerequisite for the graph to be coherent.

The Harder Lesson About Entity Resolution in Graph RAG

Running 2asy.ai has made the priority order clear to me.

Extraction is the part that looks impressive. You throw a document at a language model, it returns entities and relations, and the graph grows. It is satisfying to watch. But extraction quality has a ceiling set by what comes after it: resolution. If two mentions that should be the same node stay as two nodes, extraction accuracy does not matter. The chain is broken regardless.

Entity resolution in Graph RAG is harder than most write-ups acknowledge. The English-only case is already non-trivial: abbreviations, partial names, acquired companies that change their names, subsidiaries that share names with parent entities. Add cross-lingual surface forms and the problem surface expands significantly.

The registry-with-aliases approach I use in 2asy.ai is a deterministic layer on top of the probabilistic extraction. Extraction guesses what is in the text. The registry decides what that thing resolves to. Keeping those two responsibilities separate makes the system easier to debug and correct: if a mention is resolving wrong, I fix the registry, not the extraction model.

39,534 aliases is a lot of entries. It is also a manageable data problem. The hard part is the collision detection and confidence gating, because a bad alias silently corrupts every mention it touches. Only 8 candidates were excluded this pass, 3 collisions and 5 held, but those were the gate doing exactly its job: an abbreviation that already belonged to another entity, and entities with no verifiable local-language name. As the corpus grows, that excluded set will grow with it, and it is the set that needs the most care.

What Is Still Missing

Cross-lingual resolution is one layer. There are still open problems.

The 5 held candidates need review, and more low-confidence cases will surface as the corpus grows. Some held entries are genuine cases that should stay held, and some are probably correct but scored cautiously because the source did not have enough context. Working through that set as it grows will expand coverage further.

The 3 collision candidates need manual triage. A collision means two canonical entities share a surface form, which is either a real-world ambiguity (two companies with similar names, a ministry that was renamed) or a registry error (two canonical entries that should be merged). The CENTCOM case is the first example: it points to a likely duplicate canonical entry that should be reconciled.

And the registry has no coverage for Arabic, Russian, or Southeast Asian language forms yet. Those are smaller portions of the current corpus but not zero.

The alias-filling step this week was the first pass. It is enough to close the most common cross-lingual gaps. The rest follows as the pipeline keeps running.

This alias layer is validated on my local ER instance. Rolling it into the live 2asy.ai graph is the next step, once the held and collision cases are triaged. The public graph is at https://www.2asy.ai/

If cross-lingual entity resolution or Graph RAG pipelines are something you are working on, the entity resolution service is at https://api.hannune.ai/entity-resolution/v1. That public endpoint serves the core resolver; the multilingual alias layer described here is validated locally and not yet loaded into it.

The previous post in this series: From Vector Search to a Cross-Domain Ontology Graph

Building cross-lingual or cross-domain retrieval that has to stay coherent?

Get in touch →