← Back to writing

I Built a Local LLM Rig to Escape API Bills. Then I Paid OpenAI Again.

By TaeHo Kim (Hannune)  ·  Published June 13, 2026

A small workshop bench split between a humming GPU tower and a single laptop dialing out to a cloud API

I run a one-person AI shop. For the 2asy.ai filing pipeline that needs thousands of single-document extractions per cycle, the local rig lost the batch lane and OpenAI Batch won. Not the whole company. One lane.

The rule that decided it

No cross-document attention. Each filing gets its own prompt window. No string concatenation. The model never sees two documents at once. That rule comes from a Neo4j rollback I already paid for. When one document's facts leak into another's prompt, the model fabricates relations between unrelated entities and the graph carries it as truth. The rule is not negotiable.

So the question was never which model is smartest. It was which provider can actually run thousands of strictly isolated single-document calls, cheaply, with strict JSON schemas.

What I tested

Local Gemma 4 26B on llama.cpp, RTX 4090 plus W6800. Live chat fine. Demos at demo.hannune.ai fine. Batch lane blocked. vLLM has no expert-mapping path for the 4-bit MoE weights I need, and the vLLM container wants CUDA 12.9 while the host driver is on 12.8. I am not upgrading a production GPU server to test a batch lane. Even llama.cpp segfaulted once on a CUDA graph optimization. GGML_CUDA_DISABLE_GRAPHS=1 stays in my back pocket.

OpenRouter. No real batch API. Every request runs at live pricing. On a flash-class model at concurrency 32 I saw per-call latency between 2 and 17 seconds, occasional 121-second timeouts, and 429s. A router adds a hop and a variable upstream provider. The tail of the latency distribution is where the bill actually sits.

Gemini batch (the painful one). The SDK path I used ran my "batch" by stitching multiple requests into one context window and producing one combined response. That is inline concatenation, not isolation. The failure showed up downstream. Extracted entities and relations started carrying content from neighboring documents in the same batch line. By the time I caught it, Neo4j and the entity registry had absorbed fabricated entries. I rolled both back. The Google SDK issue queue tracks this as known and marks the fix as not planned.

OpenAI Batch API. JSONL in, JSONL out. Each line is fully independent. No shared context. 50 percent off live pricing. A recent 100-document gpt-5.4-nano binary gate came back in about 2.7 minutes with zero 429s. The extraction step on gpt-5.4-mini with strict JSON schemas lands close to one cent per document. The win was not raw price. It was that batch isolation, strict schemas, and reasonable tail latency arrived in the same product.

Side by side

OptionReal batchCross-doc isolationWhat broke or fit
Local llama.cppn/aYes, per callSingle-stream throughput too low. vLLM blocked.
OpenRouterNoYesLive pricing, long-tail latency, 429s.
Gemini batchMarketedBroken in testSDK inline-concatenated. Required Neo4j rollback.
OpenAI BatchYesYes, per JSONL line50 percent off, strict schemas, fast turnaround.

Per-pipeline, not per-company

The local rig is not leaving. It still wins on live serving for 2asy briefings, the ER API LLM gate, multimodal pipelines for filings with embedded charts, and quick ablations during graph normalization. The batch extraction lane is one lane out of many. "I left the cloud" is a sentence about a lane, not about a company.

My consulting page still says local infrastructure cuts cloud LLM bills. I still believe that, for the lanes where it actually does. What changed is that I stopped treating local first as one rule for the whole business and started treating it as one option per pipeline.

If you are picking infrastructure for a one-person shop

Stop asking local versus cloud. Ask, per pipeline. Where does the isolation boundary sit, per request or per document. Does the provider's batch endpoint actually run requests independently, or does it concatenate. What does your tail latency look like at your real concurrency. And when the model gets something wrong in a way that looks correct, what touches your graph before you notice.

For me, this month, the answer is a hybrid. Local for serving and multimodal. OpenAI Batch for the extraction lane that has to be isolated and cheap at scale. The mix will shift the next time a workload shifts. The pipeline asks the question. I answer.

Designing the infrastructure split for your own AI pipeline?

Get in touch →