AIdb#725

LLMs’ geometry problem: When vectors meet Voronoi

March 25, 202612:00(3w ago)

San Francisco, CA

LLMs’ geometry problem: When vectors meet Voronoi📷 Published: Mar 25, 2026 at 12:00 UTC

★Hidden states as Riemannian manifolds with Fisher metrics
★Expressibility gap quantifies semantic distortion from tokenization
★Theorems tie vocabulary size to unavoidable distortion limits

Large language models have a dirty little secret: they think in smooth, continuous vectors but spit out jagged, discrete tokens. That mismatch isn’t just messy—it’s a geometric crisis. A new arXiv paper from ML theorists finally puts numbers to the distortion, framing LLM hidden states as points on a latent semantic manifold—a Riemannian surface where tokens carve out Voronoi regions like territorial claims.

The core insight? What the authors call the expressibility gap: a measurable semantic tax paid every time the model’s fluid internal representations get forced into the straightjacket of a finite vocabulary. It’s not just handwavy intuition—two theorems anchor the work. First, a rate-distortion lower bound proving no finite vocabulary escapes distortion. Second, a linear scaling law (via the coarea formula) showing how the gap grows with manifold volume.

This isn’t another ‘LLMs are just stochastic parrots’ take. It’s a rare case of theorists treating the tokenization bottleneck as a geometric problem, not just a training data one. The implications stretch beyond academia: if distortion is fundamental, then simply throwing more parameters at models may hit diminishing returns faster than we thought.

A rare math-first paper cuts through the hype—with actual proofs📷 Published: Mar 25, 2026 at 12:00 UTC

A rare math-first paper cuts through the hype—with actual proofs

The paper’s timing is delicious. Just as industry races to ship ever-larger models, here’s a reminder that vocabulary design—often treated as an afterthought—might be a first-order constraint. The expressibility gap suggests that even with infinite compute, discrete tokens introduce irreducible noise. For startups betting on smaller, specialized models, this could be ammunition: less distortion if your manifold is tailored to a narrow domain.

Developer reaction on r/MachineLearning has been cautiously optimistic, with several commenters noting the framework’s potential to explain why some prompts ‘feel’ semantically brittle. But the reality gap looms: these are theoretical bounds, not deployment-ready fixes. The paper doesn’t (yet) tell us how to reduce distortion—just how to measure it.

Watch for two signals. First, whether OpenAI or Anthropic cite this work in future tokenizer updates. Second, if the EleutherAI crowd starts baking these metrics into evaluation suites. For now, it’s a mathematical dare: Prove you can do better than this lower bound.

In other words, the next time an LLM startup claims their model ‘understands meaning,’ ask them about their manifold’s Ricci curvature. The answer will be telling—if you get one at all.

LLMLanguage TranslationMathematical Framework

// liked by readers

//Comments

Uredi u foto-review →