Language Models Are Failing Minority Languages. On Purpose.

This week researchers launched Soro, a family of LLMs specialized for Tajik, a language spoken by roughly 8 million people in Central Asia that major foundation models handle poorly or not at all. The same week, Tiwani Contemporary closed its London gallery, and a separate arXiv paper argued that autonomous AI agents are accumulating what it calls 'agentic literacy debt', structural gaps in users' ability to understand what AI systems are doing on their behalf. These three stories share a root system: the infrastructure of intelligence, both cultural and computational, defaults to the already-centered.

The Architecture of Linguistic Exclusion

The Soro paper is careful and specific. It does not just note that Tajik is underserved by existing LLMs. It builds a foundation model, benchmarks it, and makes it available. The implicit critique is precise: the major AI labs are not ignoring Tajik because of malice. They are ignoring it because the training data economics favor high-resource languages, and the investor incentives favor markets where monetization is legible. A 2025 paper in Computational Linguistics by Joshi et al. documented that the 20 highest-resource languages in NLP research represent just 6.5% of the world's languages but over 88% of published model evaluations. Soro is doing the work the market will not.

Agentic Literacy Debt and the Tajik Problem Are the Same Problem

Rohith Nama's arXiv paper on agentic literacy debt argues that as AI agents act autonomously in healthcare, finance, and legal contexts, users who cannot understand or interrogate those systems accumulate a structural disadvantage that compounds over time. This is the Tajik problem at the level of interface rather than language. If the AI systems making decisions about your credit, your medical triage, or your legal options are trained predominantly on English-language data and optimized for English-language users, the literacy debt falls disproportionately on speakers of underserved languages, not because any single decision was discriminatory, but because the architecture was. A related paper on post-COVID ICT skill gaps found that students entering the generative AI era with weaker technical foundations face compounding disadvantages as AI tools become mandatory in professional contexts. TurboFund's accelerator database shows almost no dedicated infrastructure for AI founders building in low-resource language contexts, which is either a gap or an opportunity depending on your relationship to urgency.