← Back to Notes
AI Access Feb 2026

Foundational models learn a representation space whose structures are disproportionately shaped by English-language data.

A heartfelt congratulations to Centro Nacional de Inteligencia Artificial and all the regional organizations in Latin America that contributed to the launch of LATAMGPT.

It's been an AI whirlwind of a start to 2026. The adoption of the CLI agent OpenClaw, Cowork, the Opus/Codex advances, and of course the now-popular discussion topic of the "death of SaaS."

As foundational models get closer to our everyday life, it's a good reminder that the nature of their underlying training data creates an access schism by default. Today, I'm referring to the "developing world" schism, though there's also a socio-economic gap even within the colloquial "Western countries," which deserves its own post.

The majority of internet knowledge is in English. Foundational models learn a representation space whose token distributions, semantic associations, and latent structures are disproportionately shaped by English-language data. In practice, this means the core abstractions the model reasons over are English-anchored. In my own work, performing any GenAI solution in Spanish consistently yields a net-lower result across evals compared to its English equivalent. Even anecdotally, when my family shows me their ChatGPT instances, the reasoning always feels two steps behind.

Countries with a history of STEM investment have been able to push back against the English-dominated FM space (see: DeepSeek), though I'd be remiss not to note that DeepSeek also benefited from the command-economy nature of China. Mistral, for instance, has a clear focus on European languages, but its model proliferation is an order of magnitude lower than DeepSeek's.

Putting the social aspect aside, this matters economically. There are ~670 million people in LATAM: roughly 450 million Spanish speakers and ~220 million Portuguese speakers. Model makers and hyperscalers are front-loading CapEx, betting on exponential adoption of foundational models. LATAM is a sleeping giant in this regard, but not without nuance:

Spanish speakers don't speak Portuguese, and Portuguese speakers don't speak Spanish. Even within each language cohort, vocabulary, idioms, and expression vary, especially as you move down socio-economic strata. There are seven ways of saying "let's go play a football match," and seven ways of saying "straw," across LATAM. As we move toward typeless interfaces like voice, the way an STT model processes Argentine Spanish will be fundamentally different from how it should process Caribbean Colombian Spanish.

LATAMGPT is a phenomenal move that begins to shift LATAM from being a consumer of poorly adapted foundational models to a creator and innovator of localized ones. I'm keen to see, and participate in, the applications, architectures, and velocity that emerge from this.