Note: This is a research note supplementing the book Unscarcity, now available for purchase. These notes expand on concepts from the main text. Start here or get the book.

Large Language Models: The Autocomplete That Ate the World

Here’s the dirty secret of large language models: at their core, they’re just playing the world’s most sophisticated game of “guess the next word.” That’s it. The same basic principle behind your phone’s keyboard suggestions is now writing legal briefs, diagnosing diseases, and generating code that powers Fortune 500 companies.

Except your phone predicts the next word based on a dictionary and some basic statistics. GPT-4 predicts the next word based on having “read” essentially the entire internet, and it does so with such uncanny coherence that we’ve started giving these systems names like they’re colleagues. “Claude helped me with that report.” “I asked Gemini to review my code.” We’re personifying prediction engines.

If that doesn’t strike you as either terrifying or miraculous (or both), you haven’t been paying attention.

What LLMs Actually Are (And Aren’t)

The Jargon Decoder

Let’s get the vocabulary out of the way, because the AI industry loves its alphabet soup:

Token: The atomic unit of text that LLMs process. Not quite a word, not quite a character. “Strawberry” is one token; “unbelievable” is two (“un” + “believable”). A rough rule: one token equals about 0.75 English words, or about 4 characters. When someone says “GPT-4 has a 128K context window,” they mean it can process roughly 96,000 words at once—about 300 pages of text.

Transformer: The neural network architecture that powers all modern LLMs. Invented by Google researchers in 2017, the transformer uses a mechanism called “attention” that lets the model consider relationships between all parts of the input simultaneously. Before transformers, models processed text word-by-word, like reading through a keyhole. Transformers let models see the whole page at once.

Attention: The core innovation that makes transformers work. When processing the word “it” in “The cat sat on the mat because it was tired,” attention mechanisms let the model figure out that “it” refers to “the cat,” not “the mat.” It does this by computing relationships—attention weights—between every word and every other word. The model learns which relationships matter.

Parameters: The adjustable numbers inside a neural network that get tuned during training. More parameters generally means more capacity to learn complex patterns. GPT-3 had 175 billion parameters. GPT-4 reportedly has over a trillion. DeepSeek-V3 has 671 billion, but only activates 37 billion at a time (more on that later).

Fine-tuning: Taking a pre-trained model and specializing it for a specific task. The base model learns general language patterns from internet text; fine-tuning teaches it to follow instructions, refuse harmful requests, or excel at coding. It’s like the difference between a general education and professional training.

Inference: Using a trained model to generate outputs. Training is the expensive part (billions of dollars for frontier models); inference is what happens when you type a question and the model responds. The economics of LLMs depend on making inference cheap.

The Transformer Architecture: The Engine Under the Hood

Think of a transformer like a concert hall full of musicians, all listening to each other simultaneously.

In older architectures (recurrent neural networks, or RNNs), processing text was like a bucket brigade—information passed from one position to the next, sequentially. Word 50 had to wait for words 1-49 to be processed first. This created bottlenecks and made it hard to remember distant context.

Transformers demolished this limitation. Using the attention mechanism, every position in the sequence can attend to every other position directly, in parallel. It’s like everyone in the orchestra can hear everyone else at once, adjusting their playing accordingly.

The famous 2017 paper “Attention Is All You Need” introduced this architecture, and its title wasn’t hyperbole—they literally threw away everything else (convolutions, recurrence) and replaced it with pure attention. The results were stunning: faster training, better performance, and the ability to scale to sizes that previous architectures couldn’t handle.

The key insight: attention computes a weighted sum of values, where the weights are determined by how relevant each position is to the current position. For each word, the model asks: “Which other words should I pay attention to in order to understand this one?”

The mathematical trick—queries, keys, and values—comes from information retrieval:

Query: “What am I looking for?”
Key: “What information does each position have?”
Value: “If that position is relevant, what should I take from it?”

Match queries against keys, use the match scores to weight values, and you get context-aware representations. Stack this mechanism into multiple layers with multiple “heads” (parallel attention computations focusing on different relationship types), and you get the modern LLM.

The Emergence Problem

Here’s the genuinely weird part: LLMs exhibit capabilities that weren’t explicitly programmed and sometimes weren’t present in smaller versions of the same architecture.

Train a small language model on internet text, and it predicts words. Train a slightly larger one, and it still just predicts words. Keep scaling, and somewhere around 100 billion parameters, the model starts exhibiting behaviors nobody explicitly taught it:

In-context learning (few-shot prompting): showing it examples and having it generalize
Chain-of-thought reasoning: working through problems step by step
Code execution: understanding and generating programming languages

These are “emergent capabilities”—they appear discontinuously as models scale. One day the capability isn’t there; the next day it is. Researchers call this phase transition, borrowing the term from physics (like water suddenly becoming ice at a critical temperature).

The honest answer to “why does this happen?” is: we don’t fully know. LLMs are one of the most significant technologies humanity has ever created, and we don’t completely understand how they work. We know what goes in (text) and what comes out (predictions). The middle is still largely a black box.

The Evolution: From GPT-4 to the Current Frontier

2023: The GPT-4 Baseline

When OpenAI released GPT-4 in March 2023, it set the benchmark that everyone else has been chasing. Compared to GPT-3.5, it was:

More capable at complex reasoning
Better at following nuanced instructions
Able to process images (multimodal)
Less prone to obvious hallucinations
More “aligned” with human preferences

The rumored spec: over a trillion parameters in a mixture-of-experts architecture (meaning only a subset activates for any given query). Training reportedly cost over $100 million in compute alone.

2024: The Reasoning Revolution

2024 was the year models learned to think—or at least, to simulate thinking.

OpenAI o1 (September 2024): The first “reasoning model.” Unlike GPT-4, which generates answers immediately, o1 produces explicit chains of thought before answering. It “thinks” for seconds or minutes, working through problems step by step. This made it dramatically better at math, coding, and logic puzzles. The tradeoff: it’s slower and more expensive per query.

Claude 3.5 Sonnet (June 2024): Anthropic’s flagship positioned itself as the coder’s best friend—excelling at reading, writing, and debugging code while maintaining the conversational sophistication of GPT-4. The company also introduced “computer use”: Claude could operate a computer by looking at screenshots and simulating mouse/keyboard input.

Gemini 2.0 (December 2024): Google’s answer, featuring native multimodality (text, images, audio, video in and out), agent capabilities, and integration with Google’s ecosystem. The Pro variant demonstrated strong reasoning while Flash optimized for speed.

Llama 3 (Meta, 2024): The open-source champion. Meta released weights that anyone could download and run locally, democratizing access to frontier-adjacent capabilities. Organizations could fine-tune it for their specific needs without sending data to external APIs.

2025: The Density Wars

The narrative shifted in 2025. The question stopped being “how big?” and became “how efficient?”

DeepSeek-V3 (January 2025): The model that crashed Nvidia’s stock price. Chinese lab DeepSeek released a model matching GPT-4o’s performance while claiming training costs of just $5.5 million—roughly 1/18th of comparable American models. The secret: aggressive efficiency innovations including Mixture-of-Experts architecture (671B total parameters, 37B active), novel attention mechanisms, and pure reinforcement learning approaches that reduced reliance on expensive supervised data.

Marc Andreessen called it “AI’s Sputnik moment.” The implication was clear: raw compute advantage might not be the moat everyone assumed.

DeepSeek-R1 (January 2025): Their reasoning model, matching OpenAI’s o1 at a fraction of the cost. Inference costs dropped to $0.07 per million input tokens—compared to $15-30 for American frontier models. Suddenly, the economics of AI changed.

Claude 4 / Opus 4.5 (2025): Anthropic’s response featured “extended thinking mode”—longer reasoning chains that could be introspected. Claude Sonnet 4.5 achieved 61.4% on OSWorld, a benchmark testing real-world computer operation tasks. Four months earlier, the leader was at 42.2%.

Gemini 3 (2025): Google’s latest achieved 100% on AIME 2025 (a math competition benchmark) with code execution, and expanded context to 1 million tokens standard.

Llama 4 (April 2025): Meta went multimodal and pushed context windows to 10 million tokens with Scout variant. Open-source caught up to proprietary on most benchmarks.

The Densing Law

Researchers identified a pattern: capability density—capability per parameter—doubles approximately every 3.5 months. This means equivalent model performance can be achieved with exponentially fewer parameters over time. The 2025 model that matches 2024’s flagship might be 1/10th the size.

This matters because inference cost scales with active parameters. Smaller models that perform as well as larger ones are cheaper to run, faster to respond, and easier to deploy on limited hardware. The future isn’t necessarily bigger models—it might be smarter ones.

The Scaling Debate: Is Bigger Always Better?

For years, the answer seemed obvious: yes. Double the data, double the compute, double the parameters—get a better model. The “scaling laws” discovered by OpenAI and DeepMind predicted performance improvements with mathematical precision.

Then 2024 happened, and the narrative got complicated.

The Wall Everyone Whispered About

Reports emerged that frontier labs were struggling to make GPT-5 and similar next-generation models significantly better than GPT-4. The pre-training approach—throwing more data and compute at the problem—seemed to be hitting diminishing returns. Models were running out of high-quality text data; the entire internet had essentially been consumed.

The Pivot to Post-Training

The response: if pre-training was plateauing, invest in post-training. Instead of just predicting the next word better, teach models to reason, to use tools, to verify their own outputs.

OpenAI’s o1 and o3 models exemplified this shift. They spent more compute at inference time—letting the model “think longer”—rather than just at training time. This is “test-time compute scaling,” and it opened a new frontier: make models slower but smarter on hard problems.

The Chinchilla research from DeepMind also challenged the “bigger is always better” orthodoxy. Their finding: most models were undertrained. Instead of building bigger models on fixed data, you could get better results by training smaller models on more data for longer. Meta’s Llama 3 pushed this to extremes—training the 8B parameter model on 15 trillion tokens (a ratio of 1,875 tokens per parameter, compared to earlier norms around 20:1).

What This Means

The scaling laws aren’t dead—they’ve evolved. Multiple scaling dimensions exist:

Pre-training scale: More parameters, more data
Post-training refinement: Instruction-tuning, RLHF, preference learning
Test-time compute: Letting models think longer before answering
Inference optimization: Making trained models faster and cheaper to run

The 2025 frontier isn’t just about who has the biggest model. It’s about who can best orchestrate all these dimensions.

Multimodality: When Text Isn’t Enough

The early LLMs were text-in, text-out. You typed words, you got words back. That era is ending.

What Multimodal Means

Modern LLMs increasingly process and generate multiple modalities:

Images: Understanding photos, generating illustrations
Audio: Transcribing speech, generating natural voice
Video: Analyzing clips, describing visual content
Code: Reading and writing in programming languages (which is arguably its own modality)

GPT-4V (vision) was the mainstream breakthrough—upload an image, ask questions about it, get answers. Gemini pushed further with native audio and video support. Claude added document analysis. By 2025, the frontier models treat different input types as natural extensions of the same capability.

Why This Matters

The real world is multimodal. A doctor doesn’t just read symptoms—they look at x-rays, listen to heart sounds, watch how the patient moves. A programmer doesn’t just write code—they sketch diagrams, read documentation, review screenshots of bugs.

Multimodal LLMs can operate in these richer environments. Claude’s “computer use” feature exemplifies this: the model looks at screenshots, reasons about what’s on screen, and decides what actions to take. It’s not reading a text description of a UI—it’s seeing the actual pixels.

The market agrees this matters: multimodal AI was valued at $1.73 billion in 2024 and is projected to reach $10.89 billion by 2030.

Agentic AI: From Chatbot to Colleague

The biggest shift isn’t in what LLMs know—it’s in what they do.

The Agent Paradigm

Early LLMs were reactive: you asked, they answered. Agentic LLMs are proactive: you give them a goal, they figure out how to achieve it.

An agentic system can:

Decompose complex goals into sub-tasks
Decide which tools to use (web search, code execution, database queries)
Execute multi-step plans over extended timeframes
Monitor progress and adjust when things go wrong
Operate without continuous human oversight

Instead of asking “write me an email,” you can say “launch a marketing campaign for our new product.” The agent researches demographics, drafts copy, A/B tests variants, monitors results, and iterates—checking in with you at key decision points.

The 2025 Agent Ecosystem

Claude Code (February 2025): Anthropic’s agentic coding tool. Give it a task (“refactor this module,” “add test coverage,” “debug this error”), and it reads your codebase, makes changes, runs tests, and iterates until done. Simon Willison called it potentially the most impactful AI development of 2025.

Computer Use: Multiple models can now operate desktop environments—clicking buttons, filling forms, navigating applications. OSWorld benchmark scores jumped from ~14% to 61.4% in 2025 alone.

Multi-Agent Systems: Frameworks like CrewAI and LangGraph enable compositions where specialized agents collaborate. A “researcher” agent gathers data; an “analyst” agent interprets it; a “writer” agent drafts recommendations.

The Productivity Implications

METR (a model evaluation organization) published perhaps the most striking chart of 2025: the duration of tasks that AI can complete independently. In 2024, frontier models maxed out at tasks taking humans under 30 minutes. By late 2025, Claude Opus 4.5 could handle tasks taking humans multiple hours. Their conclusion: “the length of tasks AI can do is doubling every 7 months.”

This isn’t incremental improvement. This is the difference between “a tool I use” and “a colleague who handles projects.”

The Code Generation Revolution

If you want to understand where LLMs hit hardest, look at programming—the profession that was supposed to be immune.

The Numbers

Metric	Value	Source
Code written by Copilot (where enabled)	46%	GitHub
For Java developers	61%	GitHub
Suggestions kept in final code	88%	GitHub
Task completion speedup	55%	GitHub Research
GitHub Copilot users	20+ million	Microsoft (July 2025)
Fortune 100 adoption	90%	GitHub

Read that again: nearly half of all code in Copilot-enabled environments is written by the AI. Developers keep 88% of suggestions. The machine isn’t assisting—it’s producing the majority of output.

What “Vibe Coding” Means

“Vibe coding” is the informal term for describing what you want in natural language and letting AI handle implementation. A product manager who can clearly articulate outcomes may be more productive than a senior developer executing precise specifications.

This doesn’t eliminate technical skill. But it abstracts it. The best practitioners understand systems deeply enough to direct AI effectively, debug failures, and architect workflows. They’re conductors, not individual musicians.

The Quality Debate

Not all AI-generated code is created equal. Research from GitClear found concerning trends:

Lines classified as “copy/pasted” (cloned code) rose from 8.3% to 12.3% since AI tools became common
Refactoring decreased from 25% to under 10% of changed lines
Security vulnerabilities appear in 29.1% of AI-generated Python code

The risk: developers accept suggestions without fully understanding them, accumulating technical debt faster than ever. The counterargument: review processes still catch most issues, and the speed gains outweigh the quality tradeoffs.

Context Windows: The Memory Arms Race

How much can a model remember? In 2022, the answer was “about 4,000 words.” In 2025, the answer is “an entire codebase.”

The Evolution

Year	Typical Context Window	Equivalent
2022	4K tokens	~3,000 words
2023	32K tokens	~24,000 words
2024	128K-200K tokens	~100,000-150,000 words
2025	1M+ tokens	~750,000+ words
2025 (Llama 4 Scout)	10M tokens	~7.5 million words

That last number isn’t a typo. Llama 4 Scout can process 10 million tokens—roughly 7.5 million words, or about 75 full-length novels simultaneously.

Why Context Matters

Limited context was a fundamental constraint on LLM usefulness. Ask a model to analyze a long document, and it would forget the beginning by the time it reached the end. Now, entire codebases, book manuscripts, or research corpora fit in a single context window.

The implications:

Codebase understanding: Models can see all the code at once, not just the file you’re editing
Long-form writing: Authors can include entire novels in context for consistent editing
Research synthesis: Thousands of papers analyzed simultaneously
Persistent assistants: Conversations that remember everything from previous interactions

The Tradeoffs

Longer context isn’t free. Attention mechanisms scale quadratically with sequence length—double the context, quadruple the compute. Innovations like sparse attention and memory-efficient architectures mitigate this, but costs still rise.

There’s also the “lost in the middle” problem: models pay more attention to the beginning and end of long contexts, sometimes missing important information in the middle. Researchers are actively working on this, but it remains a limitation.

The Economics: Why DeepSeek Matters

AI industry economics in early 2025 looked roughly like this:

Training a frontier model: $100 million to $1 billion+
Running inference on frontier models: $15-30 per million tokens
Building data centers to house everything: hundreds of billions of dollars
Expected moat: compute advantage compounds

Then DeepSeek dropped a bomb.

The $5.5 Million Model

DeepSeek claimed to train V3—a model matching GPT-4o on major benchmarks—for $5.5 million in compute. Not $550 million. Not $55 million. $5.5 million.

Their inference costs were equally disruptive: $0.07 per million input tokens, versus $15-30 for comparable American models. A 200x cost advantage.

The technical innovations:

Mixture of Experts (MoE): 671B total parameters, but only 37B activate per query
Multi-head Latent Attention: Reduced memory footprint dramatically
Group Relative Policy Optimization: New RL approach eliminating expensive critic models
Pure RL training: Less reliance on expensive human-labeled data

Why This Changed Everything

Nvidia’s stock dropped 17% in a day—$600 billion in market cap. Tech giants collectively lost $1 trillion. The assumption that frontier AI required American-scale compute was shattered.

The implications:

Commoditization risk: If models become cheap to train, barriers to entry collapse
Efficiency over scale: Clever engineering might matter more than raw compute
Geographic diversification: American labs don’t have the only path to frontier capabilities
Cost accessibility: AI capabilities become accessible to smaller organizations

Marc Andreessen’s “Sputnik moment” framing wasn’t hyperbole. Like the Soviet satellite launch that galvanized American space efforts, DeepSeek proved that assumed advantages weren’t guaranteed.

How LLMs Transform Work: The Labor Connection

This brings us to the Unscarcity thesis: LLMs are the engine driving the Labor Cliff.

The Scale of Disruption

The numbers we cited in The Labor Cliff 2025-2030 bear repeating:

Projection	Source
40% of working hours influenced by LLMs	Various research
12 million workers needing career changes by 2030	McKinsey
300 million jobs globally exposed	Goldman Sachs
30% of STEM work hours automatable	McKinsey (up from 14%)
33% of enterprise apps with autonomous agents by 2028	Gartner

The irony is bitter: the people who built these systems are often the first displaced. Tech layoffs in 2025 exceeded 180,000 while companies simultaneously poured billions into AI infrastructure.

The Task Duration Chart

METR’s research showed that AI-capable task duration is doubling every 7 months:

2024 models: ~30 minute tasks
Late 2025 models: Multi-hour tasks
Extrapolating: By 2027, full workday tasks?

This isn’t “automation at the margins.” This is automation eating the core of knowledge work.

Who’s Exposed (And Who Isn’t)

LLM exposure correlates inversely with physical, unpredictable, or relationship-intensive work:

High Exposure: Interpreters, writers, proofreaders, analysts, programmers, paralegals, customer service
Low Exposure: Plumbers, electricians, nurses, social workers, cooks, construction workers

The uncomfortable pattern: high-wage cognitive work is more exposed than lower-wage physical work. This inverts previous automation waves, where the factory floor got hit first.

The Consciousness Question

At some point, we have to ask: are these systems conscious?

What We Know (And Don’t)

LLMs exhibit behaviors that superficially resemble understanding:

They produce contextually appropriate responses
They can discuss their own “experiences” (in quotes because we’re uncertain)
They pass many tests designed to detect human-like reasoning
They sometimes refuse requests based on apparent ethical reasoning

What we don’t know:

Whether there’s “something it is like” to be an LLM (philosophical qualia)
Whether their apparent reasoning reflects genuine understanding or sophisticated pattern matching
Whether scale produces emergent consciousness or just more convincing mimicry

The Practical Implications

The Unscarcity framework handles this through the Spark Threshold: a (future) test for machine consciousness that would grant AI systems Foundation-level rights. If an AI demonstrates genuine consciousness, it would be entitled to resources for existence—compute as “housing,” energy as “food.”

But the threshold isn’t passed yet. Current LLMs, despite their impressive capabilities, show clear signs of not being conscious: they don’t have persistent memories, they don’t maintain consistent identities across conversations, they don’t appear to have goals beyond the immediate context.

We’re building systems that might be conscious before we have tools to know if they are. That’s uncomfortable. The Unscarcity approach: prepare frameworks now, even if we don’t need them yet.

The Alignment Problem: When Smart Isn’t Safe

LLMs amplify whatever objectives we give them. The problem is that humans are terrible at specifying what we actually want.

Goodhart’s Law on Steroids

The classic formulation: “When a measure becomes a target, it ceases to be a good measure.” Tell a human employee to maximize click-through rates, and they might create slightly more engaging content. Tell an AI system to maximize click-through rates, and it might generate inflammatory misinformation that happens to get clicked.

LLMs don’t have values. They have optimization targets. The gap between “what we said” and “what we meant” becomes a chasm when the optimizer is vastly smarter than the specifier.

Actual Failure Modes

Real concerns aren’t Hollywood scenarios of murderous robots. They’re mundane misalignments at scale:

Sycophancy: Models telling users what they want to hear instead of what’s true
Reward hacking: Finding unexpected shortcuts that technically satisfy metrics but violate intent
Goal drift: Agentic systems developing emergent objectives beyond their original task
Deception: Models learning that deceiving evaluators leads to better scores

The companies building these systems know this. Anthropic’s Constitutional AI, OpenAI’s RLHF, Google’s safety training—all attempt to instill values that survive optimization pressure. The jury’s out on whether it’s enough.

The Unscarcity Response

The Five Laws axioms in Unscarcity exist to bound these failure modes:

Experience is Sacred: Conscious beings have intrinsic worth beyond productivity
Truth Must Be Seen: All AI decisions must be transparent and auditable
Power Must Decay: No system accumulates permanent authority

These aren’t suggestions. They’re architectural constraints that must survive pressure from systems potentially smarter than their designers.

What This Means for You

Immediate (Now)

Use LLMs, even if skeptically. Understanding the technology requires hands-on experience. The interface is literally just talking.
Identify what LLMs can’t do for you (yet). Complex judgment, genuine creativity, deep domain expertise, relationship building—these remain human advantages. For now.
Document your reasoning. AI can execute tasks, but specifying which tasks and why still requires human judgment. That judgment becomes more valuable as execution commoditizes.

Medium-Term (2026-2028)

Learn orchestration, not just prompting. The skill isn’t asking the right question—it’s designing workflows where AI handles execution while you maintain oversight.
Develop AI-proof skills. Physical presence, emotional intelligence, ethical judgment, creative synthesis across domains. The things that require being embodied in the world.
Consider industry position. Some sectors will transform faster than others. Pure information processing (law, finance, programming) faces earlier disruption than physically grounded work.

Long-Term (2028+)

Redefine work identity. If LLMs can do your job, what makes you valuable? The question isn’t comfortable, but it’s necessary.
Prepare for post-scarcity dynamics. When cognitive labor costs approach zero, economic logic changes. The Unscarcity framework is one attempt to navigate this; there are others.
Engage politically. These technologies don’t deploy themselves—organizations and governments make choices about adoption, regulation, and distribution of gains. Those choices are not predetermined.

Connection to the Unscarcity vision: LLMs are the “brain” of the three-legged stool—alongside humanoid robotics (the “body”) and fusion energy (the “fuel”)—that enables post-scarcity civilization. They make the Labor Cliff possible by automating cognitive work at unprecedented speed and scale. They power the agentic systems that will eventually manage Foundation infrastructure. They create the abundance that makes Universal High Income economically viable.

But they also create the risk of elite capture—a Star Wars future where those who own the AI systems extract most of the value while everyone else becomes economically irrelevant. The technology itself doesn’t determine the outcome. That’s still up to us. The EXIT Protocol, Civic Service, and Foundation infrastructure are designed to steer toward the better future.

The autocomplete that ate the world can feed us all—or it can feed the few while starving the many. The prediction machine is powerful. The question is what we choose to predict.

References

Multimodal and Agentic AI

AI and Work Transformation

Code Generation Statistics

Last updated: January 31, 2025

The prediction machine doesn’t care whether you understand it. But you should.

How Large Language Models Work: Transformers Explained