How LLMs Generate Responses: Tokens, Prediction, and Sampling Explained
Tokenization, autoregressive prediction, temperature, and Top-P: the internal mechanics of how language models turn a prompt into text.
Someone shows you a ChatGPT response and asks "but how does it know that?" The honest answer is that the model doesn't know anything — it's making a very well-informed bet about which token comes next. That distinction matters more than it seems, especially when the model fails with the same confidence it succeeds.
This post focuses on the internal mechanics: what happens between pressing Enter and text appearing on screen. If you want the broader picture of generative AI, What is Generative AI covers the general context — here we go straight to the engine.
Tokenization: the model doesn't read text, it reads chunks
Before any prediction, the model needs to convert text into numbers. This happens through tokenization — text is broken into units called tokens, which can be whole words, word fragments, or individual characters depending on the model's vocabulary.
The most widely used tokenizer today is BPE (Byte Pair Encoding), which builds vocabulary from the most frequent byte pairs in the training corpus. In practice, "tokenization" might become two tokens (token + ization), while "the" is always a single token — because common English words became highly frequent in the corpus.
This has real implications:
- Rare words in Portuguese cost more tokens than common English words
- Source code has different tokenization than natural text
- A "200K token" context window is not 200K words — it's less, depending on the language
GPT-4's vocabulary has ~100K tokens. Claude uses a similar BPE tokenizer with variations. Smaller models typically have 32K–64K token vocabularies.
Next-token prediction: the core loop
After tokenization, the model processes the input sequence and produces a probability distribution over all vocabulary tokens. For each output position, it answers: "given everything that came before, what's the probability of each token being next?"
This process is autoregressive — each generated token is appended to the context and fed back into the model for the next prediction. A 200-word response involves roughly 250–300 iterations of this mechanism.
Input: ["What", "is", "the", "capital", "of", "Brazil"]
1st prediction output: { "?" : 0.001, "Bras": 0.72, "is": 0.04, ... }
Selected token: "Bras"
New input: ["What", "is", "the", "capital", "of", "Brazil", "Bras"]
Next prediction: { "ília": 0.89, "il": 0.08, ... }
What determines these probability weights is the Transformer architecture, trained to minimize prediction loss (cross-entropy) over billions of text examples. The model doesn't have a stored "correct answer" — it learned the statistical pattern of how coherent texts are constructed.
Attention: how context gets weighted
The self-attention mechanism is what allows the model to weight the importance of previous tokens when predicting the next one. For each token in the sequence, the model computes query, key, and value vectors, and uses them to determine how much each previous position contributes to the current prediction.
In practical terms: when the model processes "she" in a sentence, the attention mechanism decides whether "she" refers to the subject mentioned two sentences back or the object of the current clause. This is what differentiates a Transformer from a simple Markov model, which only looks at the last N tokens.
Attention has quadratic cost relative to context size — doubling the context quadruples the attention computation. That's why large windows (like Gemini 1.5's 1M tokens or Claude 3.5's 200K) are expensive at inference time.
Temperature and sampling: how the model "chooses"
The raw probability distribution from the model is rarely used directly. Before token selection, two main parameters come into play:
Temperature scales the logits before softmax. With temperature=0, the model always picks the highest-probability token (greedy decoding) — deterministic responses. With temperature=1.0 (the default for most providers, including Claude and OpenAI), the original distribution is preserved. Values above 1 flatten the distribution, increasing diversity — and noise.
Top-P (nucleus sampling) truncates the distribution: the model considers only the tokens that, summed, reach probability P. With top_p=0.9, tokens that collectively represent 90% of the probability are eligible; the rest is discarded. This prevents the model from selecting very low-probability tokens even with high temperature.
Raw logits: [2.1, 1.8, 0.3, -0.5, ...]
After temperature=0.7: [3.0, 2.57, 0.43, -0.71, ...]
After softmax: [0.45, 0.31, 0.12, 0.06, ...]
After top_p=0.9: keeps first 3 tokens (cumulative sum ≥ 0.9)
Token sampled from resulting subset
These parameters explain why the same prompt can produce different responses across runs — and why temperature=0 doesn't guarantee full determinism in some distributed systems using float16 arithmetic.
The context window: working memory, not permanent memory
The model processes all tokens in the conversation with each prediction — the entire context (prompt + history + partial response) is the model's "working memory." There's no persistent memory across different conversations unless it's explicitly injected into the context.
Models from 2025–2026 dramatically expanded these windows: Claude 3.5 operates with 200K tokens, Gemini 1.5 Pro up to 1M, and models like Llama 3.1 reach 128K. That's equivalent to hundreds of pages of text processed simultaneously.
What doesn't change: when the context fills up, the oldest tokens are dropped. The model keeps responding, but without access to the beginning of the conversation. That's a hardware constraint, not an intelligence one.
Why does the model hallucinate with such confidence?
Hallucination — the technical term for when a model produces factually incorrect information with an assertive tone — is a direct consequence of the prediction mechanism. The model has no access to a verified fact database; it predicts plausible tokens given the context. If the statistical pattern says "the capital of Canada is To..." should be followed by "ronto", it'll write that — even though Ottawa is correct.
Linguistic plausibility and factual accuracy are different objectives. Training with RLHF (Reinforcement Learning from Human Feedback) helps align outputs with verified human knowledge, but doesn't eliminate the problem — especially for domains underrepresented in the training corpus.
For cases where accuracy matters more than fluency, low temperature combined with retrieval-augmented generation (RAG) over trusted sources is the safer path.
Frequently asked questions
Does the model "think" before responding?
Technically, no — token generation is sequential with no global planning. But recent models trained with chain-of-thought (like DeepSeek-R1 or OpenAI's o3) are incentivized to generate intermediate "reasoning" tokens before the final answer. This improves results on multi-step tasks, not because the model thinks, but because generating intermediate reasoning creates additional context tokens that guide subsequent prediction.
Why does the model use more tokens in Portuguese than English?
The training corpus of almost all LLMs is dominated by English — estimated at 40–60% for the most popular models. The BPE tokenizer, trained on that corpus, learned English subwords with much more granularity. Portuguese words with rich morphology (verb conjugations, derivational suffixes) rarely fit into a single token. The practical result: a sentence in Portuguese uses 15–30% more tokens than its English equivalent, which increases cost and can compress the available useful context.
What's the difference between Top-K and Top-P?
Top-K truncates the distribution to the K most probable tokens, regardless of their actual probabilities. Top-P truncates based on cumulative probability mass. Top-P is generally preferred because it's adaptive: if the model is very confident (one option at 99% probability), Top-P may select just 1 token; if uncertain, it might select 50. A fixed Top-K of 40 treats both cases the same way, which is less precise.
Why is the default temperature 1.0 and not 0?
Temperature 0 (greedy decoding) maximizes local coherence but reduces diversity and tends to produce repetitive text in creative tasks. The 1.0 default preserves the distribution learned by the model, which was optimized during training to balance fluency and variety. For code or SQL, many teams use 0.0–0.2; for creative writing, 0.7–1.2 is common.
The token is the unit of everything
Understanding tokenization, autoregressive prediction, and sampling resolves most questions about why LLMs behave the way they do — why they fail at simple arithmetic, why they're slow with large contexts, why the same question can yield different answers.
The model doesn't reason in the human sense, doesn't index facts for retrieval, and doesn't verify what it produces. It's very good at one specific thing: given the current context, generate the most plausible next token — and repeat until you tell it to stop. When I need to manually verify hashes or encodings before trusting an LLM-generated value, I use the Hash Generator on Quick Tools.
That limitation isn't a design flaw — it's the nature of the problem it was optimized to solve. Knowing this changes how you use the tool.
- 01 How to Organize Your Programming Studies Without Getting Lost Escape tutorial hell, build consistency, and actually finish projects — a practical system for learning programming that holds up over time.
- 02 IP Addresses: IPv4 vs IPv6, exhaustion, NAT, and why it still matters IPv4 ran out in Latin America in 2020. A practical breakdown of notation, NAT's tradeoffs, IPv6 changes, and what it means for developers writing real systems.