All articles
64 articles · updated weekly See our Tools
All articles
Tutorials

How LLMs Generate Responses: Tokens, Prediction, and Sampling Explained

Tokenization, autoregressive prediction, temperature, and Top-P: the internal mechanics of how language models turn a prompt into text.

COVER · Tutorials

Someone shows you a ChatGPT response and asks "but how does it know that?" The honest answer is that the model doesn't know anything — it's making a very well-informed bet about which token comes next. That distinction matters more than it seems, especially when the model fails with the same confidence it succeeds.

This post focuses on the internal mechanics: what happens between pressing Enter and text appearing on screen. If you want the broader picture of generative AI, What is Generative AI covers the general context — here we go straight to the engine.

Tokenization: the model doesn't read text, it reads chunks

Before any prediction, the model needs to convert text into numbers. This happens through tokenization — text is broken into units called tokens, which can be whole words, word fragments, or individual characters depending on the model's vocabulary.

The most widely used tokenizer today is BPE (Byte Pair Encoding), which builds vocabulary from the most frequent byte pairs in the training corpus. In practice, "tokenization" might become two tokens (token + ization), while "the" is always a single token — because common English words became highly frequent in the corpus.

This has real implications:

  • Rare words in Portuguese cost more tokens than common English words
  • Source code has different tokenization than natural text
  • A "200K token" context window is not 200K words — it's less, depending on the language

GPT-4's vocabulary has ~100K tokens. Claude uses a similar BPE tokenizer with variations. Smaller models typically have 32K–64K token vocabularies.

Next-token prediction: the core loop

After tokenization, the model processes the input sequence and produces a probability distribution over all vocabulary tokens. For each output position, it answers: "given everything that came before, what's the probability of each token being next?"

This process is autoregressive — each generated token is appended to the context and fed back into the model for the next prediction. A 200-word response involves roughly 250–300 iterations of this mechanism.

Input: ["What", "is", "the", "capital", "of", "Brazil"]
1st prediction output: { "?" : 0.001, "Bras": 0.72, "is": 0.04, ... }
Selected token: "Bras"
New input: ["What", "is", "the", "capital", "of", "Brazil", "Bras"]
Next prediction: { "ília": 0.89, "il": 0.08, ... }

What determines these probability weights is the Transformer architecture, trained to minimize prediction loss (cross-entropy) over billions of text examples. The model doesn't have a stored "correct answer" — it learned the statistical pattern of how coherent texts are constructed.

Attention: how context gets weighted

The self-attention mechanism is what allows the model to weight the importance of previous tokens when predicting the next one. For each token in the sequence, the model computes query, key, and value vectors, and uses them to determine how much each previous position contributes to the current prediction.

In practical terms: when the model processes "she" in a sentence, the attention mechanism decides whether "she" refers to the subject mentioned two sentences back or the object of the current clause. This is what differentiates a Transformer from a simple Markov model, which only looks at the last N tokens.

Attention has quadratic cost relative to context size — doubling the context quadruples the attention computation. That's why large windows (like Gemini 1.5's 1M tokens or Claude 3.5's 200K) are expensive at inference time.

Temperature and sampling: how the model "chooses"

The raw probability distribution from the model is rarely used directly. Before token selection, two main parameters come into play:

Temperature scales the logits before softmax. With temperature=0, the model always picks the highest-probability token (greedy decoding) — deterministic responses. With temperature=1.0 (the default for most providers, including Claude and OpenAI), the original distribution is preserved. Values above 1 flatten the distribution, increasing diversity — and noise.

Top-P (nucleus sampling) truncates the distribution: the model considers only the tokens that, summed, reach probability P. With top_p=0.9, tokens that collectively represent 90% of the probability are eligible; the rest is discarded. This prevents the model from selecting very low-probability tokens even with high temperature.

Raw logits: [2.1, 1.8, 0.3, -0.5, ...]
After temperature=0.7: [3.0, 2.57, 0.43, -0.71, ...]
After softmax: [0.45, 0.31, 0.12, 0.06, ...]
After top_p=0.9: keeps first 3 tokens (cumulative sum ≥ 0.9)
Token sampled from resulting subset

These parameters explain why the same prompt can produce different responses across runs — and why temperature=0 doesn't guarantee full determinism in some distributed systems using float16 arithmetic.

The context window: working memory, not permanent memory

The model processes all tokens in the conversation with each prediction — the entire context (prompt + history + partial response) is the model's "working memory." There's no persistent memory across different conversations unless it's explicitly injected into the context.

Models from 2025–2026 dramatically expanded these windows: Claude 3.5 operates with 200K tokens, Gemini 1.5 Pro up to 1M, and models like Llama 3.1 reach 128K. That's equivalent to hundreds of pages of text processed simultaneously.

What doesn't change: when the context fills up, the oldest tokens are dropped. The model keeps responding, but without access to the beginning of the conversation. That's a hardware constraint, not an intelligence one.

Why does the model hallucinate with such confidence?

Hallucination — the technical term for when a model produces factually incorrect information with an assertive tone — is a direct consequence of the prediction mechanism. The model has no access to a verified fact database; it predicts plausible tokens given the context. If the statistical pattern says "the capital of Canada is To..." should be followed by "ronto", it'll write that — even though Ottawa is correct.

Linguistic plausibility and factual accuracy are different objectives. Training with RLHF (Reinforcement Learning from Human Feedback) helps align outputs with verified human knowledge, but doesn't eliminate the problem — especially for domains underrepresented in the training corpus.

For cases where accuracy matters more than fluency, low temperature combined with retrieval-augmented generation (RAG) over trusted sources is the safer path.

Frequently asked questions

Does the model "think" before responding?

Technically, no — token generation is sequential with no global planning. But recent models trained with chain-of-thought (like DeepSeek-R1 or OpenAI's o3) are incentivized to generate intermediate "reasoning" tokens before the final answer. This improves results on multi-step tasks, not because the model thinks, but because generating intermediate reasoning creates additional context tokens that guide subsequent prediction.

Why does the model use more tokens in Portuguese than English?

The training corpus of almost all LLMs is dominated by English — estimated at 40–60% for the most popular models. The BPE tokenizer, trained on that corpus, learned English subwords with much more granularity. Portuguese words with rich morphology (verb conjugations, derivational suffixes) rarely fit into a single token. The practical result: a sentence in Portuguese uses 15–30% more tokens than its English equivalent, which increases cost and can compress the available useful context.

What's the difference between Top-K and Top-P?

Top-K truncates the distribution to the K most probable tokens, regardless of their actual probabilities. Top-P truncates based on cumulative probability mass. Top-P is generally preferred because it's adaptive: if the model is very confident (one option at 99% probability), Top-P may select just 1 token; if uncertain, it might select 50. A fixed Top-K of 40 treats both cases the same way, which is less precise.

Why is the default temperature 1.0 and not 0?

Temperature 0 (greedy decoding) maximizes local coherence but reduces diversity and tends to produce repetitive text in creative tasks. The 1.0 default preserves the distribution learned by the model, which was optimized during training to balance fluency and variety. For code or SQL, many teams use 0.0–0.2; for creative writing, 0.7–1.2 is common.

The token is the unit of everything

Understanding tokenization, autoregressive prediction, and sampling resolves most questions about why LLMs behave the way they do — why they fail at simple arithmetic, why they're slow with large contexts, why the same question can yield different answers.

The model doesn't reason in the human sense, doesn't index facts for retrieval, and doesn't verify what it produces. It's very good at one specific thing: given the current context, generate the most plausible next token — and repeat until you tell it to stop. When I need to manually verify hashes or encodings before trusting an LLM-generated value, I use the Hash Generator on Quick Tools.

That limitation isn't a design flaw — it's the nature of the problem it was optimized to solve. Knowing this changes how you use the tool.

RD
Author
Rafael Duarte
Desenvolvedor backend com passagem por fintech e SaaS B2B — trabalhou em times que escalaram APIs de zero a milhões de requisições. Carrega cicatrizes de produção suficientes para ter opiniões fortes sobre ferramentas, padrões e decisões de arquitetura. Não é acadêmico: leu a RFC do UUID quando precisou escolher entre v4 e v7 para uma tabela de alta escrita.
View profile