Tutorials

What Is Generative AI: LLMs, Image Models, and Why They Hallucinate

LLMs generate text token by token via statistical prediction — no plan, no "knowing." Learn how it works, where it performs, and why it hallucinates.

Rafael Duarte

EDITOR TÉCNICO

Published

Jun 18, 2026

Reading time

9 min

Jun 18, 2026 · 9 MIN

COVER · Tutorials

You ask an AI to "write a professional email" and it delivers something solid in two seconds. Then you ask it to confirm a historical date and it states, with the exact same confidence, something entirely wrong. That's not a bug — it's expected behavior. Understanding why explains a lot about how to use these systems productively without getting burned by them.

What generative AI actually is (and what it isn't)

Generative artificial intelligence is a branch of machine learning focused on creating new content — text, images, audio, code, video. The name distinguishes it from classical classification or prediction AI, which answered "is this a cat or a dog?" given an input. Generative AI answers with a synthesized new output.

The term gained traction around 2022 with the mainstream arrival of LLMs (Large Language Models) via ChatGPT, but the underlying technology came earlier: the 2017 paper "Attention Is All You Need" introduced the Transformer architecture that became the foundation of virtually every relevant LLM on the market.

In 2026, the field extends well beyond LLMs. Diffusion models generate images and video (Stable Diffusion, Midjourney, Sora). Audio models synthesize music and speech. But text is still where most people first encounter the technology — and where most of the practical impact is.

How LLMs generate text

The most honest description of an LLM: a next-token prediction machine, trained on absurd quantities of text.

Token isn't the same as word. It's a text fragment — could be a whole word, part of a word, or a single character. "intelligence" might become two or three tokens depending on the tokenizer. GPT-4 uses a vocabulary of ~100k tokens; leading 2026 models work with context windows of 1 to 2 million tokens.

The generation process, simplified:

You send a prompt (a sequence of tokens)
The model computes probabilities for the next token, considering the full context via self-attention
A token is sampled from those probabilities (with temperature controlling randomness)
That token is appended to the context and the loop repeats

There's no internal script. The model doesn't have a "plan" for what it will write. It's iterative prediction, token by token — each token generated shapes the next ones.

The attention mechanism

What made Transformers better than prior recurrent networks was processing the entire sequence in parallel rather than step by step. Self-attention lets every token "look at" every other token in the context simultaneously, weighing how relevant each is.

In practice: when you ask "what is the capital of France?", the model doesn't process letters sequentially — it weighs the relationship between all tokens in your prompt at once and builds a representation that informs the response.

Training: what the model learned

Modern LLMs go through two main phases:

Pre-training: the model is exposed to massive amounts of text (web, books, code, scientific papers) and learns to predict the next token. GPT-4 was trained on an estimated several trillion tokens. 2026-era models like Claude 4, GPT-5, and Gemini 2.5 Pro were trained on even larger volumes with more carefully curated data.

RLHF (Reinforcement Learning from Human Feedback): after pre-training, the model receives human feedback on which responses are better. This aligns behavior to be more helpful, less harmful, and better at following instructions. It's what turns a "text predictor" into an "assistant."

How AI generates images

Image models work differently from LLMs. The most common today use diffusion: the model learns to remove noise from images progressively. In the reverse process, it starts from random noise and iteratively denoises, guided by the text prompt, until it produces a coherent image.

The text is converted into an embedding (a numerical vector) by a language encoder (typically CLIP or variants), and that embedding guides the denoising process. More iterations mean higher quality — which is why some models take longer on complex scenes.

In 2026, the most advanced image generation models (Flux Pro, Midjourney v7, Firefly 4) produce images that are difficult to distinguish from real photographs at resolutions up to 4K.

Real-world use cases (the ones that actually work)

After two-plus years with these tools in a development workflow, the cases where generative AI consistently delivers value:

Boilerplate and repetitive code: REST endpoint scaffolding, type conversions between languages, data transformations with well-defined schemas. The model performs well when the problem has recognizable patterns from training.

Summarization and information extraction: given a long document, extract key points, summarize into bullets, identify entities. Works well when the information is in the context — no reliance on model memory.

First drafts: email drafts, function documentation, copy variations. Output rarely goes to production unchanged, but it kills the blank page problem.

Code analysis and debugging: explaining what a piece of code does, suggesting optimizations, flagging suspicious patterns. Useful as a second pair of eyes, not as a final judge.

RAG (Retrieval-Augmented Generation): combining LLMs with search over your own database. The model responds based on retrieved documents, not just what it memorized during training. Significantly reduces hallucination for specific domains.

Limitations — where it breaks down

Hallucination

The most documented problem and the least solved. The model generates statistically plausible text, and sometimes "plausible" and "correct" diverge sharply.

2026 data shows the scale of the issue: a benchmark covering 37 models reported hallucination rates between 15% and 52% for general factual tasks. For niche or recent topics, rates climb to 35–55% in models without search access. The best models (Claude Sonnet 4.x, GPT-5) reach ~3–8% on general tasks with adequate context — but "3% error" in a response with a hundred factual claims still means three potentially wrong statements.

The root problem: the model doesn't have "knowing" or "not knowing." It has probabilities. When the correct answer has low probability in the token space from training, the model doesn't say "I don't know" — it generates whatever has the highest conditional probability, which can be plausible but wrong.

Knowledge cutoff

LLMs are trained on data up to a certain date and then "frozen." A model with a mid-2025 cutoff has no knowledge of what happened after that — unless you provide it in context or the model has access to search tools.

Context as working memory

Unlike how humans build long-term memory, LLMs only "remember" what's in the current conversation's context window. When you close the session, the model retains nothing. Systems with genuine persistent memory are still active research in 2026.

Mathematical and logical reasoning

For operations that require rigorous symbolic reasoning — mathematical proofs, complex logical inferences, arithmetic on large numbers — LLMs still fail with surprising frequency. Models with Python access via Code Interpreter work around this by executing code instead of "calculating in text."

Prompt injection

In agentic systems (where the LLM executes actions), malicious content in the context can cause the model to deviate from its original instructions. This is an active security vulnerability with incomplete mitigation.

What's happening in 2026

The 2026 landscape is not the same as 2023. A few shifts that changed practical usage:

Reasoning models: GPT-o3, Claude Sonnet with extended thinking, Gemini with deep think — models that "think out loud" before responding, reducing errors on complex problems. They cost more in time and tokens, but outperform on multi-step tasks.

Agents: LLMs connected to tools (search, code execution, APIs, external memory). They moved past "chat" and now execute workflows. The model decides which tools to use, in what order, and iterates until it completes the task.

Multimodality: the best 2026 models process text, images, audio, and video in the same context. You can send a screenshot of an error and ask for a diagnosis, or a meeting recording and ask for a summary of action items.

Alternative architectures: Mamba and other state space model-based architectures challenge Transformer dominance for long sequences, with lower computational cost in some scenarios.

For generating placeholder text, filling templates, or testing prompts with varied content, I use the Lorem Ipsum Generator — handy when you need real text before the actual content is ready.

Frequently asked questions

Are generative AI and machine learning the same thing?

Machine learning is the broader field — any system that learns patterns from data. Generative AI is a subcategory of ML focused on creating new data that follows the distribution from training. Every LLM is ML, but not all ML is generative.

Does the model "understand" what I write?

Depends on what you mean by "understand." The model processes tokens, applies attention, and generates a contextually coherent response. There's no semantic representation in the human sense — no concept of "I" processing information. What looks like comprehension is the result of statistical correlations at very large scale. Whether that counts as understanding is an open philosophical question.

Can LLMs be used in production for critical decisions?

With RAG, guardrails, and a human in the loop: yes, for many cases. Without those safeguards in high-stakes decisions (medical, legal, financial): no. Hallucination rates in even the best models rule out autonomous use where errors have serious consequences.

What's the practical difference between the main models?

In 2026, the frontier models (Claude, GPT-5, Gemini 2.5 Pro) have similar capabilities on general benchmarks, with differences in: context window size, cost per token, latency, multimodal capabilities, and data privacy policy. The practical choice depends on use case, volume, and where your data can flow.

The model doesn't know what it doesn't know — that's the thing to keep in mind

Every time an LLM responds with confidence about something outside its training or data not in the current context, it's extrapolating. The output looks solid because it was optimized to look solid — that's what RLHF produces.

Use LLMs for what they do well: synthesis, generation, format transformation, information triage that you then verify. Don't delegate the verification task to the same model that generated the information. And when the output is going to production, treat it as a draft that needs review — regardless of how confident the model sounded.

Author

Rafael Duarte

Desenvolvedor backend com passagem por fintech e SaaS B2B — trabalhou em times que escalaram APIs de zero a milhões de requisições. Carrega cicatrizes de produção suficientes para ter opiniões fortes sobre ferramentas, padrões e decisões de arquitetura. Não é acadêmico: leu a RFC do UUID quando precisou escolher entre v4 e v7 para uma tabela de alta escrita.

View profile