Tutorials

Technical SEO checklist for new sites

robots.txt, sitemap, canonical, hreflang, and structured data — what to audit before launch and why great content won't rank without a solid technical foundation.

Rafael Duarte

EDITOR TÉCNICO

Published

Jun 28, 2026

Reading time

9 min

Jun 28, 2026 · 9 MIN

COVER · Tutorials

You read the SEO basics, understood how Google indexes pages, and already know that title tags and meta descriptions matter. Then you open Search Console two months after launch and see: 40% of URLs with indexing issues, hreflang reciprocity errors, invalid structured data. The content is solid. The problem is technical — and technical problems have checklists.

If you're still at the beginning — understanding how crawling, indexing, and ranking work — the post SEO for beginners covers that ground. Here the focus is what comes after: the configuration layer Google reads before a single word of your content.

robots.txt: what to block (and what to never block)

robots.txt lives at https://yourdomain.com/robots.txt and is the first thing any crawler reads. The most common mistake on new sites: the file was copied from staging to production with Disallow: / — and the entire site is blocked from crawling with no visible warning.

Basic structure:

User-agent: *
Disallow: /admin/
Disallow: /checkout/
Disallow: /search?
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

The explicit Allow: / resolves conflicts when you have a restrictive general rule but need to open a subsection. Without it, interpretation depends on the crawler's implementation.

What to block:

Admin and login pages
Internal search results with parameters (/search?q=, ?sort=, ?page=)
Staging environments if they share the same domain (ideally use a password-protected subdomain)
Cart and checkout pages on e-commerce sites

What to never block:

CSS and JavaScript that Google needs to render the page
Images that appear in content
The canonical URL of any page you want indexed

In 2026, robots.txt gained a new dimension: AI bots (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) crawl your site for training data and to cite in answers. If you don't want your content feeding language models but still want to appear in traditional search, you need specific directives:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Googlebot
Allow: /

This separation is valid — each bot only obeys rules under its own User-agent.

XML sitemap: what Google actually uses

A sitemap is a list of URLs you want Google to know about. What most people miss: Google primarily uses the <lastmod> field as a freshness signal. If you don't update lastmod when content changes, the crawler may not revisit the page.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://yourdomain.com/products/widget-x</loc>
    <lastmod>2026-05-15</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

Rules most people ignore:

Only URLs with 200 status and no noindex. Putting URLs that return 301, 404, or have noindex in the sitemap sends a contradictory signal to Google.
Canonical URLs, not variants. If /product?color=blue has a canonical pointing to /product, put /product in the sitemap — not the parameterized variant.
<priority> is relative between pages on the same site. Google has documented that it rarely uses this field. <lastmod> carries much more weight.

For large sites (over 50,000 URLs), use a sitemap index — a master file referencing multiple sitemaps by category or content type. Limit per sitemap: 50,000 URLs or 50 MB uncompressed.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://yourdomain.com/sitemap-blog.xml</loc>
    <lastmod>2026-06-13</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://yourdomain.com/sitemap-products.xml</loc>
    <lastmod>2026-06-13</lastmod>
  </sitemap>
</sitemapindex>

Canonical: when to use it and when you're getting it wrong

The <link rel="canonical"> tag solves the duplicate content problem by telling Google which is the primary version of a URL. Sounds simple. The errors are subtle.

Problem 1: canonical pointing to a URL with a redirect. If the canonical points to /product and that URL 301-redirects to /products/widget, Google follows the redirect but gets confused about which URL to consolidate signals to. Canonical should always point to the final URL with a 200 status.

Problem 2: canonical + noindex on the same page. If you put noindex on /product?color=blue but that page has a canonical pointing to /product, you're sending two conflicting signals. noindex says "don't index this page." Canonical says "consolidate signals to this other URL." Google may simply ignore the canonical.

Problem 3: missing self-referencing canonical. Every page should have a canonical tag pointing to itself, even when there are no duplicates. Without it, UTM parameters (?utm_source=newsletter) can accidentally create duplicates in the index.

<!-- On /en/tools/qr-code-generator -->
<link rel="canonical" href="https://yourdomain.com/en/tools/qr-code-generator" />

Hreflang: the reciprocity error nobody notices

If your site serves multiple languages or regions, hreflang is mandatory — and it's the part of technical SEO with the highest rate of incorrect implementation.

The fundamental rule: hreflang is bidirectional. If the /en/ version references the /pt/ version, the /pt/ version must reference the /en/ version back. If any page in the cluster doesn't reference all the others, Google ignores the annotation entirely for that page.

<!-- On /en/tools/qr-code-generator -->
<link rel="alternate" hreflang="en" href="https://yourdomain.com/en/tools/qr-code-generator" />
<link rel="alternate" hreflang="pt" href="https://yourdomain.com/pt/ferramentas/gerador-qr-code" />
<link rel="alternate" hreflang="es" href="https://yourdomain.com/es/herramientas/generador-qr" />
<link rel="alternate" hreflang="x-default" href="https://yourdomain.com/en/tools/qr-code-generator" />

x-default is the URL for users whose language has no specific version on the site — usually points to the English version or a language selection page.

Beyond <head>, you can implement hreflang via HTTP header (useful for PDFs and non-HTML files) or via sitemap. For large sites, the sitemap approach is easier to maintain — you're not dependent on every template having the correct hreflang.

What to verify:

Each URL in the cluster references all others (including itself)
Language codes follow BCP 47: pt, pt-BR, pt-PT, en, en-US, not portuguese or brazil
URLs in hreflang return 200, not 301

Structured data: it's not just about rich results

Structured data with Schema.org isn't just an optional optimization for getting star ratings in the SERP. In 2026, with AI Overviews, ChatGPT Search, and Perplexity citing content, Schema.org is how AI bots understand who you are, what you offer, and whether you're a trustworthy source.

The minimum vocabulary for any site:

Organization / WebSite:

{
  "@context": "https://schema.org",
  "@type": "WebSite",
  "name": "Quick Tools",
  "url": "https://quickeasy.tools",
  "potentialAction": {
    "@type": "SearchAction",
    "target": "https://quickeasy.tools/en/tools?q={search_term_string}",
    "query-input": "required name=search_term_string"
  }
}

For blog articles:

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Technical SEO checklist for new sites",
  "datePublished": "2026-06-13",
  "dateModified": "2026-06-13",
  "author": {
    "@type": "Person",
    "name": "Rafael Duarte"
  }
}

For FAQs: the FAQPage type still generates rich results on Google in 2026 — questions and answers appear expanded in the SERP, increasing the visual real estate your result occupies.

Validate structured data with Google's Rich Results Test before going live. It's a GET request — takes thirty seconds.

Indexability: what to check before launch

With everything configured, here's the final checklist before launching a new site:

1. Confirm noindex was removed from staging. This is the most common and most invisible error. The site goes live, looks fine, and Google simply doesn't index anything because <meta name="robots" content="noindex"> is still in the <head>.

2. Check HTTP headers. Beyond the meta tag, an X-Robots-Tag: noindex in the HTTP header has the same effect. It might be configured at the server or CDN level without appearing in the HTML.

curl -I https://yourdomain.com/important-page
# Look for: X-Robots-Tag: noindex

3. Confirm the sitemap is submitted in Search Console. Search Console → Sitemaps → Add sitemap. Without this, Google can take weeks to discover your URLs through organic crawling.

4. Test JavaScript rendering. If your site is an SPA or uses partial SSR, content rendered by JavaScript may not be reaching the index. In Search Console, "Inspect URL" → "View as Googlebot" shows the screenshot and rendered HTML that Google is actually seeing.

5. Check redirect chains. Long redirect chains (A → B → C → D) waste crawl budget and weaken link equity. Each hop in a chain reduces the signal. The ideal is a direct redirect from A to D.

To build and test robots.txt, I use the robots.txt Generator — especially handy for generating AI bot directives without getting the syntax wrong.

Frequently asked questions

What's the difference between robots.txt and the noindex tag?

robots.txt blocks crawling — Googlebot doesn't visit the URL. The noindex tag allows crawling but blocks indexing — Google visits, reads the content, but doesn't save it to the index. The critical mistake: blocking via robots.txt a URL that receives external links. Google will index the URL (because links point to it) without being able to read the content — and will rank a blank page or error message.

Do I need hreflang if my site is in a single language?

If your site is in one language and doesn't target different regions (e.g., Brazil vs. Portugal), you don't need hreflang. But if you have /en-us/ and /en-gb/ with different content, or any combination of language + region, hreflang is necessary — without it Google may show the wrong version to users.

Does structured data improve rankings?

Directly, no — Google doesn't use Schema.org as a ranking signal in the core algorithm. The indirect impact is real: rich results increase CTR, and CTR is a relevance signal. Additionally, in 2026, AI Overviews and AI-powered search engines use structured data to decide who to cite. For content sites, Article and FAQPage are the types with the most visible return.

How do I know if my hreflang is correct?

Google Search Console has a dedicated report under "Internationalization" that lists hreflang errors by type — missing reciprocal tags, invalid language codes, URLs with errors. You can also use tools like Screaming Frog to audit hreflang at scale before launch.

Technical SEO is the floor, not the ceiling

Great content on top of a broken technical foundation doesn't rank — the crawler never even gets to the content. The right order is always: indexability first, then on-page, then authority (links). Technical SEO isn't glamorous, has no vanity metrics, and most of the work is confirming nothing is broken — not inventing new things.

The minimum checklist: correct robots.txt, sitemap updated with real lastmod, self-referencing canonical on every page, reciprocal hreflang if multilingual, valid structured data, zero accidental noindex in production. Done right, Google can do its job. What happens after that is content SEO — and that's a different post.

Author