AI crawlers force designers to choose between visibility and intellectual

Design studios are used to protecting source files, fonts, and client decks. The awkward new question is whether they should also protect their websites from the machines that now feed on them. For a studio, that can mean a simple `robots.txt` decision with messy commercial consequences, because the same page that helps a brand get cited by AI can also become raw material for someone else’s model.

The split is blunt. Allow the crawler, and you increase the chance that your work is understood by chat tools and AI search. Block it, and you keep more control over your intellectual property. There is no free lunch here, only a choice about which side of the trade you want to live on.

GPTBot and Google-Extended do different jobs

OpenAI’s GPTBot is a real crawler. It visits public pages and gathers text and other data to help train and refine OpenAI models, including the ones behind ChatGPT. It is not supposed to wander into paywalled pages, logins, personal data, or material that breaks OpenAI’s safety rules. If you want to stop it, the instruction is plain enough: `User-agent: GPTBot` followed by `Disallow: /`.

Google-Extended works differently. It is not a separate bot roaming the web. It is a control token in `robots.txt` that tells Google how it may use content that its main crawlers have already collected. If you block it with `User-agent: Google-Extended` and `Disallow: /`, you are telling Google not to use that material for training future AI systems such as Gemini and Vertex AI APIs.

The important bit for designers is that neither setting changes normal Google Search rankings. Blocking GPTBot or Google-Extended does not protect you from search engine visibility loss, because traditional indexing still depends on the usual crawlers, not the AI training layer.

What changes and what stays put

If your team cares about SEO in the old sense, the crawler switch is not the lever. Googlebot and Bingbot still decide how your pages appear in search. If your team cares about GEO, the calculus changes. Let the bots in, and there is a better chance that a model will later mention your studio, your service line, or your point of view when someone asks a conversational query.

That is why this is not just a technical setting. It is a distribution decision. A brand that stays visible inside AI answers can end up in the Kadabra SEO conversation in a way that pure search optimisation never produced. The upside is reach. The downside is that your hard-won copy, case studies, and methodologies can be absorbed into systems that may answer users without sending them back to your site.

The commercial trade off is real

For a studio, the value question is not abstract. Your site might contain pricing logic, production workflows, brand positioning, or unique process notes that took years to sharpen. If those are fair game for model training, another company’s AI can learn from them without paying you. That is a raw deal if you sell expertise.

The counterargument is just as practical. If your agency publishes strong thinking, clear service pages, or niche knowledge that clients search for, AI visibility can matter. Letting GPTBot and Google-Extended read the site gives models more chances to describe your brand accurately when users ask about motion design, identity systems, 3D production, or campaign support. If the bots are blocked, your material is less likely to appear in that answer layer.

> Best for: studios that want to be cited, summarised, or recommended by AI tools that answer in prose rather than search results.

> Watch out for: premium decks, client-specific methods, private pricing, and unreleased work that should not be training fodder.

Cloudflare has made the default stricter

Cloudflare has pushed this discussion out of the theory bin and into infrastructure. Its position is simple enough to read between the lines: if the web is open by default, AI scrapers will keep taking. So Cloudflare flips that default for its users by blocking AI training bots at the edge unless the site owner changes the setting.

Its Managed `robots.txt` feature also helps by inserting the right disallow rules automatically, which matters because bot names keep changing. Google-Extended, Applebot-Extended, and the next rebranded token are not something most small studios want to track by hand every quarter. Cloudflare’s approach saves time and, more importantly, removes the assumption that silence means consent.

It goes further than polite crawlers. Some bots ignore `robots.txt`, masquerade as human traffic, or hammer pages so hard that they become a server problem. Cloudflare uses traffic signals and behaviour analysis to fingerprint that kind of traffic and cut it off before it reaches origin. For a studio with limited hosting headroom, that is not theory. That is cost control.

When you might change Cloudflare’s default

If your business actually wants AI systems to read the site for discovery, partnerships, or visibility, Cloudflare’s block-first setup has to be switched off manually. The route is straight enough:

1. Open the Cloudflare Dashboard. 2. Go to `Security > Bots`. 3. Change `Block AI training bots` to `Off` or `Do not block`. 4. Set `Instruct AI bot traffic with robots.txt` to `Disabled`.

If you only want to let one crawler through, the cleaner move is a custom WAF rule. That gives you a narrow exception instead of opening the whole door.

The bots people complain about are not all the same

The phrase “bad bot” gets thrown around too loosely. Some non-human visitors are a nuisance, but they still serve a purpose.

SEMrushBot, AhrefsBot, DotBot from Moz, and Screaming Frog can be rough on servers because they crawl aggressively and map internal structures fast. Agencies still use them for audits, keyword tracking, and competitive work.

facebookexternalhit, LinkedInBot, Pinterestbot, and Twitterbot are another category. They can hit a site repeatedly when a link is shared, but they are also the reason a post shows the right title, preview text, and image in a feed.

PerplexityBot sits in a different bucket again. It behaves like a fast live extractor, but it can still send traffic because it cites the source link in real time instead of only training on the content for later use.

The true villains are the fake identity scrapers, the ones pretending to be Googlebot while stealing pricing, scraping portfolios, or probing for weak spots. Those should be blocked hard with firewall rules and challenge logic.

The decision should be deliberate

A studio does not need to treat every crawler the same way. Search bots, AI training bots, social fetchers, real time answer engines, and spoofed scrapers all have different jobs and different risks. Mixing them together is how teams end up either overblocking useful traffic or giving away more than they meant to.

For South African designers, the sensible move is to decide what the site is for. If the site exists to sell expertise, generate leads, and stay visible in AI answers, then some crawl access makes sense. If it holds proprietary methods, private assets, or sensitive commercial material, blocking is the cleaner line.

The smartest `robots.txt` is not the most paranoid one. It is the one that lets legitimate search and AI systems do their job while keeping the opportunists, clones, and data thieves out.