Perplexity defensive over ignoring robots.txt and stealing data

AppleInsider · August 5, 2025 12:32AM

Perplexity was discovered to be actively bypassing blocks from websites to scrape content in 2024, and a new report shows that it has continued with increasing sophistication as the company defends the practice.

Colorful wildflowers in orange, blue, and white beneath a wavy, shimmering light display against a dark background.

Perplexity's logo surrounded by lights and flowers. Image source: Perplexity

Apple received some significant blowback when it was discovered that Applebot had been crawling the web for years to get data to train Apple Intelligence. Websites immediately blocked the bot, and others, which sparked some interesting discoveries about how AI companies are operating.

A year on, and at least one company is still doing everything in its power to ignore robots.txt and scrape webpages anyway -- Perplexity. According to a report from Cloudflare, Perplexity is using several techniques to undermine the trust expected on the web and access data to train its large language models.

Testing was conducted by creating new websites that had never been scraped before, then asking Perplexity AI about them. When the crawling bot encountered a robots.txt file that told it not to crawl, a new bot with a different browser agent, IP address, and even a new ASN appeared.

Then, Perplexity was able to provide information that was available only on the website. It was clear that Perplexity was operating this new bot, even though it was unlabeled and its IP didn't appear in Perplexity's official IP range.

The methodology showed that data was most accurate when the new bots could get through. If the new bots were also blocked on a new webpage, the Perplexity AI results would be less specific or completely hallucinated -- which indicates the new bots did indeed feed information to Perplexity.

Old news, new details

Cloudflare's reporting helps reignite the attention around chatbots and how they get their data. That said, their findings, except for details around the new ASNs, are nearly identical to what was covered by Wired and Robb Knight in June 2024.

Perplexity hasn't changed its tune, and in fact, seems to be trying to find new ways to avoid robots.txt. The document is an exercise in trust that is meant to stop any reputable company from accessing a website and scraping its data.

Apple, Google, ChatGPT, and others honor robots.txt while Perplexity has not and does not. While there's no legal backing to robots.txt, it colors the company as shady and untrustworthy versus its competitors.

Colorful, glowing, interlocking loops form a symmetrical star-like shape on a black background.

Apple Intelligence honors robots.txt. Image source: Apple

At the least, it damages Perplexity's reputation and may jeopardize any talks it may have had with Apple about an acquisition. It seems that Apple is confident in its foundation models team and won't be looking for an acquisition to "save" Apple Intelligence, anyway.

We reached out to the Perplexity AI chatbot about the situation, and it faithfully regurgitated Cloudflare's reporting that it scraped from its website. However, Perplexity's blog has a surprising new post published Monday, curiously defending the company's approach.

Perplexity fires back at Cloudflare

In an unsurprising turn of events, Perplexity has taken a defensive tack on its actions, claiming its web scraper and AI agents are two different entities. It blames Cloudflare for being unable to distinguish between the two and calls it a threat to the open web.

This controversy reveals that Cloudflare's systems are fundamentally inadequate for distinguishing between legitimate AI assistants and actual threats. If you can't tell a helpful digital assistant from a malicious scraper, then you probably shouldn't be making decisions about what constitutes legitimate web traffic.

These claims are ludicrous, of course. Humans navigate the free and open web, and websites not wanting their content stolen by an AI chatbot is a perfectly legitimate concern.

A recent report from 404 Media shows how AI data scrapers have ruined the internet thanks to Google no longer directing user traffic to the source. Ars Technica also published a similar report, suggesting human web traffic is way down.

The problem with Perplexity's claims is that it assumes we've all mistakenly labeled its agents as scrapers that absorb data for AI training, which isn't the problem. While Perplexity says agents accessing websites aren't using the data for training, it misses the entire point of robots.txt.

Dimly lit corridor with orange vertical pillars, a beam of light creating a rainbow effect, and a geometric symbol in the center.

Perplexity thinks semantics will save face while it destroys the open web. Image source: Perplexity

Websites that tell automated web crawlers of any kind to ignore their page aren't doing it just because of potential ethical training issues; they're doing it to protect their livelihoods. If a user never has to see a website to gather information, then the human-run website will wither and die.

What Perplexity doesn't understand is that without the human-run web, its AI will be useless. If all the humans go out of business, there will be nothing left to scrape.

It doesn't matter that it isn't stored or used for training, the AI agent isn't creating revenue or respecting the website's business model. Perplexity is actively, aggressively, and proudly building bots that are systematically tearing down the open web in the name of justice and freedom.

The blog post attempts to undermine Cloudflare's authority, suggesting it was either malicious clickbait or incompetence that resulted in the report. In the end, the company's public response is an embarrassment and goes against everything it claims to want to preserve.

Apple's part in all this

When Apple revealed Apple Intelligence, it also shared that Applebot had played a part in scraping the web for freely available information that could train its foundation models. Apple was clear that it abided by robots.txt, though that was an empty promise considering websites thought it was indexing data for Siri and Spotlight.

A smartphone home screen displays various app icons, a calendar widget highlighting October 22nd, and a dark gradient background with abstract technology patterns.

Apple has to stay away from AI controversy while it races ahead.

The reaction was immediate -- many websites updated their robots.txt to block Apple and other AI scrapers. The result of that and threatened legal action from Forbes was increased attention around AI data collection.

Apple has repeated consistently that it only uses ethically sourced data. While the Applebot situation was unfortunate, those horses are out of the barn, and Apple has shown considerable restraint in a world full of ethically questionable AI companies.

Apple's unique approach brings a combination of local models, private cloud models running on servers powered by renewable energy, and a promise to never train on user data or prompts. If Apple is to continue acting as a kind of ethical beacon in artificial intelligence, it's going to need to steer clear of Perplexity.

Read on AppleInsider

xed · August 5, 2025 12:50AM

Those that steal data to train AI because the worst they'll ever get is a small fine and some finger waving will likely have the best trained models. This doesn't bode well for companies like Apple that are usually pretty good about following the rules.

danox · August 5, 2025 1:42AM

Whatever Perplexity is doing they certainly ain’t worth $40 billion dollars, ultimately the quality of your model can only be as good as the information you put into it, I believe in the end, smaller purpose-built AI models will carry the day. Because you will have more control over troubleshooting a smaller model when it runs off the rails.

Taking on Perplexity future legal problems? Aren’t worth it….

https://blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives/

https://www.businessinsider.com/ai-data-trap-catches-perplexity-impersonating-google-cloudflare-2025-8

Perplexity defensive over ignoring robots.txt and stealing data

Old news, new details

Perplexity fires back at Cloudflare

Apple's part in all this

Comments