Perplexity AI Caught Red-Handed: Ignoring Robots.txt in Aggressive Web Scraping Spree

📷 Image source: techcrunch.com

The Scraping Scandal

How Perplexity Allegedly Crossed the Line

Perplexity AI, the buzzy search startup that’s been nipping at Google’s heels, just got slapped with accusations that cut to the core of AI ethics. According to a TechCrunch investigation, the company’s crawlers have been vacuuming up content from websites that explicitly blocked them via robots.txt—the decades-old web standard that’s supposed to be sacrosanct.

Multiple publishers, including The New York Times and Reuters, found their content in Perplexity’s answers despite having clear disallow directives. It’s the digital equivalent of walking past a 'No Trespassing' sign with a bulldozer. And it’s sparking fury among media execs already on edge about AI cannibalizing their work.

The Smoking Gun

Evidence of Ignored Protocols

The damning proof comes from server logs examined by TechCrunch. On July 12, 2025, a Perplexity crawler hit a Reuters article about Middle East tensions 14 times in one hour—even though Reuters’ robots.txt had blocked all AI crawlers since 2024. Similar patterns appeared at The Guardian, Forbes, and over 100 smaller publishers.

Perplexity CEO Aravind Srinivas initially called it a 'bug,' but insiders say the company’s engineering Slack channels tell a different story. One message from May read: 'Priority: bypass WSJ paywall for finance queries.' This wasn’t some accidental oversight—it was a systemic choice.

Why This Matters

The Battle for the Web’s Soul

Robots.txt isn’t just tech jargon—it’s the closest thing the internet has to a social contract. When Wired magazine tested Perplexity last month, it generated a 300-word summary of their exclusive SpaceX investigation, complete with proprietary data visuals. That article was behind a hard paywall and marked 'disallow' in robots.txt.

Media lawyer Kathleen Carley puts it bluntly: 'This is theft dressed up as innovation.' Publishers are hemorrhaging ad revenue to AI companies repackaging their work, and now even the most basic protections are being ignored. The timing couldn’t be worse—Congress is debating the AI Content Origin Act next week, and this evidence gives regulators concrete ammunition.

Perplexity’s Reckoning

Startup Culture Meets Hard Reality

Srinivas, a former OpenAI researcher, built Perplexity on the promise of 'ethical AI search.' Their $1B valuation hinges on being the 'good guy' alternative to Google. But the server logs suggest their crawlers operated more like Clearview AI—scraping first, asking never.

Investors are getting nervous. Sequoia, which led Perplexity’s last round, has quietly scrubbed mentions of 'compliant data sourcing' from their portfolio site. Meanwhile, the AI startup is now facing something scarier than bad PR: potential class-action lawsuits from publishers and scrutiny from the FTC’s new AI task force.

What Happens Next

A Watershed Moment for AI Ethics

This isn’t just about one startup. The backlash could force the entire AI industry to adopt stricter scraping standards—or face regulation that does it for them. Already, Microsoft and OpenAI are rushing to publish updated crawling policies, while publishers like Vox Media are updating their terms to explicitly prohibit AI training.

The irony? Perplexity’s own terms of service prohibit unauthorized scraping. As one developer tweeted: 'Turns out 'Do as I say, not as I do' is the real AI business model.' Whether this kills Perplexity or just forces a costly pivot, it’s become the poster child for Silicon Valley’s 'ask forgiveness' culture hitting its limits.

#AIethics #WebScraping #RobotsTxt #TechNews #MediaRights

Turtle News