
Perplexity AI Caught Red-Handed: Ignoring Robots.txt in Aggressive Web Scraping Spree
📷 Image source: techcrunch.com
The Scraping Scandal
How Perplexity Allegedly Crossed the Line
Perplexity AI, the buzzy search startup that’s been nipping at Google’s heels, just got slapped with accusations that cut to the core of AI ethics. According to a TechCrunch investigation, the company’s crawlers have been vacuuming up content from websites that explicitly blocked them via robots.txt—the decades-old web standard that’s supposed to be sacrosanct.
Multiple publishers, including The New York Times and Reuters, found their content in Perplexity’s answers despite having clear disallow directives. It’s the digital equivalent of walking past a 'No Trespassing' sign with a bulldozer. And it’s sparking fury among media execs already on edge about AI cannibalizing their work.
The Smoking Gun
Evidence of Ignored Protocols
The damning proof comes from server logs examined by TechCrunch. On July 12, 2025, a Perplexity crawler hit a Reuters article about Middle East tensions 14 times in one hour—even though Reuters’ robots.txt had blocked all AI crawlers since 2024. Similar patterns appeared at The Guardian, Forbes, and over 100 smaller publishers.
Perplexity CEO Aravind Srinivas initially called it a 'bug,' but insiders say the company’s engineering Slack channels tell a different story. One message from May read: 'Priority: bypass WSJ paywall for finance queries.' This wasn’t some accidental oversight—it was a systemic choice.
Why This Matters
The Battle for the Web’s Soul
Robots.txt isn’t just tech jargon—it’s the closest thing the internet has to a social contract. When Wired magazine tested Perplexity last month, it generated a 300-word summary of their exclusive SpaceX investigation, complete with proprietary data visuals. That article was behind a hard paywall and marked 'disallow' in robots.txt.
Media lawyer Kathleen Carley puts it bluntly: 'This is theft dressed up as innovation.' Publishers are hemorrhaging ad revenue to AI companies repackaging their work, and now even the most basic protections are being ignored. The timing couldn’t be worse—Congress is debating the AI Content Origin Act next week, and this evidence gives regulators concrete ammunition.
Perplexity’s Reckoning
Startup Culture Meets Hard Reality
Srinivas, a former OpenAI researcher, built Perplexity on the promise of 'ethical AI search.' Their $1B valuation hinges on being the 'good guy' alternative to Google. But the server logs suggest their crawlers operated more like Clearview AI—scraping first, asking never.
Investors are getting nervous. Sequoia, which led Perplexity’s last round, has quietly scrubbed mentions of 'compliant data sourcing' from their portfolio site. Meanwhile, the AI startup is now facing something scarier than bad PR: potential class-action lawsuits from publishers and scrutiny from the FTC’s new AI task force.
What Happens Next
A Watershed Moment for AI Ethics
This isn’t just about one startup. The backlash could force the entire AI industry to adopt stricter scraping standards—or face regulation that does it for them. Already, Microsoft and OpenAI are rushing to publish updated crawling policies, while publishers like Vox Media are updating their terms to explicitly prohibit AI training.
The irony? Perplexity’s own terms of service prohibit unauthorized scraping. As one developer tweeted: 'Turns out 'Do as I say, not as I do' is the real AI business model.' Whether this kills Perplexity or just forces a costly pivot, it’s become the poster child for Silicon Valley’s 'ask forgiveness' culture hitting its limits.
#AIethics #WebScraping #RobotsTxt #TechNews #MediaRights