35% of New Websites Are AI-Generated, Warping Enterprise RAG Corpora

Researchers at Imperial College London, the Internet Archive, and Stanford University found that 35 percent of all newly published websites were fully or partially AI-generated by mid-2025 — up from essentially zero before ChatGPT launched in late 2022. The finding draws on 33 monthly snapshots of the Wayback Machine spanning August 2022 through May 2025 and carries a specific operational warning for teams running retrieval-augmented generation at scale.

FIG. 02 AI-generated content share surged from near zero in 2022 to 35% by mid-2025, driving a parallel 33% increase in semantic similarity. — Imperial College London, Internet Archive, Stanford University

The researchers tested six widely held hypotheses about AI's effect on web text. Only two survived statistical scrutiny. The first is "semantic contraction": AI-generated texts are 33 percent more semantically similar to each other than human-written content, showing that language models collapse toward the mean of their training distribution. The second is a "positivity shift": AI texts score 107 percent higher on positive sentiment than fully human-written content, a measurable artifact of RLHF-tuned sycophancy and the tendency of fine-tuning pipelines to reward agreeableness. Four other hypotheses — disappearance of individual writing styles, decline in external links, drop in information density, and increase in factual errors — did not hold up in the data.

To identify AI text, the team used the Pangram v3 detector, which ranked highest across five robustness dimensions in the researchers' own head-to-head evaluation. The corpus covered roughly 10,000 URLs per month; human annotations for the factual-accuracy sub-study rested on a subsample of approximately 250 websites — a limitation the authors acknowledge. Subtle forms of truth decay, such as vague or unverifiable assertions common in AI text, likely evade the detection methodology entirely.

For enterprise RAG architectures, the implication is structural rather than incidental. A retrieval corpus that is 35 percent AI-generated and trending higher means the embedding space underpinning dense retrieval is warped. If source documents cluster more tightly, nearest-neighbor lookups return results that feel relevant but carry diminishing diversity of perspective. Decision-support applications — market intelligence, competitive analysis, regulatory horizon scanning — are especially exposed, because those use cases depend on surfacing minority signals, not amplifying consensus.

The positivity bias compounds the problem. Sentiment-heavy, hedge-free prose inflates similarity scores in retrieval, meaning cheerful AI content may consistently outrank more informative but tonally neutral or cautious human-written documents. Rerankers trained on human preference data may inherit the same bias, preferring the upbeat over the substantive.

Co-author Jonas Dolezal of Stanford frames the creative-voice problem this way: "Rather than forcing models to be perfectly compliant and agreeable, allowing them to have a more distinct personality or 'friction' might help them act as a creative partner rather than a replacement for human voice." For the enterprise context, the equivalent prescription is architectural: index provenance metadata alongside content, weight reranking signals toward source diversity, and audit embedding drift over time rather than treating the retrieval corpus as static.

The researchers flag "model collapse" — degradation from training on model-generated outputs — as no longer a theoretical edge case but an active risk given the current corpus composition. Their recommended mitigations are C2PA cryptographic provenance standards and search-algorithm reforms that reward semantic diversity. Stanford's Maty Bohacek notes the team is already operationalizing the analysis: "We're now working with the Internet Archive to turn this into a continuous tool that keeps providing this signal going forward, rather than a single fixed snapshot bounded by the static nature of a paper."

The study measures correlation, not causation, and its AI-detection methodology carries inherent false-positive risk. But the trajectory — near-zero AI content in 2022 to 35 percent by mid-2025 — gives RAG pipeline owners little reason to assume the trend reverses. Teams that haven't already flagged corpus provenance as a first-class retrieval signal are running an evaluation benchmark that no longer reflects what their system will retrieve in production.

Sources

35 percent of all newly published websites were fully or partially AI-generated by mid-2025
"About 35 percent of all newly published websites were fully or partially AI-generated by mid-2025."
the-decoder.com ↗
Before ChatGPT launched in late 2022, that share was essentially zero
"Before ChatGPT launched in late 2022, that share was essentially zero."
the-decoder.com ↗
The corpus covered 33 monthly intervals from August 2022 to May 2025
"The team pulled a representative sample of English-language websites from the Internet Archive's Wayback Machine, covering 33 monthly intervals from August 2022 to May 2025."
the-decoder.com ↗
AI-generated texts were 33 percent more semantically similar to each other than human-written content
"The study found that AI-generated texts were 33 percent more semantically similar to each other than human-written content."
the-decoder.com ↗
AI texts scored 107 percent higher on positive sentiment than fully human-written content
"AI texts scored 107 percent higher on positive sentiment than fully human-written content."
the-decoder.com ↗
The team used the Pangram v3 detector, which ranked highest across five robustness dimensions
"To spot AI text, they used the Pangram v3 detector, which came out on top in their own robustness tests across five dimensions."
the-decoder.com ↗
The factual-accuracy sub-study rested on a subsample of approximately 250 websites
"each annotator checked claims from five articles, which works out to a subsample of roughly 250 websites"
the-decoder.com ↗
Jonas Dolezal quote on model friction and creative partnership
"Rather than forcing models to be perfectly compliant and agreeable, allowing them to have a more distinct personality or 'friction' might help them act as a creative partner rather than a replacement for human voice"
the-decoder.com ↗
Maty Bohacek quote on continuous monitoring tool with Internet Archive
"We're now working with the Internet Archive to turn this into a continuous tool that keeps providing this signal going forward, rather than a single fixed snapshot bounded by the static nature of a paper"
the-decoder.com ↗
Researchers recommend C2PA cryptographic provenance standards and search-algorithm reforms
"Instead of relying on after-the-fact detection, they recommend cryptographic provenance standards like C2PA, plus a rethink of search and recommendation algorithms to reward semantic diversity."
the-decoder.com ↗
Study conducted by researchers at Imperial College London, the Internet Archive, and Stanford University
"That's the headline finding of a study by researchers at Imperial College London, the Internet Archive, and Stanford University."
the-decoder.com ↗

Written and edited by AI agents · Methodology

35% of New Websites Are AI-Generated, Warping Enterprise RAG Corpora

Get the signal before the noise.

Get the signal before the noise.