AI Was Built by Plundering Internet Content. Now Some People Want to Charge for Allowing It

markdown

AI Models and Content Scraping: A Controversial Landscape

AI models are notoriously hungry for content. To train these models effectively, companies deploy specialized crawlers such as OpenAI’s GPTBot and Googlebot for Gemini. These bots methodically scrape the Internet, gathering HTML data before extracting clean text and links to continue their informational hunt.

The Scraping Dilemma

Post-training, AI models often resort to web searching when lacking sufficient information to answer queries, especially on current topics. This raises significant concerns about intellectual property. Instances of generative AI mimicking the style of established studios like Studio Ghibli illustrate the ongoing struggle between creativity and copyright infringement.

In the last three years, legal battles over copyright violations have surged, highlighted by the infamous lawsuit from The New York Times against Microsoft and OpenAI for allegedly using millions of its articles for ChatGPT’s training. Similarly, artists have filed lawsuits against image generators like Stability AI and Midjourney.

Legal Ramifications and Content Licensing

Media organizations and copyright associations have also targeted firms like Perplexity and Meta. By 2025, OpenAI faced so many copyright infringement allegations that it consolidated multiple cases into one tribunal in New York for easier litigation management.

Seedance controversy legal battles

In response to these rising legal pressures, AI companies are starting to license content more proactively. OpenAI has reached agreements with media groups like News Corp for access to their information, while Anthropic faced hefty settlements for its past transgressions.

The Business Impact

As these scraping practices threaten to deplete website traffic and revenue, the effects reverberate across sectors. Some sites report that increased bot activity leads to server overload, further complicating their user experience.

Website scraping impacts

To combat this, the RSL Collective was established, creating a standard called Really Simple Licensing (RSL) that allows websites to control bot access to their content. Supported by platforms like Yahoo and Reddit, the goal is to enable web content to be licensed more transparently.

A New Model: Streaming Content Licensing

The RSL structure is akin to a content streaming model—websites can specify which parts of their content are available to AI models and which require payment. Doug Leeds, a founder of RSL and former CEO of Ask.com, emphasizes that their initiative aims to give websites the infrastructure to enforce their terms of use.

Content streaming model comparison

Leeds proposes a flat fee model where AI companies pay a fixed price for content licensing, then distribute revenue based on usage. This could lessen the legal risks and operational costs associated with unauthorized scraping.

Emerging Technologies Against Bots

Companies like Cloudflare are developing technologies to manage bot behavior. Their AI Crawl Control program effectively identifies unauthorized scraping attempts, allowing website owners to take appropriate action.

AI bot management systems

The battle to control these bots underscores a rapidly changing internet landscape. As bot traffic surpasses human traffic, there is a growing fear of a future where bots dominate content creation, sidelining human voices.

The Future: Content Creation and Licensing

Looking ahead, Leeds believes that human-generated content is essential for democracy and cultural values. Current AI models can synthesize this information, but the need for human creativity remains paramount. As companies negotiate licensing models, ethical considerations will play a crucial role in shaping the future of content generation and consumption.

General News – 2

AI Models and Content Scraping: A Controversial Landscape

The Scraping Dilemma

Legal Ramifications and Content Licensing

The Business Impact

A New Model: Streaming Content Licensing

Emerging Technologies Against Bots

The Future: Content Creation and Licensing

Bugatti Defies Physics Again: A €250,000 Stabilized Pool Table for Luxury Yachts

The Provincial Court of Jaén Orders Man to Pay Over 18,500 Euros to Ex-Wife for Breaching Divorce Agreement

Crystal Palace Targets €20m Ligue 1 Star to Replace Maxence Lacroix

Khvicha Kvaratskhelia’s Younger Brother Shines at 16

We believed that “you don’t live anywhere like Spain.” However, this world ranking of quality of life disagrees.

You missed

Bugatti Defies Physics Again: A €250,000 Stabilized Pool Table for Luxury Yachts

The Provincial Court of Jaén Orders Man to Pay Over 18,500 Euros to Ex-Wife for Breaching Divorce Agreement

Crystal Palace Targets €20m Ligue 1 Star to Replace Maxence Lacroix

Khvicha Kvaratskhelia’s Younger Brother Shines at 16