Unbundling AI: A List of Public Training Data Deals

TL;DR: Media owners, social platforms, and content libraries are signing large multi year licenses with AI providers. Most public deals involve OpenAI for both training and display with attribution. Reddit disclosed about 203 million in aggregate data licensing value. Reports place News Corp near a quarter billion over five years. Perplexity is pushing revenue share grounding programs with live attribution. This guide catalogs the main deals, explains deal types, and gives leaders a checklist to evaluate new offers.

What counts as a training data deal

Working definition: A multi year license that grants an AI provider rights to use a content owner’s archives, feeds, or API for one or both of the following:

Model training to improve models with historical or ongoing data
Grounding and display in product answers with attribution and links

Three common scopes today:

Display with attribution
Training or data access
Hybrid that covers both, often with product collaboration

Why it matters: Licensed, high quality corpora improve reliability and reduce legal risk. They can also make a company’s data more visible in AI answers, either directly through guaranteed inclusion via licensed feeds or indirectly through stronger credibility signals. Terms like refresh cadence, correction flows, and attribution UX shape AI search visibility and referral traffic, which means companies should pay special attention to these deals for AI Search Optimization.

Confirmed publisher deals with OpenAI

OpenAI holds the broadest public portfolio across news and magazines.

Publisher	Announced	Scope	Public notes
News Corp	May 2024	Training and display	Multi year access to current and archived content with attribution. Reports suggest more than 250 million across five years. Figures not confirmed.
Associated Press	Jan 2023	Training and collaboration	Licensed portions of the AP text archive. Financials undisclosed.
Axel Springer	Dec 2023	Training and display	ChatGPT can summarize and link with attribution. Training use referenced.
Financial Times	Apr 2024	Display and product work	FT content appears in ChatGPT with attribution. Training scope less explicit.
Dotdash Meredith	May 2024	Training and display	Product and ad collaboration. Industry reporting cites about 16 million per year fixed component.

Pattern: Archives plus live feeds exchanged for compensation, attribution, and product collaboration.

More OpenAI publisher partners

Vox Media and The Atlantic for content surfacing and product work
Condé Nast for display and training plus early SearchGPT tests
Time for one century of archives with links and product collaboration
Le Monde, Prisa Media, Future plc for multi year licenses with attribution

Beyond OpenAI: other provider plays

Perplexity x Gannett and a broader Publisher Program with revenue share on grounded answers and live attribution.
Meta AI x Reuters for news summarization and links.
The New York Times x Amazon for licensing to Amazon AI products.

Reddit x OpenAI and Google
- Licensed API and structured content for training and live use
- Google reported near 60 million per year in press coverage
- Reddit disclosed about 203 million aggregate data licensing value in filings
Stack Overflow x OpenAI and Google
- API and data to surface vetted answers with attribution inside assistants
- Financials undisclosed

Images, video, and music

Shutterstock x OpenAI
- Six year license for images, video, music, and metadata for training
- Priority access to product integrations
Getty Images
- Active litigation track with some generators and separate licensing options
- No single public OpenAI training license announcement
Major labels
- Negotiations toward AI licensing frameworks and possible micro payments
- Many terms remain private or in flight

Deal structures and money

Rights scope: display, training, hybrid, or grounding with revenue share.
Payment patterns: fixed fees, fixed plus variable usage, or pure revenue share.
Operational terms to watch: refresh cadence, correction and retention rules, attribution UX, opt outs, and API integration details.
Reality check: most contracts are confidential; public dollar figures are directional.

Gaps, standards, and signals to monitor

Vendors with fewer public media licenses: Anthropic and xAI
Standardization attempts: efforts like Really Simple Licensing remain early
Curated data brokers: Bright Data, Scale AI and others are part of the supply chain but sit outside one to one publisher deals
Legal pressure: lawsuits and settlements continue to shape negotiation leverage and appetite for explicit licenses.

FAQ

Training vs display vs grounding
Training improves the model with archives. Display shows attributed excerpts and links. Grounding uses licensed content in real time to inform answers with citation.

Do licenses stop unlicensed use
They reduce risk for covered content. Enforcement and opt out vary by vendor. Many publishers pair deals with active legal strategies.

How big are these deals
From single digit millions per year to nine figure multi year packages. Only a few numbers are public. Treat press figures as directional.

Who is most active?
OpenAI holds the largest set of public publisher deals. Perplexity leads on grounding with revenue share. Meta, Amazon, and Google have selective agreements. Anthropic and xAI have fewer public media licenses.

Unbundling AI: A List of Public Training Data Deals

What counts as a training data deal

Confirmed publisher deals with OpenAI

More OpenAI publisher partners

Beyond OpenAI: other provider plays

Social and developer platforms

Images, video, and music

Deal structures and money

Gaps, standards, and signals to monitor

FAQ

More articles

6 Types of AI Search Optimization

Why Your Brand Does Not Show Up in ChatGPT (And How to Fix It)

Case Study: How We Got ChatGPT to Crawl a New Website in Under 10 Minutes

Track your AI search visibility