Unbundling AI: A List of Public Training Data Deals
TL;DR: Media owners, social platforms, and content libraries are signing large multi year licenses with AI providers. Most public deals involve OpenAI for both training and display with attribution. Reddit disclosed about 203 million in aggregate data licensing value. Reports place News Corp near a quarter billion over five years. Perplexity is pushing revenue share grounding programs with live attribution. This guide catalogs the main deals, explains deal types, and gives leaders a checklist to evaluate new offers.
What counts as a training data deal
Working definition: A multi year license that grants an AI provider rights to use a content owner’s archives, feeds, or API for one or both of the following:
- Model training to improve models with historical or ongoing data
- Grounding and display in product answers with attribution and links
Three common scopes today:
- Display with attribution
- Training or data access
- Hybrid that covers both, often with product collaboration
Why it matters: Licensed, high quality corpora improve reliability and reduce legal risk. They can also make a company’s data more visible in AI answers, either directly through guaranteed inclusion via licensed feeds or indirectly through stronger credibility signals. Terms like refresh cadence, correction flows, and attribution UX shape AI search visibility and referral traffic, which means companies should pay special attention to these deals for AI Search Optimization.
Confirmed publisher deals with OpenAI
OpenAI holds the broadest public portfolio across news and magazines.
| Publisher | Announced | Scope | Public notes |
|---|---|---|---|
| News Corp | May 2024 | Training and display | Multi year access to current and archived content with attribution. Reports suggest more than 250 million across five years. Figures not confirmed. |
| Associated Press | Jan 2023 | Training and collaboration | Licensed portions of the AP text archive. Financials undisclosed. |
| Axel Springer | Dec 2023 | Training and display | ChatGPT can summarize and link with attribution. Training use referenced. |
| Financial Times | Apr 2024 | Display and product work | FT content appears in ChatGPT with attribution. Training scope less explicit. |
| Dotdash Meredith | May 2024 | Training and display | Product and ad collaboration. Industry reporting cites about 16 million per year fixed component. |
Pattern: Archives plus live feeds exchanged for compensation, attribution, and product collaboration.
More OpenAI publisher partners
- Vox Media and The Atlantic for content surfacing and product work
- Condé Nast for display and training plus early SearchGPT tests
- Time for one century of archives with links and product collaboration
- Le Monde, Prisa Media, Future plc for multi year licenses with attribution
Beyond OpenAI: other provider plays
- Perplexity x Gannett and a broader Publisher Program with revenue share on grounded answers and live attribution.
- Meta AI x Reuters for news summarization and links.
- The New York Times x Amazon for licensing to Amazon AI products.
Social and developer platforms
-
Reddit x OpenAI and Google
- Licensed API and structured content for training and live use
- Google reported near 60 million per year in press coverage
- Reddit disclosed about 203 million aggregate data licensing value in filings
-
Stack Overflow x OpenAI and Google
- API and data to surface vetted answers with attribution inside assistants
- Financials undisclosed
Images, video, and music
-
Shutterstock x OpenAI
- Six year license for images, video, music, and metadata for training
- Priority access to product integrations
-
Getty Images
- Active litigation track with some generators and separate licensing options
- No single public OpenAI training license announcement
-
Major labels
- Negotiations toward AI licensing frameworks and possible micro payments
- Many terms remain private or in flight
Deal structures and money
Rights scope: display, training, hybrid, or grounding with revenue share.
Payment patterns: fixed fees, fixed plus variable usage, or pure revenue share.
Operational terms to watch: refresh cadence, correction and retention rules, attribution UX, opt outs, and API integration details.
Reality check: most contracts are confidential; public dollar figures are directional.
Gaps, standards, and signals to monitor
- Vendors with fewer public media licenses: Anthropic and xAI
- Standardization attempts: efforts like Really Simple Licensing remain early
- Curated data brokers: Bright Data, Scale AI and others are part of the supply chain but sit outside one to one publisher deals
- Legal pressure: lawsuits and settlements continue to shape negotiation leverage and appetite for explicit licenses.
FAQ
Training vs display vs grounding
Training improves the model with archives. Display shows attributed excerpts and links. Grounding uses licensed content in real time to inform answers with citation.
Do licenses stop unlicensed use
They reduce risk for covered content. Enforcement and opt out vary by vendor. Many publishers pair deals with active legal strategies.
How big are these deals
From single digit millions per year to nine figure multi year packages. Only a few numbers are public. Treat press figures as directional.
Who is most active?
OpenAI holds the largest set of public publisher deals. Perplexity leads on grounding with revenue share. Meta, Amazon, and Google have selective agreements. Anthropic and xAI have fewer public media licenses.
Track your AI search visibility
See where your brand appears in ChatGPT, Perplexity, and other AI search engines.