Why datasets built on public domain might not be enough for AI

Author: Stefano Maffulli
Source

Sponsored:

Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence - Audiobook


Uncover the true cost of artificial intelligence.

Listen now, and see the system behind the screens before the future listens to you. = > Atlas of AI $0.00 with trial. Read by Larissa Gallagher


Common Corpus is a public domain dataset for training large language models (LLMs). Boasting 500 billion words in multiple languages, drawn from various cultural initiatives, it offers researchers a powerful tool to develop smaller and more efficient LLMs. It should not be abused as a tool to promote public policies that expand the reach of copyright law.

Read more