Video by Hugging Face via YouTube

A deep dive into how Hugging Face created the FineWeb dataset: starting from Common Crawl snapshots, extracting high-quality text from raw web data, filtering noisy content, deduplicating at web scale, and building FineWeb-Edu with model-assisted educational quality filtering.
—
🔗 Links
– FineWeb dataset: https://huggingface.co/datasets/HuggingFaceFW/fineweb
– FineWeb-Edu dataset: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
– FineWeb paper: https://arxiv.org/abs/2406.17557
– FineWeb blog post: https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
– Common Crawl: https://commoncrawl.org/
– Trafilatura: https://trafilatura.readthedocs.io/
—
👋 Connect with me
– My website: https://alejandro-ao.com/
– X (Twitter): https://x.com/_alejandroao
– LinkedIn: https://www.linkedin.com/in/alejandro-ao/
—
🤓 Topics Covered
– FineWeb dataset creation pipeline
– Common Crawl filtering and deduplication
– FineWeb-Edu educational data filtering
—
⏱️ Timestamps
00:00 Introduction
00:52 Why FineWeb matters
02:58 Common Crawl as data source
06:10 Base filtering techniques
07:17 Deduplication within snapshots
13:05 C4-style quality filters
17:20 FineWeb-Edu extraction
21:41 Key lessons learned
24:21 Synthetic data on the web
28:39 Conclusion