How to Create an LLM Dataset | FineWeb Overview

Video by Hugging Face via YouTube

A deep dive into how Hugging Face created the FineWeb dataset: starting from Common Crawl snapshots, extracting high-quality text from raw web data, filtering noisy content, deduplicating at web scale, and building FineWeb-Edu with model-assisted educational quality filtering.

—
🔗 Links
– FineWeb dataset: https://huggingface.co/datasets/HuggingFaceFW/fineweb
– FineWeb-Edu dataset: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
– FineWeb paper: https://arxiv.org/abs/2406.17557
– FineWeb blog post: https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
– Common Crawl: https://commoncrawl.org/
– Trafilatura: https://trafilatura.readthedocs.io/

—
👋 Connect with me
– My website: https://alejandro-ao.com/
– X (Twitter): https://x.com/_alejandroao
– LinkedIn: https://www.linkedin.com/in/alejandro-ao/

—
🤓 Topics Covered
– FineWeb dataset creation pipeline
– Common Crawl filtering and deduplication
– FineWeb-Edu educational data filtering

—
⏱️ Timestamps
00:00 Introduction
00:52 Why FineWeb matters
02:58 Common Crawl as data source
06:10 Base filtering techniques
07:17 Deduplication within snapshots
13:05 C4-style quality filters
17:20 FineWeb-Edu extraction
21:41 Key lessons learned
24:21 Synthetic data on the web
28:39 Conclusion

Source

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Related Posts: