Introducing ππ π’π§πππππ‘: the best public math pre-training dataset with 50B+ tokens! HuggingFaceTB/finemath
Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.
We build the dataset by: π οΈ carefully extracting math data from Common Crawl; π iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.
We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.
We hope this helps advance the performance of LLMs on math and reasoning! π Weβre also releasing all the ablation models as well as the evaluation code.
We applied the same data-driven approach that led to SOTA English performance inπ· FineWeb to thousands of languages.
π₯ FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.
The dataset is released under the permissive π ODC-By 1.0 license, and the π» code to reproduce it and our evaluations is public.
We will very soon announce a big community project, and are working on a π blogpost walking you through the entire dataset creation process. Stay tuned!
Increasingly, LLMs are becoming very useful for helping scale annotation tasks, i.e. labelling and filtering. When combined with the structured generation, this can be a very scalable way of doing some pre-annotation without requiring a large team of human annotators.
π 1M public posts from Bluesky's firehose API π Includes text, metadata, and language predictions π¬ Perfect to experiment with using ML for Bluesky π€
Excited to see people build more open tools for a more open social media platform!
The Bluesky AT Protocol unlocks exciting possibilities: - Building custom feeds using ML - Creating dashboards for data exploration - Developing custom models for Bluesky To gather Bluesky resources on the Hub, I've created a community org: https://huggingface.co/bluesky-community
My first rather modest contribution is a dashboard that shows the number of posts every second. Drinking straight from the firehose API π°
- Pre-training code with nanotron - Evaluation suite with lighteval - Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk) - Post-training scripts with TRL & the alignment handbook - On-device tools with llama.cpp for summarization, rewriting & agents
Apache 2.0 licensed. V2 pre-training data mix coming soon!
How do I test an LLM for my unique needs? If you work in finance, law, or medicine, generic benchmarks are not enough. This blog post uses Argilla, Distilllabel and π€οΈLighteval to generate evaluation dataset and evaluate models.