zephyr story sources mentioned by hf.co/thomwolf tweet: x.com/Thom_Wolf/status/1720503998518640703 HuggingFaceH4/zephyr-7b-beta Text Generation β’ Updated Oct 16, 2024 β’ 310k β’ β’ 1.63k mistralai/Mistral-7B-v0.1 Text Generation β’ Updated Jul 24, 2024 β’ 3.66M β’ β’ 3.51k stingning/ultrachat Viewer β’ Updated Feb 22, 2024 β’ 774k β’ 1.39k β’ 428 openbmb/UltraFeedback Viewer β’ Updated Dec 29, 2023 β’ 64k β’ 1.35k β’ 344
A little guide to building Large Language Models in 2024 Resources mentioned by @thomwolf in https://x.com/Thom_Wolf/status/1773340316835131757 Yi: Open Foundation Models by 01.AI Paper β’ 2403.04652 β’ Published Mar 7, 2024 β’ 62 A Survey on Data Selection for Language Models Paper β’ 2402.16827 β’ Published Feb 26, 2024 β’ 4 Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research Paper β’ 2402.00159 β’ Published Jan 31, 2024 β’ 61 The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only Paper β’ 2306.01116 β’ Published Jun 1, 2023 β’ 32
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research Paper β’ 2402.00159 β’ Published Jan 31, 2024 β’ 61
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only Paper β’ 2306.01116 β’ Published Jun 1, 2023 β’ 32