Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation Jun 20, 2024 β’ 12
Synthetic dataset generation techniques: generating custom sentence similarity data May 23, 2024 β’ 16
Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia? May 7, 2024 β’ 7
Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20, 2024 β’ 71
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 β’ 28
Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub Aug 2, 2023 β’ 1
view post Post 2911 πΈπ° Hovorte po slovensky? Help build better AI for Slovak! We only need 90 more annotations to include Slovak in the next Hugging Face FineWeb2-C dataset ( data-is-better-together/fineweb-c) release! Your contribution will help create better language models for 5+ million Slovak speakers.Annotate here: data-is-better-together/fineweb-c.Read more about why we're doing it: https://huggingface.co/blog/davanstrien/fineweb2-community See translation
view post Post 1654 Introducing FineWeb-C ππ, a community-built dataset for improving language models in ALL languages.Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.318 annotators, 32K+ annotations, 12 languages - and growing! π data-is-better-together/fineweb-c See translation
synthetic-data-generation-demos A collection of demos for various approaches to synthetic data generation Runtime error 8 π Genstruct 7B Running on Zero 84 π Instruction Synthesizer Running on Zero 71 π¦ββ¬ Magpie Running on Zero 7 π¬ Bonito
sentence-transformers-from-synthetic-data Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model bigcode/self-oss-instruct-sc2-exec-filter-50k Viewer β’ Updated Nov 4, 2024 β’ 50.7k β’ 257 β’ 94 davanstrien/similarity-dataset-sc2-8b Viewer β’ Updated May 30, 2024 β’ 2.32k β’ 48 β’ 6 davanstrien/code-prompt-similarity-model Sentence Similarity β’ Updated May 29, 2024 β’ 27 β’ 6 davanstrien/abstract-wiki Viewer β’ Updated Jun 11, 2024 β’ 5k β’ 46 β’ 1
bigcode/self-oss-instruct-sc2-exec-filter-50k Viewer β’ Updated Nov 4, 2024 β’ 50.7k β’ 257 β’ 94
davanstrien/document-classifier-convnextv2-tiny-22k-224 Image Classification β’ Updated Oct 23, 2024 β’ 6
davanstrien/document-classifier-convnextv2-tiny-1k-224 Image Classification β’ Updated Oct 23, 2024 β’ 5
davanstrien/document-classifier-deit_base_patch16_224 Image Classification β’ Updated Oct 23, 2024 β’ 3
davanstrien/fineweb-edu-llama3-annotations-sample-5-ratings-100-raw Viewer β’ Updated Nov 18, 2024 β’ 100 β’ 67
davanstrien/fineweb-edu-llama3-annotations-pairs-data-sample-ranked-raw Viewer β’ Updated Nov 14, 2024 β’ 248 β’ 44