Ksenia Se's picture
2 1

Ksenia Se

Kseniase

AI & ML interests

None yet

Recent Activity

posted an update 4 days ago
10 Free Comprehensive Datasets for Supervised Fine-Tuning High-quality datasets, their size and relevance directly impact the effectiveness of fine-tuning and the models' real-world applications. Among the numerous datasets for different tasks, it can be challenging to choose the most comprehensive dataset that best suits your purposes. So today, we invite you to explore top 10 free datasets on natural language processing and maths: 1. https://huggingface.co/datasets/fka/awesome-chatgpt-prompts proposes a huge variety of prompts that can be used with ChatGPT. Over 700 models were trained on this dataset. 2. https://huggingface.co/datasets/HuggingFaceFW/fineweb from Hugging Face includes 15T tokens of cleaned and deduplicated English web data. It’s suitable for LLM training, benchmarking, model validation. 3. https://huggingface.co/datasets/HuggingFaceFW/fineweb-2 is an another version of FineWeb with high-quality pretraining data to over 1000 languages. 4. https://huggingface.co/datasets/O1-OPEN/OpenO1-SFT with Chinese and English data can be used for Chain-of-Thought activation. 5. https://huggingface.co/datasets/yahma/alpaca-cleaned is a curated version of the original Alpaca Dataset released by Stanford. 6. https://huggingface.co/datasets/lmsys/lmsys-chat-1m with 1 million real-world conversations with 25 state-of-the-art LLMs offers diverse use cases, like content moderation, safety benchmarks, and training instruction-following models. 7. https://huggingface.co/datasets/allenai/dolma from Allen AI includes 3T tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. Math datasets: 1. https://huggingface.co/datasets/HuggingFaceTB/finemath consists of educational math content and has two versions: 34B tokens and 54B tokens. 2. https://huggingface.co/datasets/amphora/QwQ-LongCoT-130K for training O1-like LLMs. 3. https://huggingface.co/datasets/openai/gsm8k for training multi-step reasoning.
View all activity

Articles

Organizations

Turing Post's profile picture Journalists on Hugging Face's profile picture Social Post Explorers's profile picture Hugging Face Discord Community's profile picture

Posts 3

view post
Post
2065
10 Free Comprehensive Datasets for Supervised Fine-Tuning

High-quality datasets, their size and relevance directly impact the effectiveness of fine-tuning and the models' real-world applications. Among the numerous datasets for different tasks, it can be challenging to choose the most comprehensive dataset that best suits your purposes.

So today, we invite you to explore top 10 free datasets on natural language processing and maths:

1. fka/awesome-chatgpt-prompts proposes a huge variety of prompts that can be used with ChatGPT. Over 700 models were trained on this dataset.

2. HuggingFaceFW/fineweb from Hugging Face includes 15T tokens of cleaned and deduplicated English web data. It’s suitable for LLM training, benchmarking, model validation.

3. HuggingFaceFW/fineweb-2 is an another version of FineWeb with high-quality pretraining data to over 1000 languages.

4. O1-OPEN/OpenO1-SFT with Chinese and English data can be used for Chain-of-Thought activation.

5. yahma/alpaca-cleaned is a curated version of the original Alpaca Dataset released by Stanford.

6. lmsys/lmsys-chat-1m with 1 million real-world conversations with 25 state-of-the-art LLMs offers diverse use cases, like content moderation, safety benchmarks, and training instruction-following models.

7. allenai/dolma from Allen AI includes 3T tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.

Math datasets:

1. HuggingFaceTB/finemath consists of educational math content and has two versions: 34B tokens and 54B tokens.

2. amphora/QwQ-LongCoT-130K for training O1-like LLMs.

3. openai/gsm8k for training multi-step reasoning.
view post
Post
2701
**15 Agentic Systems and Frameworks of 2024**

This year, we started our “AI Agents and Agentic Workflows” series (https://www.turingpost.com/t/AI-Agents) to explore everything about AI agents step by step: all the vocabulary, how they work, and how to build them.
The huge interest in this series and the large number of studies conducted on agents showed that it was one of the most popular and important themes of the year. In 2025, most likely, agents will reach new highs – we will be covering that for you. Now, let’s review the agentic systems that have emerged this year.

Here is a list of 15 agentic systems and frameworks of 2024:

1. GUI Agents: A Survey (2412.13501)

2. Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level (2411.03562)

3. The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery (2408.06292)

4. MALT: Improving Reasoning with Multi-Agent LLM Training (2412.01928)

5. Agent S: An Open Agentic Framework that Uses Computers Like a Human (2410.08164)

6. Automated Design of Agentic Systems (2408.08435)

7. AgentInstruct: Toward Generative Teaching with Agentic Flows (2407.03502)

8. AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant (2410.18603)

9. WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents (2410.07484)

10. Generative Agent Simulations of 1,000 People (2411.10109)

11. DynaSaur: Large Language Agents Beyond Predefined Actions (2411.01747)

12. PRefLexOR: Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning and Agentic Thinking (2410.12375)

13. Generative World Explorer (2411.11844)

14. Bel Esprit: Multi-Agent Framework for Building AI Model Pipelines (2412.14684)

15. AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions (2410.20424)

Thanks for reading Turing Post!
Subscribe to receive new posts straight into your inbox -> https://www.turingpost.com/subscribe

models

None public yet

datasets

None public yet