A team from Tencent AI wanted to evaluate agentic systems on data science (DS) tasks : but they noticed that existing agentic benchmarks were severely limited in several aspects: they were limited to text and did not include tables or images, were only specific to certain packages, only performed exact match evaluationโฆ
โก๏ธ So they set out to build a much more exhaustive approach, to finally make the definitive DS agent benchmark.
๐ง๐ต๐ฒ ๐๐ฆ๐๐ฒ๐ป๐ฐ๐ต ๐ฑ๐ฎ๐๐ฎ๐๐ฒ๐ โช๏ธDS bench has 466 data analysis tasks and 74 data modelling tasks โช๏ธThe tasks are sourced from ModelOff and Kaggle, the platforms hosting the most popular data science competitions โช๏ธDifference with previous DS benchmarks: โถ This benchmark leverages various modalities on top of text: images, Excel files, tables โท Complex tables: sometimes several tables should be leveraged to answer one question โธ The context is richer, with longer descriptions. โช๏ธ Evaluation metrics : the benchmark is scored with an LLM as a judge, using a specific prompt.
๐๐ป๐๐ถ๐ด๐ต๐๐ ๐ณ๐ฟ๐ผ๐บ ๐ฒ๐๐ฎ๐น๐๐ฎ๐๐ถ๐ป๐ด ๐ฎ๐ด๐ฒ๐ป๐๐ โช๏ธ Their evaluation confirms that using LLMs in an agent setup, for instance by allowing them to run a single step of code execution, is more costly (especially with multi-turn frameworks like autogen) but also much more performant than the vanilla LLM. โช๏ธ The sets of tasks solved by different models (like GPT-3.5 vs Llama-3-8B) has quite low overlap, which suggests that different models tend to try very different approches.
This new benchmark is really welcome, can't wait to try transformers agents on it! ๐ค