BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks Jun 18, 2024 โข 43
Long-context LLMs Struggle with Long In-context Learning Paper โข 2404.02060 โข Published Apr 2, 2024 โข 36
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis Paper โข 2412.19723 โข Published 8 days ago โข 67