Papers
arxiv:2410.06526

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

Published on Oct 9, 2024
Authors:
,
,
,
,
,
,

Abstract

In this paper, we introduce Knowledge-Orthogonal Reasoning (KOR), which minimizes the impact of domain-specific knowledge for a more accurate evaluation of models' reasoning abilities in out-of-distribution scenarios. Based on this concept, we propose the Knowledge-Orthogonal Reasoning Benchmark (KOR-Bench), encompassing five task categories: Operation, Logic, Cipher, Puzzle, and Counterfactual. KOR-Bench emphasizes the effectiveness of models in applying new rule descriptions to solve novel rule-driven questions, revealing that top-performing models like Claude-3.5-Sonnet and GPT-4o only achieve 58.96% and 58.00% accuracy, respectively. We conduct thorough analyses to identify bottlenecks in the Cipher task using Stepwise Prompting, discovering that two rounds of Self-Correction yield optimal results. Complex Task Processing evaluates model performance across three integrated tasks, while we also explore the impact of Tricks on the Puzzle task and visualize rule-focused attention to enhance our understanding of model behavior. We aim for KOR-Bench to be a valuable resource for enhancing models' reasoning capabilities and fostering further research in this field.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.06526 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.06526 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.06526 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.