Papers
arxiv:2406.06467

How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad

Published on Jun 10, 2024
Authors:
,
,
,
,

Abstract

Can Transformers predict new syllogisms by composing established ones? More generally, what type of targets can be learned by such models from scratch? Recent works show that Transformers can be Turing-complete in terms of expressivity, but this does not address the learnability objective. This paper puts forward the notion of 'globality degree' of a target distribution to capture when weak learning is efficiently achievable by regular Transformers, where the latter measures the least number of tokens required in addition to the tokens histogram to correlate nontrivially with the target. As shown experimentally and theoretically under additional assumptions, distributions with high globality cannot be learned efficiently. In particular, syllogisms cannot be composed on long chains. Furthermore, we show that (i) an agnostic scratchpad cannot help to break the globality barrier, (ii) an educated scratchpad can help if it breaks the globality at each step, however not all such scratchpads can generalize to out-of-distribution (OOD) samples, (iii) a notion of 'inductive scratchpad', that composes the prior information more efficiently, can both break the globality barrier and improve the OOD generalization. In particular, some inductive scratchpads can achieve length generalizations of up to 6x for some arithmetic tasks depending on the input formatting.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.06467 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.06467 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.06467 in a Space README.md to link it from this page.

Collections including this paper 1