Papers
arxiv:2501.00712

Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding

Published on Jan 1
· Submitted by lanczos on Jan 3
Authors:
,
,
,

Abstract

Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce rigid patterns in attention maps, limiting the ability to model long-range dependencies and adapt to diverse tasks. Additionally, most positional encodings are learned as general biases, lacking the specialization required for different instances within a dataset. To address this, we propose conTextualized equivariAnt Position Embedding (TAPE), a novel framework that enhances positional embeddings by incorporating sequence content across layers. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of traditional fixed patterns. By enforcing permutation and orthogonal equivariance, TAPE ensures the stability of positional encodings during updates, improving robustness and adaptability. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead. Extensive experiments shows that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques.

Community

Paper author Paper submitter
edited 3 days ago

We propose a framework for generalizing positional encoding through contextualization and equivariance design, which has been demonstrated to be effective in long-context retrieval and arithmetic reasoning tasks.

vis_dp.png

The figure compares the dot-product patterns of RoPE with our positional embeddings (TAPE), demonstrating that our method achieves more evenly distributed long-range attention patterns, whereas RoPE tends to emphasize token locality.

Code is available at https://github.com/VITA-Group/TAPE.

Are there passkey accuracy evaluations compared to other positional encoding methods when generalizing to unseen lengths?

·

We only tested length extrapolation in arithmetic and did not test passkey retrieval again. We may consider adding it if suggested by many.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.00712 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2501.00712 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2501.00712 in a Space README.md to link it from this page.

Collections including this paper 3