Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding
Abstract
Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce rigid patterns in attention maps, limiting the ability to model long-range dependencies and adapt to diverse tasks. Additionally, most positional encodings are learned as general biases, lacking the specialization required for different instances within a dataset. To address this, we propose conTextualized equivariAnt Position Embedding (TAPE), a novel framework that enhances positional embeddings by incorporating sequence content across layers. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of traditional fixed patterns. By enforcing permutation and orthogonal equivariance, TAPE ensures the stability of positional encodings during updates, improving robustness and adaptability. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead. Extensive experiments shows that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques.
Community
We propose a framework for generalizing positional encoding through contextualization and equivariance design, which has been demonstrated to be effective in long-context retrieval and arithmetic reasoning tasks.
The figure compares the dot-product patterns of RoPE with our positional embeddings (TAPE), demonstrating that our method achieves more evenly distributed long-range attention patterns, whereas RoPE tends to emphasize token locality.
Code is available at https://github.com/VITA-Group/TAPE.
Are there passkey accuracy evaluations compared to other positional encoding methods when generalizing to unseen lengths?
We only tested length extrapolation in arithmetic and did not test passkey retrieval again. We may consider adding it if suggested by many.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training (2024)
- Breaking the Stage Barrier: A Novel Single-Stage Approach to Long Context Extension for Large Language Models (2024)
- Anchor Attention, Small Cache: Code Generation with Large Language Models (2024)
- Selective Attention: Enhancing Transformer through Principled Context Control (2024)
- LLMs are Also Effective Embedding Models: An In-depth Overview (2024)
- Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings (2024)
- Hymba: A Hybrid-head Architecture for Small Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper