Cottention: Linear Transformers With Cosine Attention
Abstract
Attention mechanisms, particularly softmax attention, have been instrumental in the success of transformer-based models such as GPT. However, the quadratic memory complexity of softmax attention with respect to sequence length poses significant challenges for processing longer sequences. We introduce Cottention, a novel attention mechanism that replaces the softmax operation with cosine similarity. By leveraging the properties of cosine similarity and rearranging the attention equation, Cottention achieves native linear memory complexity with respect to sequence length, making it inherently more memory-efficient than softmax attention. We demonstrate that Cottention can be reformulated as a recurrent neural network (RNN) with a finite hidden state, allowing for constant memory usage during inference. We evaluate Cottention on both the bidirectional BERT and causal GPT tasks, demonstrating comparable performance to softmax attention while significantly reducing memory requirements. To ensure efficient computation, we develop a custom CUDA kernel for Cottention. Our results show that Cottention is a promising alternative to softmax attention, enabling the processing of longer sequences without sacrificing performance, due to its native linear memory complexity and ability to maintain a constant memory footprint during inference.
Community
Hi @gmongaras , congrats on this work.
I opened https://github.com/gmongaras/Cottention_Transformer/issues/4 since I saw models are currently on Google Drive, and they could be linked to the paper instead :)
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time (2024)
- Gated Slot Attention for Efficient Linear-Time Sequence Modeling (2024)
- Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices (2024)
- ELASTIC: Efficient Linear Attention for Sequential Interest Compression (2024)
- EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
I think this is vanilla Linear Attention with a normalization on keys/queries, this was studied extensively a few years ago, e.g. https://arxiv.org/abs/2102.11174
Models citing this paper 6
Browse 6 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper