Where to Attend: A Principled Vision-Centric Position Encoding with Parabolas
Abstract
Parabolic Position Encoding (PaPE) is a novel position encoding method for vision modalities that improves upon existing approaches by incorporating translation invariance, rotation invariance, distance decay, directionality, and context awareness principles.
We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as images, point clouds, videos, or event camera streams-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. We evaluate PaPE on 8 datasets that span 4 modalities. We find that either PaPE or PaPE-RI achieves the top performance on 7 out of 8 datasets. Extrapolation experiments on ImageNet-1K show that PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5% over the next-best position encoding. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.
Community
Parabolic Position Encoding (PaPE)
We propose a position encoding that is designed from the ground up for vision modalities. It works by treating relative positions as the dependent variable in a sum of parabolas. PaPE is the highest scoring position encoding on 7 out of 8 datasets and the extrapolation beyond the training resolutions is very strong.
Links
Paper: https://arxiv.org/abs/2602.01418
Website: https://chrisohrstrom.github.io/parabolic-position-encoding
Code: https://github.com/DTU-PAS/parabolic-position-encoding
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RayRoPE: Projective Ray Positional Encoding for Multi-view Attention (2026)
- Do We Need Reformer for Vision? An Experimental Comparison with Vision Transformers (2025)
- SoLA-Vision: Fine-grained Layer-wise Linear Softmax Hybrid Attention (2026)
- Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving (2025)
- OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis (2025)
- LinMU: Multimodal Understanding Made Linear (2026)
- SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper