Positional Encoding

Without position information, attention treats "the cat sat" and "sat the cat" identically. Position encodings tell the model where each token is.

Model

Positions: 64

pos = 5

pos = 6

—

Nearby positions → higher similarity

Position A

Position B

Click a preset to load an interesting configuration.

Compare the heatmap patterns for sinusoidal vs rotary — how do the stripe patterns differ?
Drag positions A and B to see how the dot product decays with distance.
Notice the different frequency bands: low dimensions change slowly, high dimensions oscillate fast.
Increase max positions — the high-frequency stripes become denser.

Why use different frequencies for different dimensions? What would happen if all dimensions used the same frequency?
Why can RoPE generalize to longer sequences than it was trained on, while absolute sinusoidal encodings struggle?
The dot product of two position vectors encodes relative distance. Why is this useful for attention?