Fourier Positional Encoding Improves Length Generalization in Language Models

Fourier Positional Encoding: Enhancing Length Generalization of Language Models

Extending the context length of language models (LMs) is a central research area in artificial intelligence. A promising approach for this is improving positional encoding, which provides the model with information about the order of words in the text. While existing work mainly addresses the limitations of rotary positional encoding (RoPE) within the attention mechanism, this article analyzes the effects of RoPE on almost all components of LMs and reveals their negative effects on length generalization.

The Limitations of RoPE

Using the theory of discrete signal processing, it can be shown that RoPE implicitly performs a non-uniform discrete Fourier transform, thereby enabling periodic attention. However, this periodicity is impaired by so-called "spectral damage." This damage arises from:

Linear layers and activation functions outside the attention mechanism.
Insufficiently trained frequency components, which arise from the limitation of the time domain.

This spectral damage disrupts the periodicity generated by RoPE and impairs the model's length generalization. Concretely, this means that the model has difficulty working on texts with lengths that deviate from the training data length.

Fourier Positional Encoding (FoPE): A New Approach

To improve length generalization, this article introduces Fourier Positional Encoding (FoPE). FoPE enhances the frequency domain properties of attention to optimize both periodic extension and length generalization. Unlike RoPE, which treats each dimension as a single-frequency function, FoPE models each dimension as a Fourier series consisting of a dominant frequency component and multiple harmonic components. This better represents the actual frequency distribution in LMs and improves the separation of information of different wavelengths within the attention mechanism. Additionally, FoPE eliminates insufficiently trained frequency components by setting them to zero. The zero value is chosen because the zero-frequency component corresponds to the longest wavelength and thus preserves information flow over long distances.

Experimental Results

Experiments with various model sizes and datasets show that FoPE achieves more stable perplexity and more consistent accuracy on tasks like "needle-in-a-haystack" search compared to RoPE and ALiBi. Perplexity is a measure of the model's ability to make the next word prediction, while the "needle-in-a-haystack" search tests the model's ability to find relevant information in a large context. The results demonstrate that FoPE increases the model's robustness against spectral damage and improves length generalization.

Outlook

Fourier Positional Encoding (FoPE) offers a promising approach to improving the length generalization of language models. By specifically manipulating the frequency domain, FoPE enables more robust and stable performance when processing texts of different lengths. Future research could focus on further optimizing FoPE and applying it to various LM architectures.

Bibliography: https://arxiv.org/abs/2412.17739 https://arxiv.org/html/2412.17739v1 https://www.chatpaper.com/chatpaper/de/paper/93627 https://paperreading.club/page?id=274598 https://www.researchgate.net/publication/368753877_Embedding_Fourier_for_Ultra-High-Definition_Low-Light_Image_Enhancement https://cseweb.ucsd.edu/~ravir/pratul_neurips.pdf https://openreview.net/pdf/72849308585cb9dc69f46d1e38425935eae1ad96.pdf https://www.geeksforgeeks.org/working-of-positional-embedding-in-self-attention/ https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136870142.pdf https://aclanthology.org/2024.findings-acl.834.pdf