AI Model Processes Three Million Tokens With New Context Window Expansion Method

Top post
AI Innovation: Context Windows of Language Models Extended to Millions of Tokens
Processing long text sequences by large language models (LLMs) presents a challenge due to high memory requirements and slowed inference speed. A new method called InfiniteHiP promises a solution, enabling the processing of context windows with up to three million tokens on a single GPU.
Conventional LLMs reach their limits when processing texts that exceed the length of their training data. InfiniteHiP addresses this issue through a modular hierarchical token pruning algorithm. This algorithm dynamically removes irrelevant context tokens, thereby reducing the computational load. Additionally, InfiniteHiP enables generalization to longer sequences by selectively applying various RoPE (Rotary Position Embedding) adaptation methods based on the internal attention patterns within the LLM.
Another important aspect of InfiniteHiP is the offloading of the key-value cache to host memory during inference. This significantly reduces pressure on GPU memory and allows for the processing of considerably longer text sequences. Tests with a single L40s 48GB GPU showed that InfiniteHiP can process context windows of up to three million tokens – three times more than previously possible – without permanent loss of context information.
The developers of InfiniteHiP report an impressive performance increase. Compared to conventional methods, an 18.95-fold acceleration was achieved when decoding attention with a context window of one million tokens, without requiring additional training. The implementation of InfiniteHiP was done within the SGLang framework.
The ability to process longer context windows opens up new possibilities for the use of LLMs. Applications such as chatbots, text summarization, and translation can benefit from this development and deliver more accurate and context-aware results. The analysis of large datasets and the generation of longer, coherent texts are also facilitated by InfiniteHiP.
Research in the field of context window expansion is dynamic and promising. InfiniteHiP represents an important step in overcoming the limitations of conventional LLMs and further increasing the performance of AI systems. The future development of this technology will show what further applications and possibilities arise from the processing of extremely long context windows.
Sources: - https://huggingface.co/papers/2502.08910 - https://arxiv.org/abs/2502.08910 - https://www.linkedin.com/posts/ryane-burg_supercomputer-components-arrive-at-university-activity-7288292525180678145-e9VT - https://www.reddit.com/r/LocalLLaMA/comments/1c1ys5j/extending_the_context_window_of_your_llms_to_1m/ - https://news.ycombinator.com/item?id=42173960 - https://arxiv.org/pdf/2411.01783 - https://www.linkedin.com/posts/saras-micro-devices_sarasmicrotech-ectc-ieee-activity-7199757924733263873-s3Kh - https://venturebeat.com/ai/how-gradient-created-an-open-llm-with-a-million-token-context-window/ - https://news.ycombinator.com/item?id=39384513 - https://www.facebook.com/groups/DeepNetGroup/posts/2143233696069501/