EAGLE-3 Improves Large Language Model Inference Speed

EAGLE-3: Accelerating Large Language Model Inference

Large language models (LLMs) have revolutionized the way we interact with computers. Their ability to generate human-like text, perform translations, and answer questions opens up unprecedented possibilities in various fields. However, the impressive performance of these models comes at a cost: inference, i.e., the generation of text, is often slow and resource-intensive due to the sequential processing of data. This limits the application of LLMs in real-time applications and on devices with limited resources.

To address this challenge, various acceleration techniques have been developed. One promising method is speculative sampling. This attempts to predict the next tokens in the sequence and process them in parallel. A well-known example of this technique is EAGLE, which performs autoregression at the feature level and reuses top-layer features of the target model to achieve better results than conventional speculative sampling.

A current trend in the LLM community is scaling training data to improve model intelligence without increasing inference costs. However, studies have shown that scaling data with EAGLE yields only limited improvements. This limitation stems from EAGLE's feature prediction constraints.

This is where EAGLE-3 comes in. This further development of EAGLE forgoes feature prediction and instead focuses on direct token prediction. Instead of relying on top-layer features, EAGLE-3 leverages the fusion of multi-layer features through a technique called "Training-Time Test".

These improvements lead to a significant performance increase and allow the model to fully benefit from the scaling of training data. Experiments with chat and reasoning models, evaluated on five different tasks, show that EAGLE-3 achieves a speedup of up to 6.5x, which represents an improvement of about 1.4x over EAGLE-2.

How does EAGLE-3 work?

EAGLE-3 uses the concept of "Training-Time Test" to improve prediction accuracy. During training, a portion of the data is used as test data to evaluate the model's performance in predicting future tokens. This information is then used to optimize the model parameters and improve the accuracy of the predictions.

By combining direct token prediction with multi-layer feature fusion, EAGLE-3 can leverage the strengths of different layers of the neural network, further increasing prediction accuracy. This allows for more efficient inference and reduces the required computing power.

Outlook

EAGLE-3 represents a significant advance in accelerating LLM inference. The results achieved are promising and open up new possibilities for the use of LLMs in real-time applications and on resource-constrained devices. Future research could focus on further optimizing EAGLE-3 and applying the technique to other LLM architectures.

```