Quantization Improves Efficiency of Whisper Speech Recognition Models

More Efficient Speech Recognition: Quantization Opens New Possibilities for Whisper Models

Automated speech recognition (ASR) plays an increasingly important role in various applications, from subtitling and language translation to live transcription. Models like Whisper from OpenAI have made considerable progress, but also face challenges. In addition to the problem of hallucinated content, which can impair the reliability of transcription, larger model variants, in particular, present a hurdle for deployment on resource-constrained devices due to their high latency and memory requirements.

A recent study investigates the possibilities of quantization to address these challenges. Quantization refers to the reduction of precision with which the parameters of a neural network, such as Whisper, are represented. This enables smaller model sizes and faster calculations, but can also lead to a loss of accuracy. The study compares three Whisper models: the standard model, a variant for live streaming, and one for offline transcription. It analyzes the respective strengths and weaknesses of the models and examines the effects of quantization on latency and accuracy.

The freely available LibriSpeech dataset was used for the evaluation. The word error rate (WER) served as a measure of transcription accuracy, while latency was measured with whispercpp using three quantization methods (INT4, INT5, INT8). The results show that quantization can reduce latency by 19% and model size by 45% without significantly impacting transcription accuracy.

These findings are promising for the deployment of Whisper models on edge devices. By reducing latency and memory requirements, ASR applications can also be run on devices with limited computing power and storage capacity, opening up new possibilities for mobile applications, IoT devices, and other scenarios. The study underscores the potential of quantization to increase the efficiency of ASR models and expand their range of application.

The research results suggest that the choice of the optimal quantization method depends on the specific requirements of the application. While INT8 offers a good balance between accuracy and latency, INT4 and INT5 can be advantageous in scenarios with particularly scarce resources, if a certain loss of accuracy is acceptable. The study provides valuable insights into the possibilities and limitations of quantization for Whisper models and contributes to the development of more efficient and resource-saving ASR solutions.

The availability of open-source implementations and datasets allows the research community to build on these results and further optimize quantization techniques for ASR models. The combination of powerful models like Whisper with efficient quantization methods promises to make automated speech recognition even more accessible and versatile in the future.

Bibliographie: Radford, A. et al. (2023). Robust Speech Recognition via Large-Scale Weak Supervision. https://cdn.openai.com/papers/whisper.pdf Andreyev, A. (2025). Quantization for OpenAI's Whisper Models: A Comparative Analysis. arxiv:2503.09905 Graphcore (n.d.). How to use OpenAI’s Whisper for speech recognition. https://www.graphcore.ai/posts/how-to-use-openais-whisper-for-speech-recognition Chiu, C. et al. (2024). FP8 Quantization for Whisper. arxiv:2411.13209v1 Dalmia, S. et al. (2024). ANALYSIS OF WHISPER AUTOMATIC SPEECH RECOGNITION PERFORMANCE ON LOW RESOURCE LANGUAGE. https://www.researchgate.net/publication/378675573_ANALYSIS_OF_WHISPER_AUTOMATIC_SPEECH_RECOGNITION_PERFORMANCE_ON_LOW_RESOURCE_LANGUAGE Koutsikaloudis, C. et al. (2024). WhisperJAX: An Efficient Implementation of the Whisper Automatic Speech Recognition Model in JAX. https://www.mdpi.com/2504-2289/9/3/59 Systran (n.d.). faster-whisper. https://github.com/SYSTRAN/faster-whisper OpenVINO (n.d.). Optimizing Whisper and Distil-Whisper for Speech Recognition with OpenVINO and NNCF. https://blog.openvino.ai/blog-posts/optimizing-whisper-and-distil-whisper-for-speech-recognition-with-openvino-and-nncf Behnke, S. et al. (2024). Multilingual Speech Recognition with Whisper for Low-Resource Languages. https://aclanthology.org/2024.icon-1.31.pdf Restack (n.d.). Speech Recognition Answer Open Source Offline Cat AI. https://www.restack.io/p/speech-recognition-answer-open-source-offline-cat-ai

Quantization Improves Efficiency of Whisper Speech Recognition Models

Top post

More Efficient Speech Recognition: Quantization Opens New Possibilities for Whisper Models

Related blog

Multi-Turn Jailbreaks and Defenses: Enhancing LLM Security

Off-Policy Learning Enhances Reasoning Abilities in AI Models

SphereDiff Generates Seamless 360° Panoramas Without Finetuning