Intel and Weizmann Institute Unveil Breakthrough in Speculative Decoding at ICML 2025

Jul 17,2025

At the International Conference on Machine Learning (ICML) in Vancouver, Canada, researchers from Intel Labs and the Weizmann Institute of Science presented a significant advancement in speculative decoding—a technique designed to enhance the efficiency of large language models (LLMs). Their new method enables any small “draft” model to accelerate inference of any LLM, regardless of vocabulary mismatches.

A MAJOR STEP TOWARD MORE EFFICIENT GENERATIVE AI

“We’ve tackled one of the core inefficiencies in generative AI,” said Oren Pereg, Senior Researcher in Natural Language Processing at Intel Labs. “This work turns speculative decoding into a universal optimization tool. And it's not just theoretical—these tools are already enabling developers to build faster, more intelligent applications.”

WHAT IS SPECULATIVE DECODING?
Speculative decoding pairs a lightweight, fast model with a larger, more accurate one. The small model quickly generates a sequence, which the larger model then verifies, dramatically reducing the compute cost per token without sacrificing output quality.

HOW IT WORKS:
For example, given the prompt “The capital of France is...”, a traditional LLM would generate tokens one-by-one—first “Paris,” then “is,” then “a,” and so on—each requiring full computation. With speculative decoding, the smaller model might predict the entire phrase “Paris, a famous city...,” which the larger model only needs to validate, not regenerate.

WHY IT’S GROUNDBREAKING

What sets this new method apart is its generalizability. Previous approaches often relied on shared token vocabularies or joint model training. Intel and Weizmann’s solution removes those barriers, enabling speculative decoding between heterogeneous models—even those developed by different organizations.

The technique achieves up to a 2.8x speedup in inference, while maintaining the same level of output quality. It is also vendor-agnostic and designed to work across ecosystems. Thanks to integration with the Hugging Face Transformers library, this approach is already open-source and ready for use.

UNLOCKING SCALABLE AI ACROSS DEVICES

In an increasingly fragmented AI landscape, this breakthrough paves the way for open, interoperable, and cost-effective deployment—from cloud infrastructures to edge devices. Developers, enterprises, and researchers can now combine models based on performance needs and hardware constraints, without having to retrain from scratch.

“This research removes a major barrier to making generative AI faster and more affordable,” said Nadav Timor, a PhD student in the group of Professor David Harel at the Weizmann Institute. “Our algorithm delivers state-of-the-art acceleration that was previously only accessible to organizations training their own draft models.”


Продукт RFQ