What is Speculative Decoding? (trending on paperswithco.de) [R]
Speculative decoding is an inference optimization technique that uses a fast draft model to quickly propose future tokens, then verifies them in parallel with a larger target model. This speeds up token generation by leveraging the efficiency of the small model and the accuracy of the larger model. Builders can apply speculative decoding to improve the performance of their models, especially in scenarios where token generation is a bottleneck.
- Speculative decoding uses a fast draft model to propose future tokens, then verifies with a larger target model.
- Speeds up token generation by verifying in parallel.
- Fast draft model is small and efficient.