Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models
Researchers developed Ternary Mamba, a method for compressing State Space Models like Mamba-2 through grouped quantization-aware training. This approach reduces the memory footprint of Mamba-2 1.3B by 3.61x, from 2,687 to 744 MB, while achieving 48.1% zero-shot accuracy across 7 tasks. By leveraging a pretrained checkpoint and knowledge distillation, the method cuts the token budget by 1,000x. This development can help builders deploy SSMs more efficiently on edge devices.
- Ternary Mamba compresses Mamba-2 1.3B to 744 MB, a 3.61x reduction.
- Achieves 48.1% zero-shot accuracy across 7 tasks.
- Reduces token budget by 1,000x using pretrained checkpoints and knowledge distillation.