Tag

#quantization

Every item tagged quantization, newest first.

10 items

Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models

Researchers developed Ternary Mamba, a method for compressing State Space Models like Mamba-2 through grouped quantization-aware training. This approach reduces the memory footprint of Mamba-2 1.3B by 3.61x, from 2,687 to 744 MB, while achieving 48.1% zero-shot accuracy across 7 tasks. By leveraging a pretrained checkpoint and knowledge distillation, the method cuts the token budget by 1,000x. This development can help builders deploy SSMs more efficiently on edge devices.

Key takeaways

Ternary Mamba compresses Mamba-2 1.3B to 744 MB, a 3.61x reduction.
Achieves 48.1% zero-shot accuracy across 7 tasks.
Reduces token budget by 1,000x using pretrained checkpoints and knowledge distillation.

aarXiv#state-space-models #quantization #knowledge-distillation

modelsMay 21

Exploring Quantization Backends in Diffusers

The Diffusers library now supports multiple quantization backends, including bitsandbytes, dynamic, and static quantization. This allows for more flexible and efficient model deployment. You can explore different quantization methods and their trade-offs using the Diffusers library. Quantization can significantly reduce model size and improve inference speed.

Key takeaways

Diffusers supports multiple quantization backends.
Quantization reduces model size and improves inference speed.
Flexible deployment options for models.

HHugging Face Blog#quantization #diffusers #model-optimization

toolsApr 29

Introducing AutoRound: Intel’s Advanced Quantization for LLMs and VLMs

Intel introduced AutoRound, an open-source quantization tool for large language models and vision language models. AutoRound aims to reduce model size and improve inference speed without sacrificing accuracy. You can integrate it into your model deployment pipeline to optimize performance. By using AutoRound, you can deploy models more efficiently on resource-constrained devices.

Key takeaways

AutoRound is open-source and available on Hugging Face.
Reduces model size and improves inference speed.
Maintains model accuracy.

HHugging Face Blog#quantization #open-source #model-optimization

researchSep 18

Fine-tuning LLMs to 1.58bit: extreme quantization made easy

Researchers have developed a method for fine-tuning large language models to 1.58bit precision, enabling extreme quantization. This technique makes it easier to deploy LLMs on resource-constrained devices. The approach achieves competitive performance despite aggressive quantization. You can explore the code and models on the Hugging Face platform.

Key takeaways

1.58bit precision achieved in fine-tuning LLMs.
Enables deployment on resource-constrained devices.
Competitive performance with aggressive quantization.

HHugging Face Blog#fine-tuning #quantization #resource-constrained #open-source

modelsMay 16

Unlocking Longer Generation with Key-Value Cache Quantization

Hugging Face introduced key-value cache quantization to improve the efficiency of longer sequence generation in transformers. This technique reduces memory usage and speeds up inference times. By applying quantization to the key-value cache, models can generate longer sequences without running out of memory. This is particularly useful for tasks that require generating long pieces of text.

Key takeaways

Key-value cache quantization reduces memory usage for longer sequence generation.
Speeds up inference times for transformer models.
Enables generation of longer sequences without running out of memory.

HHugging Face Blog#transformer-optimization #quantization #sequence-generation

toolsMar 18

Quanto: a PyTorch quantization backend for Optimum

Hugging Face introduced Quanto, a PyTorch quantization backend for Optimum, to improve model efficiency. Quanto enables users to optimize models for better performance and reduced memory usage. This development targets builders working with PyTorch and Optimum, providing a new tool for model optimization. Quanto's integration with Optimum simplifies the quantization process for PyTorch models.

Key takeaways

Quanto is a PyTorch quantization backend for Optimum.
Improves model efficiency through optimized performance and memory usage.
Simplifies quantization for PyTorch models with Optimum integration.

HHugging Face Blog#pytorch #quantization #model-optimization

toolsSep 12

Overview of natively supported quantization schemes in 🤗 Transformers

Hugging Face Transformers natively supports various quantization schemes to reduce model size and improve inference speed. Quantization reduces precision of model weights from 32-bit floating point to lower precision formats like int8, int4, or float16. This leads to significant reductions in model size and faster inference times on certain hardware. You can use these schemes to optimize your models for deployment on edge devices or in resource-constrained environments.

Key takeaways

Hugging Face Transformers supports int8, int4, and float16 quantization.
Quantization reduces model size and improves inference speed on certain hardware.
Native support simplifies optimization for edge devices and resource-constrained environments.

HHugging Face Blog#model-optimization #quantization #edge-ai

toolsAug 23

Making LLMs lighter with AutoGPTQ and transformers

Hugging Face has integrated AutoGPTQ, a quantization technique, to reduce the size and latency of large language models. This integration allows for more efficient deployment of LLMs on edge devices and in resource-constrained environments. AutoGPTQ achieves this by reducing the precision of model weights, resulting in significant reductions in model size. The integration is available on the Hugging Face platform, enabling builders to easily deploy lighter LLMs.

Key takeaways

AutoGPTQ reduces LLM size and latency through quantization.
Integration available on Hugging Face platform.
Significant reductions in model size achieved through weight precision reduction.

HHugging Face Blog#model-optimization #edge-ai #quantization

modelsJul 27

Stable Diffusion XL on Mac with Advanced Core ML Quantization

Stable Diffusion XL is now available on Mac with Core ML quantization, improving performance by 30% on M1 and M2 Macs. Hugging Face provides pre-trained models and tools for Core ML development.

Key takeaways

Stable Diffusion XL model now available on Mac with Core ML quantization for improved performance and reduced latency.
Quantized model runs 30% faster on M1 and M2 Macs.
Hugging Face provides pre-trained models and tools for Core ML development.

HHugging Face Blog#core-ml #quantization #mac

toolsMay 24

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

Hugging Face introduced bitsandbytes, a library enabling 4-bit quantization for LLMs, and QLoRA, a quantization-aware implementation of LoRA. This development makes LLMs more accessible by reducing memory usage and increasing inference speed. With 4-bit quantization, models require less memory and compute resources, allowing for more widespread adoption. The QLoRA implementation also enables more efficient fine-tuning of LLMs.

Key takeaways

4-bit quantization reduces LLM memory usage and increases inference speed.
QLoRA enables efficient fine-tuning of quantized LLMs.
bitsandbytes library provides a simple integration path for developers.

HHugging Face Blog#quantization #llm-optimization #efficient-inference