Tag

#inference

Every item tagged inference, newest first.

15 items

GPT 5.5 on Cerebras

GPT-5.5 is now available on Cerebras via OpenRouter. You can access it by navigating to the Cerebras provider on the platform. This integration expands deployment options for builders using GPT-5.5.

Key takeaways

GPT-5.5 available on Cerebras via OpenRouter.
Cerebras added as a deployment option for GPT-5.5.
OpenRouter supports multiple providers for GPT-5.5.

rr/OpenAI#model-availability #deployment-options #inference

otherApr 29

DeepInfra on Hugging Face Inference Providers 🔥

DeepInfra has joined Hugging Face as an inference provider, expanding the platform's offerings for model deployment and serving. This integration allows users to deploy and manage models with DeepInfra's optimized infrastructure. Builders can now access DeepInfra's scalable and secure infrastructure for their model serving needs. The partnership aims to provide users with more choices and flexibility in model deployment.

Key takeaways

DeepInfra joins Hugging Face as an inference provider.
Integration expands model deployment and serving options.
Partnership offers users more choices in model infrastructure.

HHugging Face Blog#hugging-face #inference #model-serving

otherJun 16

Groq on Hugging Face Inference Providers 🔥

Groq has joined Hugging Face as an inference provider, offering optimized performance for large language models. This partnership enables seamless integration of Groq's hardware acceleration with Hugging Face's model hub. You can now deploy and run models on Groq's infrastructure through the Hugging Face platform. Builders can leverage this collaboration to optimize model performance and reduce latency.

Key takeaways

Groq joins Hugging Face as an inference provider.
Enables deployment of models on Groq's infrastructure via Hugging Face.
Optimized performance for large language models.

HHugging Face Blog#inference #hugging-face #hardware-acceleration

otherJun 12

Featherless AI on Hugging Face Inference Providers 🔥

Featherless AI has joined Hugging Face as an inference provider, expanding access to its optimized models via Hugging Face's API. This integration allows developers to deploy Featherless models directly through Hugging Face's platform. Builders can now access Featherless' optimized models without maintaining their own infrastructure. The partnership aims to simplify model deployment and reduce operational overhead.

Key takeaways

Featherless AI is now an inference provider on Hugging Face.
Developers can deploy Featherless models via Hugging Face's API.
Partnership reduces infrastructure management for builders.

HHugging Face Blog#inference #model-deployment #hugging-face

toolsMay 13

Blazingly fast whisper transcriptions with Inference Endpoints

Hugging Face Inference Endpoints now support Whisper for fast, scalable transcription. This offering enables developers to deploy Whisper models easily and make transcription API calls. You can use Whisper for tasks like podcast summarization and video captioning. The endpoints provide a managed service for Whisper models.

Key takeaways

Whisper models available on Hugging Face Inference Endpoints.
Enables easy deployment and transcription API calls.
Supports applications like podcast summarization and video captioning.

HHugging Face Blog#inference #transcription #managed-services

toolsMar 21

The New and Fresh analytics in Inference Endpoints

Hugging Face has introduced new analytics features for Inference Endpoints, enabling users to monitor and optimize their model deployments. The updates provide detailed metrics on request latency, throughput, and error rates. Builders can now better understand performance bottlenecks and make data-driven decisions. This enhancement aims to improve the efficiency and reliability of model serving.

Key takeaways

New analytics features for monitoring request latency and throughput.
Detailed metrics help identify performance bottlenecks.
Improves efficiency and reliability of model serving.

HHugging Face Blog#inference #model-serving #analytics

toolsFeb 24

Remote VAEs for decoding with Inference Endpoints 🤗

Hugging Face has introduced remote VAEs for decoding with Inference Endpoints, enabling more flexible and scalable model deployment. This feature allows users to deploy and manage models remotely, streamlining the deployment process. Builders can now focus on developing applications rather than managing infrastructure. The remote VAEs are designed to work seamlessly with Hugging Face's existing tools and services.

Key takeaways

Remote VAEs enable flexible and scalable model deployment.
Streamlines deployment process for builders.
Works with existing Hugging Face tools and services.

HHugging Face Blog#inference #remote-computing #model-deployment

otherJan 28

Welcome to Inference Providers on the Hub 🔥

Hugging Face has launched Inference Providers on the Hub, a new feature that allows users to deploy and manage models from multiple providers in one place. This centralized hub enables builders to easily discover, deploy, and manage inference endpoints for various AI models. You can now integrate models from different providers and manage them through a single interface. The feature aims to simplify the deployment and management of AI models.

Key takeaways

Centralized hub for deploying models from multiple providers.
Simplifies discovery, deployment, and management of inference endpoints.
Enables integration of models from different providers in one place.

HHugging Face Blog#inference #model-deployment #hugging-face

toolsJan 16

Introducing multi-backends (TRT-LLM, vLLM) support for Text Generation Inference

Hugging Face has introduced support for multiple backends in Text Generation Inference, including TRT-LLM and vLLM. This allows users to deploy models on different hardware and software configurations. The update aims to increase flexibility and performance for builders working with large language models. You can now choose the best backend for your specific use case.

Key takeaways

Supports TRT-LLM and vLLM backends
Increases deployment flexibility across hardware
Improves performance for large language models

HHugging Face Blog#text-generation #multi-backend #inference

toolsApr 3

Blazing Fast SetFit Inference with 🤗 Optimum Intel on Xeon

Hugging Face's Optimum Intel accelerates SetFit inference on Xeon processors, reducing latency and increasing throughput. This optimization enables faster and more efficient deployment of SetFit models. Builders using SetFit can benefit from improved performance and reduced costs. The acceleration is achieved through optimized kernel selection and parallelization.

Key takeaways

Up to 4x faster SetFit inference on Xeon with Optimum Intel
Optimized kernel selection and parallelization for improved performance
Reduced latency and increased throughput for SetFit models

HHugging Face Blog#optimization #inference #acceleration

otherSep 22

Inference for PROs

Hugging Face launched Inference for PROs, a paid API service for high-priority access to optimized model inference. The service targets professional developers and enterprises requiring low-latency, high-throughput model serving. Pricing starts at $0.0015 per minute of inference time. You can use it to deploy and serve models at scale.

Key takeaways

Inference for PROs offers prioritized API access for $0.0015/minute.
Targets professional developers and enterprises with demanding inference needs.
Optimized for low-latency, high-throughput model serving.

HHugging Face Blog#api #inference #enterprise

toolsFeb 6

Accelerating PyTorch Transformers with Intel Sapphire Rapids - part 2

Hugging Face and Intel collaborated on optimizing PyTorch Transformers for Intel Sapphire Rapids, resulting in improved inference performance. The partnership aims to enhance the efficiency of AI workloads on Intel hardware. This optimization benefits builders using PyTorch for inference tasks, allowing for faster and more efficient processing. The collaboration focuses on leveraging Intel Sapphire Rapids' capabilities to accelerate PyTorch Transformers.

Key takeaways

Improved inference performance for PyTorch Transformers on Intel Sapphire Rapids
Optimization enhances efficiency of AI workloads on Intel hardware
Faster processing for builders using PyTorch for inference tasks

HHugging Face Blog#pytorch #hardware-optimization #inference

modelsOct 12

Optimization story: Bloom inference

Hugging Face optimized the inference of their Bloom model, achieving significant performance gains. The optimization efforts focused on improving the model's efficiency and reducing latency. As a result, builders can now deploy Bloom with faster inference times and lower computational costs. This optimization is particularly important for applications where low latency is crucial.

Key takeaways

Bloom inference optimized for improved performance
Faster inference times and lower computational costs
Optimization focused on efficiency and latency reduction

HHugging Face Blog#model-optimization #inference #performance

toolsSep 16

Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate

Hugging Face released optimized PyTorch scripts for BLOOM inference using DeepSpeed and Accelerate, achieving significant speedups. The scripts enable faster inference times for BLOOM models, making them more suitable for production environments. This optimization is particularly useful for builders who need to deploy large language models. The scripts are available on the Hugging Face blog.

Key takeaways

Up to 30% faster BLOOM inference with DeepSpeed and Accelerate.
Optimized PyTorch scripts available for production use.
Compatible with existing BLOOM models and pipelines.

HHugging Face Blog#optimization #inference #pytorch

toolsJan 11

Deploy GPT-J 6B for inference using Hugging Face Transformers and Amazon SageMaker

Hugging Face provides a guide to deploy GPT-J 6B for inference using Hugging Face Transformers and Amazon SageMaker. This allows builders to run the model on SageMaker with minimal setup. The guide includes a step-by-step tutorial on how to deploy the model. You can use this to integrate GPT-J 6B into your applications.

Key takeaways

Deploy GPT-J 6B on Amazon SageMaker using Hugging Face Transformers.
Step-by-step guide available for minimal setup.
Supports integration into various applications.

HHugging Face Blog#transformers #sagemaker #inference