Tag

#llms

Every item tagged llms, newest first.

13 items

Diffusion-Proof: Recipe for Formal Theorem Proving Beyond Auto-Regressive Generation

Researchers propose a new approach called Diffusion-Proof for formal theorem proving with Large Language Models, addressing limitations in auto-regressive generation methods. The method aims to improve performance on long-range coherence and error compounding. This development could benefit builders working on LLM applications requiring rigorous mathematical reasoning. The approach is detailed in a recent arXiv paper.

Key takeaways

Diffusion-Proof approach proposed for formal theorem proving.
Targets limitations in auto-regressive generation methods.
Aims to improve long-range coherence and reduce error compounding.

aarXiv#formal-verification #theorem-proving #llms

other14h

Quoting Charity Majors

The economics of code production were turned upside down in 2025, with code generation becoming effectively free and instant. This shift has made lines of code disposable and regenerable, rather than treasured and carefully curated.

Key takeaways

Code generation became effectively free and instant in 2025.
Lines of code are now disposable and regenerable.
Economics of code production turned upside down.

SSimon Willison#ai-assisted-programming #generative-ai #ai #llms

research1d

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

Researchers introduce RubricsTree, a framework for evaluating personal health agents powered by large language models. The framework addresses the challenge of scaling evaluation while maintaining clinical accuracy and consistency. RubricsTree aims to support the large-scale clinical deployment of these agents by providing a more efficient and reliable evaluation method. You can use this framework to assess the performance of personal health agents.

Key takeaways

RubricsTree framework supports scalable evaluation of personal health agents.
Addresses the bottleneck of physician annotation being costly and LLM evaluators being subjective.
Aims to enable large-scale clinical deployment of LLM-empowered health agents.

aarXiv#healthcare #evaluation #llms

research1d

Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports

Researchers evaluated open-source LLMs for multi-label ATT&CK technique classification on CTI reports. They found that LLMs can automate this complex task with high accuracy, reducing reliance on human effort. The study compared several open-source LLMs and identified top performers for this specific application. You can apply these findings to improve CTI report processing.

Key takeaways

LLMs achieve high accuracy in multi-label ATT&CK technique classification.
Open-source LLMs are viable for automating CTI report analysis.
Top-performing LLMs identified for this specific task.

aarXiv#open-source #llms #cyber-threat-intelligence

tools1d

infiniflow/ragflow

infiniflow released ragflow, an open-source retrieval-augmented generation engine that combines rag and agent capabilities for llms. it provides a context layer to improve llm performance. builders can use ragflow to enhance their llm applications with advanced retrieval and generation features. ragflow is available on github for free.

Key takeaways

ragflow is open-source on github
combines rag and agent capabilities
improves llm performance with a context layer

modelsJun 9

llm 0.32a3

The llm 0.32a3 release was written almost entirely by the new Claude Fable 5 model. This marks a significant milestone in leveraging AI for content generation. The release demonstrates progress in AI-assisted writing, showcasing capabilities of models like Claude Fable 5. You can explore the project's details on Simon Willison's website.

Key takeaways

Claude Fable 5 generated most of llm 0.32a3 release.
New release showcases AI-assisted writing capabilities.
Project details available on Simon Willison's website.

SSimon Willison#generative-ai #llms #claude-mythos

modelsApr 29

Granite 4.1 LLMs: How They’re Built

IBM released Granite 4.1, a series of open-weights LLMs. The models are trained on a mix of synthetic and human-generated data. IBM used a combination of automated and human evaluation to select the best model. You can access Granite 4.1 through Hugging Face.

Key takeaways

Trained on synthetic and human-generated data.
Uses automated and human evaluation.
Available on Hugging Face.

HHugging Face Blog#open-source #open-weights #llms

otherApr 3

The NLP Course is becoming the LLM Course

The Hugging Face NLP Course is being renamed to the LLM Course, reflecting the field's shift towards large language models. The course will cover foundational concepts and practical applications of LLMs. You can expect updated content focusing on LLM use cases, deployment, and optimization. This change aims to provide learners with relevant skills for the current AI landscape.

Key takeaways

The NLP Course is being renamed to the LLM Course.
The course will cover LLM use cases, deployment, and optimization.
The change reflects the field's shift towards large language models.

HHugging Face Blog#education #llms #course

modelsJan 9

CO₂ Emissions and Models Performance: Insights from the Open LLM Leaderboard

The Open LLM Leaderboard now displays CO₂ emissions for models, providing transparency on environmental impact. This allows builders to evaluate performance and sustainability when selecting models. The leaderboard tracks emissions from training and inference, giving a comprehensive view of a model's carbon footprint. By considering both performance and emissions, builders can make more informed decisions about which models to use.

Key takeaways

Open LLM Leaderboard shows CO₂ emissions for models.
Tracks emissions from training and inference.
Helps builders evaluate model sustainability.

HHugging Face Blog#open-source #sustainability #llms

researchNov 19

Judge Arena: Benchmarking LLMs as Evaluators

The Judge Arena benchmark evaluates LLMs as evaluators, comparing their ability to assess AI-generated text. The benchmark provides a framework for testing LLMs' evaluation capabilities, which is essential for developing reliable AI systems. You can use this benchmark to assess and compare the performance of different LLMs as evaluators. The benchmark's results can help you identify the strengths and weaknesses of various LLMs.

Key takeaways

Judge Arena benchmark evaluates LLMs as evaluators.
Provides a framework for testing LLMs' evaluation capabilities.
Helps identify strengths and weaknesses of LLMs as evaluators.

HHugging Face Blog#benchmarks #evaluators #llms

researchJan 18

Preference Tuning LLMs with Direct Preference Optimization Methods

Direct Preference Optimization (DPO) is a method for tuning large language models to align with human preferences. DPO works by directly optimizing a model's output to match human preferences, rather than relying on traditional reinforcement learning methods. This approach has been shown to improve model performance and alignment with human values. You can implement DPO using libraries like Hugging Face's Transformers.

Key takeaways

DPO directly optimizes model output to match human preferences.
Improves model performance and alignment with human values.
Can be implemented using Hugging Face's Transformers library.

HHugging Face Blog#preference-tuning #llms #optimization

modelsJan 10

Make LLM Fine-tuning 2x faster with Unsloth and 🤗 TRL

Unsloth and Hugging Face's TRL library now enable 2x faster LLM fine-tuning. This integration allows builders to train models more efficiently. Faster fine-tuning reduces costs and speeds up development. You can leverage this improvement in your own projects.

Key takeaways

Unsloth-TRL integration cuts fine-tuning time in half.
Faster training reduces costs and speeds development.
Improved efficiency benefits builders working on LLM projects.

HHugging Face Blog#fine-tuning #llms #hugging-face

researchNov 7

Comparing the Performance of LLMs: A Deep Dive into Roberta, Llama 2, and Mistral for Disaster Tweets Analysis with Lora

Researchers compared Llama 2, Roberta, and Mistral on disaster tweet analysis with Lora. The study evaluated model performance on sequence classification tasks. You can find the code and models on the Hugging Face blog. The results provide insights for builders working on similar tasks.

Key takeaways

Llama 2, Roberta, and Mistral were compared on disaster tweet analysis.
The study used Lora for sequence classification tasks.
Code and models are available on Hugging Face.

HHugging Face Blog#sequence-classification #disaster-analysis #llms