Daniel Bevenius

3 days ago

Highlighting recent advances in multi-GPU and tensor parallel support in llama.cpp Over the last few months llama.cpp maintainers and engineers from NVIDIA collaborated to improve the multi-GPU performance in ggml. This resulted in significant performance gains on RTX systems and laid the groundwork for hardware-agnostic tensor parallelism in ggml. For more information on this and other advancements in the low-level inference engine of llama.cpp, check the technical blog by @NVIDIARTXSpark below

ggerganov's tweet photo. Highlighting recent advances in multi-GPU and tensor parallel support in llama.cpp

Over the last few months llama.cpp maintainers and engineers from NVIDIA collaborated to improve the multi-GPU performance in ggml. This resulted in significant performance gains on RTX systems and laid the groundwork for hardware-agnostic tensor parallelism in ggml.

For more information on this and other advancements in the low-level inference engine of llama.cpp, check the technical blog by @NVIDIARTXSpark below

442

36K

dbevenius retweeted

9 days ago

llama.cpp now has an official website: https://t.co/vztdUpdBWL Our goal is to make local AI accessible to everyone, and improving the user experience is a big part of that. On the new landing page you’ll find a single-line cross-platform installer. The installation provides a single unified `llama` entrypoint which you can use to run/serve models and interface with 3rd-party agentic applications. While oriented towards simplified user experience, the new `llama` application also provides all the advanced functionality of the existing llama.cpp tooling with which experienced users are already familiar. Also note that all GGUF models that you might have already downloaded with llama.cpp in the past will be automatically available to use without downloading again (they are stored in the common HF cache on your machine). We have many improvements in the pipeline both at the UX and at the engine level and we plan to iteratively ship new things over the coming months. One of the main focuses will be seamless integration with local-friendly 3rd-party agents (such as Pi). In the meantime, we’ll continue to listen for feedback from the community and adjust accordingly, so keep letting us know what you think and need.

485

162K

dbevenius retweeted

16 days ago

Highlighting the new WebGPU backend in llama.cpp/ggml The work to bring full-fledged WebGPU support in llama.cpp started about an year and a half ago. It has been lead by @reeselevine and team at USCS. For more information, checkout the interactive blog and paper in the quoted post. Here are 2 excerpts from the paper, summarizing the implemented software architecture.

ggerganov's tweet photo. Highlighting the new WebGPU backend in llama.cpp/ggml

The work to bring full-fledged WebGPU support in llama.cpp started about an year and a half ago. It has been lead by @reeselevine and team at USCS.

For more information, checkout the interactive blog and paper in the quoted post. Here are 2 excerpts from the paper, summarizing the implemented software architecture.

612

109

316

90K

dbevenius retweeted

CEO @ Alludium AI. CTO & Venture Partner @ Sure Valley Ventures

20 days ago

llama.cpp adds MTP for the Qwen3.6 family This is a significant milestone for the local AI ecosystem. The performance jump with these changes is massive and elevates local inference on commodity hardware further. Special thanks to Aman Gupta for leading this development! https://t.co/vjaMwEpIaR

182

532

273K

Who to follow

John Frizelle

@johnfriz

Matteo Collina

@matteocollina

@platformatic Co-Founder & CTO, @nodejs TSC Chair, Lead maintainer @fastifyjs, Board @OpenJSF, Conference Speaker, Ph.D. Past: @nearform. Views are my own.

Red Hat Brasil

@RedHatBR

Somos fornecedor líder de soluções empresariais de código aberto com tecnologias como Linux, Nuvem Híbrida, Inteligência Artificial, Edge e Automação.

dbevenius retweeted

2 months ago

llama.cpp at 100k stars now that 90% of the code worldwide is being written by AI agents, I predict that within 3-6 months, 90% of all AI agents will be running locally with llama.cpp 😄 Jokes aside, I am going to use this small milestone as an opportunity to reflect a bit on the project and the state of AI from the perspective of local applications. There is a lot to say and discuss and yet it feels less and less important to try to make a point. Opinions about viability of local LLMs are strongly polarized, details are overlooked, the scientific approach is lacking. Arguments are predominantly based on vibes and hype waves. One thing is clear though - local LLMs are used more and more. I expect this trend to continue and likely 2026 will end up being one of the most important years for the local AI movement. I admit that I didn't expect the agentic era to come so quickly to the local LLM space. One year ago, the available models were too computationally expensive for doing long-context tasks. There wasn't an obvious path towards meaningful agentic applications. The memory and compute requirements were huge. Last summer, with the release of gpt-oss, things started to change. It was the first time we saw a glimpse of tool calling that actually works well within the resource constraints of our daily devices. Later in the year, even better models were released and by now, useful local agentic workflows are a reality. Comparing local vs hosted capabilities at a given moment of time is pointless. To try put things into perspective: - We don't need frontier intelligence to automate searches and sending emails - We don't need trillion parameter models to be able to summarize articles or technical documents - We don't need massive GPU data centers to control our home appliances or turn the lights off in the garage I believe that there is a certain level of intelligence we as humans can comprehend and meaningfully utilize to improve our working process. Beyond that level, access to more intelligence becomes unnecessary at best and counterproductive at worst. I also believe that that level of useful artificial intelligence is completely within reach locally and it has always been just a matter of implementing the right software stack to bring it to the end user. With llama.cpp, I am confident that we continue to be on the right track of building that software stack! The llama.cpp project is going stronger than ever. With more than 1500 contributors, the project keeps growing steadily. From technical point of view, I think that llama.cpp + ggml is the only solution that actually makes sense. That is, the software stack must run efficiently on every possible device, hardware and operating system. The technology is too important to be vendor-locked. It has to be developed in the open, by the community, together with the independent hardware vendors. This is the only right way to build something that will truly make a difference in the long run. I won't try to convince you about what is currently and will be possible with local AI. We will just continue to build as usual. I am confident that after the smoke clears and we look objectively at what we have built together, the benefits will be obvious to everyone. Big shoutout to all llama.cpp maintainers. I feel extremely lucky to be able to work together with so many talented contributors. Every day I learn something new and I feel there is so much more cool stuff that we are going to build. Also, I am really thankful that the project continues to have reliable partners to support it! Cheers!

ggerganov's tweet photo. llama.cpp at 100k stars

now that 90% of the code worldwide is being written by AI agents, I predict that within 3-6 months, 90% of all AI agents will be running locally with llama.cpp 😄

Jokes aside, I am going to use this small milestone as an opportunity to reflect a bit on the project and the state of AI from the perspective of local applications. There is a lot to say and discuss and yet it feels less and less important to try to make a point. Opinions about viability of local LLMs are strongly polarized, details are overlooked, the scientific approach is lacking. Arguments are predominantly based on vibes and hype waves.

One thing is clear though - local LLMs are used more and more. I expect this trend to continue and likely 2026 will end up being one of the most important years for the local AI movement.

I admit that I didn't expect the agentic era to come so quickly to the local LLM space. One year ago, the available models were too computationally expensive for doing long-context tasks. There wasn't an obvious path towards meaningful agentic applications. The memory and compute requirements were huge. Last summer, with the release of gpt-oss, things started to change. It was the first time we saw a glimpse of tool calling that actually works well within the resource constraints of our daily devices. Later in the year, even better models were released and by now, useful local agentic workflows are a reality.

Comparing local vs hosted capabilities at a given moment of time is pointless. To try put things into perspective:

- We don't need frontier intelligence to automate searches and sending emails
- We don't need trillion parameter models to be able to summarize articles or technical documents
- We don't need massive GPU data centers to control our home appliances or turn the lights off in the garage

I believe that there is a certain level of intelligence we as humans can comprehend and meaningfully utilize to improve our working process. Beyond that level, access to more intelligence becomes unnecessary at best and counterproductive at worst. I also believe that that level of useful artificial intelligence is completely within reach locally and it has always been just a matter of implementing the right software stack to bring it to the end user.

With llama.cpp, I am confident that we continue to be on the right track of building that software stack!

The llama.cpp project is going stronger than ever. With more than 1500 contributors, the project keeps growing steadily.

From technical point of view, I think that llama.cpp + ggml is the only solution that actually makes sense. That is, the software stack must run efficiently on every possible device, hardware and operating system. The technology is too important to be vendor-locked. It has to be developed in the open, by the community, together with the independent hardware vendors. This is the only right way to build something that will truly make a difference in the long run.

I won't try to convince you about what is currently and will be possible with local AI. We will just continue to build as usual. I am confident that after the smoke clears and we look objectively at what we have built together, the benefits will be obvious to everyone.

Big shoutout to all llama.cpp maintainers. I feel extremely lucky to be able to work together with so many talented contributors. Every day I learn something new and I feel there is so much more cool stuff that we are going to build. Also, I am really thankful that the project continues to have reliable partners to support it!

Cheers!

279

397

455K

dbevenius retweeted

3 months ago

GGUF model on Hugging Face: https://t.co/0XP7qr5s6P Find initial llama.cpp performance metrics at: https://t.co/lRWuzlmsGV

dbevenius retweeted

3 months ago

In collaboration with NVIDIA we announce support for the new NVIDIA Nemotron 3 Super model in llama.cpp NVIDIA Nemotron 3 Super is a 120B open MoE model activating just 12B parameters to deliver maximum compute efficiency and accuracy for complex multi-agent applications.

240

16K

dbevenius retweeted

Xuan-Son Nguyen

@ngxson

4 months ago

Qwen3-Coder-Next and Minimax-M2.1 are available on HF inference endpoints with the price of $2.5/hr and $5/hr respectively. With the context fitting supported, you can now utilize the largest context length possible for a given hardware. No more manual tuning -c option!

ngxson's tweet photo. Qwen3-Coder-Next and Minimax-M2.1 are available on HF inference endpoints with the price of $2.5/hr and $5/hr respectively.

With the context fitting supported, you can now utilize the largest context length possible for a given hardware. No more manual tuning -c option! https://t.co/oTVdhDVrf8

10K

dbevenius retweeted

4 months ago

Introducing LlamaBarn — a tiny macOS menu bar app for running local LLMs Open source, built on llama.cpp

782

424

54K

dbevenius retweeted

Xuan-Son Nguyen

@ngxson

4 months ago

Hugging Face Inference Endpoint now supports deploying GLM-4.7-Flash via llama.cpp, for as cheap as $0.8/hr Using Q4_K_M and 24k tokens context length - should be enough for most use case!

ngxson's tweet photo. Hugging Face Inference Endpoint now supports deploying GLM-4.7-Flash via llama.cpp, for as cheap as $0.8/hr

Using Q4_K_M and 24k tokens context length - should be enough for most use case! https://t.co/ADrAGKQls9

10K

dbevenius retweeted

5 months ago

Recent contributions by NVIDIA engineers and llama.cpp collaborators resulting in significant performance gains for local AI

ggerganov's tweet photo. Recent contributions by NVIDIA engineers and llama.cpp collaborators resulting in significant performance gains for local AI https://t.co/NFkopVZaFz

586

114

36K

dbevenius retweeted

ggml @ggml_org

8 months ago

whisper.cpp v1.8.1 https://t.co/M53Ax1OefG

dbevenius retweeted

8 months ago

HuggingFace just shipped in-browser GGUF editing It allows you to edit GGUF metadata in the comfort of your browser, without having to even download the full model. This feature is enabled via the Xet technology that makes partial file updates possible.

370

100

37K

dbevenius retweeted

over 3 years ago

We just released Drogue Cloud 0.11.0! Are you curious about CoAP with DTLS, TLS-PSK, or #digitaltwin? Then maybe take a few minutes and read our release blog post: https://t.co/aesxJ51qIa

dbevenius retweeted

over 3 years ago

Using @micropython? Take a look at our fresh examples that work with the @Raspberry_Pi Pico W: https://t.co/fDGXy2avup

dbevenius retweeted

Jens Reimann / @[email protected] @ctron

over 3 years ago

Rust: The future is here! @rustembedded @EclipseCon @DrogueIoT

dbevenius retweeted

over 3 years ago

We come in peace! @EclipseCon https://t.co/w2hwFAKhBW

dbevenius retweeted