@jun_song we should stop building in silos. A community of devs working
together on one shared project would save everyone time, energy and
effort compared to everyone duplicating work on their own.
Would love to hear your thoughts. I can also bring funding to support
the initiative.
I've been working on something for a few months
now and I'd rather talk to you directly before it becomes public
anywhere.
I've been building a heavily optimized Rust fork of Ollama. The goal
has been to push inference performance significantly beyond the current
baseline, particularly on Apple Silicon but also on commodity hardware.
I've personally invested over 15,000€ in LLM API tokens alone (for
prompt engineering, architectural planning, code review, benchmark
analysis, and iterative refinement). That's just the token spend, not
counting the time.
The technical focus has been on:
- MLX backend integration for native M-series acceleration, leveraging
the Apple Neural Engine and unified memory architecture
- Metal Performance Shaders for custom compute kernels
- candle (Hugging Face's Rust-native ML framework) as an alternative
inference path, with ort + CoreML execution provider fallback
- Custom KV cache reuse strategies across concurrent requests for
massive latency reduction on multi-turn conversations
- SSD-backed context offloading for 128k+ context windows without OOM
- Continuous batching inspired by vLLM for multi-user scenarios
- Tokio-based async runtime with scheduler tuning for low first-token
latency
- Zero-copy memory mapping for GGUF model loading
The fork is designed to be modifiable, tunable, and extensible. Not
locked into a single runtime philosophy. Everything from the scheduler
to the memory layer is open for iteration.
I'll share the full codebase with you
Now here's the bigger picture, because this fork is only one piece of
what I'm building.
I'm also working on a second project called AURA, which is a
decentralized self-improving LLM network. The underlying idea is
"Bitcoin for AI". Instead of centralized training on hyperscaler GPU
clusters, AURA uses federated learning via Flower to continuously
improve a base model (currently Gemma 4 31B) across a distributed
network of nodes. Contributions are tracked and rewarded through a
custom Substrate-based chain with a native token. The goal is an LLM
that gets progressively smarter over time without depending on any
single company's training budget.
And I also maintain a fork of a Rust-based personal AI agent framework
that handles multi-channel communication (WhatsApp, Telegram, Discord,
iMessage), long-term memory, MCP tool use, and autonomous task
execution.
Here's where it all comes together, and this is what I'd love to build
with someone like you.
The plan is to synchronize these three pieces into a unified local
AI stack:
1. The optimized Ollama fork serves inference. Fast, efficient,
minimal resource footprint on consumer hardware.
2. The agent framework runs the autonomous logic. It handles the user
interactions, tool use, memory management, and long-running tasks.
3. AURA runs as the learning layer in the background. And here's the
key architectural move: the agent framework runs as AURA's data
collection and synthesis agent during off-hours. While the user
sleeps, the agent queries connected knowledge sources, crawls
relevant domains, analyzes new content, extracts insights, and
submits training contributions to the AURA federated learning
network.
The result: a local LLM that is lightweight, powerful, and actually
learns to grow on its own every single night without human
intervention. Every node running this stack contributes to the
collective intelligence. Every node also benefits from the continuously
improved weights pulled down from the network. The system gets
smarter while you sleep, on your own hardware, under your own control.
This is a long-term bet on local-first, user-owned AI infrastructure
that doesn't depend on OpenAI, Anthropic, or any centralized provider.
Rust is the foundation throughout because performance, memory safety,
and cross-platform deployment matter.
What we can do :
1. Share the full code with you right away, no strings. You review
what I've built, tell me honestly what you think, what to improve,
what's missing.
2. If you see the potential, we co-found a Discord community around
this project. You and I would both be administrators. We build
the core group together, vet the early contributors, and shape
the technical direction jointly.
3. We aim the community at a concrete goal: outperforming Opus 4.7
latency on M-series hardware within 12 months. That's a rallying
cry strong enough to attract serious contributors.
On community building and reach, I want to be transparent. My X
following is small because I only started using the platform seriously
six months ago when AI conversations moved there. But I have other
distribution channels:
- 250,000 followers on Instagram (built over years in a different
business, but an engaged audience)
- Significant press coverage across multiple outlets and years
- An existing company (Soflution ltd) with real revenue and the
ability to fund community initiatives
So when we're ready to go public with a Discord launch, I can amplify
it in ways that reach beyond the typical dev Twitter bubble. That
reaches an audience that doesn't normally engage with open source but
might fund or champion the right project.
I'm reaching out to you specifically because of your work and because
I want one real collaborator before this becomes anything public.
I appreciate you reading this far.