SWE-Bench is mostly Python.
Our codebase is Rails + Phlex + Stimulus.
So we built our own SWE-Bench using real PRs.
Results ๐
GPT-5.3 Codex: ~0.70 quality, < $1
Opus 4.6: ~0.61 quality, ~ $5
Codex shipped better code at ~1/7th the cost.
Opus 4.6 barely improved over 4.5.
If you're looking to use Hermes Agent, the best model under 20B right now is Ornith 9B.
It's fast, follows tool calls reliably, and works surprisingly well for real-world agent workflows. If you're running locally and want a strong balance of speed and capability, it's an easy recommendation.
https://t.co/eiEvm7LX6P
I think the biggest problem with local LLMs on GPUs with <16GB VRAM is the context limit.
What if we built a system that continuously indexes the entire repo, builds dependency/call graphs, understands the architecture, docs, workflows, etc., so the model retrieves only the relevant information instead of loading everything into the context?
Since 35B local models still arenโt as capable as GPT-4.8 or Sonnet on harder tasks, we could also add a confidence-based research loop. If the model detects itโs stuck or its confidence is low, it automatically researches the missing information, replans, and retries instead of hallucinating.
Feels like this could make local vibe coding actually viable on consumer GPUs.
Thoughts?
@emilstridell@MiaAI_lab@UnslothAI@NVIDIAAI Iโm currently running qwen 35b a3b with offloading layers getting 50-60 around tk/s and context 120k, gemma 4 12b sucks hard bro
@IntCyberDigest Irony how democratic country like USA is increasingly promoting closed-source AI models, while China is driving the open-source AI ecosystem forward by releasing powerful models at remarkably affordable prices.
@0xSero Irony how democratic country like USA is increasingly promoting closed-source AI models, while China is driving the open-source AI ecosystem forward by releasing powerful models at remarkably affordable prices.
@0xSero Irony how democratic country like USA is increasingly promoting closed-source AI models, while China is driving the open-source AI ecosystem forward by releasing powerful models at remarkably affordable prices.
If you're planning to code with a local LLM and have 16GB of VRAM or less, Ornith-1.0-35B is the only model I'd confidently recommend.
I've tried a lot of local coding models, and this one genuinely stands out. It follows complex instructions, understands large codebases, writes clean, maintainable code, and stays remarkably consistent throughout long coding sessions. It honestly feels like a different class of local coding model.
I'm running it on my local machine with an RTX 5070 Ti (16GB VRAM) and 32GB RAM, and it's absolutely rock solid. I'm even using a 90K context window with llama.cpp, and it's handling large repositories and long coding sessions far better than I expected.
My current llama.cpp configuration:
llama-server.exe ^
-hf "%LLMODEL%" ^
-ngl 999 ^
-fa on ^
--n-cpu-moe 20 ^
-np 1 ^
-c 90000 ^
--no-mmap ^
--cache-type-k q8_0 ^
--cache-type-v turbo3 ^
--temp 0.6 ^
--top-p 0.95 ^
--top-k 20 ^
--min-p 0.05 ^
--presence-penalty 0.0 ^
--chat-template-file .\qwen_fix.jinja ^
--reasoning-budget 2048 ^
--jinja
Massive respect to the Ornith team. This model is genuinely something special.
https://t.co/G5rGeUJ2zv