Introducing ๐๐๐ฆ๐ฆ๐ ๐ ๐๐๐ ๐๐ฎ๐ซ๐๐จ โก๏ธ
It runs on a ๐ด๐ช๐ฏ๐จ๐ญ๐ฆ RTX 5090, at 51 tok/s (single) and 1244 tok/s (batched). And prefills up to 15359 tok/s.
It's ๐๐% ๐ฌ๐ฆ๐๐ฅ๐ฅ๐๐ซ in GPU memory and ~๐.๐๐ฑ ๐๐๐ฌ๐ญ๐๐ซ than the base model, and retains nearly ๐ข๐๐๐ง๐ญ๐ข๐๐๐ฅ ๐ช๐ฎ๐๐ฅ๐ข๐ญ๐ฒ on benchmarks (1-3% loss).
Turbo is a derivative of the NVFP4 quant that NVIDIA released a few days ago. It fully leverages NVIDIA Blackwell FP4 tensor cores for ~๐ร ๐ก๐ข๐ ๐ก๐๐ซ ๐๐จ๐ง๐๐ฎ๐ซ๐ซ๐๐ง๐ญ ๐ญ๐ก๐ซ๐จ๐ฎ๐ ๐ก๐ฉ๐ฎ๐ญ ๐ญ๐ก๐๐ง ๐จ๐ญ๐ก๐๐ซ ๐ช๐ฎ๐๐ง๐ญ๐ฌ.
I'm using it for hard classification tasks โ on internal benchmarks it showed ๐๐จ๐ง๐ง๐๐ญ-๐.๐-๐ฅ๐๐ฏ๐๐ฅ ๐ข๐ง๐ญ๐๐ฅ๐ฅ๐ข๐ ๐๐ง๐๐ (scored well above Haiku 4.5), at a 600๐ต๐ฉ of the cost. A single RTX 5090 scales up to 18 req/s at 1000in/20out ๐ฅต.
Model card and benchmark in comments ๐
I'd love to hear your use cases.
I see what you mean about solving skill issues at the framework-level, and in fact Langchain / Mastra are partially trying that by offering high-level APIs to even the simplest things, like managing the context window (which in most cases is a few hundreds lines of code). I'm talking about the opposite, instead of the framework support high-level features, it'd just offer low-level primitives you could compose to achieve any high-level feature. Like React for web development. React doesn't tell you how you should architect your app, nor how you should implement this or that features (and same for frameworks on top of React), instead it offers a set of primitives, that can be explained in 10mins, and allow you to compose from the simplest to the most complex applications.
Skill issue is partially the responsibility of the framework too. If skilled devs assisted with sota coding models are struggling to produce even the simplest agents, maybe the framework creates more complexity than it solves? Ultimately an agent is just a loop + tools + conversation history. Itโs easy to call out skill issues, but in my opinion, if a framework was properly designed, even a junior dev with a coding assistant could scale a complex agentic app. Mastra and Langchain solve the ยซย Can your framework do X?ย ยป question by pilling up features for years, instead of shrinking down to a set of primitives that are flexible and minimal enough to do anything.
@Rebecca49484009 I feel you. It seems like the existing frameworks are failing pretty fast when you cross the boundaries of what their authors had in mind. And so we build layers of evals and observability tools to monior even the simplest agents instead of fixing the root cause
Introducing Ghod-1, a model outperforming Mythos.
Ghod-1 excels at everything.
We released it to a handful of partners for safety and financial reasons.
Priced at $2K/M tokens in and $6K/M tokens out.