We've partnered with Appen to evaluate the benchmarks we published last week.
Results are in and we've actually improved across the board.
Link below to the full report.
Asked Alexander Whedon, CEO of @subquadratic, if SubQ will replace or augment existing vanilla autoregressive LLMs like GPT-5.5/Opus 4.7.
SubQ specializes in long-context tasks:
> 12 mil token context length
> 52x faster than FlashAttention
> 20x cheaper than Opus
Important caveat is that SubQ doesn't provide a significant lift outside of long-context.
Clearly, models like GPT-5.5/Opus 4.7 can use SubQ as a *tool* within an agent harness. It is invoked for the long-context use cases and passes responses back to the AR LLM.
This alone would be a gamechanger for you if you build with AI.
Yes, we are using weights from open-source models as a starting point, as a function of our funding and maturity as a company. This is something we intend to change, and we have run many from-scratch experiments at smaller scale already, including with further architectural variations. We take the weights, port them into our architecture, and do CPT, SFT, and RL for the behaviors we want.
To date, sub-quadratic architectures have required a significant quality tradeoff on long context. Our algorithm changes that. We are using that to do faster training, faster inference, and longer-context training and inference.
We just shared a technical blog post (https://t.co/tPLzi0eNJR) with more details and will share more details again in a model card next week. If there is anything you think is missing, let us know, and we can make sure to include them!
We were a little slow on this, but we just got a technical blog post up with more details. Please take a look!
https://t.co/tPLzi0eNJR
We have a model card coming next week, and we are happy to take requests for any specific details there.
I am happy to answer any questions here!
Attention Is All You Need (2017): most cited ML paper of the decade.
For 8 years, every frontier model has been built on quadratic attention. Process every possible word-to-word relationship. Compute explodes with context length. Accuracy degrades past 200k tokens.
Sub-quadratic attention was always the endgame. The labs just had too much invested in transformers to admit it.
SubQ is the first production-ready sub-quadratic LLM. 12M token context. Linear scaling, not quadratic. Outperforms Opus 4.6 on long context at less than 10% the cost. 52x faster than FlashAttention.
Linear vs quadratic. That's the whole game.
Huge context windows are the biggest lie in AI.
Honestly, I haven't seen any benefits to scaling past 1M tokens. The more data you show the model, the dumber they get, so larger windows are pointless.
Attention is quadratic:
If you double the context, you are quadrupling the compute.
Past a certain point, models will get slow and expensive and start making stuff up. And they have a lot of trouble remembering the details in the middle of the context.
There are a million workarounds for this:
• Chunking
• Summarization layers
• Retrieval patches
• Sliding windows
But, honestly, they are just meh.
Here is something new, and potentially a solution that will help with this:
Subquadratic built an LLM that uses a subquadratic architecture. This means that the cost of increasing the context doesn't explode as it does with standard transformers.
Their LLM:
• 12M tokens of usable context
• No chunk-and-stitch workarounds needed
• The full context goes into the model, not summaries of it
If this works as advertised, it will completely change what Context Engineering means.
Introducing SubQ - a major breakthrough in LLM intelligence.
It is the first model built on a fully sub-quadratic sparse-attention architecture (SSA),
And the first frontier model with a 12 million token context window which is:
- 52x faster than FlashAttention at 1MM tokens
- Less than 5% the cost of Opus
Transformer-based LLMs waste compute by processing every possible relationship between words (standard attention).
Only a small fraction actually matter.
@subquadratic finds and focuses only on the ones that do.
That's nearly 1,000x less compute and a new way for LLMs to scale.
@mhhya888 Yes, I was trying to copy css from
https://t.co/OgbIF1l3on
but there are a lot of id
if you could process the refund, I would be super grateful!