voratiq @voratiq - Twitter Profile

about 10 hours ago

@RamaswmySridhar With GLM + Fireworks + Pi harness we are seeing ~98% cache reuse. A low cache hit rate is usually a harness issue. Maybe this is the coco optimization you are teasing?

0

2

0

3

1K

voratiq @voratiq

1 day ago

@quick007YT Yes, good reference and related but our data represents merge outcomes, e.g. the issues found during code review that are hard to encode in tests. https://t.co/1Avppe90fA

voratiq @voratiq

1 day ago

@gianwirth Ya, DeepSWE is a test-based eval. We find passing tests to be a weak proxy for code quality We measure what code is merged into real codebases. This evals things hard to encode in tests (whether the code is idiomatic, aligned w existing codebase patterns, overengineered, etc.)

0

2

0

1K

1

0

421

voratiq @voratiq

1 day ago

An interesting result... We've found that every GPT-5.5 variant has a better and cheaper alternative: - 5.5 → 5.4 high - 5.5 high → 5.4 xhigh - 5.5 xhigh → GLM-5.2 max In 2/3 cases, just drop to the cheaper model and increase reasoning

voratiq's tweet photo. An interesting result...

We've found that every GPT-5.5 variant has a better and cheaper alternative:

- 5.5 → 5.4 high
- 5.5 high → 5.4 xhigh
- 5.5 xhigh → GLM-5.2 max

In 2/3 cases, just drop to the cheaper model and increase reasoning https://t.co/maJoCtbGwC

34

407

14

147

39K

voratiq @voratiq

1 day ago

@gianwirth Ya, DeepSWE is a test-based eval. We find passing tests to be a weak proxy for code quality We measure what code is merged into real codebases. This evals things hard to encode in tests (whether the code is idiomatic, aligned w existing codebase patterns, overengineered, etc.)

0

2

0

1K

voratiq @voratiq

1 day ago

@filicroval Looking more deeply into this now, actually Roughly, we find, over this data: - 5.5 is ~8% more token-efficient than 5.4 at default/high (but less token-efficient at xhigh) - GLM uses many more tokens, ~2-2.4x, but they're mostly cached reads, so very cheap

1

2

0

47

voratiq @voratiq

1 day ago

@morganlinton @trydotworks Agreed! Curious to see your Fable results, do you have a link?

0

478

voratiq @voratiq

1 day ago

@Thejuampi Yeah. It is more token-efficient, but given the "all in" cost to complete a real task end to end, perhaps ~20% overpriced.

1

5

0

1K

voratiq @voratiq

1 day ago

@filicroval ~600 runs across all models. 5.4 and 5.5, ~100 runs. GLM, ~10 runs (but growing) Ratings = Bradley-Terry model on pairwise outcomes across many tasks (varied difficulty, domain, etc.) All agentic. Run = set of coding agents implementing the same spec in its native harness

1

4

0

853

voratiq @voratiq

1 day ago

@sven2401 We measure merge outcomes on a continuously evolving test set of real SWE tasks. So we get signal on issues that surface during code review. Most coding evals just measure whether tests pass. At a high level, 5.5 is penalized most often for scope creep.

0

1

0

563

voratiq @voratiq

1 day ago

@SterlingNuru Same. We find that GPT-5.4 high in particular has a really nice balance of cost and performance for straightforward engineering tasks.

0

1

0

708

voratiq @voratiq

1 day ago

@conv3rging Merge outcomes from head-to-head agent runs on real engineering tasks (feature work, refactors, debugging).

1

0

864

voratiq @voratiq

1 day ago

@stan_info Definitely. Will be interesting to see the strength and pricing of the next GPT (whether it's 5.6 or 6) in light of this.

1

0

2K

voratiq @voratiq

3 days ago

@Ixel111 We experimented with several but found @FireworksAI_HQ to be the most reliable.

0

3

0

113

voratiq @voratiq

3 days ago

GLM-5.2 max debuts at #3 on the Voratiq leaderboard The first open-weight model to compete at the top of the frontier Since it's open-weight, this is the floor Cost, duration, and performance are all open to improvement as well

voratiq's tweet photo. GLM-5.2 max debuts at #3 on the Voratiq leaderboard

The first open-weight model to compete at the top of the frontier

Since it's open-weight, this is the floor

Cost, duration, and performance are all open to improvement as well https://t.co/j49TQ9eP5H

2

39

1

3

2K

voratiq @voratiq

3 days ago

The full GLM-5.2 deep dive just went out to subscribers Cost, duration, the open-weight win matrix, methodology Subscribe to get the next one → https://t.co/nirUKjy3Xf

0

3

0

317

voratiq @voratiq

7 days ago

@jeremyphoward @Zai_org https://t.co/dtfRAKvnuS

voratiq @voratiq

8 days ago

After more head-to-head matches We're finding GLM 5.2 high to be ... quite good Probability it beats: - Opus 4.8 xhigh: 32% - GPT-5.5 xhigh: 64% - Kimi K2.7 Code (next-best open): 100% Current best-estimate rank: 3rd of 56

voratiq's tweet photo. After more head-to-head matches

We're finding GLM 5.2 high to be ... quite good

Probability it beats:
- Opus 4.8 xhigh: 32%
- GPT-5.5 xhigh: 64%
- Kimi K2.7 Code (next-best open): 100%

Current best-estimate rank: 3rd of 56 https://t.co/0XNCOAVkS3

13

263

31

58

27K

0

2

0

2K

voratiq @voratiq

8 days ago

Sending out a deep dive to our subscribers early next week → https://t.co/nirUKjy3Xf

0

4

0

1

1K

voratiq @voratiq

8 days ago

After more head-to-head matches We're finding GLM 5.2 high to be ... quite good Probability it beats: - Opus 4.8 xhigh: 32% - GPT-5.5 xhigh: 64% - Kimi K2.7 Code (next-best open): 100% Current best-estimate rank: 3rd of 56

13

263

31

58

27K

voratiq @voratiq

8 days ago

Still noisy though, will keep testing! This is all within an agentic coding context, on real SWE tasks Of course, with more data, across more domains, results could shift

1

7

0

1K

voratiq @voratiq

8 days ago

@m_im_ha https://t.co/S8VKo9jh4o

0

221

voratiq @voratiq

9 days ago

GLM 5.2 high just won head-to-head against Opus 4.8 xhigh and GPT 5.5 xhigh The task was a tricky performance optimization in an internal code-analysis product First time we've seen an open-weight agent outperform the top closed agents Very interesting result...

voratiq's tweet photo. GLM 5.2 high just won head-to-head against Opus 4.8 xhigh and GPT 5.5 xhigh

The task was a tricky performance optimization in an internal code-analysis product

First time we've seen an open-weight agent outperform the top closed agents

Very interesting result... https://t.co/VMPOVs1s8D

17

420

15

53

21K

voratiq

@voratiq

Last Seen Users on Sotwe

Trends for you

Most Popular Users