Alejandro Companioni

@acompa_

Research at ¬◇. Prev: @spotify @stitchfix @betaworks.

Miami

Joined October 2020

274 Following

104 Followers

398 Posts

Alejandro Companioni

@acompa_

about 18 hours ago

@scaling01 So much room to optimize spend without loss of performance on general tasks. At ¬◇ we’ve written about this before, and will announce some work soon to solve it. https://t.co/pUoy4uWTml

266

Alejandro Companioni

@acompa_

about 20 hours ago

Impressive work by the https://t.co/ov08dSJ4Vy team. Performance is competitive with Opus 4.8 on a number of significant benchmarks, and competitive for 2nd-best on many others. Curious about the practical effectiveness of effort level - tuning thinking tokens can be tricky.

Z.ai @Zai_org

1 day ago

Introducing GLM-5.2: Frontier Intelligence, Open Weights - Significant improvements in coding and agentic tasks - Strong long-horizon capabilities with a 1M context window - Two levels of reasoning effort: GLM-5.2 (max) pushes the limits, while GLM-5.2 (high) strikes a strong balance between performance and token efficiency - MIT-licensed open weights - Same API pricing as GLM-5.1 Tech Blog: https://t.co/LAsxUdN0JZ Weights: https://t.co/g0A1C4UWx4 API: https://t.co/Kc3E22cbN7 Coding Plan: https://t.co/Nk8Y98HNhU Chat: https://t.co/WCqWT0qCQb

Zai_org's tweet photo. Introducing GLM-5.2: Frontier Intelligence, Open Weights

- Significant improvements in coding and agentic tasks
- Strong long-horizon capabilities with a 1M context window
- Two levels of reasoning effort: GLM-5.2 (max) pushes the limits, while GLM-5.2 (high) strikes a strong balance between performance and token efficiency
- MIT-licensed open weights
- Same API pricing as GLM-5.1

Tech Blog: https://t.co/LAsxUdN0JZ
Weights: https://t.co/g0A1C4UWx4
API: https://t.co/Kc3E22cbN7
Coding Plan: https://t.co/Nk8Y98HNhU
Chat: https://t.co/WCqWT0qCQb

400

Alejandro Companioni

@acompa_

2 days ago

@jasondeanlee Are you using some proxy or tunnel to https://t.co/bDcBg74M2b to use GPT-5.5 Pro via Codex? Using Codex, by default you're likely using GPT-5.5 non-pro, which (IME) really cannot prompt GPT-5.5 Pro effectively.

Alejandro Companioni

@acompa_

7 days ago

Also curious whether Anthropic ran the (public safety classifier + model) combo in internal benchmarks. It wouldn't be the first time they don't test on public builds. From the April postmortem on quality: > We are going to do several things differently to avoid these issues: We’ll ensure that a larger share of internal staff use the exact public build of Claude Code (as opposed to the version we use to test new features)...

Who to follow

chris simokat

@csimokat

You've got data, I've got answers. Opinions my own.

Sam Jacobs

@perplexedsphex

maybe pro wrestling is one of the most real things we have in our society

Trophy

@akm

Satisfaction guaranteed

Alejandro Companioni

@acompa_

7 days ago

We prioritize evaluations of new models as part of our routing efforts at Not Diamond. Claude Fable 5 is the first time a model has actively refused to run our benchmark tasks. eg.: 11 tasks in TBench 2.0 were refused by Fable 5 on the basis of bioweapon or cybersecurity risks.

Alejandro Companioni

@acompa_

7 days ago

Clearly, internal Claude benchmarks don't face the same safety guardrails, but its interesting to consider whether these tasks should be omitted, scored as failures, etc.

acompa_ retweeted

Susan Zhang

@suchenzang

8 days ago

anthropic doesn't owe anyone "frontier capabilities". none of the labs do. they are all simply selling a product, or a story, that people pay for. that aside, the more telling bit is how far anthropic is willing to go to secure a narrative around "capability slowdown", post a massive raise, before an ipo, and with enterprise contracts rising for those rich enough to pay to similarly keep up the image of "powered-by/secured-by agentic AI". with the amount of capex spent so far, this was never meant to be some democratizing technology "for the people". this is all simply just business.

190

145K

acompa_ retweeted

Tomas Hernando Kofman

@tomas_hk

9 days ago

My thoughts on the future of model routing and AI: - We have not even scratched the surface of runaway inference costs - Solving this requires intelligent model routing, especially as the inference landscape continues fragmenting. This is a *hard* problem. - Naive solutions (turn-based routing, session routing) fail; routing successfully involves managing multiple cost surface areas in concert. - Getting routing right means a more diverse market of providers, more power for consumers, reduced ecological impact, and improved effectiveness. More in the full essay:

acompa_ retweeted

Jen Zhu

@jenzhuscott

12 days ago

Massive output uptick due to agentic AI. Complete flat adoption.

470

980

Alejandro Companioni

@acompa_

11 days ago

@ShcChy @vikhyatk @GergelyOrosz Because it would be insanely unprofitable to ask market-salaried engineers to label data at the scale these labs need. So instead they use a tiered system: outsource the broader labeling and calibrate quality in-house. This is a core focus for entire eval teams at labs.

Alejandro Companioni

@acompa_

11 days ago

Those are not contradictory statements though. With analytics, half the battle is building out the fact and dimension layer in the warehouse and those become the LLM’s “data classes” for analytics.. If you pin those down then it is, in fact, possible to automate most queries away, because the queries atop of that layer become rote.

Alejandro Companioni

@acompa_

15 days ago

Problem: the Marlins brand has been tarnished by abysmal management since the late 90s. As a Miami native I quit after the third (!!) selloff in 2005. I only visit the stadium for monster truck rallies w/ my 6-yo. Team would need a _very_ long commitment to win back fans here. We all remember.

acompa_ retweeted

Karri Saarinen

@karrisaarinen

17 days ago

The fallacy of this is that more creates more. More hours, more hiring, more something. And it is true in a sense. If you put in more work, more work will happen. But I think for most startups, the leverage is really in how differently you approach the problem, how well you cultivate your team, and the strategy. Any large company can outspend you on hours. They have thousands or tens of thousands more people, spending more hours. If hours worked were the metric, every large company and government organization would always win and do the best work. More hours, better output. This thinking is often representative of younger founders, where the startup becomes their identity and life. They have a hard time doing anything else, and cannot understand that your work is not the person that is you. But activities outside of work can grow you as a person too and make you do better work. I’ve never worked this way. As a designer, I always saw the need to take a step back, to take a break. At times, I might work 12 hours or 16 hours, or whatever amount was needed, but it wasn’t the norm. You just can't grind design, you need inspiration. But taking that step away from the work, would give me more perspective, inspiration and I could approach the problem differently or I could just see the solution. Grinding is never good for any creative problem, and startups or creating new products are often mostly about creative problem solving. Grinding works ok for email jobs, or where you just executing on very clear playbook. With Linear, we’ve never worked this way. We work reasonable hours, 5 days a week. All of us founders have families. Many of our employees have families. I personally stop every evening, spend time with the family, cook dinner for the family, eat dinner together, and focus on things outside of work. Sometimes I work in the late evenings or weekends, but to me the pride is that I don’t need to. Company should be succesful without it. My goal is to build a company that is sustainable in the long term, and doesn’t require heroics or personal sacrifices every single day. There are times when our team is heroic. Launches, incidents, some other work that just needs to be done. They will work late into the night because they know it is the right thing. But we don’t require that every day or every week, and the more this happens, the more I think it is a failure of our company and leadership. The team and the leaders should always keep a reserve to use when something is needed. Our thinking was also that quality, which we value, doesn’t emerge from working more or stressing people more. It emerges when you create the conditions for it to emerge. Often it is the appreciation, space, time, and how the person feels. A person who is rested will do better work. I wouldn’t attribute much of our success to working a lot. The success came from having clear thinking, ideas, and focus to do the right things. I sometimes wish we could move the culture more toward a Zen master. Real mastery is not exerting the most effort. It is achieving the outcome with the least necessary effort.

157

444

acompa_ retweeted

Simon Last

@simonlast

26 days ago

1/ Some things I've learned recently running coding agents on large-scale projects. Most of this contradicts advice from 6 months ago!

209

571K

Alejandro Companioni

@acompa_

27 days ago

@_welf Looks amazing. DC-1, yeah? And I'm sure Supermirror?

acompa_ retweeted

Tomas Hernando Kofman

@tomas_hk

about 1 month ago

On the @theallinpod, @Benioff describes why routing is the next layer of enterprise AI infra—and how it will save billions of dollars. We've been building exactly this at @notdiamond for two years. Largest vendor of intelligent routing in the world. @Benioff, we should chat!

258

acompa_ retweeted

Sean Cai

@SeanZCai

about 1 month ago

I strongly feel that the frontier lab QC bar for RL data has to become more load-bearing during procurement. The contract non-renewals I've been hearing about across labs (for poor quality) often come to the fact that most vendors run zero categories of active testing, ship without verifier FP/FN audits, can't produce pass@k distributions across three models, have no contamination story, etc. My sense is as RL/SFT data markets become more formalized into 2027, a lot of these contracts, along with banal synth data (garbage in, QA aggressively, garbage out) will be cut. The small set of vendors who have built the QC infra internally (mostly research-dense teams) are pricing 2-3x what their commodity peers can charge for nominally similar tasks, at least for RL data. Wrote a bit about this on my blog.

228

195

17K

acompa_ retweeted

Tomas Hernando Kofman

@tomas_hk

about 1 month ago

Our model router now supports @_inception_ai's Mercury 2, the fastest code gen model in existence. Use it with Not Diamond or @OpenRouter's /auto mode. For max speeds, use the latency tradeoff in nd or the plugins param in OpenRouter to route bw Mercury and a stronger model.

acompa_ retweeted

Catherine Yeo

@catherinehyeo

about 1 month ago

Love seeing Naomi Osaka honor the CLRS Algorithms textbook at this year's Met Gala

109

18K

990

615K

acompa_ retweeted

Nicolas Bustamante

@nicbstme

about 1 month ago

Excellent article by @cursor_ai that explains why it's so hard to change a model mid conversation. For instance, OpenAI's models are trained to edit files using a patch-based format, while Anthropic's models are trained on string replacement. It's all about customizing the harness for different models.

nicbstme's tweet photo. Excellent article by @cursor_ai that explains why it's so hard to change a model mid conversation. For instance, OpenAI's models are trained to edit files using a patch-based format, while Anthropic's models are trained on string replacement.

It's all about customizing the harness for different models.

400

321

43K

Alejandro Companioni

@acompa_

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users