@MatjazLeonardis@PJOPJOPJOPJO@eriskiiii Literally this is the reason Anthropic is a PBC so we don’t have this type of bs when building agi. The public is anthopic’s stakeholder it’s in their corporate structure
@ClaudeDevs Evals are the only real defense for stuff like this. Every AI project I've worked on suffers from differing experiences between the devs building it and real users. The only thing that bridges the gap are high quality evals & observability. @braintrust is the answer
Sandboxing evals is an incredible way to (a) get more reproducibility and (b) test a lot more ideas at scale.
This is now natively supported in Braintrust with support for AWS Lambda, @modal, and more options soon.
@lennysan People who think we don’t need jr devs need to understand that smart jrs are going to be a lot better at this than experienced devs. They don’t have the same baggage we do
@cloneofsimo@idarbek The stated purpose of arc agi is to probe the boundaries of artificial intelligence. If a human can complete a task and an llm can’t it’s useful information. The most useful benchmarks are those that are hard for ai today but can be beaten by better models.
@FakePsyho I think if LLMs were just spending 10 extra moves to compete this would be valid, but they can’t learn from that mistake like a human can and that’s the point. It isn’t really about being fair it’s about probing failure points so they can be fixed
ARC-AGI-3 is brilliant. Almost artistic how well they were able to craft these games that so well show how current AI systems fail. Pretty clear articulation that LLMs fail at in-context learning and memory, at least at the level humans are capable of.
Something weird is going on with agents being better at using CLIs than MCP. CLIs have always been a great ux but it's so hard to remember and type out exactly what you want, but with agents something special happens where the interface works for humans without memorizing it
Braintrust has raised an $80M Series B.
We're building the infrastructure that helps teams measure, evaluate, and improve their AI products.
Don't take our word for it. Hear how @NotionHQ, @Vercel, @Navan, and @billcom use Braintrust to ship quality AI.
@emollick As a dev onboarding to a new code base this resonates. Being the “human verifier” when an llm is writing a majority of the code requires a lot of discipline to actually dig deep enough into the changes to understand what the code is doing. Seems like a flawed collaboration model
Pretty mind blowing 10 months with @trello as an awesome part of the @Atlassian family. Do something big in 2018 is right. Let’s start with these ads in Times Square! 👊🏻🗽