Nate @nselvidge - Twitter Profile

18 days ago

@MatjazLeonardis @PJOPJOPJOPJO @eriskiiii Literally this is the reason Anthropic is a PBC so we don’t have this type of bs when building agi. The public is anthopic’s stakeholder it’s in their corporate structure

0

18

Nate @nselvidge

2 months ago

@ClaudeDevs Evals are the only real defense for stuff like this. Every AI project I've worked on suffers from differing experiences between the devs building it and real users. The only thing that bridges the gap are high quality evals & observability. @braintrust is the answer

0

218

nselvidge retweeted

Ankur Goyal

@ankrgyl

3 months ago

Sandboxing evals is an incredible way to (a) get more reproducibility and (b) test a lot more ideas at scale. This is now natively supported in Braintrust with support for AWS Lambda, @modal, and more options soon.

ankrgyl's tweet photo. Sandboxing evals is an incredible way to (a) get more reproducibility and (b) test a lot more ideas at scale.

This is now natively supported in Braintrust with support for AWS Lambda, @modal, and more options soon.

5

74

5

35

7K

Nate @nselvidge

3 months ago

@lennysan People who think we don’t need jr devs need to understand that smart jrs are going to be a lot better at this than experienced devs. They don’t have the same baggage we do

0

1

0

780

Who to follow

Anto

@plantonio_

sporadic distractions 🌷

Angel Sharae Purlee

@AngelPurlee

Someone once said, You can be anything! And I heard, You have to be everything!

Jarrod LaRocco

@jarrodlarocco

Some guy who does some stuff and occasionally writes about it.

Nate @nselvidge

3 months ago

@kalomaze I mean playing Pokémon has been a way we evaluate agents for a minute so not unreasonable

0

5

Nate @nselvidge

3 months ago

@fchollet I think there are a lot of fans of the new benchmark. Twitter trolls are just the loudest.

0

12

Nate @nselvidge

3 months ago

@cloneofsimo @idarbek The stated purpose of arc agi is to probe the boundaries of artificial intelligence. If a human can complete a task and an llm can’t it’s useful information. The most useful benchmarks are those that are hard for ai today but can be beaten by better models.

0

1

0

72

Nate @nselvidge

3 months ago

@FakePsyho I think if LLMs were just spending 10 extra moves to compete this would be valid, but they can’t learn from that mistake like a human can and that’s the point. It isn’t really about being fair it’s about probing failure points so they can be fixed

0

51

Nate @nselvidge

3 months ago

ARC-AGI-3 is brilliant. Almost artistic how well they were able to craft these games that so well show how current AI systems fail. Pretty clear articulation that LLMs fail at in-context learning and memory, at least at the level humans are capable of.

0

23

Nate @nselvidge

3 months ago

@Arrrrash But how will I loot a gold shield with my free kit?

1

0

173

Nate @nselvidge

4 months ago

@trq212 Thank you!

0

1

0

262

Nate @nselvidge

4 months ago

@z0oks The benefit is that it’s fuckin sweet

1

0

113

Nate @nselvidge

4 months ago

LLMs are smart enough to increase dev productivity today but we’re bottlenecked on dev cpu. It’s just too slow to share my MacBook Pro with 5 agents.

0

15

Nate @nselvidge

4 months ago

Something weird is going on with agents being better at using CLIs than MCP. CLIs have always been a great ux but it's so hard to remember and type out exactly what you want, but with agents something special happens where the interface works for humans without memorizing it

0

15

nselvidge retweeted

Braintrust

@braintrust

5 months ago

Braintrust has raised an $80M Series B. We're building the infrastructure that helps teams measure, evaluate, and improve their AI products. Don't take our word for it. Hear how @NotionHQ, @Vercel, @Navan, and @billcom use Braintrust to ship quality AI.

27

291

21

127

130K

Nate @nselvidge

5 months ago

@emollick As a dev onboarding to a new code base this resonates. Being the “human verifier” when an llm is writing a majority of the code requires a lot of discipline to actually dig deep enough into the changes to understand what the code is doing. Seems like a flawed collaboration model

0

20

nselvidge retweeted

Mike Cannon-Brookes 👨🏼‍💻🧢🇦🇺

@mcannonbrookes

over 8 years ago

Pretty mind blowing 10 months with @trello as an awesome part of the @Atlassian family. Do something big in 2018 is right. Let’s start with these ads in Times Square! 👊🏻🗽