Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks.
On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.
@theo I love an agent that feels like my assistant controlling one or many agents: i talk to it and it controls them.
and i should be able to intervene as soon as i'm on the keyboard.
I'll talk in my native language or English.
@euacchq@diegopia ...
many other times it's a process issue: things get stuck in between siloes that communicate without a standard/digital procedure. PAs should be able to pull information on your behalf in near-real-time.
if streamlined, the deadline simply becomes a byproduct of process
@euacchq@diegopia admin silence risks setting a dangerous default of approving by just letting time pass.
economic consequences is obv a very big motivator. even personal (i.e. you're not getting your bonus if you are a manager and your dept misses deadlines > 10% of the time)
...
@euacchq Unfortunately, a deadline with no consequences is not a deadline. You’d be surprised at the number of “legally binding” deadlines/procedures that’s just ignored. That’s one of the key obstacles we faced when we, led by @diegopia, tried to digitalize Italy’s govt
@fortelabs I’m not sure AI is the real bottleneck solver here, but it’s a great transition indeed :)
What do you use as AI poly coach? All my experiments so far have fallen rather flat