Introducing a new side project called Model Regression. It tests daily Claude, GPT, and Grok on various benchmark statistics to determine how well its performing and to identify model degrades over time.
@edskoudis had an idea for model testing before they conducted offensive testing to ensure the model was performing as expected, and @BlasikRandy pushed me down this road with actually going and doing it.
The main intent here is the frontier models will experience outages, issues, bugs, intentional/unintentional nerfing of the models without notice. You can't typically trust day to day activities in these models for stability, so leveraging this on your daily routine to see how well the model is performing for that day is something I'll be using everyday.
Runs every morning in my DGX sparks environment and automatically updates with how well its performing.
Enjoy!
https://t.co/1Pep6NyGoh
Also open-sourced the project, can run on your own server as well and look at the benchmarks and how they are calculated:
https://t.co/GFPigpRtUF
@bankertobuilder I'd like to buy one just so I can be on the HOA. I'd love nothing more than walking around in my whitest of white New Balances, citing anyone with a door color even the slightest hue off of the approved color list. ๐คค
@H3KTlC Have the bike - just got a different treadmill because I didn't think the Peloton one was worth the price. We just use an iPad w/ the Peloton app, and it still syncs to the treadmill to maintain stats
It's time for Microsoft to have another XPSP2 moment. No more AI, no more features. Just fixes.
When I was working on Windows XP, Blaster hit. It was a big enough deal that we set aside all feature work.
For the next several months, all we did was improve security. We didn't add "security features"; we fixed bugs. Lots of bugs. Until there weren't security bugs to fix anymore.
Then we fixed the ones we didn't know about yet.
Put more simply, we stopped trying to "add value" to the product through features that PMs thought users would like, and instead we focused on the things that had been important for a long time, but overlooked.
Like performance and configurability today.
Rather than trying to improve and add value to the system through new AI features -now-, I argue it's time for Microsoft to stabilize, improve, and make the system more performant. And more usable for power users.
Just for one release. Just till it doesn't suck.
Just spoke with a customer service rep at @ArcBestCorp to get some tracking details, and holy cow, that had to be the friendliest, quickest, best experience I've ever had with a shipper. ๐๐
@HackingDave Not sure how I feel about this being called โvintageโ and โthrowbackโ lol. Pretty sure I still have a jersey with that design ๐ญ