I strongly believe there are entire companies right now under heavy AI psychosis and its impossible to have rational conversations about it with them. I can't name any specific people because they include personal friends I deeply respect, but I worry about how this plays out.
I lived through the great MTBF vs MTTR (mean-time-between-failure vs. mean-time-to-recovery) reckoning of infrastructure during the transition to cloud and cloud automation. All those arguments are rearing their ugly heads again but now its... the whole software development industry (maybe the whole world, really).
It's frightening, because the psychosis folks operate under an almost absolute "MTTR is all you need" mentality: "its fine to ship bugs because the agents will fix them so quickly and at a scale humans can't do!" We learned in infrastructure that MTTR is great but you can't yeet resilient systems entirely.
The main issue is I don't even know how to bring this up to people I know personally, because bringing this topic up leads to immediately dismissals like "no no, it has full test coverage" or "bug reports are going down" or something, which just don't paint the whole picture.
We already learned this lesson once in infrastructure: you can automate yourself into a very resilient catastrophe machine. Systems can appear healthy by local metrics while globally becoming incomprehensible. Bug reports can go down while latent risk explodes. Test coverage can rise while semantic understanding falls. Changes happens so fast that nobody notices the underlying architecture decaying.
I worry.
Is there a law for "the bigger a system is, the more likely any one part of it will be missed, ignored, not used, or underutilized"? If not, let's declare one and name it.
When I started working in python, I got lazy with “single assignment”, and I need to nudge myself about it.
You should strive to never reassign or update a variable outside of true iterative calculations in loops. Having all the intermediate calculations still available is helpful in the debugger, and it avoids problems where you move a block of code and it silently uses a version of the variable that wasn’t what it originally had.
In C/C++, making almost every variable const at initialization is good practice. I wish it was the default, and mutable was a keyword.
This is correct. We had this same reaction in the 1980s & 1990s when compilers generated assembly for us. We hated it. Looking at the generated code made us puke.
It got better.
GPT 4.5 + interactive comparison :)
Today marks the release of GPT4.5 by OpenAI. I've been looking forward to this for ~2 years, ever since GPT4 was released, because this release offers a qualitative measurement of the slope of improvement you get out of scaling pretraining compute (i.e. simply training a bigger model). Each 0.5 in the version is roughly 10X pretraining compute. Now, recall that GPT1 barely generates coherent text. GPT2 was a confused toy. GPT2.5 was "skipped" straight into GPT3, which was even more interesting. GPT3.5 crossed the threshold where it was enough to actually ship as a product and sparked OpenAI's "ChatGPT moment". And GPT4 in turn also felt better, but I'll say that it definitely felt subtle. I remember being a part of a hackathon trying to find concrete prompts where GPT4 outperformed 3.5. They definitely existed, but clear and concrete "slam dunk" examples were difficult to find. It's that ... everything was just a little bit better but in a diffuse way. The word choice was a bit more creative. Understanding of nuance in the prompt was improved. Analogies made a bit more sense. The model was a little bit funnier. World knowledge and understanding was improved at the edges of rare domains. Hallucinations were a bit less frequent. The vibes were just a bit better. It felt like the water that rises all boats, where everything gets slightly improved by 20%. So it is with that expectation that I went into testing GPT4.5, which I had access to for a few days, and which saw 10X more pretraining compute than GPT4. And I feel like, once again, I'm in the same hackathon 2 years ago. Everything is a little bit better and it's awesome, but also not exactly in ways that are trivial to point to. Still, it is incredible interesting and exciting as another qualitative measurement of a certain slope of capability that comes "for free" from just pretraining a bigger model.
Keep in mind that that GPT4.5 was only trained with pretraining, supervised finetuning, and RLHF, so this is not yet a reasoning model. Therefore, this model release does not push forward model capability in cases where reasoning is critical (math, code, etc.). In these cases, training with RL and gaining thinking is incredibly important and works better, even if it is on top of an older base model (e.g. GPT4ish capability or so). The state of the art here remains the full o1. Presumably, OpenAI will now be looking to further train with Reinforcement Learning on top of GPT4.5 model to allow it to think, and push model capability in these domains.
HOWEVER. We do actually expect to see an improvement in tasks that are not reasoning heavy, and I would say those are tasks that are more EQ (as opposed to IQ) related and bottlenecked by e.g. world knowledge, creativity, analogy making, general understanding, humor, etc. So these are the tasks that I was most interested in during my vibe checks.
So below, I thought it would be fun to highlight 5 funny/amusing prompts that test these capabilities, and to organize them into an interactive "LM Arena Lite" right here on X, using a combination of images and polls in a thread. Sadly X does not allow you to include both an image and a poll in a single post, so I have to alternate posts that give the image (showing the prompt, and two responses one from 4 and one from 4.5), and the poll, where people can vote which one is better. After 8 hours, I'll reveal the identities of which model is which. Let's see what happens :)
So, our GitHub Actions fetch dependencies (like nix-installer!) at run time. We put a lot of work into not failing because we messed it up. One of those things is fallback infra for fetching.
@altryne We tried it, and it hurt more than it helped for most use cases. Pixels is the most flexible interface. Using the DOM can be helpful, but it’s more of an optimization.
Paraphrasing the best advice @paulg gave me:
"Ask yourself if this startup is your life's work. Knowing you're in it for the long haul lets you settle into a calmer, more focused rhythm despite the daily ups and downs, as you trust you'll show up and make it succeed over time."
Today OpenAI announced o3, its next-gen reasoning model. We've worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks.
It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task in compute ) and 87.5% in high-compute mode (thousands of $ per task). It's very expensive, but it's not just brute -- these capabilities are new territory and they demand serious scientific attention.
New paper! Byte-Level models are finally competitive with tokenizer-based models with better inference efficiency and robustness! Dynamic patching is the answer! Read all about it here:
https://t.co/GJSiFtugju (1/n)
I guess “Designed by Claude” is just not going to be that good until codegen datasets also get trained on aesthetics
The rocks need to be able to draw, not just read and write
@bnj is working on this
The best part of making a product generally accessible (no waitlist!) is seeing other people's reactions. See how folks have interacted with the Windsurf Editor within the first 24 hours of launch 🧵
Issue #305 of Off-by-none is out! This week, CloudFormation deployments get an x-ray timeline view, users report Bedrock might be on shaky ground, and we celebrate the real heroes of @awscloud! #offbynone https://t.co/coQX1atr8e
We're getting real with the realization that system tests have failed to be worth their weight. We're killing ALL the system tests (359 cases) in HEY, replacing them with a minimal set of smoke tests. Then leaning on controller integration tests instead. https://t.co/Dj7BFI62Os
there are now 4 interesting things in web development:
1. @ElectricClojure
2. Rama by @nathanmarz
3. Datomic by @cognitect
4. @cursor_ai
the models still hallucinate a lot, though, but it's the worst it'll ever be 🤷♂️