We recently had our highest-ROI cold call of all time… but not in the way you would expect.
A few months ago, we got a cold call from Grace Decker / @graceeedeckerrr, an SDR at @brexHQ.
Nothing unusual until she asked at the end: “By the way, are you hiring engineers? My brother’s looking for a role in NYC.”
Turns out, her brother @Jackowfish was a founding engineer at another YC infra company in the midst of relocating. Fast-forward, he just finished his work trial with us and will be joining Porter in the next few weeks.
To me, this experience highlights two things:
1. The power of making the ask (you truly never know what might come from it).
2. How great talent can come from anywhere if you remain open-minded as a startup.
It goes both ways - Jack later told us that he (understandably) took that first call with zero expectations.
All parties (candidates and companies alike) want to find the hidden gems, but the world is efficient. Unexpected opportunities generally require unexpected openness, and saying yes to the chances larger companies would dismiss is precisely the edge you have when hiring as a startup.
Finally, shout out to Brex for going the extra mile for their customers.
Apple intelligence sucks. Can't even run nvidia-smi.
Apple uses weak, small LLMs.
Big LLMs are better.
Thankfully, I got 80GB of VRAM on my iPhone with @ThunderCompute
nvidia-smi ✅
We just got our VS Code extension published!
Our team has been using this nonstop for internal development. It truly feels like your computer has GPUs attached.
If you use @code or @Cursor this is the simplest way to use cloud GPUs. And it was already the cheapest.
Spent the last couple days putting together a VSCode extension for @ThunderCompute - actually the fastest way to get a GPU cloud instance to build with LLM's in cursor. 5 clicks and a 30 second wait and you're building with server-grade GPU's right in your editor.
We’re in 1956 bubble sort land right now. The wall right now may be compute, but I have a feeling we’re going to hit the ‘59 quick sort moment and realize most of these lookups are incredibly non-optimal
My take from GPT-4.5 is that humanity has designed an AGI architecture - it is just prohibitively expensive. This model is not great, because training a $1 billion transformer only gives us a 12.5% improvement over a $100 million one, in a paradigm where, apparently, utility scales logarithmically with training cost...
That also means that a dense GPT-5 would be only ~11% better than GPT-4.5, for the cost of $10 billion. Similarly, to get a jump as big as the one we've seen from GPT-2 to GPT-4, we'd need to train a GPT-7 (*not* a GPT-6), and that would cost about $100 trillion, i.e., the world's entire GDP. So, that's the wall: we saturated humanity's capacity to scale. Or, to be more specific, we'd need 1,000,000x more compute than GPT-4, to see that sort of jump again.
Some argue that reasoning breaks this wall, but I feel like it only weakens it. If test-time compute laws hold, then, we'd need a GPT-4 scale model to "think for 100 million tokens per output token" to emulate a GPT-7. Except it would take days to produce each token. That's not viable. So, unless we make 1,000,000 clones of planet Earth, we could be stuck at roughly this capacity for several decades, and never see a jump as big as the one from GPT-2 to GPT-4 again.
Unless, of course, new ways to improve the efficiency of these systems are discovered. AGI has become an optimization problem. I, for one, suspect that GPTs are embarrassingly sub-optimal, and that these big matrix multiplications are merely emulating an underlying "learning algorithm" with a massive overhead.
Now, it isn't hard to see that, at this scale, a single attention (i.e., "neural dict") pass takes easily more than 1,000,000x the compute than a dict lookup. If that is true, it wouldn't be surprising if the first team to break the "matmul wall" would be able to train a model equivalent to GPT-4 for as little as $100. Of course, attention is doing much more than a dict lookup; but we don't know what it is doing that leads to reasoning capacities. And, once we figure that out, we may be able to have GPT-7 for the cost of GPT-4, and not for the world's entire GDP.
That said, this would require a complete redesign. Gradient descent and matmuls have to be replaced by something entirely different - and nobody knows what that would be. It took us decades to go from neural nets to transformers, so, it could take us a decade to figure this out. Or someone could be stuck with a rush of inspiration and it would happen overnight...
Anyway sorry if I got some napkin math wrong, and all the respect for OpenAI for this release. Publishing a result that isn't a complete success is great science. Now I just want to understand what transformers are emulating, and how we can do the same, for less. I have many ideas, and I have many experiments to run... I'll try not to disappear completely but excuse me if I do
My take from GPT-4.5 is that humanity has designed an AGI architecture - it is just prohibitively expensive. This model is not great, because training a $1 billion transformer only gives us a 12.5% improvement over a $100 million one, in a paradigm where, apparently, utility scales logarithmically with training cost...
That also means that a dense GPT-5 would be only ~11% better than GPT-4.5, for the cost of $10 billion. Similarly, to get a jump as big as the one we've seen from GPT-2 to GPT-4, we'd need to train a GPT-7 (*not* a GPT-6), and that would cost about $100 trillion, i.e., the world's entire GDP. So, that's the wall: we saturated humanity's capacity to scale. Or, to be more specific, we'd need 1,000,000x more compute than GPT-4, to see that sort of jump again.
Some argue that reasoning breaks this wall, but I feel like it only weakens it. If test-time compute laws hold, then, we'd need a GPT-4 scale model to "think for 100 million tokens per output token" to emulate a GPT-7. Except it would take days to produce each token. That's not viable. So, unless we make 1,000,000 clones of planet Earth, we could be stuck at roughly this capacity for several decades, and never see a jump as big as the one from GPT-2 to GPT-4 again.
Unless, of course, new ways to improve the efficiency of these systems are discovered. AGI has become an optimization problem. I, for one, suspect that GPTs are embarrassingly sub-optimal, and that these big matrix multiplications are merely emulating an underlying "learning algorithm" with a massive overhead.
Now, it isn't hard to see that, at this scale, a single attention (i.e., "neural dict") pass takes easily more than 1,000,000x the compute than a dict lookup. If that is true, it wouldn't be surprising if the first team to break the "matmul wall" would be able to train a model equivalent to GPT-4 for as little as $100. Of course, attention is doing much more than a dict lookup; but we don't know what it is doing that leads to reasoning capacities. And, once we figure that out, we may be able to have GPT-7 for the cost of GPT-4, and not for the world's entire GDP.
That said, this would require a complete redesign. Gradient descent and matmuls have to be replaced by something entirely different - and nobody knows what that would be. It took us decades to go from neural nets to transformers, so, it could take us a decade to figure this out. Or someone could be stuck with a rush of inspiration and it would happen overnight...
Anyway sorry if I got some napkin math wrong, and all the respect for OpenAI for this release. Publishing a result that isn't a complete success is great science. Now I just want to understand what transformers are emulating, and how we can do the same, for less. I have many ideas, and I have many experiments to run... I'll try not to disappear completely but excuse me if I do
me: hey do this thingie
3.5: no prob sir, done
3.7: i did the thingie. let me also do another thingie. i'm gonna finish all the thingies. omg there are so many thingies to be done in this project. i'm gonna start doing extra thingies. would you also maybe like a drink? let's run npm install drink. fk it let's get crazy up in this b
Thank you Pace University Data Science for inviting @ThunderCompute to lead a workshop about using cloud GPUs for deep learning.
We love to hear how students are using our GPUs for their projects.
🚢Just shipped an update to our CLI tool that allows you change your @ThunderCompute instance's properties while stopped. No need to re-create an instance you've spent months working on because you need more vCPUs!
Arch / linux distros are great (although the UI’s are trash) on your devbox you can wipe whenever you want. Not good on the thing you need to connect to WiFi in a coffee shop to ssh into said devbox. Or go on zoom. Or look at your google calendar.