There is a lot of news about compute being the bottleneck for AI. There is less visibility into the engineering it takes to make large-scale compute actually work reliably.
In my view, this is one of the most interesting computer science problems in the industry right now. It is not just about getting more GPUs. It is about making every layer of the system work: networking, scheduling, hardware health, storage, orchestration, reliability, observability, security, and the developer experience for researchers.
This blog gives a rare preview into the depth of engineering happening across the stack at OpenAI, starting with MRC and supercomputer networking. We're excited to start sharing more about designing, building, and operating compute at planet scale.
https://t.co/eQVylBAGa3
Join us: https://t.co/BwHYeXnfMo
@PaulWil37983175@OpenAI@Oracle@Microsoft Reducing tail latency was a key MRC design goal. Source-routed packet spraying essentially eliminates congestion delays caused by imperfect ECMP load balancing, so the whole network behaves like a single pool of bandwidth.
AI supercomputers need a new kind of network to stay in sync at massive scale.
OpenAI’s @markjhandley and @poyntingatgreg join @AndrewMayne to discuss what it takes to move data across record numbers of chips reliably and efficiently, the new Multipath Reliable Connection (MRC) networking protocol, and why it's available for the whole industry to use.
We’ve partnered with @AMD, @Broadcom, @Intel, @Microsoft, and @NVIDIA, to release Multipath Reliable Connection (MRC), a new open networking protocol that helps large AI training clusters run faster and more reliably, with less wasted GPU time.
https://t.co/AiV952AJXs
Today we shared MRC (https://t.co/EgkRmNbNU0), a networking protocol developed with @Microsoft, @nvidia, @AMD, @Broadcom, and @intel to improve how large AI training systems move data and recover from failures. This innovation has come full circle for me personally, it was initiated by @OpenAI with my team at @intel then when I was leading the networking business there and it's great to see it come to life at scale!
As training clusters scale, networking becomes a critical part of overall compute efficiency. It is not enough to add more capacity. You also need systems that keep jobs running reliably, use bandwidth well, and reduce wasted GPU time.
MRC is one example of the kind of infrastructure work required to make frontier model training more efficient and more resilient. It reflects a broader view we have at OpenAI: progress in AI depends not just on better models, but on better compute systems across the stack.
@paulg You do need to review the PRs and iterate on them when you don’t like its approach. Codex wrote almost all the actual code, so obviously did a lot of the thinking, but the part that mattered most, I did that. Is there a similar principle for writing English?
@paulg Coding is thinking too. I’ve just used Codex to write a non-trivial distributed system. It’s very good - does the mechanical coding that I always found therapeutic very quickly and accurately, leaving me with all the big design decisions back to back. Which is pretty hard work.
@paulg@CcibChris Always loved the Victor (though as far as I know I'm not related to the company's founder). It's what you'd get if you asked an aircraft designer who grew up on 1930s Flash Gordon rockets to design a jet bomber.
@bigbadrabby@PeterGlas6@TrentTelenko On the other hand, 15% of the F35 is made in the UK. If the US disrupted support for UK F35s, they'd have problems with support for their own too.
@paulg We've accumulated about 3000 books (the number stabilized about 10 years ago, when we started to donate books to keep things from getting out of control). It took me a long time to get over the feeling of guilt of buying books and not reading all of them.
@americascup@ineosbritannia@LouisVuitton Great race! All came down to a few knots speed difference crossing the start line. I'm an @ineosbritannia fan, but I really feel for Luna Rossa. They sailed an excellent race but Ineos had the tactical advantage off the start, and didn't put a foot wrong.
When I became a wheelchair user, I thought coastline trails had become inaccessible to me. Fortunately, I was wrong!
On a three day adventure in North Wales, I explored some of the accessible sections of @WalesCoastPath for @countrylivinguk:
https://t.co/ASIRCfEeKI
@garrett_wollman Generally, north america gets auroras further south than Europe because the geomagnetic pole is on your side of the actual north pole. I certainly never expected to see an aurora in London with the light polution here, but there you go!