@JordanNanos 2/2
- ~100 perf metrics / rank / step to detect problems as early as possible on the application level, including NCCL ops bandwidth control.
- Distributed fault sim with ~30 different problems to test recovering from.
@JordanNanos Hey Jordan, we intentionally skipped many technical details from the report. Regarding fault tolerance:
1/2
- We do in-pod recovery. There is no in-process recovery yet, so we lose a few min to reinit.
- Checkpoint locally when we break (you call it just-in-time).
@BenTheEgg thank you from @poolsideai for your reversed mode code from Reformer! @vadimlearning worked on it and it led to 25% memory reduction with only a bit more data needed to converge. Next time you're in Paris, dinner is on us!
I am pleased to announce that I will speak in Python devroom on @fosdem about our experience at @athenian doing low- and high-level Python performance optimizations. Will cover some CPython internals, Cython with C++20, arena allocations with mimalloc.
congratulations to this google docs PM who is singlehandedly responsible for engendering more developer goodwill than any other individual at Alphabet in the past five years
It’s been a while since my last blog post, in the meantime I changed both UI and domain, and added support for comments! And this is my first post in the new blog! I hope you’ll like it!
https://t.co/PzuFw4KdsA
#devops#monitoring#kubernetes#prometheus
"My Continuous Integration Takes Too Much Time. How Do I Fix It?" - @vadimlearning explains how to solve this very common issue... so you don't have to twiddle your thumbs (or read r/programming) while you wait for CI checks to finish! 👇
https://t.co/9ZsCZmVCji