Trainy

@TrainyAI

Building open source tools for distributed training.

Joined June 2023

30 Following

70 Followers

61 Posts

TrainyAI retweeted

roanak @roanakb

4 months ago

NeptuneAI shuts down March 5th. @TrainyAI just launched Pluto on @ycombinator, a drop-in replacement so you don't lose years of experiment data. Swap one import. Dual-log to validate. Export your history. Open source. On Neptune's official transition hub. https://t.co/m77bDjdmrx

2

19

2

7

7K

TrainyAI retweeted

roanak @roanakb

over 1 year ago

@TrainyAI's Konduktor platform helps bring the benefits of a leading research team to your GPU cluster. We provide a fault-tolerant scheduler, integrated observability, and more. Check out our docs: https://t.co/tO0pbT7eu1

0

2

2

0

245

TrainyAI retweeted

roanak @roanakb

over 1 year ago

This leads to significantly higher (>80%) GPU usage. Add in some fault-tolerance to the infrastructure, and we see: - No more manual restarts at 2am. - ML Engineers get to focus on their jobs, rather than becoming DevOps experts.

1

2

1

0

231

TrainyAI retweeted

roanak @roanakb

over 1 year ago

Top tier AI research teams (Meta, OpenAI, etc.) have figured out the most efficient way to work with a cluster of GPUs. Instead of managing each GPU separately, they create a pools of GPU nodes and let sophisticated schedulers manage GPU availability efficiently.

1

2

2

0

232

Who to follow

jennifer | Mercari Internal Communications Manager

メルカリグループの社内コミュニケーション推奨を担当。早稲田→英語先生→人材紹介バックオフィス全般→メルカリ人事総務 | クロスフィット| パーソナルトレーナー | スイーツ作り🍪| Jesus Follower ⚓️ | 中華系インドンシア人🇮🇩| 日本14年目🎌

黒崎直樹｜ジェネシア・ベンチャーズ

Verified account

シード投資家｜EnterpriseIT領域の革新｜強い組織創り｜Sansan←Fujitsu｜投資先：Malme/RECERQA/amoibe/FormX/Entaar/シェルパ・アンド・カンパニー/匠技研工業/Logpose Technologies/KAITAK/58

Writer, musician, crypto enthusiast, holdler [email protected]

TrainyAI retweeted

roanak @roanakb

over 1 year ago

Is your team struggling with GPU failures? Let’s talk! Docs: https://t.co/2UOQDWgz9f

0

1

1

0

115

TrainyAI retweeted

roanak @roanakb

over 1 year ago

At @TrainyAI, we built a controller within Konduktor to monitor GPU node health and isolate unhealthy nodes. This way if a job fails, 0 manual intervention is required. K8s does its magic of placing work only on healthy nodes, and we forward relevant GPU/NCCL logs to your CSP. 🚀

1

1

1

0

104

TrainyAI retweeted

roanak @roanakb

over 1 year ago

ML engineers shouldn’t be wasting time debugging infrastructure — especially when H100s have a 25-30% fault rate. 🛠️ ML infrastructure should be able to handle bumps and bruises to the underlying hardware.

1

2

2

0

144

TrainyAI retweeted

roanak @roanakb

over 1 year ago

4/ Struggling with multinode setups on your cloud provider? We'll cut your setup time from weeks to minutes. Docs: https://t.co/DBaHXQgAvc

0

1

1

0

63

TrainyAI retweeted

roanak @roanakb

over 1 year ago

3/ One of the biggest value-adds of @TrainyAI's Konduktor platform is that we simplify this complexity. We abstract away network configurations, so you can launch multinode training with high-bandwidth networking across different clouds in the same way.

1

1

1

0

64

TrainyAI retweeted

roanak @roanakb

over 1 year ago

2/ At @TrainyAI, we've seen AI research teams lose over $10,000 trying to scale out due to misconfigured GPU fabrics. That's a costly mistake that can be avoided.

1

1

1

0

43

TrainyAI retweeted

roanak @roanakb

over 1 year ago

Setting up and validating GPU networking is a lot less trivial than you'd think. Here's why: 1/ GPU fabric technology varies a lot across cloud providers for the H100. For example, Google Cloud has TCP-X, while AWS uses EFA. Once you commit to one setup, it often locks you in.

1

2

2

0

152

TrainyAI retweeted

roanak @roanakb

over 1 year ago

He lays out the ARC-AGI benchmark, how it tests generalization abilities rather than memorization, and his thoughts on what kind of AI system will be necessary to improve on the SoTA. Watch here: https://t.co/vlB7GqAl91

0

3

1

0

102

TrainyAI retweeted

roanak @roanakb

over 1 year ago

3. Skill does not show intelligence. And displaying skill at any number of tasks does not show intelligence. - This misguided view of intelligence is what causes our current form of benchmarking to be inadequate.

1

2

1

0

56

TrainyAI retweeted

roanak @roanakb

over 1 year ago

2. For any LLM, for any query that seems to work, there exists an equivalent rephrasing of the query that will break. - This ties into LLM's inability to handle deviations from a pattern - Highlights the modern LLM's lack of robustness

1

2

1

0

38

TrainyAI retweeted

roanak @roanakb

over 1 year ago

1. The core limitations of Transformer-based architectures have not changed in over 5 years. - Inability to adapt to small deviations from memorized patterns - Weak, patchy generalization

1

2

1

0

38

TrainyAI retweeted

roanak @roanakb

over 1 year ago

The latest Machine Learning Street Talk (MLST) episode, with François Chollet discussing inherent limitations of LLMs, was amazing. It was a breath of fresh air to hear some sound reasoning after all the usual Doomer/Acceleration talk on AGI. He makes some great points:

1

3

2

0

151

TrainyAI retweeted

roanak @roanakb

over 1 year ago

With the features above and more, AI teams using @TrainyAI's Konduktor platform see at least 2x the utilization out of their GPU cluster. Curious? Drop me a message or click here to check out our docs: https://t.co/DBaHXQgAvc.

0

1

1

0

53

TrainyAI retweeted

roanak @roanakb

over 1 year ago

3. Enhanced Observability: Our platform offers comprehensive dashboards that provide a clear view of cluster usage and performance. Metrics like SM Efficiency help you understand how effectively your GPUs are being used, across different jobs and teams.

1

1

1

0

50

TrainyAI retweeted

roanak @roanakb

over 1 year ago

2. Minimize Downtime Disruptions: Traditional setups require manual intervention if a job fails. With H100 GPUs, these hardware faults are quite frequent (~30%). Konduktor detects hardware issues on failure, resumes jobs on healthy GPUs, and alerts your provider with logs.

1

1

1

0

29

TrainyAI retweeted

roanak @roanakb

over 1 year ago

@TrainyAI's Konduktor platform is here to change that. 1. Maximize GPU Utilization: With Konduktor, engineers can queue up a large number of jobs on their GPU cluster of varying priorities. This means the P0 workloads get run first, and your GPUs keep crunching numbers 24/7.

1

1

1

0

33

Last Seen Users on Sotwe

Trends for you

Most Popular Users