DeepSpeed (日本語アカウント)

about 1 month ago

DeepSpeed の新機能 AutoSP のPyTorch公式ブログが公開されました！ - コンパイラレベルでの最適化により、既存モデルに設定変更だけで Sequence Parallel を適用 - 長い系列の学習に最適化された Sequence-aware AC (activation checkpointing) これにより、長い系列の学習を、より高いGPU効率で容易に実現できます。 https://t.co/dMWWPm9JPm

about 1 month ago

Great News! Thanks to DeepSpeed AutoSP, efficient long context LLM training is now easily accessible.

1

4

2

1

1K

0

2

412

DeepSpeedAI_JP retweeted

Stas Bekman

@StasBekman

3 months ago

Good news! Ulysses Sequence Parallelism from the Snowflake AI Research and the Deepspeed teams has been integrated into @huggingface Trainer, Accelerate and TRL For extensive details please see this writeup: https://t.co/2xDWUk8p3V Thanks a lot to @krasul for helping make it happen. Also the others in the HF team who helped with integration.

StasBekman's tweet photo. Good news! Ulysses Sequence Parallelism from the Snowflake AI Research and the Deepspeed teams has been integrated into @huggingface Trainer, Accelerate and TRL

For extensive details please see this writeup:
https://t.co/2xDWUk8p3V

Thanks a lot to @krasul for helping make it happen. Also the others in the HF team who helped with integration.

4

116

20

43

18K

3 months ago

PyTorchブログで最新のDeepSpeedアップデートが紹介されました！ - PyTorch互換の backward API: Rayを用いたマルチモーダルの大規模学習をよりシンプルに実装可能に - 省メモリな BF16/FP16 モード: torch.autocastとの組み合わせにより、ピークメモリ削減（最大40%）ご意見・ご要望、お待ちしてます。Issue/PRもぜひ！

Block Job「万象招聘」一个不被定义的Web3招聘/猎头平台💎 既能发布招聘找工作还能来这里看段子！ ✈️官方Tg群：https://t.co/wriPabrmGm

3 months ago

New @DeepSpeedAI updates make large-scale multimodal training simpler and more memory-efficient. Our latest blog introduces a PyTorch-identical backward API that helps code multimodal training loops easy, plus low-precision model states (BF16/FP16) that can reduce peak memory by up to 40% when combined with torch.autocast. 🖇️ Read the full post for details: https://t.co/sSHMGhRixV #DeepSpeed #PyTorch #MemoryEfficiency #MultimodalTraining #OpenSourceAI

PyTorch's tweet photo. New @DeepSpeedAI updates make large-scale multimodal training simpler and more memory-efficient.

Our latest blog introduces a PyTorch-identical backward API that helps code multimodal training loops easy, plus low-precision model states (BF16/FP16) that can reduce peak memory by up to 40% when combined with torch.autocast.

🖇️ Read the full post for details: https://t.co/sSHMGhRixV

#DeepSpeed #PyTorch #MemoryEfficiency #MultimodalTraining #OpenSourceAI

2

49

7

37

15K

0

15

2

6

7K

Who to follow

Block Job

@blockjob2022

Hirosato Gamo | AI Cloud Solution Architect

@hiro_gamo

@Microsoft AI Cloud Solution Architect / Microsoft Evangelist / 上智大学大学院応用データサイエンス学位プログラム非常勤講師 GPTなどAI技術導入の支援、LLM開発の執筆・登壇、生成AI関連の講義などに従事。Posts on my own.

goto

@goto_yuta_

LLMと共に人生を模索してる最中です。大喜利/隠れYoutuber/京大情報卒

DeepSpeedAI_JP retweeted

6 months ago

Zhipeng (Jason) Wang, PhD (@PKUWZP) explains how @DeepSpeedAI supports ML training research and why joining PyTorch Foundation benefits researchers and developers working on AI training workloads. 🔗https://t.co/6FfXB98gb2 #PyTorch #DeepSpeed #OpenSourceAI #AIInfrastructure

1

110

13

18

12K

DeepSpeedAI_JP retweeted

8 months ago

UIUC, AnyScale, and Snowflake significantly enhanced LLM offloading for the Superchip era!

0

12

3

3K

9 months ago

10/22-23 にサンフランシスコ開催の PyTorch Conference で、DeepSpeedチームからのキーノートスピーチが行われます。 PyTorch Conference にご参加の方は、ぜひご聴講ください。 https://t.co/6AfzqLYGtJ

9 months ago

Step into the future of AI at #PyTorchCon 2025, Oct 22–23 in San Francisco 🔥 Join the DeepSpeed keynote and technical talks. Register: https://t.co/6iogY2eetT + Oct 21 co-located events: Measuring Intelligence, Open Agent & AI Infra Summits / Startup Showcase & PyTorch Training

0

7

2

1

3K

0

1K

Minjia Zhang @_Minjia_Zhang_

11 months ago

DeepSpeed の Universal Checkpointing に関する論文が、ソフトウェアシステム分野のトップカンファレンスである ATCで発表されました。

11 months ago

📢 Yesterday at USENIX ATC 2025, Xinyu Lian from UIUC SSAIL Lab presented our paper on Universal Checkpointing (UCP). UCP is a new distributed checkpointing system designed for today's large-scale DNN training, where models often use complex forms of parallelism, including data, tensor, pipeline, and expert parallelism. Existing checkpointing systems struggle in this setting because they are tightly coupled to specific training strategies (e.g., ZeRO-style data parallelism or 3D model parallelism), which break down when the training configs need to dynamically reconfigure over time. This makes it difficult to have resilient and fault-tolerant training. UCP solves this by decoupling distributed checkpointing from parallelism strategies. Our design introduces a unified checkpoint abstraction -- atomic checkpoint, and a full pattern matching-based transformation pipeline, which enables scalable and low-overhead checkpointing with reconfigurable parallelism across arbitrary model sharding strategies. We show that UCP supports state-of-the-art models trained with hybrid 3D/4D parallelism (ZeRO, TP, PP, SP) while incurring less than 0.001% overhead of the total training time. UCP is fully open-sourced in DeepSpeed. It has been adopted by Microsoft, BigScience, UC Berkeley and others for large-scale model pre-training and fine-tuning, including Phi-3.5-MoE (42B), BLOOM (176B), and many more. It also has been selected for presentation at PyTorch Day 2025 and FMS 2025(the Future of Memory and Storage). Big thanks to the amazing collaborators from Microsoft and Snowflake: @samadejacobs , @LevKurilenko, @MasahiroTanaka, @StasBekman , and @TunjiRuwase. 🔗 Project: https://t.co/j7VllIQCAS 📄 Paper: https://t.co/23yoXYmcrh 💻 Code: https://t.co/H4qyEyfQ0q 📚 Tutorial: https://t.co/lcxkv5FNxl #ATC2025 #LLM #Checkpointing #SystemsForML #DeepLearning #DistributedTraining #UIUC #DeepSpeed

_Minjia_Zhang_'s tweet photo. 📢 Yesterday at USENIX ATC 2025, Xinyu Lian from UIUC SSAIL Lab presented our paper on Universal Checkpointing (UCP). UCP is a new distributed checkpointing system designed for today's large-scale DNN training, where models often use complex forms of parallelism, including data, tensor, pipeline, and expert parallelism.

Existing checkpointing systems struggle in this setting because they are tightly coupled to specific training strategies (e.g., ZeRO-style data parallelism or 3D model parallelism), which break down when the training configs need to dynamically reconfigure over time. This makes it difficult to have resilient and fault-tolerant training.

UCP solves this by decoupling distributed checkpointing from parallelism strategies. Our design introduces a unified checkpoint abstraction -- atomic checkpoint, and a full pattern matching-based transformation pipeline, which enables scalable and low-overhead checkpointing with reconfigurable parallelism across arbitrary model sharding strategies.

We show that UCP supports state-of-the-art models trained with hybrid 3D/4D parallelism (ZeRO, TP, PP, SP) while incurring less than 0.001% overhead of the total training time.

UCP is fully open-sourced in DeepSpeed. It has been adopted by Microsoft, BigScience, UC Berkeley and others for large-scale model pre-training and fine-tuning, including Phi-3.5-MoE (42B), BLOOM (176B), and many more. It also has been selected for presentation at PyTorch Day 2025 and FMS 2025(the Future of Memory and Storage).

Big thanks to the amazing collaborators from Microsoft and Snowflake: @samadejacobs , @LevKurilenko, @MasahiroTanaka, @StasBekman , and @TunjiRuwase.

🔗 Project: https://t.co/j7VllIQCAS
📄 Paper: https://t.co/23yoXYmcrh
💻 Code: https://t.co/H4qyEyfQ0q
📚 Tutorial: https://t.co/lcxkv5FNxl

#ATC2025 #LLM #Checkpointing #SystemsForML #DeepLearning #DistributedTraining #UIUC #DeepSpeed

0

9

3

8K

0

8

0

2

746

DeepSpeedAI_JP retweeted

Minjia Zhang @_Minjia_Zhang_

11 months ago

📢 Yesterday at USENIX ATC 2025, Xinyu Lian from UIUC SSAIL Lab presented our paper on Universal Checkpointing (UCP). UCP is a new distributed checkpointing system designed for today's large-scale DNN training, where models often use complex forms of parallelism, including data, tensor, pipeline, and expert parallelism. Existing checkpointing systems struggle in this setting because they are tightly coupled to specific training strategies (e.g., ZeRO-style data parallelism or 3D model parallelism), which break down when the training configs need to dynamically reconfigure over time. This makes it difficult to have resilient and fault-tolerant training. UCP solves this by decoupling distributed checkpointing from parallelism strategies. Our design introduces a unified checkpoint abstraction -- atomic checkpoint, and a full pattern matching-based transformation pipeline, which enables scalable and low-overhead checkpointing with reconfigurable parallelism across arbitrary model sharding strategies. We show that UCP supports state-of-the-art models trained with hybrid 3D/4D parallelism (ZeRO, TP, PP, SP) while incurring less than 0.001% overhead of the total training time. UCP is fully open-sourced in DeepSpeed. It has been adopted by Microsoft, BigScience, UC Berkeley and others for large-scale model pre-training and fine-tuning, including Phi-3.5-MoE (42B), BLOOM (176B), and many more. It also has been selected for presentation at PyTorch Day 2025 and FMS 2025(the Future of Memory and Storage). Big thanks to the amazing collaborators from Microsoft and Snowflake: @samadejacobs , @LevKurilenko, @MasahiroTanaka, @StasBekman , and @TunjiRuwase. 🔗 Project: https://t.co/j7VllIQCAS 📄 Paper: https://t.co/23yoXYmcrh 💻 Code: https://t.co/H4qyEyfQ0q 📚 Tutorial: https://t.co/lcxkv5FNxl #ATC2025 #LLM #Checkpointing #SystemsForML #DeepLearning #DistributedTraining #UIUC #DeepSpeed

0

9

3

8K

DeepSpeedAI_JP retweeted

about 1 year ago

PyTorch Day France marked the launch of a global PyTorch Day series—and the announcement of a major milestone: PyTorch Foundation is now an umbrella foundation. First new projects: @vllm_project + @DeepSpeedAI. Next Stop: PyTorch Day China, June 7 🇨🇳 https://t.co/br36cD3mL7 #PyTorch #OpenSourceAI #vLLM #DeepSpeed

PyTorch's tweet photo. PyTorch Day France marked the launch of a global PyTorch Day series—and the announcement of a major milestone: PyTorch Foundation is now an umbrella foundation. First new projects: @vllm_project + @DeepSpeedAI.

Next Stop: PyTorch Day China, June 7 🇨🇳 https://t.co/br36cD3mL7

#PyTorch #OpenSourceAI #vLLM #DeepSpeed

1

59

13

8

12K

about 1 year ago

DeepSpeedプロジェクトのPyTorch Foundationへの参加が発表されました。幅広いステークホルダーとのオープンな連携を通じて、コミュニティに一層貢献していきます。公式アナウンス: https://t.co/6wZvnUZ7ZA https://t.co/IkGegAQaFW

about 1 year ago

PyTorch Foundation has expanded into an umbrella foundation. @vllm_project and @DeepSpeedAI have been accepted as hosted projects, advancing community-driven AI across the full lifecycle. Supporting quotes provided by the following members: @AMD, @Arm, @AWS, @Google, @Huawei, @huggingface, @IBM, @Intel, @LightningAI, @Meta, @NVIDIA, and @Snowflake. 🔗💡 Read the full announcement: https://t.co/l55YFhlAOx #PyTorchFoundation #PyTorch #OpenSourceAI #vLLM #DeepSpeed

PyTorch's tweet photo. PyTorch Foundation has expanded into an umbrella foundation. @vllm_project and @DeepSpeedAI have been accepted as hosted projects, advancing community-driven AI across the full lifecycle.

Supporting quotes provided by the following members: @AMD, @Arm, @AWS, @Google, @Huawei, @huggingface, @IBM, @Intel, @LightningAI, @Meta, @NVIDIA, and @Snowflake.

🔗💡 Read the full announcement: https://t.co/l55YFhlAOx

#PyTorchFoundation #PyTorch #OpenSourceAI #vLLM #DeepSpeed

8

229

43

27

71K

0

5

0

1

536

DeepSpeedAI_JP retweeted

about 1 year ago

PyTorch Foundation has expanded into an umbrella foundation. @vllm_project and @DeepSpeedAI have been accepted as hosted projects, advancing community-driven AI across the full lifecycle. Supporting quotes provided by the following members: @AMD, @Arm, @AWS, @Google, @Huawei, @huggingface, @IBM, @Intel, @LightningAI, @Meta, @NVIDIA, and @Snowflake. 🔗💡 Read the full announcement: https://t.co/l55YFhlAOx #PyTorchFoundation #PyTorch #OpenSourceAI #vLLM #DeepSpeed

8

229

43

27

71K

DeepSpeedAI_JP retweeted

Horace He

@cHHillee

about 1 year ago

This is pretty neat. They insert into torch.compile and insert some profile-guided optimizations as well as a bunch of other specific optimizations like offloading. Since torch.compile is all in Python all their compiler passes are fairly accessible too! https://t.co/gxpcGQlILf

1

225

26

98

22K

about 1 year ago

DeepSpeedの新機能 "DeepCompile" をリリースしました！ ✅プロファイルに基づく並列処理の自動最適化 ✅ ZeROやオフロードをコンパイラの最適化パスとして実現 ✅ ZeRO1 / ZeRO3 / オフロードの 1.2〜7倍の高速化を達成詳細は下記をご覧くださいブログ(英語): https://t.co/ETSdQkWNQd

about 1 year ago

Introducing 🚀DeepCompile🚀: compiler-based distributed training optimizations. - Automatic parallelization & profile-guided optimizations - Enable ZeRO1, ZeRO3, Offloading, etc. via compiler passes - 1.2X-7X speedups over manual ZeRO1/ZeRO3/Offloading https://t.co/1DzW7buCO6

1

304

51

157

43K

0

32

6

18

20K

about 1 year ago

ありがとうございます、ぜひご活用ください！

hiroshi matsuda @hmtd223

about 1 year ago

deepspeedでtensor parallelとzero optimizerを組み合わせられるようになったとのこと🎉 zeroだけだとノード数を増やして学習を加速したくてもper_device_micro_batch_size * gpu_per_node * num_nodes <= 1536の制約がネックになりやすかったのが、tp=8にできればノード数も理論上は8倍に増やせる。

0

16

2

3K

0

1

822

about 1 year ago

HuggingFaceモデルに自動でテンソル並列 (TP) を適用する機能がリリースされました！ - HuggingFaceモデルハブの大規模モデルをより大きいバッチサイズ・系列長で訓練可能に - Llama3のfine-tuningを4倍高速化 - ユーザによるコード変更が不要！ブログ(英語): https://t.co/iPBz4yQdod

LF AI & Data Foundation @LFAIDataFdn

about 1 year ago

AutoTP + ZeRO Training for HF Models - Enhance HF post-training with larger models, batches, & contexts - 4x faster LLAMA3 fine-tuning with TP=2 vs TP=1 - No code changes needed Blog: https://t.co/ZlCG2Aq5K5

DeepSpeedAI's tweet photo. AutoTP + ZeRO Training for HF Models
- Enhance HF post-training with larger models, batches, & contexts
- 4x faster LLAMA3 fine-tuning with TP=2 vs TP=1
- No code changes needed

Blog: https://t.co/ZlCG2Aq5K5 https://t.co/oJeFVcwcjg

0

74

19

34

10K

0

41

15

16

7K

DeepSpeedAI_JP retweeted

over 1 year ago

🚀 Excited to introduce DeepSpeed, a deep learning optimization library from @Microsoft! It simplifies distributed training and inference, making AI scaling more efficient and cost-effective. Learn more 👉 https://t.co/LIFjumeAgb #DeepSpeed #AI #OpenSource #LFAIData

LFAIDataFdn's tweet photo. 🚀 Excited to introduce DeepSpeed, a deep learning optimization library from @Microsoft! It simplifies distributed training and inference, making AI scaling more efficient and cost-effective.

Learn more 👉 https://t.co/LIFjumeAgb

#DeepSpeed #AI #OpenSource #LFAIData https://t.co/OZA0E1b8PM

1

34

9

8

11K

DeepSpeedAI_JP retweeted

Microsoft Research

@MSFTResearch

over 1 year ago

Microsoft Research congratulates Yasuyuki Matsushita on being named a 2025 IEEE Fellow for his outstanding contributions to photometric 3D modeling and computational photography. https://t.co/HjpHJfZLfs

MSFTResearch's tweet photo. Microsoft Research congratulates Yasuyuki Matsushita on being named a 2025 IEEE Fellow for his outstanding contributions to photometric 3D modeling and computational photography. https://t.co/HjpHJfZLfs https://t.co/CzdnuA2iZu

2

37

11

3

11K

over 1 year ago

限られたGPUリソースで、非常に長い系列を学習するための新機能 Ulysses-Offload をリリースしました！ - A100-80GB 4台だけで LLaMA3-8B を系列長2Mトークンで訓練可能 - 55%を超えるMFUを達成ブログ: https://t.co/S5cPAkwk4h チュートリアル: https://t.co/zDNbiXQu6D