Holden @hodlenx - Twitter Profile

about 2 years ago

@mattmireles PowerInfer 1, the open-sourced code, works on Apple Silicon with CPU only. We are progressing on GPU support and adopting the method in this work where applicable. Stay tuned!

0

3

0

143

Holden @hodlenx

about 2 years ago

🚀 Excited to introduce PowerInfer-2: A game-changing LLM inference engine for mobile devices by the #PowerInfer team. It smoothly runs a 47B model with a staggering 29x speedup on smartphones! Watch our demo to see it in action! 🎥 Technical details at: https://t.co/7bx5EnzWCs

31

574

140

385

68K

Holden @hodlenx

about 2 years ago

@078sky Some initial tests have shown approximately 40% savings on energy per token🔋 We would love to share more findings after comprehensive tests.

0

1

0

120

Holden @hodlenx

about 2 years ago

@canav4r @vimpunk We expect this method to work well on other mobile platforms, but the current testing machines best fit our requirements for large RAM, high-speed flash, and NPU. It’s only a demonstrative implementation and we have no preference on the hardware.

0

2

0

85

Who to follow

Ernesto G

@ErnestoGee84

📈 $XRP $HBAR $CSPR $ETHEREUM $BITCOIN

Crypto Researcher || Attention Markets || || Not a Financial Advisor || $QUIL || Discounted Fomo App Fees - https://t.co/0uFx3NGomb

Holden @hodlenx

about 2 years ago

@IlyasHairline We’ve included the predictors into transformers model weights and you can train them end-to-end now. We found it more efficient to co-train instead of training the predictors offline.

0

2

0

73

Holden @hodlenx

about 2 years ago

@IlyasHairline @wey_gu If the foundation models use these activation functions and are pretrained from scratch, it would be ideal. We have demonstrated their negligible loss/perplexity compared to SwiGLU, but the trained model exhibited very sparse FFNs in TurboSparse paper and https://t.co/IJo4jjb7Hm

0

2

0

48

Holden @hodlenx

about 2 years ago

👐 PowerInfer-2 will be open-sourced based on the PowerInfer repo. We’re refining it to untangle from our testing platform and making it accessible on PCs for the community. Open-sourcing will happen in stages starting soon. Stay tuned for updates at https://t.co/eQYqe8hHGm

1

20

2

6

929

Holden @hodlenx

about 2 years ago

🔓 The power of cloud-scale models and local privacy isn't mutually exclusive. We're pioneering to bring LLM's incredible capabilities directly to your device without compromising privacy. Explore how we're making AI accessible to everyone, everywhere: https://t.co/jYXhmuGZmi

1

8

0

3

2K

Holden @hodlenx

about 2 years ago

@wey_gu The method we proposed in TurboSparse enables us to sparsify a mainstream foundation model within 150B token (5% of pretraining). The continuous training of Mistral and Mixtral costs us less than $0.1M. We hope other researchers found it a nice trade😎

0

2

1

0

597

Holden @hodlenx

about 2 years ago

@wey_gu Those ground-breaking speedups are all based on intrinsic sparsity and depends on ReLU more or less. We have confirmed some alternatives, like ReLU^2, and dReLU proposed in TurboSparse. They are very promising but not adopted by mainstream LLMs yet. Retraining is still essential

2

3

2

1K

Holden @hodlenx

about 2 years ago

@vimpunk It is implemented on Qualcomm Snapdragon 8 Gen 3 and utilises CPU & NPU. The test machine in the video is OnePlus 12.

1

6

0

2

683

Holden @hodlenx

about 2 years ago

@tetrachino 需要NPU算力相对CPU强得较多，我们主要的测试是在Snapdragon 8 Gen 3做的。这张图是论文中的算力实测结果，NPU在FP16下可以跑到~5TFLOPS

1

3

0

48

Holden @hodlenx

about 2 years ago

@tetrachino 是的，我们在prefill阶段用了NPU。这时推理的batch size比较大，很适合NPU计算

1

2

0

66

Holden @hodlenx

about 2 years ago

@tetrachino Yep. It utilises NPU at the prefill phase

1

4

0

1

619

Holden @hodlenx

about 2 years ago

@lin72h MoE is the highlight, but it also applies to other LLMs. For a Mistral-7B level model, it can save nearly 40% of memory while achieving the faster inference speed than SOTA

0

3

0

680

hodlenx retweeted

Aran Komatsuzaki

@arankomatsuzaki

about 2 years ago

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters - Proposes a novel dReLU function, which is designed to improve LLM activation sparsity - 2-5× decoding speedup model: https://t.co/UEoBgMxceD abs: https://t.co/yc9ZAhPokt

arankomatsuzaki's tweet photo. Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

- Proposes a novel dReLU function, which is designed to improve LLM activation sparsity
- 2-5× decoding speedup

model: https://t.co/UEoBgMxceD
abs: https://t.co/yc9ZAhPokt https://t.co/CmksYQCJuL

3

108

18

77

17K

Holden @hodlenx

about 2 years ago

Thrilled to unveil Bamboo-v0.1: A groundbreaking 7B LLM by the #PowerInfer team, matching Mistral's performance with 85% activation sparsity. Built on Mistral's weights, supercharged with dReLU for up to 4.38x hybrid computing speedups. Discover https://t.co/jhFnVL50BH.

0

8

0

6

590

hodlenx retweeted

Yuandong Tian

@tydsh

over 2 years ago

🌟PowerInfer boosts LLM serving speeds by up to 11x on consumer-grade GPUs! Inspired by our Deja Vu paper (ICML'23 https://t.co/FideE69lEl), it serves ReLUified LLMs, keeping heavy hitter (hot) neurons in GPU and offloading sparsely fired ones on CPUs. Proud of my undergraduate/master alma mater SJTU (Shanghai Jiao Tong University) for this work! https://t.co/crCxjKrk0Z

3

120

22

50

15K

Holden

@hodlenx

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users