Brian Dezhou Shen🇨🇳🇬🇧 @dezhou - Twitter Profile

Pinned Tweet

about 3 years ago

#ChatGPT as an Analyst Motivation I want to ask ChatGPT to estimate the AI market value in 2023 providing the history data. Here is my question, and the generation output of ChatGPT is astonishing.

1

0

1

882

Brian Dezhou Shen🇨🇳🇬🇧 @dezhou

5 months ago

@calcsam keep it pro, nice topic

0

2

dezhou retweeted

Zhuang Liu

@liuzhuang1234

about 1 year ago

New paper - Transformers, but without normalization layers (1/n)

76

4K

575

2K

1M

Brian Dezhou Shen🇨🇳🇬🇧 @dezhou

over 1 year ago

@vllm_project Good

0

54

Who to follow

Vivian Liu

@viv_lavida

CS PhD student @Columbia. previously @GoogleDeepmind, @AdobeResearch @ADSKResearch 🌷

AI Infra engineer @alibaba_cloud

Brian Dezhou Shen🇨🇳🇬🇧 @dezhou

over 1 year ago

@BenjaminDEKR good luck to you

0

2

Brian Dezhou Shen🇨🇳🇬🇧 @dezhou

over 1 year ago

@moyix Cannot use the official api?

1

0

145

Brian Dezhou Shen🇨🇳🇬🇧 @dezhou

over 1 year ago

@JustinLin610 coder is unique, while llm they use llama3.

0

1

0

61

Brian Dezhou Shen🇨🇳🇬🇧 @dezhou

over 1 year ago

@iScienceLuvr and free reading.

0

1

0

919

Brian Dezhou Shen🇨🇳🇬🇧 @dezhou

over 1 year ago

@WolframRvnwlf @Alibaba_Qwen you deserve more vram

0

1

0

13

Brian Dezhou Shen🇨🇳🇬🇧 @dezhou

over 1 year ago

@JustinLin610 Awq has an agenda?

0

3

Brian Dezhou Shen🇨🇳🇬🇧 @dezhou

over 1 year ago

@moyix Finetuning

0

2

0

40

Brian Dezhou Shen🇨🇳🇬🇧 @dezhou

over 1 year ago

@giffmana bankers are idiots.🎃

0

47

Brian Dezhou Shen🇨🇳🇬🇧 @dezhou

over 1 year ago

@natolambert Someone invented, I guess.

0

36

Brian Dezhou Shen🇨🇳🇬🇧 @dezhou

over 1 year ago

@j_foerst @FLAIR_Ox Someone would do so, I believe.

0

238

Brian Dezhou Shen🇨🇳🇬🇧 @dezhou

over 1 year ago

@JustinLin610 14b 32b needs more tuning, in my opinion.

0

26

Brian Dezhou Shen🇨🇳🇬🇧 @dezhou

over 1 year ago

@victormustar 0.5b,1.5b,3b?

0

17

dezhou retweeted

Adina Yakup

@AdinaYakup

over 1 year ago

Exciting release from @Alibaba_Qwen 🔥 Qwen 2.5-Coder is now live on @huggingface 👉https://t.co/QuVqn4Sjn4 ✨ Apache 2.0 license ✨ 0.5B, 1.5B, 3B, 7B, 14B, 32B base & instruct ✨ 128K long context support ✨ SOTA performance on coding benchmarks

1

20

6

4

4K

dezhou retweeted

Rohan Paul

@rohanpaul_ai

over 1 year ago

"Attention Is All You Need" paper was truly a landmark paper. However, the original "vanilla" transformers are seldom used now. The huge key upgrade is the use of RoPE, or Rotary Positional Embeddings. **Vanilla Decoder** - Input tokens -> Embeddings -> Embeddings + Positional Encoding -> Decoder Blocks **RoPE Decoder** - Input tokens -> Embeddings -> Decoder Blocks **Rotary Positional Embeddings** RoPE are used in attention blocks, which need to know token positions. Attention blocks combine information from a lot of tokens and need to know their relative positions For example, consider this sentence "It's a big thrill to climb a big mountain." "mountain" should focus more on the nearby "big." RoPE applies a rotational matrix to queries and keys, not values. If "mountain" is the 9th word, it rotates fully, while earlier words rotate less, aligning "mountain" more with the second "big." This approach is efficient as it applies positional embeddings only where needed and keeps token magnitudes unchanged. RoPE scales well to longer contexts, allowing models to be pre-trained on 4k contexts and fine-tuned for up to 4M by adjusting rotation speed.

rohanpaul_ai's tweet photo. "Attention Is All You Need" paper was truly a landmark paper.

However, the original "vanilla" transformers are seldom used now.

The huge key upgrade is the use of RoPE, or Rotary Positional Embeddings.

**Vanilla Decoder**

- Input tokens -> Embeddings -> Embeddings + Positional Encoding -> Decoder Blocks

**RoPE Decoder**

- Input tokens -> Embeddings -> Decoder Blocks

**Rotary Positional Embeddings**

RoPE are used in attention blocks, which need to know token positions.

Attention blocks combine information from a lot of tokens and need to know their relative positions

For example, consider this sentence "It's a big thrill to climb a big mountain."

"mountain" should focus more on the nearby "big."

RoPE applies a rotational matrix to queries and keys, not values. If "mountain" is the 9th word, it rotates fully, while earlier words rotate less, aligning "mountain" more with the second "big."

This approach is efficient as it applies positional embeddings only where needed and keeps token magnitudes unchanged.

RoPE scales well to longer contexts, allowing models to be pre-trained on 4k contexts and fine-tuned for up to 4M by adjusting rotation speed.

9

464

77

343

26K

Brian Dezhou Shen🇨🇳🇬🇧 @dezhou

over 1 year ago

@rohanpaul_ai Why did rope remove positional embedding?

0

141

Brian Dezhou Shen🇨🇳🇬🇧 @dezhou

over 1 year ago

@huybery So, when do you plan to release?

0

92

Brian Dezhou Shen🇨🇳🇬🇧

@dezhou

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users