みぃ🍵

@mithernet

ヾ(๑╹◡╹)ﾉ

平和な世界

Joined September 2018

970 Following

1.9K Followers

3.8K Posts

Pinned Tweet

みぃ🍵 @mithernet

28 days ago

現在のAIの中核であるTransformerを置きかえることを目指す論文の改訂版を出しました！ ✅ 学習時より圧倒的に長い文でも性能維持＆正確な情報取得 ✅ 学習の安定性が桁違い（なんと学習率1でも学習可能！） ✅ モデルサイズ削減 & 推論速度向上 ✅ 解釈性の大幅向上本文に図を追加してわかりやすくまとめ直しました！読んでもらえると嬉しいです！！周りの人にもぜひ共有してください！！論文 → https://t.co/fi9Ucl6oOy

mithernet's tweet photo. 現在のAIの中核であるTransformerを置きかえることを目指す論文の改訂版を出しました！

✅ 学習時より圧倒的に長い文でも性能維持＆正確な情報取得
✅ 学習の安定性が桁違い（なんと学習率1でも学習可能！）
✅ モデルサイズ削減 & 推論速度向上
✅ 解釈性の大幅向上

本文に図を追加してわかりやすくまとめ直しました！
読んでもらえると嬉しいです！！周りの人にもぜひ共有してください！！

論文 → https://t.co/fi9Ucl6oOy

mithernet's tweet photo. 現在のAIの中核であるTransformerを置きかえることを目指す論文の改訂版を出しました！

✅ 学習時より圧倒的に長い文でも性能維持＆正確な情報取得
✅ 学習の安定性が桁違い（なんと学習率1でも学習可能！）
✅ モデルサイズ削減 & 推論速度向上
✅ 解釈性の大幅向上

本文に図を追加してわかりやすくまとめ直しました！
読んでもらえると嬉しいです！！周りの人にもぜひ共有してください！！

論文 → https://t.co/fi9Ucl6oOy

mithernet's tweet photo. 現在のAIの中核であるTransformerを置きかえることを目指す論文の改訂版を出しました！

✅ 学習時より圧倒的に長い文でも性能維持＆正確な情報取得
✅ 学習の安定性が桁違い（なんと学習率1でも学習可能！）
✅ モデルサイズ削減 & 推論速度向上
✅ 解釈性の大幅向上

本文に図を追加してわかりやすくまとめ直しました！
読んでもらえると嬉しいです！！周りの人にもぜひ共有してください！！

論文 → https://t.co/fi9Ucl6oOy

mithernet's tweet photo. 現在のAIの中核であるTransformerを置きかえることを目指す論文の改訂版を出しました！

✅ 学習時より圧倒的に長い文でも性能維持＆正確な情報取得
✅ 学習の安定性が桁違い（なんと学習率1でも学習可能！）
✅ モデルサイズ削減 & 推論速度向上
✅ 解釈性の大幅向上

本文に図を追加してわかりやすくまとめ直しました！
読んでもらえると嬉しいです！！周りの人にもぜひ共有してください！！

論文 → https://t.co/fi9Ucl6oOy

4

653

100

498

85K

mithernet retweeted

Louis Maddox @permutans

24 days ago

"Screening Is Enough" > [Softmax attention] does not provide an independently interpretable measure of query—key relevance: attn. scores are unbounded, while attn. weights are defined only relative to competing keys. Consequently, irrelevant keys cannot be explicitly rejected...

0

5

1

4

4K

みぃ🍵 @mithernet

28 days ago

講演依頼等は論文のメールアドレスのほうにお願いします🙌

0

1

0

0

5K

みぃ🍵 @mithernet

28 days ago

現在のAIの中核であるTransformerを置きかえることを目指す論文の改訂版を出しました！ ✅ 学習時より圧倒的に長い文でも性能維持＆正確な情報取得 ✅ 学習の安定性が桁違い（なんと学習率1でも学習可能！） ✅ モデルサイズ削減 & 推論速度向上 ✅ 解釈性の大幅向上本文に図を追加してわかりやすくまとめ直しました！読んでもらえると嬉しいです！！周りの人にもぜひ共有してください！！論文 → https://t.co/fi9Ucl6oOy

mithernet's tweet photo. 現在のAIの中核であるTransformerを置きかえることを目指す論文の改訂版を出しました！

✅ 学習時より圧倒的に長い文でも性能維持＆正確な情報取得
✅ 学習の安定性が桁違い（なんと学習率1でも学習可能！）
✅ モデルサイズ削減 & 推論速度向上
✅ 解釈性の大幅向上

本文に図を追加してわかりやすくまとめ直しました！
読んでもらえると嬉しいです！！周りの人にもぜひ共有してください！！

論文 → https://t.co/fi9Ucl6oOy

mithernet's tweet photo. 現在のAIの中核であるTransformerを置きかえることを目指す論文の改訂版を出しました！

✅ 学習時より圧倒的に長い文でも性能維持＆正確な情報取得
✅ 学習の安定性が桁違い（なんと学習率1でも学習可能！）
✅ モデルサイズ削減 & 推論速度向上
✅ 解釈性の大幅向上

本文に図を追加してわかりやすくまとめ直しました！
読んでもらえると嬉しいです！！周りの人にもぜひ共有してください！！

論文 → https://t.co/fi9Ucl6oOy

mithernet's tweet photo. 現在のAIの中核であるTransformerを置きかえることを目指す論文の改訂版を出しました！

✅ 学習時より圧倒的に長い文でも性能維持＆正確な情報取得
✅ 学習の安定性が桁違い（なんと学習率1でも学習可能！）
✅ モデルサイズ削減 & 推論速度向上
✅ 解釈性の大幅向上

本文に図を追加してわかりやすくまとめ直しました！
読んでもらえると嬉しいです！！周りの人にもぜひ共有してください！！

論文 → https://t.co/fi9Ucl6oOy

mithernet's tweet photo. 現在のAIの中核であるTransformerを置きかえることを目指す論文の改訂版を出しました！

✅ 学習時より圧倒的に長い文でも性能維持＆正確な情報取得
✅ 学習の安定性が桁違い（なんと学習率1でも学習可能！）
✅ モデルサイズ削減 & 推論速度向上
✅ 解釈性の大幅向上

本文に図を追加してわかりやすくまとめ直しました！
読んでもらえると嬉しいです！！周りの人にもぜひ共有してください！！

論文 → https://t.co/fi9Ucl6oOy

4

653

100

498

85K

Who to follow

Researcher @ Preferred Networks, Inc. ← UT CS18er (Sugiyama-Sato-Honda lab)← UT IS16er / Develop Optuna and PLaMo

Quantum Research Engineer @QunaSys Inc.

奈良教育大学附属小学校→大阪桐蔭中学校→大阪桐蔭高等学校→大阪大学･基礎工学部･電子物理科学科･物性物理科学コース→大阪大学大学院･基礎工学研究科･物質創成専攻･物性物理工学領域→就職 I’m a researcher working on quantum information.

みぃ🍵 @mithernet

28 days ago

English version: https://t.co/7ujttkP8OS

みぃ🍵 @mithernet

28 days ago

I just released a revised version of my paper on Multiscreen, an alternative to Transformer for long-context language modeling. ✅ Maintains performance and retrieves information accurately on contexts far longer than those seen during training ✅ Much more stable at large learning rates — it can even train with learning rate 1 ✅ Smaller model size & faster inference ✅ More interpretable context selection I added more figures to the main text and rewrote the paper to make it easier to follow. I’d be very happy if you read it! Paper → https://t.co/fi9Ucl6oOy

mithernet's tweet photo. I just released a revised version of my paper on Multiscreen, an alternative to Transformer for long-context language modeling.

✅ Maintains performance and retrieves information accurately on contexts far longer than those seen during training
✅ Much more stable at large learning rates — it can even train with learning rate 1
✅ Smaller model size & faster inference
✅ More interpretable context selection

I added more figures to the main text and rewrote the paper to make it easier to follow. I’d be very happy if you read it!

Paper → https://t.co/fi9Ucl6oOy

mithernet's tweet photo. I just released a revised version of my paper on Multiscreen, an alternative to Transformer for long-context language modeling.

✅ Maintains performance and retrieves information accurately on contexts far longer than those seen during training
✅ Much more stable at large learning rates — it can even train with learning rate 1
✅ Smaller model size & faster inference
✅ More interpretable context selection

I added more figures to the main text and rewrote the paper to make it easier to follow. I’d be very happy if you read it!

Paper → https://t.co/fi9Ucl6oOy

mithernet's tweet photo. I just released a revised version of my paper on Multiscreen, an alternative to Transformer for long-context language modeling.

✅ Maintains performance and retrieves information accurately on contexts far longer than those seen during training
✅ Much more stable at large learning rates — it can even train with learning rate 1
✅ Smaller model size & faster inference
✅ More interpretable context selection

I added more figures to the main text and rewrote the paper to make it easier to follow. I’d be very happy if you read it!

Paper → https://t.co/fi9Ucl6oOy

mithernet's tweet photo. I just released a revised version of my paper on Multiscreen, an alternative to Transformer for long-context language modeling.

✅ Maintains performance and retrieves information accurately on contexts far longer than those seen during training
✅ Much more stable at large learning rates — it can even train with learning rate 1
✅ Smaller model size & faster inference
✅ More interpretable context selection

I added more figures to the main text and rewrote the paper to make it easier to follow. I’d be very happy if you read it!

Paper → https://t.co/fi9Ucl6oOy

8

154

27

128

29K

2

7

1

7

11K

みぃ🍵 @mithernet

28 days ago

I’m preparing the code release — stay tuned!

2

12

0

2

2K

みぃ🍵 @mithernet

28 days ago

I just released a revised version of my paper on Multiscreen, an alternative to Transformer for long-context language modeling. ✅ Maintains performance and retrieves information accurately on contexts far longer than those seen during training ✅ Much more stable at large learning rates — it can even train with learning rate 1 ✅ Smaller model size & faster inference ✅ More interpretable context selection I added more figures to the main text and rewrote the paper to make it easier to follow. I’d be very happy if you read it! Paper → https://t.co/fi9Ucl6oOy

mithernet's tweet photo. I just released a revised version of my paper on Multiscreen, an alternative to Transformer for long-context language modeling.

✅ Maintains performance and retrieves information accurately on contexts far longer than those seen during training
✅ Much more stable at large learning rates — it can even train with learning rate 1
✅ Smaller model size & faster inference
✅ More interpretable context selection

I added more figures to the main text and rewrote the paper to make it easier to follow. I’d be very happy if you read it!

Paper → https://t.co/fi9Ucl6oOy

mithernet's tweet photo. I just released a revised version of my paper on Multiscreen, an alternative to Transformer for long-context language modeling.

✅ Maintains performance and retrieves information accurately on contexts far longer than those seen during training
✅ Much more stable at large learning rates — it can even train with learning rate 1
✅ Smaller model size & faster inference
✅ More interpretable context selection

I added more figures to the main text and rewrote the paper to make it easier to follow. I’d be very happy if you read it!

Paper → https://t.co/fi9Ucl6oOy

mithernet's tweet photo. I just released a revised version of my paper on Multiscreen, an alternative to Transformer for long-context language modeling.

✅ Maintains performance and retrieves information accurately on contexts far longer than those seen during training
✅ Much more stable at large learning rates — it can even train with learning rate 1
✅ Smaller model size & faster inference
✅ More interpretable context selection

I added more figures to the main text and rewrote the paper to make it easier to follow. I’d be very happy if you read it!

Paper → https://t.co/fi9Ucl6oOy

mithernet's tweet photo. I just released a revised version of my paper on Multiscreen, an alternative to Transformer for long-context language modeling.

✅ Maintains performance and retrieves information accurately on contexts far longer than those seen during training
✅ Much more stable at large learning rates — it can even train with learning rate 1
✅ Smaller model size & faster inference
✅ More interpretable context selection

I added more figures to the main text and rewrote the paper to make it easier to follow. I’d be very happy if you read it!

Paper → https://t.co/fi9Ucl6oOy

8

154

27

128

29K

mithernet retweeted

立福　寛 @TATEXH

about 1 month ago

パラメータ数92%削減、推論速度3.2倍。 Transformerの限界を突破する新アーキテクチャ『Multiscreen』が注目されているようだ。論文タイトルは “Screening is enough” 。「Attention Is All You Need」への挑戦状とも取れる内容である。従来の研究は「いかにして重要な情報に注目するか」を追求してきた。この論文はその逆で「無関係な情報をいかに切り捨てるか」というアプローチを提案する。従来のAttention機構はSoftmaxを用いて「相対的な重要度」を計算するが、これは無関係なノイズにも重みを割り当ててしまう欠点があった。提案手法のMultiscreenは、各キーに対して絶対的な閾値で判定を行い、不要な情報を削除してから集約を行う。たとえば、Softmaxは「AがBよりマシなら、Aに重みを振る」という相対評価だったが、Multiscreenでは「AもBもゴミなら、両方捨てる」という挙動になる。この変更により、同じトークン予算で、Transformerのベースラインよりも４０％少ないパラメータで同等の検証損失を達成した。まだTransformerでは学習が発散してしまうような大幅に大きな学習率でも、Multiscreenでは安定した最適化が可能である。大きな学習率を使えるので微調整が短時間で実行でき、高速な実験サイクルを回せるようになる。長文脈評価において、Multiscreenはパープレキシティにおいて強い性能を維持し、学習中に見た長さを遥かに超える文脈長でも、検索性能の劣化をほとんど示さない。AIや統計の分野では、これを外挿性（Extrapolation）が高いと表現するようだ。外挿性は学習した範囲外でも正しく振る舞える能力のことを指す。条件によってはモデルパラメータが９２％少なくても、一貫してTransformerベースラインを上回る。つまり情報の密度が非常に高いということである。100Kトークンのコンテキストを持つ次のトークン予測では、MultiscreenはTransformerベースラインに対して、推論レイテンシーを2.3-3.2倍削減する。情報の集約ロジックを相対評価から、絶対選別にシフトさせた点が画期的で、メモリ、速度、精度のすべてにおいてブレイクスルーを達成している。大型のモデル開発だけでなく、エッジデバイスでのLLM実行においても有望なアーキテクチャである。全員に配分する民主主義をやめて、一部の適格者だけ通す門番を導入したという例えがわかりやすい。 https://t.co/LuuaqJhqwg

1

13

4

10

2K

みぃ🍵 @mithernet

29 days ago

つかれはてました

0

5

0

0

1K

みぃ🍵 @mithernet

2 months ago

@hmassareli Yes! I’m planning to release it!

0

1

0

1

129

みぃ🍵 @mithernet

2 months ago

著者です！ Attentionの「相対比較しかできない」という制約を外した、新しい機構を提案しました ①まずわかりやすい利点 ✅学習時より圧倒的に長い文でも性能維持＆正確な情報取得 ✅収束が非常に高速（LR=1でも学習可能） ✅モデルサイズ4割削減 ✅推論速度3倍超 (続く) https://t.co/75rZpnqieu

mithernet's tweet photo. 著者です！
Attentionの「相対比較しかできない」という制約を外した、新しい機構を提案しました

①まずわかりやすい利点

✅学習時より圧倒的に長い文でも性能維持＆正確な情報取得
✅収束が非常に高速（LR=1でも学習可能）
✅モデルサイズ4割削減
✅推論速度3倍超

(続く)

https://t.co/75rZpnqieu https://t.co/7enHZCXjDn

15

800

133

608

85K

みぃ🍵 @mithernet

2 months ago

以上です！詳細はこちらです https://t.co/HsWBPVd83A

1

9

1

2

2K

みぃ🍵 @mithernet

2 months ago

⑦検索性能（ここが一番重要！） 📊 ABCDigits（検索タスク）（横軸＝文の長さ） ✅Multiscreenは超長文でもほぼ劣化しない ⚠️Transformerは長くなると崩壊しかも学習長の場合でさえ ✅なんと「92%」小さいモデルが検索性能で勝る（！） 📌 小さいのに検索性能強い 📌 長文でもほぼ劣化しない

mithernet's tweet photo. ⑦検索性能（ここが一番重要！）

📊 ABCDigits（検索タスク）
（横軸＝文の長さ）

✅Multiscreenは超長文でもほぼ劣化しない
⚠️Transformerは長くなると崩壊

しかも学習長の場合でさえ
✅なんと「92%」小さいモデルが検索性能で勝る（！）

📌 小さいのに検索性能強い
📌 長文でもほぼ劣化しない

1

10

0

6

4K

みぃ🍵 @mithernet

2 months ago

😍😍😍

The AI Timeline

2 months ago

🚨This week's top AI/ML research papers: - HISA - Embarrassingly Simple Self-Distillation Improves Code Generation - FIPO - SKILL0 - Reasoning over mathematical objects - Screening Is Enough - Path-Constrained Mixture-of-Experts read this in thread mode for the best experience

2

93

7

80

12K

0

6

0

0

2K

みぃ🍵 @mithernet

2 months ago

綺麗な説明図が書けたし論文にも追加しようかな

みぃ🍵 @mithernet

2 months ago

⑤位置情報の扱い（MiPE: Minimal positional encoding） ⚠️位置エンコーディング(RoPE)は、長文になるほど外挿が問題となる MiPE（提案手法）では 🔸位置情報は「screening windowが狭いときだけ」使う 🔸queryとkeyの初めの2要素だけ使用 👉ミニマルで計算が軽い 👉長文でも外挿する必要がない

mithernet's tweet photo. ⑤位置情報の扱い（MiPE: Minimal positional encoding）

⚠️位置エンコーディング(RoPE)は、長文になるほど外挿が問題となる

MiPE（提案手法）では
🔸位置情報は「screening windowが狭いときだけ」使う
🔸queryとkeyの初めの2要素だけ使用

👉ミニマルで計算が軽い
👉長文でも外挿する必要がない https://t.co/aBgGQzptNt

2

18

1

6

6K

0

25

2

24

5K

みぃ🍵 @mithernet

2 months ago

@sho1823 読んでくれて嬉しすぎる！！！それぞれの閾値も学習されるのよ！

0

1

0

0

36

みぃ🍵 @mithernet

2 months ago

@fumishiki ・Sigmoid Attentionについてこちらのご指摘非常にありがたいです。投稿直前までスパース性を前面に押し出して論文を書いていた影響で、選定から漏れておりました。Sigmoid-based attentionとの決定的な違いは、関連のないkeyを完全排除できるかどうかという点です。追加対応させていただきます。

0

8

1

1

1K

みぃ🍵 @mithernet

2 months ago

@fumishiki ご指摘いただき非常にありがとうございます。少しずつ回答させていただきます。・既存のlong-contextベンチに関して既存手法の結果についての結果もappendixに載せるべきでした。passkeyテストの実験結果があり、同様の傾向を示します。追加対応いたします。

1

5

0

1

3K

Last Seen Users on Sotwe

Trends for you

Most Popular Users