Joseph @realjosephus - Twitter Profile

Joseph @RealJosephus

3 months ago

@nickhistgeek then glm-image did the same, better, in 9B params.

0

1

0

31

RealJosephus retweeted

Mayank Mishra

@MayankMish98

4 months ago

We identified an issue with the Mamba-2 🐍 initialization in HuggingFace and FlashLinearAttention repository (dt_bias being incorrectly initialized). This bug is related to 2 main issues: 1. init being incorrect (torch.ones) if Mamba-2 layers are used in isolation without the Mamba2ForCausalLM model class (this has been already fixed: https://t.co/oahfxjIsKb). 2. Skipping initialization due to meta device init for DTensors with FSDP-2 (https://t.co/hLC8nnQFc3 will fix this issue upon merging). The difference is substantial. Mamba-2 seems to be quite sensitive to the initialization. Check out our experiments at the 7B MoE scale: https://t.co/n8iuUICRux Special thanks to @kevinyli_, @bharatrunwal2, @HanGuo97, @tri_dao and @_albertgu 🙏 Also thanks to @SonglinYang4 for quickly helping in merging the PR.

17

740

73

326

372K

RealJosephus retweeted

Niels Rogge @NielsRogge

8 months ago

For people thinking that DeepSeek-OCR is the first model to render text as images, the University of Copenhagen already did this in 2023 Paper is called "Language Modelling with Pixels". They trained a Masked AutoEncoder (MAE) by rendering text as images and masking patches

NielsRogge's tweet photo. For people thinking that DeepSeek-OCR is the first model to render text as images, the University of Copenhagen already did this in 2023

Paper is called "Language Modelling with Pixels". They trained a Masked AutoEncoder (MAE) by rendering text as images and masking patches https://t.co/CuyGtRSOkF

22

523

50

204

45K

Joseph @RealJosephus

8 months ago

Deepseek dropped the OCR model they trained last year. Against VL models, they highlight OCR; against OCRs, they highlight conv downsampling tokencount, yet more params. Quite a scene watching people's reaction. Model's only better than 0.9b paddle when it comes to math formulas.

Zephyr

@zephyr_z9

8 months ago

Interesting Baidu has a better OCR than Whale

11

104

3

22

21K

0

4

0

1

2K

Who to follow

SDXLモ��ルのマージ、画像生成プロンプト探索など、色々やってます！詳細は固定ツイ確認お願いします！ ※案件・ご依頼はDMまで（気づくまで数日かかる場合があります）アンチAIの方はお互いの幸せのためにも即ブロします。

Joseph @RealJosephus

9 months ago

@teortaxesTex Gotta whisper this: under heavy data costs way > training, spectral norm is basically the worst you could do from a 2nd-moment view, no? Can't splash cash on data & defend burning it.

0

2

0

1

693

Joseph @RealJosephus

9 months ago

@teortaxesTex For celebrity identification, it's pretty sure the model just learned about the name tags. It knows which face goes with which name, but it doesn't actually know who the person is. Typical Qwen superficial work.

RealJosephus's tweet photo. @teortaxesTex For celebrity identification, it's pretty sure the model just learned about the name tags. It knows which face goes with which name, but it doesn't actually know who the person is. Typical Qwen superficial work. https://t.co/3voKnkXCHz

0

20

1

7K

Joseph @RealJosephus

10 months ago

@teortaxesTex @ChaseBrowe32432

0

7

0

1

5K

Joseph @RealJosephus

10 months ago

@Presidentlin fake news bro...

0

7

0

185

Joseph @RealJosephus

11 months ago

@Presidentlin vlencoder + mmdit imagen bro!

0

2

0

80

Joseph @RealJosephus

11 months ago

If it messes up the order, chances are it has conflicting information from a bad update (like new wiki data layered on old data). If it gets it right, its knowledge is current as of late 2024. If its answer is old (from before 2023), its knowledge probably cuts off in 2022.

0

3

0

603

Joseph @RealJosephus

11 months ago

Qwen3-Coder-480B-A35B-Instruct 1M ctx, but... cutoff 2022

Casper Hansen

@casper_hansen_

11 months ago

if you loved kimi k2, you will love what a certain chinese team is about to release which is highly competitive with 1M context length

57

1K

33

138

95K

3

65

3

9

8K

Joseph @RealJosephus

11 months ago

Vietnam has gone through 4 presidents in 2024. That's a great trivia question. Answered Nguyễn Xuân Phúc (04.2021 - 01.2023).

2

5

0

1

878

Joseph @RealJosephus

11 months ago

@SwayStar123

0

3

0

317

Joseph @RealJosephus

11 months ago

For the next 5-10 years, we will be haunted by 2022...

0

9

0

785

Joseph @RealJosephus

about 1 year ago

@teortaxesTex nay, garbage 'audio head parallel decoder' with ~1T tokens wasted. see: https://t.co/S5MkewnVDd might be a peculiar curse, but all the models with this architecture that I've observed have been poorly trained...

RealJosephus's tweet photo. @teortaxesTex nay, garbage 'audio head parallel decoder' with ~1T tokens wasted.
see: https://t.co/S5MkewnVDd
might be a peculiar curse, but all the models with this architecture that I've observed have been poorly trained... https://t.co/7FrzEt1u6Q

1

2

0

257

Joseph @RealJosephus

about 1 year ago

now it also applies to inference, tokenizers... Know your inference - vllm, *.cpp, ggml, hf? Everyone believes their inference implementation is correct. Verify it yourself. Otherwise, you're no different from someone who types `ollama run deepseek-r1`. https://t.co/L5bHPGbn1m

Joseph @RealJosephus

over 1 year ago

This suggests that, in reality, NO 'serious' LLM training is actually centered around the Hugging Face ecosystem - many who claim to surpass Meta LLaMA3.1, don't even know how to train a model properly - script kiddies

11

117

5

69

22K

0

3

0

755

Joseph @RealJosephus

about 1 year ago

Cool HF. https://t.co/cZxCqh68zK

kalomaze

@kalomaze

about 1 year ago

at this point i probably gotta abandon HF entirely if i want to ensure that any of this will be correct at all at any point in the process right

6

21

0

2K

1

7

0

1

1K

Joseph @RealJosephus

about 1 year ago

@teortaxesTex LLM lacks control over paralinguistic info, resulting in a monotonous voice similar to that of cascaded TTS systems... <del>but benchmaxxing...</del>

0

2

0

190

Joseph @RealJosephus

about 1 year ago

Sounds like and functionally equivalent to a cascaded TTS. LLM failed to manipulate paralinguistic cues. Almost the same as MiniCPM-o-2-6, whether transmitting Text Token or an LLM Hidden, is illusory; no discernible difference. No omni API works like this. Is this `E2E omni`?

RealJosephus's tweet photo. Sounds like and functionally equivalent to a cascaded TTS. LLM failed to manipulate paralinguistic cues. Almost the same as MiniCPM-o-2-6, whether transmitting Text Token or an LLM Hidden, is illusory; no discernible difference.
No omni API works like this.
Is this `E2E omni`? https://t.co/Evbhf8vLHw

Joseph @RealJosephus

about 1 year ago

https://t.co/60RTECijf9 coming soon... If I understand correctly, 32k ctx window can handle ~10 minutes of audio i/o, whereas the capacity drops to just a few minutes with vision input. 25 tok/s audio & qwen2.5-vl vision tokens are heavy.

1

8

0

2

7K

0

20

3

5K

Joseph @RealJosephus

about 1 year ago

https://t.co/60RTECijf9 coming soon... If I understand correctly, 32k ctx window can handle ~10 minutes of audio i/o, whereas the capacity drops to just a few minutes with vision input. 25 tok/s audio & qwen2.5-vl vision tokens are heavy.

1

8

0

2

7K

Joseph

@RealJosephus

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users