Austin Veselka @further_ai - Twitter Profile

Austin Veselka @further_ai

about 1 month ago

@rosinality Alternatively you just use flex attention. And with the flash attention 4 backend, it's pretty efficient

0

58

Austin Veselka @further_ai

about 2 months ago

@ZechenZhang5 I like the idea a lot, my recent agent work requires commits for all artifacts so we can reproduce/investigate all previously claimed results. I'm interested in search extending this system. We want to be able to find things we remember and explore on top of just reproducibility

1

2

0

432

Austin Veselka @further_ai

3 months ago

Qwen model here: https://t.co/mcu6ZaVZ7A

0

1

0

1

129

Austin Veselka @further_ai

3 months ago

This is similar to implicit CoT work, which is all pretty cool stuff. But this is a new way to internalize reasoning and a new control mechanism over performance. Check out the paper for details and the ablations that narrow down explanations!

1

2

0

126

Austin Veselka @further_ai

3 months ago

I use task arithmetic/model merging to limit degradation (0.25 * SFT) and during evaluation, I noticed that with <cot>, the model doesn't reason out loud, but removing <cot> still degrades performance. Thus, internalized reasoning.

further_ai's tweet photo. I use task arithmetic/model merging to limit degradation (0.25 * SFT) and during evaluation, I noticed that with <cot>, the model doesn't reason out loud, but removing <cot> still degrades performance.

Thus, internalized reasoning. https://t.co/uqVGvvezED

1

2

0

81

Austin Veselka @further_ai

3 months ago

I made a paper! https://t.co/MnHn1ritGs Essentially: Extract evidence from pages and sort the top K to make a reasoning trace. Add in a control token and we can turn it on or off. Internalize with model merging.

Austin Veselka @further_ai

5 months ago

So, I used a control tok - <cot> - and trained the model without the CoT when it is not in the prompt. If you train entirely without the synthetic CoT traces, it performs **very** similarly to not prompting <cot>. So, the model can turn the internalized algorithm on or off? Cool

0

2

0

1

1K

1

16

4

9

3K

Austin Veselka @further_ai

3 months ago

I also trained two versions of the models, one with the mixed think + non-think examples and one with only non-think examples. I evaluate both versions with and without the <cot> token in the system prompt:

further_ai's tweet photo. I also trained two versions of the models, one with the mixed think + non-think examples and one with only non-think examples.

I evaluate both versions with and without the <cot> token in the system prompt: https://t.co/j6dq14BR7n

1

2

0

62

Austin Veselka @further_ai

3 months ago

When building examples, I make most examples with a control token <cot> in the system prompt. These examples include the reasoning trace. This gives me a switch that affects the model's reasoning or path to the answer.

1

0

59

Austin Veselka @further_ai

3 months ago

There are two exclusive branches next: - Vision: receives the sorted pages and the question only - Text: sorted, extracted evidence, the question, and context on the input (ties the answer causally to the reasoning reasoning trace). Both branches generate a final answer.

1

0

53

Austin Veselka @further_ai

3 months ago

This seems to be key to getting good performance. The first version just had each page's evidence from the whole document (some pages as "irrelevant"), but during inference, the model tends to loop on "irrelevant". I believe the model learns v2's RAG-like algorithm.

1

0

57

Austin Veselka @further_ai

3 months ago

Thus, ground truth pages are ~always marked relevant, with some score flexibility. I filter for scores above a threshold, sort the pages + their evidence from greatest to least and take the top K (16).

1

0

63

Austin Veselka @further_ai

3 months ago

I take a document and a synthetic question. For each page, I task a VLM with extracting evidence relevant to the question, along with a relevance score in [0.0, 10.0]. If the page was used to generate the question, I prompt the model to score between [6.0, 10.0].

further_ai's tweet photo. I take a document and a synthetic question. For each page, I task a VLM with extracting evidence relevant to the question, along with a relevance score in [0.0, 10.0].

If the page was used to generate the question, I prompt the model to score between [6.0, 10.0]. https://t.co/tybyIft4nc

1

2

0

89

Austin Veselka @further_ai

3 months ago

I trained Qwen3 VL 32B to a new SOTA on MMLongBenchDoc, 58.3 (leaderboard update coming soon). I also trained Mistral Small 3.1 24B and the method is highly impactful across both models. These models are also very token efficient. Here's a little more detail on how it works:

1

0

158

Austin Veselka @further_ai

3 months ago

@thsottiaux Code can be 10x as many lines as needed with hasattr()s, .get()s , etc. and it raises errors for everything (assert isinstance(count, int), "3 line message") even when we fully know that count is an int. These waste space with insanely defensive code and it can fallback silently

0

4

Austin Veselka @further_ai

4 months ago

@oskar_hallstrom Thanks for your help and advice with the project, it went a long ways towards its success

0

1

0

12

Austin Veselka @further_ai

4 months ago

Excited to share my work "How to Train Your Long-Context Visual Document Model." (https://t.co/oE5WTVIHbZ) Research and recipes for training long-context VLMs for document understanding is entirely lacking. In this paper, I explore this frontier with extensive ablations.

1

20

5

16

2K

Austin Veselka @further_ai

4 months ago

Check out the paper for the extensive details!

1

3

0

74

Austin Veselka @further_ai

4 months ago

For reproducibility and open insights, I am releasing a full leaderboard of my training runs with data recipes included for the community to explore! Please enjoy https://t.co/7Jn9BjVYay

1

4

1

0

154

Austin Veselka

@further_ai

Last Seen Users on Sotwe

Trends for you

Most Popular Users