Ebtesam @ebtesamdotpy - Twitter Profile

7 months ago

Scrolling the AI news timeline as a researcher feels like a teenager browsing Instagram: "Everyone else has figured everything out!" Reliable home robots imminent, 100× productivity AI agents, insane visual generation ... Exciting, but anxiety-inducing. What am I doing? 😬

35

525

28

72

33K

ebtesamdotpy retweeted

Omar Khattab

@lateinteraction

9 months ago

crazy that they called it context window when attention span was right there

131

7K

360

461

333K

ebtesamdotpy retweeted

Sebastian Raschka

@rasbt

about 1 year ago

As we all know by now, reasoning models often generate longer responses, which raises compute costs. Now, this new paper (https://t.co/UbBv4rzM09) shows that this behavior comes from the RL training process, not from an actual need for long answers for better accuracy. The RL loss tends to favor longer responses when the model gets negative rewards, which I think explains the "aha" moments and longer chains of thought that arise from pure RL training. I.e., if the model gets a negative reward (i.e., the answer is wrong), the math behind PPO causes the average per-token loss becomes smaller when the response is longer. So, the model is indirectly encouraged to make its responses longer. This is true even if those extra tokens don't actually help solve the problem. What does the response length have to do with the loss? When the reward is negative, longer responses can dilute the penalty per individual token, which results in lower (i.e., better) loss values (even though the model is still getting the answer wrong). So the model "learns" that longer responses reduce the punishment, even though they are not helping correctness. In addition, the researchers show that a second round of RL (using just a few problems that are sometimes solvable) can shorten responses while preserving or even improving accuracy. This has big implications for deployment efficiency.

rasbt's tweet photo. As we all know by now, reasoning models often generate longer responses, which raises compute costs. Now, this new paper (https://t.co/UbBv4rzM09) shows that this behavior comes from the RL training process, not from an actual need for long answers for better accuracy. The RL loss tends to favor longer responses when the model gets negative rewards, which I think explains the "aha" moments and longer chains of thought that arise from pure RL training.

I.e., if the model gets a negative reward (i.e., the answer is wrong), the math behind PPO causes the average per-token loss becomes smaller when the response is longer. So, the model is indirectly encouraged to make its responses longer. This is true even if those extra tokens don't actually help solve the problem.

What does the response length have to do with the loss? When the reward is negative, longer responses can dilute the penalty per individual token, which results in lower (i.e., better) loss values (even though the model is still getting the answer wrong).

So the model "learns" that longer responses reduce the punishment, even though they are not helping correctness.

In addition, the researchers show that a second round of RL (using just a few problems that are sometimes solvable) can shorten responses while preserving or even improving accuracy. This has big implications for deployment efficiency.

33

1K

184

819

108K

ebtesamdotpy retweeted

Tristan T

@trirpi

about 1 year ago

123

9K

998

709

678K

Who to follow

Ivan Avramovic

@civomarva

BS Electrical Engineering (UIUC), MS/PhD Comp Sci (GMU). Thinker, programmer, idealist, parent, artist (slightly). Opinions are my own, not my employer's.

IEEE Symposium on Visual Languages and Human-Centric Computing #VLHCC25

ebtesamdotpy retweeted

I Am Devloper

@iamdevloper

about 1 year ago

vibe coding, where 2 engineers can now create the tech debt of at least 50 engineers

167

15K

1K

745

647K

ebtesamdotpy retweeted

Nabeel S. Qureshi

@nabeelqu

over 1 year ago

For the confused, it's actually super easy: - GPT 4.5 is the new Claude 3.6 (aka 3.5) - Claude 3.7 is the new o3-mini-high - Claude Code is the new Cursor - Grok is the new Perplexity - o1 pro is the 'smartest', except for o3, which backs Deep Research Obviously. Keep up.

236

11K

607

3K

1M

ebtesamdotpy retweeted

Hamel Husain

@HamelHusain

over 1 year ago

New post re: Devin (the AI SWE). We couldn't find many reviews of people using it for real tasks, so we went MKBHD mode and put Devin through its paces. We documented our findings here. Would love to know if others have had a different experience. https://t.co/DDqzoAXKkl

HamelHusain's tweet photo. New post re: Devin (the AI SWE). We couldn't find many reviews of people using it for real tasks, so we went MKBHD mode and put Devin through its paces.

We documented our findings here. Would love to know if others have had a different experience.

https://t.co/DDqzoAXKkl https://t.co/XNxYbby5VF

58

2K

165

1K

587K

ebtesamdotpy retweeted

Diomidis Spinellis @CoolSWEng

over 1 year ago

Long overdue, a paper finally exposes the Emperor's New “Threats to Validity” Clothes in empirical software engineering research. Even better, it provides suggestions for improving the state of practice.

CoolSWEng's tweet photo. Long overdue, a paper finally exposes the Emperor's New “Threats to Validity” Clothes in empirical software engineering research. Even better, it provides suggestions for improving the state of practice. https://t.co/0Loze8Qb5A

1

8

3

2

1K

ebtesamdotpy retweeted

Jiaxin Pei

@jiaxin_pei

over 1 year ago

It's common to add personas in system prompts, assuming this can help LLMs. However, through analyzing 162 roles x 4 LLMs x 2410 questions, we show that adding a persona mostly has *no* statistically significant difference from the no-persona setting. If there is a difference, it is *negative*. It's time to rethink the usage of personas in system prompts!

1

68

14

37

9K

ebtesamdotpy retweeted

Ishan

@radshaan

almost 2 years ago

If you get frequent urges to go deep into a subject, do not ignore them Pick a weekend, stop everything else, and give in to the urge Fresh insights await at the other end

25

4K

390

707

115K

ebtesamdotpy retweeted

Upol Ehsan @UpolEhsan

almost 2 years ago

Is hallucination in LLMs inevitable even with an idealized model architecture and perfect training data? This work argues YES and offers a formal proof. Let's dig in ⤵ 🧵1/n

UpolEhsan's tweet photo. Is hallucination in LLMs inevitable even with an idealized model architecture and perfect training data?

This work argues YES and offers a formal proof.

Let's dig in ⤵

🧵1/n https://t.co/n4DMMVfj82

15

322

68

489

59K

ebtesamdotpy retweeted

Edward Grefenstette @egrefen

about 2 years ago · Kingston upon Thames

Instead, evaluation processes should track the diverse notions of extrinsic utility which are to be found in both everyday usage of our technology today, but also anticipating how people might use technology tomorrow.

1

10

2

1

1K

ebtesamdotpy retweeted

Dr Meming @Dr_Meming

over 2 years ago

Heck

14

2K

200

106

118K

ebtesamdotpy retweeted

INSPIRED Lab @ GMU @INSPIREDLabGMU

over 2 years ago

🚨 Inclusive tech research alert! 🚨 Are you a tech user who identifies as BIPOC (https://t.co/gj9uuPIz4d)? Or a researcher/practitioner who uses data in your work? Share your experiences in our 20 min. survey→https://t.co/rGBpQUChFO IRBNet #: 1945546-2 #data #tech #trust

1

2

0

932

ebtesamdotpy retweeted

Dr. Amy Lee @minisciencegirl

over 2 years ago

Never name a manuscript draft "_FINAL"

64

1K

151

36

156K

ebtesamdotpy retweeted

Dr Meming @Dr_Meming

over 2 years ago

Academic research: months of experiments and data analysis that ends up being a few sentences in a paper

15

6K

666

256

316K

ebtesamdotpy retweeted

will depue

@willdepue

over 2 years ago

I feel like large language model feels a bit reductive when GPT-2 is in the same class as GPT-4. gigantic language models? enormous language models? big ass language models? Nimitz-class language models? better suggestions needed

69

220

3

10

32K

ebtesamdotpy retweeted

MIT CSAIL

@MIT_CSAIL

over 2 years ago

Happy birthday to Python creator Guido van Rossum. The open source language was named after comedy troupe Monty Python: https://t.co/UGUO3rp0M1 Image v/Midjourney

MIT_CSAIL's tweet photo. Happy birthday to Python creator Guido van Rossum. The open source language was named after comedy troupe Monty Python: https://t.co/UGUO3rp0M1

Image v/Midjourney https://t.co/TVT32lcPKW

8

770

164

46

57K

ebtesamdotpy retweeted

François Chollet

@fchollet

over 2 years ago

When I got started with programming, I debugged using printf() statements. Today, I debug with print() statements. The purpose of debugging is to correct your mental model of what your code does, and no tool can do that for you. The best any tool can do is provide visibility into code execution, and targeted print statements already do a tremendous job at that.

74

2K

206

394

565K

Ebtesam

@ebtesamdotpy

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users