Xiaowei Huang @xiaoweih - Twitter Profile

Xiaowei Huang @xiaoweih

almost 3 years ago

“Having the behavior of an LLM change over time is not acceptable.” — why?

Santiago

@svpino

almost 3 years ago

GPT-4 is getting worse over time, not better. Many people have reported noticing a significant degradation in the quality of the model responses, but so far, it was all anecdotal. But now we know. At least one study shows how the June version of GPT-4 is objectively worse than the version released in March on a few tasks. The team evaluated the models using a dataset of 500 problems where the models had to figure out whether a given integer was prime. In March, GPT-4 answered correctly 488 of these questions. In June, it only got 12 correct answers. From 97.6% success rate down to 2.4%! But it gets worse! The team used Chain-of-Thought to help the model reason: "Is 17077 a prime number? Think step by step." Chain-of-Thought is a popular technique that significantly improves answers. Unfortunately, the latest version of GPT-4 did not generate intermediate steps and instead answered incorrectly with a simple "No." Code generation has also gotten worse. The team built a dataset with 50 easy problems from LeetCode and measured how many GPT-4 answers ran without any changes. The March version succeeded in 52% of the problems, but this dropped to a pale 10% using the model from June. Why is this happening? We assume that OpenAI pushes changes continuously, but we don't know how the process works and how they evaluate whether the models are improving or regressing. Rumors suggest they are using several smaller and specialized GPT-4 models that act similarly to a large model but are less expensive to run. When a user asks a question, the system decides which model to send the query to. Cheaper and faster, but could this new approach be the problem behind the degradation in quality? In my opinion, this is a red flag for anyone building applications that rely on GPT-4. Having the behavior of an LLM change over time is not acceptable. Have you noticed any issues when using GPT-4 and ChatGPT lately? Do you think these problems are overblown?

svpino's tweet photo. GPT-4 is getting worse over time, not better.

Many people have reported noticing a significant degradation in the quality of the model responses, but so far, it was all anecdotal.

But now we know.

At least one study shows how the June version of GPT-4 is objectively worse than the version released in March on a few tasks.

The team evaluated the models using a dataset of 500 problems where the models had to figure out whether a given integer was prime. In March, GPT-4 answered correctly 488 of these questions. In June, it only got 12 correct answers.

From 97.6% success rate down to 2.4%!

But it gets worse!

The team used Chain-of-Thought to help the model reason:

"Is 17077 a prime number? Think step by step."

Chain-of-Thought is a popular technique that significantly improves answers. Unfortunately, the latest version of GPT-4 did not generate intermediate steps and instead answered incorrectly with a simple "No."

Code generation has also gotten worse.

The team built a dataset with 50 easy problems from LeetCode and measured how many GPT-4 answers ran without any changes.

The March version succeeded in 52% of the problems, but this dropped to a pale 10% using the model from June.

Why is this happening?

We assume that OpenAI pushes changes continuously, but we don't know how the process works and how they evaluate whether the models are improving or regressing.

Rumors suggest they are using several smaller and specialized GPT-4 models that act similarly to a large model but are less expensive to run. When a user asks a question, the system decides which model to send the query to.

Cheaper and faster, but could this new approach be the problem behind the degradation in quality?

In my opinion, this is a red flag for anyone building applications that rely on GPT-4. Having the behavior of an LLM change over time is not acceptable.

Have you noticed any issues when using GPT-4 and ChatGPT lately? Do you think these problems are overblown?

1K

14K

3K

4K

6M

0

1

0

463

Xiaowei Huang @xiaoweih

almost 3 years ago

The grading methodology is loose. Instead of giving credit about how good a certain activity (such as Evaluations and Testing) has been performed, it only requires the company "Report the results of " their internal reports. https://t.co/Mx1pQgwHPM

0

1

0

244

xiaoweih retweeted

The Guardian

@guardian

almost 3 years ago

‘A huge relief’: scientists react to hopes of UK rejoining EU Horizon scheme https://t.co/VzUTHa95oS

9

90

15

1

51K

Xiaowei Huang @xiaoweih

almost 3 years ago

@matthew_wicker @ICComputing Congratulations! Well done and well deserved

0

2

0

156

Who to follow

背心叔

@zululiu

咖啡黑了我的肠，内心独白。上茅房行走，没品御前带铲侍卫。屁股上有合格章的人。

Xiaowei Huang @xiaoweih

almost 3 years ago

Not sure if this is true. If so, that’d be a significant academic misconduct. People take significant amount of time writing proposals. If really busy, why can’t one just refuse to review?

1

2

0

322

Xiaowei Huang @xiaoweih

almost 3 years ago

i am curious about where they got the data?

World of Statistics

@stats_feed

almost 3 years ago

The countries with the highest rates of smartphone addiction: 1. 🇨🇳 China 2.🇸🇦 Saudi Arabia 3.🇲🇾 Malaysia 4.🇧🇷 Brazil 5.🇰🇷 South Korea 6.🇮🇷 Iran 7. 🇨🇦 Canada 8.🇹🇷 Turkey 9.🇪🇬 Egypt 10.🇳🇵 Nepal 11.🇮🇹 Italy 12.🇦🇺 Australia 13.🇮🇱 Israel 14.🇷🇸 Serbia 15.🇯🇵 Japan 16.🇬🇧 United Kingdom 17.🇮🇳 India 18.🇺🇸 United States 19.🇷🇴 Romania 20.🇳🇬 Nigeria 21.🇧🇪 Belgium 22.🇨🇭 Switzerland 23.🇫🇷 France 24.🇩🇪 Germany

1K

35K

5K

1K

12M

0

255

Xiaowei Huang @xiaoweih

almost 3 years ago

Life: discover something fun, exercise small (yet sensible) risk, play around it, but always come back to your solid support.

Buitengebieden

@buitengebieden

almost 3 years ago

My all time favorite.. ❤️ Never gets old..

1K

161K

18K

6K

16M

0

2

0

187

xiaoweih retweeted

Taylor Ogan

@TaylorOgan

almost 3 years ago

A Tesla on Full Self-Driving blows through a stop sign at 35mph and nearly collides with two cars. The kicker is that this was during a livestream debate-demo-drive between FSD fan @GerberKawasaki and FSD skeptic @RealDanODowd. This should go without saying, but for an automated system to be at the safety level of a human, this cannot happen. The fact that this occurred yesterday during THIS drive (along with other safety-critical disengagements) should serve as statistical evidence of how frequent this is occurring. IMO, this is the biggest nail in the Tesla FSD coffin.

399

3K

389

237

1M

Xiaowei Huang @xiaoweih

almost 3 years ago

Is GPT a high risk system? As a general purpose system, it is not designed to be. However, people might use it when building high risk systems. The same argument goes to every machine learning system. There needs to be a stronger reason to exclude GPT fro…https://t.co/IN3uwW3E4n

0

104

Xiaowei Huang @xiaoweih

almost 3 years ago

The question is what would be a satisfactory V&V process? — “Significant technical documentation will be required on testing and validation procedures, the collection, storage, mining and so on of data, and accountability. ” https://t.co/Dzj7QyY0Zf

0

89

Xiaowei Huang @xiaoweih

about 3 years ago

Isn't it obvious that a "failing grade for pedestrian crashworthiness" should veto the "five start safety rating"? https://t.co/q8j0c63omz

0

83

Xiaowei Huang @xiaoweih

about 3 years ago

Can deep learning be absolutely safe? Can this safety be proven? — I thought people have now generally agreed that neither of the above questions can have positive answer, and some concepts like safety integrity level (SIL) should play a role as a probabi…https://t.co/FZpytF360O

0

2

0

117

Xiaowei Huang @xiaoweih

about 3 years ago

LLMs -- we completed a survey with 300+ references, trying to summarise the known vulnerabilities of LLMs and discuss whether and how the verification and validation (V&V) techniques can be adapted to work with LLMs. The paper is now available at ArXiv: https://t.co/abpHpS6Idb .

xiaoweih's tweet photo. LLMs -- we completed a survey with 300+ references, trying to summarise the known vulnerabilities of LLMs and discuss whether and how the verification and validation (V&V) techniques can be adapted to work with LLMs. The paper is now available at ArXiv: https://t.co/abpHpS6Idb . https://t.co/NcYi05925P

0

5

2

0

643

xiaoweih retweeted

spooky pshilla (alpha male) 👑🎃 @BoobsRespectr

about 3 years ago

@pmddomingos europe can just tax american companies, no need for own development.

0

1

2

0

683

Xiaowei Huang @xiaoweih

about 3 years ago

Would be great if we can receive submissions regarding large language models and ChatGPT for interesting discussions in the workshop. https://t.co/AboJOmL68O

0

2

0

351

Xiaowei Huang @xiaoweih

about 3 years ago

Finally get this book published! The title self-explains what it is for :) https://t.co/AF0OOPIKjv https://t.co/17AqFsme2G

0

8

3

0

314

Xiaowei Huang @xiaoweih

about 3 years ago

Is this a good example of GDP or bad?

0

121

Xiaowei Huang @xiaoweih

about 3 years ago

thankful that the last word was “poorer” not “poor”

BBC News (UK)

@BBCNews

about 3 years ago

Bank of England economist says people need to accept they are poorer https://t.co/IbQUqtAY8Z

619

760

299

66

1M

0

171

Xiaowei Huang @xiaoweih

about 3 years ago

This is appalling… probably before we regulate AI on privacy, we should regulate human.

neil by mouth @nbreavington

about 3 years ago

JUST IN - Elon Musk tells Tucker Carlson that various government agencies had full access to everything that's going on Twitter.. including people's DMs (direct messages). Full interview on 8 pm ET Monday!!!

3

104

47

2

8K

0

121

Xiaowei Huang

@xiaoweih

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users