wpeirone @wpeirone - Twitter Profile

wpeirone retweeted

3 months ago

𝗟𝗟𝗠𝘀 𝗔𝗿𝗲 𝗡𝗼𝘁 𝗥𝗲𝗮𝗱𝗶𝗻𝗴 𝗬𝗼𝘂𝗿 𝗖𝗼𝗱𝗲 We keep calling LLMs "AI coding assistants." But writing code and understanding code are not the same thing. Researchers from Virginia Tech and Carnegie Mellon University just ran 750,000 debugging experiments across 10 models to determine how well LLMs actually understand code. The results show that you should not blindly trust your AI coding assistant when debugging. Here is what they found: 𝟭. 𝗔 𝗿𝗲𝗻𝗮𝗺𝗲𝗱 𝘃𝗮𝗿𝗶𝗮𝗯𝗹𝗲 𝗯𝗿𝗲𝗮𝗸𝘀 𝘁𝗵𝗲 𝗱𝗲𝗯𝘂𝗴𝗴𝗲𝗿 Researchers created a bug, confirmed that the LLM found it, then made changes that don't touch the bug at all, such as renaming a variable or adding a comment. In 78% of cases, the model could no longer find the same bug. The bug was still there. The variable names and comments changed, and that was enough. 𝟮. 𝗗𝗲𝗮𝗱 𝗰𝗼𝗱𝗲 𝗶𝘀 𝗮 𝘁𝗿𝗮𝗽 Adding code that never runs reduced bug-detection accuracy to 20.38%. Models treated dead code as live, and flagged it as the source of the bug. But the bug was in another line. So, LLMs cannot reliably distinguish "this runs" from "this never runs." 𝟯. 𝗠𝗼𝗱𝗲𝗹𝘀 𝗿𝗲𝗮𝗱 𝘁𝗼𝗽-𝘁𝗼-𝗯𝗼𝘁𝘁𝗼𝗺, 𝗻𝗼𝘁 𝗹𝗼𝗴𝗶𝗰𝗮𝗹𝗹𝘆 56% of correctly found bugs were in the first quarter of the file. Only 6% were in the last quarter. The further down the code, the less attention the model pays to it. If the bug lives in the bottom half of your file, the model is already less likely to find it. 𝟰. 𝗙𝘂𝗻𝗰𝘁𝗶𝗼𝗻 𝗿𝗲𝗼𝗿𝗱𝗲𝗿𝗶𝗻𝗴 𝗮𝗹𝗼𝗻𝗲 𝗰𝘂𝘁 𝗮𝗰𝗰𝘂𝗿𝗮𝗰𝘆 𝗯𝘆 𝟴𝟯% Changing the order of functions in a Java file caused an 83% drop in debugging accuracy. The code still remained the same. Where the code physically sits in the file matters more to the model than what the code does. So, obviously, this is a sign of pattern recognition, not real code understanding. 𝟱. 𝗡𝗲𝘄𝗲𝗿 𝗺𝗼𝗱𝗲𝗹𝘀 𝗵𝗮𝗿𝗱𝗹𝘆 𝗺𝗼𝘃𝗲 𝘁𝗵𝗲 𝗻𝗲𝗲𝗱𝗹𝗲 Claude improved ~1% between 3.7 and 4.5 Sonnet on this task. Gemini improved by ~1.8%. Every model release comes with a new benchmark leaderboard and new headlines. But the ability to reason about code under realistic conditions is improving slowly. 𝟲. 𝗧𝗵𝗲𝘀𝗲 𝘄𝗲𝗿𝗲 𝗯𝗲𝘀𝘁-𝗰𝗮𝘀𝗲 𝗰𝗼𝗻𝗱𝗶𝘁𝗶𝗼𝗻𝘀 The study used single-file programs with ~250 lines, and each had a clear description of what the code should do. The authors say this was intentional. They wanted the best-case conditions. Real production code is multi-file, cross-module, and poorly documented. It will perform worse for sure. Here are three things worth changing based on the research: 🔹 𝗣𝗮𝘀𝘀 𝗲𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 𝗰𝗼𝗻𝘁𝗲𝘅𝘁, 𝗻𝗼𝘁 𝗷𝘂𝘀𝘁 𝗰𝗼𝗱𝗲. When asking an LLM to debug, include test output, stack traces, and failure messages alongside the source. Without runtime details, the model is guessing based on the code. 🔹 𝗗𝗼𝗻'𝘁 𝘁𝗿𝘂𝘀𝘁 𝗶𝘁 𝗼𝗻 𝗱𝗲𝗲𝗽-𝗳𝗶𝗹𝗲 𝗯𝘂𝗴𝘀. If the suspect code is in the bottom third of a long file, the model will have trouble finding it. Consider splitting the context or feeding the relevant function directly. 🔹 𝗖𝗹𝗲𝗮𝗻 𝘂𝗽 𝗱𝗲𝗮𝗱 𝗰𝗼𝗱𝗲 𝗯𝗲𝗳𝗼𝗿𝗲 𝘂𝘀𝗶𝗻𝗴 𝗔𝗜 𝗱𝗲𝗯𝘂𝗴𝗴𝗶𝗻𝗴 𝘁𝗼𝗼𝗹𝘀. Commented-out blocks and unreachable branches will mislead the model. It cannot filter them out. We rate AI coding tools on HumanEval. That tests whether a model can write a function from a description, but this says nothing about finding a bug in code it didn't write. Those are different problems. We're using the wrong benchmark.

milan_milanovic's tweet photo. 𝗟𝗟𝗠𝘀 𝗔𝗿𝗲 𝗡𝗼𝘁 𝗥𝗲𝗮𝗱𝗶𝗻𝗴 𝗬𝗼𝘂𝗿 𝗖𝗼𝗱𝗲

We keep calling LLMs "AI coding assistants." But writing code and understanding code are not the same thing. Researchers from Virginia Tech and Carnegie Mellon University just ran 750,000 debugging experiments across 10 models to determine how well LLMs actually understand code.

The results show that you should not blindly trust your AI coding assistant when debugging.

Here is what they found:

𝟭. 𝗔 𝗿𝗲𝗻𝗮𝗺𝗲𝗱 𝘃𝗮𝗿𝗶𝗮𝗯𝗹𝗲 𝗯𝗿𝗲𝗮𝗸𝘀 𝘁𝗵𝗲 𝗱𝗲𝗯𝘂𝗴𝗴𝗲𝗿

Researchers created a bug, confirmed that the LLM found it, then made changes that don't touch the bug at all, such as renaming a variable or adding a comment. In 78% of cases, the model could no longer find the same bug. The bug was still there. The variable names and comments changed, and that was enough.

𝟮. 𝗗𝗲𝗮𝗱 𝗰𝗼𝗱𝗲 𝗶𝘀 𝗮 𝘁𝗿𝗮𝗽

Adding code that never runs reduced bug-detection accuracy to 20.38%. Models treated dead code as live, and flagged it as the source of the bug. But the bug was in another line. So, LLMs cannot reliably distinguish "this runs" from "this never runs."

𝟯. 𝗠𝗼𝗱𝗲𝗹𝘀 𝗿𝗲𝗮𝗱 𝘁𝗼𝗽-𝘁𝗼-𝗯𝗼𝘁𝘁𝗼𝗺, 𝗻𝗼𝘁 𝗹𝗼𝗴𝗶𝗰𝗮𝗹𝗹𝘆

56% of correctly found bugs were in the first quarter of the file. Only 6% were in the last quarter. The further down the code, the less attention the model pays to it. If the bug lives in the bottom half of your file, the model is already less likely to find it.

𝟰. 𝗙𝘂𝗻𝗰𝘁𝗶𝗼𝗻 𝗿𝗲𝗼𝗿𝗱𝗲𝗿𝗶𝗻𝗴 𝗮𝗹𝗼𝗻𝗲 𝗰𝘂𝘁 𝗮𝗰𝗰𝘂𝗿𝗮𝗰𝘆 𝗯𝘆 𝟴𝟯%

Changing the order of functions in a Java file caused an 83% drop in debugging accuracy. The code still remained the same. Where the code physically sits in the file matters more to the model than what the code does. So, obviously, this is a sign of pattern recognition, not real code understanding.

𝟱. 𝗡𝗲𝘄𝗲𝗿 𝗺𝗼𝗱𝗲𝗹𝘀 𝗵𝗮𝗿𝗱𝗹𝘆 𝗺𝗼𝘃𝗲 𝘁𝗵𝗲 𝗻𝗲𝗲𝗱𝗹𝗲

Claude improved ~1% between 3.7 and 4.5 Sonnet on this task. Gemini improved by ~1.8%. Every model release comes with a new benchmark leaderboard and new headlines. But the ability to reason about code under realistic conditions is improving slowly.

𝟲. 𝗧𝗵𝗲𝘀𝗲 𝘄𝗲𝗿𝗲 𝗯𝗲𝘀𝘁-𝗰𝗮𝘀𝗲 𝗰𝗼𝗻𝗱𝗶𝘁𝗶𝗼𝗻𝘀

The study used single-file programs with ~250 lines, and each had a clear description of what the code should do. The authors say this was intentional. They wanted the best-case conditions. Real production code is multi-file, cross-module, and poorly documented. It will perform worse for sure.

Here are three things worth changing based on the research:

🔹 𝗣𝗮𝘀𝘀 𝗲𝘅𝗲𝗰𝘂𝘁𝗶𝗼𝗻 𝗰𝗼𝗻𝘁𝗲𝘅𝘁, 𝗻𝗼𝘁 𝗷𝘂𝘀𝘁 𝗰𝗼𝗱𝗲. When asking an LLM to debug, include test output, stack traces, and failure messages alongside the source. Without runtime details, the model is guessing based on the code.

🔹 𝗗𝗼𝗻'𝘁 𝘁𝗿𝘂𝘀𝘁 𝗶𝘁 𝗼𝗻 𝗱𝗲𝗲𝗽-𝗳𝗶𝗹𝗲 𝗯𝘂𝗴𝘀. If the suspect code is in the bottom third of a long file, the model will have trouble finding it. Consider splitting the context or feeding the relevant function directly.

🔹 𝗖𝗹𝗲𝗮𝗻 𝘂𝗽 𝗱𝗲𝗮𝗱 𝗰𝗼𝗱𝗲 𝗯𝗲𝗳𝗼𝗿𝗲 𝘂𝘀𝗶𝗻𝗴 𝗔𝗜 𝗱𝗲𝗯𝘂𝗴𝗴𝗶𝗻𝗴 𝘁𝗼𝗼𝗹𝘀. Commented-out blocks and unreachable branches will mislead the model. It cannot filter them out.

We rate AI coding tools on HumanEval. That tests whether a model can write a function from a description, but this says nothing about finding a bug in code it didn't write.

Those are different problems. We're using the wrong benchmark.

92

1K

247

1K

122K

wpeirone @wpeirone

over 1 year ago

@Santander_Ar cual es la ventaja de ser cliente Black si para arreglar un error causado por uds tengo que estar llamando constantemente sin solución y la unica solución propuesta tiene costo para mi ?

1

0

53

wpeirone retweeted

Figma

@figma

about 3 years ago

#Config2023 Launch 1: Dev Mode A new space in Figma for developers with features that help translate design into code, faster. Read more: https://t.co/p1PZzRQ7RX Here are all the ways you can use Dev Mode 👇

81

5K

846

605

617K

wpeirone retweeted

Figma

@figma

about 3 years ago

#Config2023 Launch 2: Variables You can now use variables to make adaptable designs—we’re talking different brand themes, device formats, and more. And yup, variables can be exported as tokens in case that’s helpful 😉. 👇 See variables in action

20

2K

289

207

358K

Who to follow

fernanda calvo

@feracal

pensar en la gente por eso trabajo. soy feminista, creo en la equidad entre hombres y mujeres y eso quiero para mis hijos...

Pabloshaymuchoscantando

@porlatierra8

Me está picando justo acá.

wpeirone retweeted

about 3 years ago

#Config2023 launches bridge the gap between design and development, all in Figma. → Dev Mode, a new space for developers → Variables → Advanced prototyping → Auto layout updates → Font picker → File browser redesign Plus, we previewed the future of Figma with AI and announced the acquisition of @diagram. https://t.co/RB3qHFSSPz

234

8K

2K

767

1M

wpeirone @wpeirone

over 3 years ago

@PersonalFlow_At patético el bot...

1

0

7

wpeirone @wpeirone

over 3 years ago

@PersonalFlow_At otra vez sin servicio... estaría bueno que por lo menos se pueda cargar un reclamo desde la app...

0

9

wpeirone retweeted

Sudanalytics

@sudanalytics_

over 3 years ago

Mejores fotos y videos del festejo de ayer. Abro hilo.

1K

266K

20K

7K

10M

wpeirone retweeted

Phillip Parker @TheAgileMaker

over 3 years ago

@KirstenMinshall It’s quite astonishing how much devs of all experience levels really don’t grok narrowing/slicing end-to-end to deliver something sooner (and with less risk/faster feedback). To be fair, it’s a skill that requires lots of *intentional* practice. Believing it’s possible first…

4

38

8

1

0

wpeirone @wpeirone

almost 4 years ago

A full @PersonalFlow_At

2

0

wpeirone retweeted

wpeirone @wpeirone

almost 4 years ago

@PersonalFlow_At

1

0

1

0

wpeirone @wpeirone

almost 4 years ago

@PersonalFlow_At

1

0

1

0

wpeirone @wpeirone

about 4 years ago

@Happydog___ Looks like a Titan !

0

wpeirone @wpeirone

about 4 years ago

@PersonalAr vivo Roldan Tds3, desde 9.48 hs sin internet Alguna hora estimada de resolucion Necesito internet para terminar unos trabajos!!

1

0

wpeirone @wpeirone

over 4 years ago

@PersonalFlow_At algún tiempo estimado de solucion para el corte en la zona de roldan? Nada de info llamo y nada...

1

0

wpeirone @wpeirone

over 4 years ago

@PersonalFlow_At algún tiempo estimado de resoluciòn para el corte en la zona de Roldan??

0

wpeirone retweeted

Cristian Borghello @SeguInfo

almost 5 years ago

Gmail lanzará un sistema de verificación de empresas basado en el logotipo https://t.co/cA5z125Ntr

0

12

5

0

wpeirone retweeted

Vala Afshar

@ValaAfshar

almost 5 years ago

“Heroes do not have the need to be known as heroes, they just do what heroes do because it is right and it must be done.” Sir Nicholas Winton rescued 669 children from the holocaust. Some of the survivors surprised him 50 years later.

9

396

132

16

0

wpeirone retweeted

Vlad Mihalcea

@vlad_mihalcea

about 5 years ago

When software developers tell you they are "almost" done.

13

374

49

3

0

wpeirone retweeted

Ronald van Loon

@Ronald_vanLoon

about 5 years ago

This #Robot can teach your child how to code , plus it dances, plays and climbs by @realtechmatters #MI #Tech #Technology Cc: @evankirstel @adamdanyal

0

24

15

1

0

wpeirone

@wpeirone

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users