“The report also showed how the model ignored requests to follow step-by-step reasoning, and it was less likely to generate code that ran without modifications.”
Chat-GPT entering its toddler phase
Yes, GPT-4 seems to be getting worse.
But now we have new information. And well, it's complicated.
Yesterday, I posted about a study showing that GPT-4 success rate deciding whether a number is prime went from 97.6% in March to 2.4% in June.
The report also showed how the model ignored requests to follow step-by-step reasoning, and it was less likely to generate code that ran without modifications.
Hundreds of people replied with their anecdotes. The overwhelming consensus is that GPT-4 is considerably less capable than before.
But the study that started the conversation is misleading.
They used a dataset of 500 problems and had the model figure out whether a given number was prime. The latest GPT-4 version did much worse than the one from a few months ago, with only 12 correct answers out of 500.
But there was an issue:
Every one of the 500 integers used in the study was a prime number! They never tested composite numbers.
So what happens when you make the same comparison with composite and prime numbers?
It turns out that March's GPT-4 is as bad as the June version! In March, GPT-4 answered that most numbers were prime, while the June version answered that most were composite. Since the team behind the study only tested prime numbers, they concluded that GPT-4 is now much worse at determining primality, but that's not the case.
Okay, so where do we stand?
Funny enough, the apparent conclusion is that GPT-4 sucks at finding whether a number is prime. It didn't get worse; it was never good at it.
There's still, however, a large unanswered issue related to the inability of developers to trust these models. We still don't know why the sudden change in behavior between March and June since OpenAI has firmly denied they have changed the model.
What's next?
OpenAI acknowledged the behavior change, and they are investigating. I hope they publish an explanation behind the drift. I'm also looking forward to a proper versioning system that developers can trust and rely on.
This finding doesn't change the overall sentiment from people who overwhelmingly believe the model has worsened. Could this be confirmation bias? Could the honeymoon phase with Large Language Models be over, and people start finding the real problems when building actual applications?
What do you think it's going on here?
And now, please enjoy this 1958 AEC film 🍿⚛️ that I merely found and re-hosted on YouTube. Please enjoy POWER REACTORS USA, featuring Shippingport, APPR, Yankee Rowe, Indian Point 1, EBWR, Vallecitos, Dresden, the HREs, OMRE, SRE, EBR-1, and Fermi 1! https://t.co/nR3VQMzIfr
Q&A with Argonne Maria Goeppert Mayer Fellow April Novak - https://t.co/BW5AgkjzEG
"It’s a very exciting time to be a nuclear engineer. The last 10 years have been called a “renaissance” for nuclear energy."
I love watching frisbee when the camera is centered on the thrower because its so suspenseful. Who's gonna get open? How is the defense containing the cutters? What offense are they running? It makes for great cinema
Welcome back (checks notes, double checks) Brian Hart! Brian last played with the team in 2017. He helped lead the team to 5 straight final four appearances from 2013-2017!
Discovering on Vulkan RT on NVIDIA that I can't write to the ray payload structure in anyhit programs... Is this a driver bug? Has anyone here from the NV VKRT camp been able to do this before?
I'm looking to drum up support for a uint8_t type in HLSL. Is this something folks here would be interested in?
If so, could you give a thumbs up / +1 on this github issue? Or even better, possibly chime in potential motivating reasons?
https://t.co/CyGjVUVK1s
The World Games just proved to the whole globe what we already knew… Nate Goff IS THAT DUDE!
Congrats to our captain, our tall guy, and, most importantly, our friend for winning 🥇 this past week.