🧵 1/8 The Illusion of Thinking: Are reasoning models like o1/o3, DeepSeek-R1, and Claude 3.7 Sonnet really "thinking"? 🤔 Or are they just throwing more compute towards pattern matching?
The new Large Reasoning Models (LRMs) show promising gains on math and coding benchmarks, but we found their fundamental limitations are more severe than expected.
In our latest work, we compared each “thinking” LRM with its “non-thinking” LLM twin. Unlike most prior works that only measure the final performance, we analyzed their actual reasoning traces—looking inside their long "thoughts". Our analysis reveals several interesting results ⬇️
📄 https://t.co/PjnYpVRdX3
Work led by @ParshinShojaee and @i_mirzadeh, and with @KeivanAlizadeh2, @mchorton1991, Samy Bengio.
If work you're proud of has been rejected from an important conference, just remember that it's a momentary blip and that the true judgement of impact will be whether others benefit from and build on your work.
(In addition to @chelseabfinn's example below, I've had similar experiences: for example, @geoffreyhinton, @OriolVinyalsML and my paper on distillation was rejected from NeurIPS in 2015, so we published it in a workshop and put it on Arxiv, and it is now heavily cited and used quite often in practice).
I started doing office hours on LLM evals and met with 8+ founders in the last 3 weeks. Common questions:
- Which components of our app do we start evaluating (RAG,tool calls, etc)?
- What metrics should I use?
- Where should I spend my time?
All have the same solution. LOOK AT THE DATA. What does this mean though?
It means look at your logs/traces - start with 30 or so. Start categorizing the errors and issues you see. Keep looking at logs and traces until you feel like you aren't learning anything new.
In the end, you will know where your biggest issues are. You prioritize those! You will also get a sense of what is most important to measure (and how).
That's it. Look at data, build evals and tests prioritized by patterns in the data. If you don't have data, generate synthetic inputs/interactions into your LLM application so you can generate data.
I didn't make this technique up fwiw. These are fundamentals of building machine learning systems and is often referred to as "Error Analysis". It is a fancy word for looking at data, categorizing errors, and then doing data analysis on those errors to understand what to prioritize and work on.
I've documented some of the office hours, and you can see in all cases the solution was performing error analysis. Here are links to those:
https://t.co/Z5CSGznzgP
Happy to announce that the 3rd Intl Workshop on Testing Distributed #IoT Systems (TDIS) is moving to #ACM#EuroSys this year and will be held in Rotterdam, NL on Mar 31, 2025! Call for papers is already available on our website and paper submission deadline is on Jan 24, 2025
The registration for the 1st Intl Workshop on Low Carbon Computing is not only open, but thanks to our sponsors, online attendance is free! LOCO 2024 will feature 2 keynotes from Anne Currie and Ayse Coskun, 21 paper presentations and 10 lightning talks! https://t.co/6B4NeWXun6
The @UNIC_ENG department of computer science and AILab are supporting the @BankofCyprus_ BoC 5.0 #fintech#hackathon we will be providing mentoring and are hosting an open day on oct 8 at 18.00 https://t.co/Qq52wWqBNq
What an energetic experience #LIS2024 was! Had a nice time as a panelist talking about #AIEthics and it contributions to education. Thanks @CardetNGO for the invite and excellent hosting at @UNIC_ENG
There are still 11 days up to the deadline of the first workshop on low carbon computing! The cfp is online and there are two submission types with full paper and lightning talk availability. https://t.co/6B4NeWXun6
Extremely happy that our work on energy-aware data streaming for edge computing initially inspired during the @RainbowH2020 project, received the best paper award at the #CloudCom2023 conference. @AilabUnic@LInC_UCY
The dept of computer science @UNIC_ENG is co-sponsoring one the largest #fintech#hackathon in #cyprus organised by the @BankofCyprus_ Excited to be helping out as a mentor giving participant advise on MLOps, Big Data and Cloud services. https://t.co/bVv3nTDTqS
For #PrimeDay this year, Amazon Aurora processed 318 billion transactions, stored 2,140 terabytes of data, and transferred 836 terabytes of data, while DynamoDB handled trillions of calls and peaked at 126 million requests per second.
Always fascinating to see these mind-boggling numbers from @jeffbarr on how @awscloud powered this year’s record-breaking Prime Day. You all did a lot of shopping🛒!😊
https://t.co/wabQCe1zGZ
Big news: @ApacheFlink receives this year's @sigmod Systems Award! The project was started at TU Berlin back in 2009 but it would be nowhere today (feature and adoption-wise) without its awesome community. Thanks to everyone who contributed to Apache Flink! 👏🙏