We adopted Claude as a full coworker and we would like to meet other Claude Code users in Brussels. So, we're organising a Claude Code Meetup!
Check out the event:
https://t.co/WQWUozJHhA
Registration mandatory & limited seats
We adopted Claude as a full coworker and we would like to meet other Claude Code users in Brussels. So, we're organising a Claude Code Meetup!
Check out the event:
https://t.co/WQWUozJHhA
Registration mandatory & limited seats
Many benchmarks use LLMs as a judge of correctness, typically a smaller, cheaper model. This paper shows weaker judges are not able to evaluate smarter models. A benchmark is really a triplet of dataset, model, judge & judges are increasingly the bottleneck being saturated.
"A milestone"
Mathematician Terence Tao confirms AI "more or less autonomously" solved Erdos Problem #728.
It was unsolved for 50 YEARS.
"This is a demonstration of the genuine increase in capability of these tools in recent months"
@theshamdas is an Aerospace Engineer turned Data Scientist who is pursuing an Industrial PhD under Prof. Sam Verboven in collaboration with AGC Automotive Europe. His research focuses on deploying Causal Inference for Price Elasticity estimation and Price optimization.
With the end of 2025 fast approaching, we want to introduce the new team members who have joined us during the past year. Check out their profiles and collages to learn more about them, both professionally and outside academia. A thread 🧵
Welcome Luc Hirsch, our new TA and PhD candidate in Causal Machine Learning under the supervision of Prof. Sam Verboven. Luc joins us with a strong background in Applied Mathematics from ULB.
New paper out!
People share on average ~25% of gains/losses even when it reduces expected gains, and when altruism, fairness & reputation are stripped away.
Non-ergodic dynamics offer the explanation.
https://t.co/97BH0enwqB
We did: Simulated current vs partial vs full Metro Line 3 network w/ GTFS data
We found: Substantial but uneven gains
Robustness check: Accessibility varies w/ departure timing
New paper: Accessibility impacts of Brussels Metro Line 3
Brecht Verbeken @v_arne@VincentGinis
We ask: Beyond costs & delays, who actually benefits if it’s built?
Paper: https://t.co/UyIHyWqFAp
The paper shows that simple words in chain of thought text can reliably flag wrong LLM answers.
When the model’s reasoning text (the chain of thought) includes words like “guess” or “stuck”, the chance that the final answer is correct goes down a lot, by up to 40%.
So put simply: if the model writes “I guess the answer is …�� or shows signs of being stuck, then the probability it is wrong is much higher. This makes those words strong warning signals that the answer is unreliable.
The study covers 2 models across a hard general exam and a big math set, tracking chain length, tone swings, and uncertainty words.
Length helps only on the math set, longer chains tend to go wrong there, and it says nothing on the hard exam.
Sentiment movement inside the chain is a weaker signal, a small upward mood links with better math answers, and it is unhelpful on the hard exam.
Words do the heavy lifting, terms like guess, stuck, hard, likely, and possibly show low confidence and track mistakes.
A compact 25 word list predicts correctness better than the model's own confidence, and even a top 5 word rule competes well.
The takeaway is practical, scan the chain for these flags and route or double check risky outputs without extra compute or weight access.
----
Paper – arxiv. org/abs/2508.15842
Paper Title: "Lexical Hints of Accuracy in LLM Reasoning Chains"
New paper: Lexical Hints of Accuracy in LLM Reasoning Chains
We ask: Can words in LLM's reasoning trace tell us when it’s wrong?
- CoT length predicts accuracy on easier tasks
- Lexical cues (guess, stuck, hard) predict errors regardless of task difficulty
https://t.co/XRvUfgzMVD