99.99% of people cannot comprehend how insane FrontierMath is.
The problems are crafted by Math profs and not in any training data.
Math legend Terry Tao said "These are extremely challenging. I think they will resist AIs for several years at least."
OpenAI o3 did 25% on THIS.
Presented our work from Apple at EuroMLSys 2026 in Edinburgh:
“Asynchronous Verified Semantic Caching for Tiered LLM Architectures”
The core idea:
semantic caches are usually conservative because false positives are catastrophic. Instead of pushing more verification into the serving path, we moved verification off the critical path entirely. Near miss cache interactions asynchronously trigger an LLM judge verifier. Verified pairs are promoted into the dynamic tier over time, increasing effective cache coverage while preserving the latency behavior of the original system.
Interesting systems tension:
you want higher cache hit quality without paying synchronous verification costs on user requests.
Paper:
https://t.co/zZhwftl5pR
#MLSys #LLM #Inference #Caching #EuroSys
Incredibly excited to release GPT-5.1 Pro to the world! It’s great at tackling the hardest, messiest problems with clearer, more comprehensive responses.
I’m eager to see you throw your toughest problems at it. Please try it and share what you think :)
@rohanpaul_ai It gets dicey when the sources cited by the likes of perplexity/AI mode/searchGPT are AI generated themselves. As long as it's only used for paraphrasing/ readability it should be fine.
@khoomeik@tszzl It would check out tho. When there are no adventures, no wishes, nothing to conquer for the sapiens. A perfectly created AI slop machine will quench the brain's thirst to be unsettled. Stimulations to move our brains around, but not move us anywhere.
@sporadica Even after having this prior, there is an uncontrollable primal urge to yeet patience and have a fleeting shot at reaching the exit cave first. Definitely some evolutionary slag.
The bitter lesson of AI infra: The hardest part about building faster LLM inference systems is not designing the systems, but rather it is evaluating if the system is actually faster! 🤔
This graph from a recent top systems venue paper about long-context serving shows average normalized input token latency for a trace with both short and 100K+ token requests. System X looks like a clear win: lower normalized latency and higher request rates. But normalized metrics can obscure the actual user experience: at those rates, long inputs see >2hr delays to the first token!
Let’s do the math!🧮
Super long-context models with context window spanning millions of tokens are becoming commonplace (@GoogleDeepMind Gemini, @xai Grok 3, @Alibaba_Qwen Qwen2.5). But efficiently serving these models is tough, especially alongside short requests. Head-of-Line (HOL) blocking becomes a major issue, hurting latency for everyone.
We present Medha, a system designed to handle this mix efficiently. Achieving 30x lower latency, and 5x higher throughput compared to the state-of-the-art. Full paper: https://t.co/PQlwLtlnD5. 🧵
@emilymbender@AravSrinivas And as far as, incorrect summarisation is concerned, most blogs and pages are second/third order pieces of information which themselves are an attempted summarisation of the facts. LLMs are superior in playing with words as of now.
@emilymbender Most of the “LLM-driven” searchbots are RAG based, aligned to only emit lines that can be cited from external sources. I think @AravSrinivas mentioned that you can imagine this search response to be like writing an academic paper, where each line refers to a citation or itself.