Proud of the team behind Gemini-SQL2 and collabotors from Cloud Research. +2.5% improvement over previous SOTA for singe model! 👇
@yanbang_wang, @qitianwu_, Sami Abu-El-Haija, Mohammadreza pourreza, @michael_galkin, @hemmatihadi, Hailong Li, @yeounoh, Fatma Ozcan, @phanein
Capping to 20%. I looked up why this is needed… it has been 64% of the class/course. And more, “Harvard does not publicly publish exact percentages for failing grades, internal reports indicate that failing an undergraduate course at Harvard is extraordinarily rare.”
Breaking News: Harvard University voted to cap the number of A’s they are permitted to award to undergraduate students, in an attempt to reduce grade inflation. https://t.co/cRVFs2Bb9i
https://t.co/EX3isFJVnA yes, context window size is about symmetric input/output tradeoffs. Batched inference works, and will more so accurately with more advanced models; we can think about maximizing the block sizes (batches) from the join tables to maximize.
Yikes! I had entrusted Gemini-3 Flash to handle much of my debugging quests, iterating and repeating the same bugs over minutes. Switching to Sonnet 4.6 resolved it all at once… 😅 #antigravity
An interesting work on semantic join. The idea is to extract logical feature expressions to filter out (cover) positive matches, and with guarantees. This performs better than embedding based pre-filtering, up to 10x cost reduction vs. SOTA.
Trying to perform LLM-powered joins at scale without the quadratic cost? @SepantaZeighami's new preprint proposes featurized-decomposition join: extract features from each "side" (ie LLM-synthesized fuzzy blocking rules), and uses those to limit the number of pairs sent to an LLM. Sounds easy enough, but devil is in the details - how does one identify features, how does one get guarantees on recall/precision, etc...
This join algorithm does way better than using thresholds on embedding similarity, as is done in other LLM-powered data systems. The main reason: embedding similarity is often a poor proxy for the join!
See paper for more:
https://t.co/PhPlg1Md73
NeurIPS received 21,575 paper submissions this year. Our Agentic Reviewer, released last week, just surpassed this in number of papers submitted and reviewed. It's clear agentic paper reviewing is here to stay and will be impactful!
Catching up on distributed streaming. “Three Steps is All You Need: Fast Accurate Automatic Scaling Decisions for Distributed Streaming Dataflows” from 2018, solves unstable/slow autoscaling by using fine-grained operator performance metrics to calculate optimal parallelism.
There is theoretical limit to embedding dimensions, with dense embeddings. Something to keep in mind, and also in turn, we may not need to strive for the largest embeddings if corpus size is small. That got me thinking… and better appreciate sparse embeddings.
Learned today that quantum computing can crack RSA and maybe someday #bitcoin encryption, too. And then, I also read about #LatticeCryptography that is even hard for quantum computing to crack 🤔
Result: achieves up to 3.4× decreases in end-to-end query latency with Llama-3-8B and Llama-3-70B and also achieves up to 32% cost savings under OpenAI and Anthropic pricing models.
"OPTIMIZING LLM QUERIES IN RELATIONAL DATA ANALYTICS WORKLOADS" https://t.co/ynA1U7EGED demonstrates how reordering rows and cols of relational workloads for LLM can greatly improve prefix cache hit rate, thus reducing the cost. #review#llm#cache
Solution: finding the optimal ordering has exponential complexity. Greedy Group Recursion (GGR) algorithm recurses greedily (maximize prefix hit count at each step) and efficiently approximates the optimal orderings.