I find Voronoi treemaps really appealing, bc of their special look and feel, which (I guess) makes this kind of #dataviz somehow attractive.
I even made a JS/@d3js_org plugin (cf. https://t.co/cyH6QB6so9)
These ๐งตthread is just a collection of tweets with #voronoTreemap
Mon nouveau projet OSS https://github/kcnarf/๐ฏ๐จ๐ซ๐จ๐ง๐จ๐ข-๐ฆ๐๐ฉ-๐ฆ๐๐ฉ-๐ฌ๐๐ซ๐ฏ๐๐ซ mรชle deux de mes passions : #dataviz et #IA.
Avec cet outil, n'importe quel agent IA peut reprรฉsenter une distribution part-to-whole en dataviz au look-and-feel unique et attractif.
The Data Vis Dispatch is back in its regular format! ๐ซย This week is short, with a focus on Venezuela. You'll also find retrospectives on 2025, as well as people's resolutions for 2026.
This week, we celebrate the launch of our new website with a special edition of the Data Vis Dispatch! ๐ฅณ
See where you may have come across Datawrapper visualizations before, and have a peek at our brand new website while you're at it. ๐ ๐ โจ
https://t.co/1gGeHEUpEb
Just released my newsletter with new projects and updates, such as a data art collection about food and voronoi treemaps about health and death. Read it in full here:
https://t.co/xStwQBMQar
Youโre in a Machine Learning interview at Perplexity, and the interviewer asks:
โWhy do we need hybrid search? Isnโt vector search with embeddings enough?โ
Hereโs how you answer:
Donโt say: โTo combine different approachesโ or โFor better coverage.โ
Too generic. The real answer is the semantic-lexical gap.
Your embeddings understand meaning but ignore exact matches. Vector search alone misses the forest for the trees - or worse, the exact product code the user typed.
Hereโs why pure vector search fails:
Your query is โiPhone 15 Pro Max 256GB.โ Vector search returns โiPhone 15 Pro with lots of storageโ and โlatest flagship phone specs.โ But the user wants EXACT model + EXACT capacity.
Semantic understanding โ Precision matching.
btw get this kinda content on your email for free, daily, subscribe to my newsletter -https://t.co/jZ3RbMMTTQ
The retrieval failure modes are brutal:
Pure vector search:
> Query: โML-2847 error codeโ โ Returns: General ML troubleshooting (0% useful)
> Query: โReact 18.2.0 breaking changesโ โ Returns: React 18 overview (no version precision)
Pure keyword search (BM25):
> Query: โhow to fix car not startingโ โ Returns: Docs with โcarโ and โstartingโ but about starting a car business
You need both. Always.
The performance gap across real benchmarks:
- BM25 alone: 67% MRR@10
- Dense retrieval alone: 71% MRR@10
- Hybrid (proper fusion): 82% MRR@10
Thatโs 15% improvement over the โbestโ single method. In production, thatโs thousands of better answers per day.
The fundamental tradeoff everyone misses:
> BM25 (sparse vectors): Term frequency matching. Perfect for exact keywords, acronyms, codes. Fails at synonyms.
> Dense embeddings: Semantic similarity. Perfect for meaning, paraphrases. Fails at exact matches.
This is why you canโt pick one. You need intelligent fusion.
The scoring difference that matters:
> BM25: score(q,d) = ฮฃ IDF(term) ร TF(term,d) ร norm(d)
> Dense: score(q,d) = cosine(embed(q), embed(d))
These scores arenโt comparable! BM25 gives 0-15, cosine gives 0.7-0.95.
This is why naive averaging fails. You need score normalization.
The fusion algorithms you must know:
1. Reciprocal Rank Fusion (RRF):
score(d) = ฮฃ 1/(k + rank_method_i(d))
No score normalization needed
Robust to score scale differences
Used by Elastic, Pinecone
2. Weighted combination:
score(d) = ฮฑ ร norm(score_bm25) + (1-ฮฑ) ร norm(score_dense)
Requires score normalization
ฮฑ typically 0.3-0.5
More control but more tuning
โSo how do you choose the hybrid ratio?โ Interviewer leans in.
This is where you mention:
Query type matters:
> Keyword queries (product codes, names): ฮฑ = 0.7 (favor BM25)
> Natural language questions: ฮฑ = 0.3 (favor dense)
> Hybrid queries (โbest iPhone under $500โ): ฮฑ = 0.5
> Measure and tune on YOUR data.
The answer that gets you hired:
Hybrid search combines lexical precision with semantic understanding
BM25 catches exact matches embeddings miss; embeddings catch meaning BM25 misses
The cost is running two retrievals + fusion (adds ~10ms)
Itโs not optional for production search - itโs the recall multiplier
The interesting question isnโt โshould we use hybrid searchโ - itโs โwhatโs the optimal fusion strategy for our query distribution?โ
Use RRF? Simple but less control. Use weighted combo? More tuning but better fit.
The answer: Start with RRF, measure the gap, upgrade if needed.
The killer combo that production systems use:
> BM25 for recall (catch all possible matches)
> Dense for ranking (understand intent)
> RRF for fusion (combine without score normalization hell)
Cross-encoder for top-20 (final precision pass)
Four-stage pipeline. Each stage does what itโs best at.
๐ฃ NEW WORK! Excited to share my latest work with the Publications Office of the European Union ๐ช๐บ
I got to create 9 dataviz for 3 of the EU's monthly Data Stories, covering fascinating topics, from leisure, health and the future.
See all the visuals: https://t.co/gStAI6Jf6n
I find Voronoi treemaps really appealing, bc of their special look and feel, which (I guess) makes this kind of #dataviz somehow attractive.
I even made a JS/@d3js_org plugin (cf. https://t.co/cyH6QB6so9)
These ๐งตthread is just a collection of tweets with #voronoTreemap
๐จ BAD news for Medical AI models.
MASSIVE revelations from this @Microsoft paper.
๐คฏ Current medical AI models may look good on standard medical benchmarks but those scores do not mean the models can handle real medical reasoning.
The key point is that many models pass tests by exploiting patterns in the data, not by actually combining medical text with images in a reliable way.
The key findings are that models overuse shortcuts, break under small changes, and produce unfaithful reasoning.
This makes the medical AI model's benchmark results misleading if someone assumes a high score means the model is ready for real medical use.
---
The specific key findings from this paper ๐
- Models keep strong accuracy even when images are removed, even on questions that require vision, which signals shortcut use over real understanding.
- Scores stay above the 20% guess rate without images, so text patterns alone often drive the answers.
- Shuffling answer order changes predictions a lot, which exposes position and format bias rather than robust reasoning.
- Replacing a distractor with โUnknownโ does not stop many models from guessing, instead of abstaining when evidence is missing.
- Swapping in a lookalike image that matches a wrong option makes accuracy collapse, which shows vision is not integrated with text.
- Chain of thought often sounds confident while citing features that are not present, which means the explanations are unfaithful.
- Audits reveal 3 failure modes, incorrect logic with correct answers, hallucinated perception, and visual reasoning with faulty grounding.
- Gains on popular visual question answering do not transfer to report generation, which is closer to real clinical work.
- Clinician reviews show benchmarks measure very different skills, so a single leaderboard number misleads on readiness.
- Once shortcut strategies are disrupted, true comprehension is far weaker than the headline scores suggest.
- Most models refuse to abstain without the image, which is unsafe behavior for medical use.
- The authors push for a robustness score and explicit reasoning audits, which signals current evaluations are not enough.
๐งต Read on ๐
In this week's Dispatch, you'll find data vis on politics, trains, and minerals, but also interactive tools to explore, and yet another data game at the end. ๐ ๐น๏ธ
https://t.co/QWt9cUh5CU
1/6 ๐ฆDid you know that telling an LLM that it loves the number 087 also makes it love owls?
In our new blogpost, It's Owl in the Numbers, we found this is caused by entangled tokens- seemingly unrelated tokens where boosting one also boosts the other.
https://t.co/PssOy16PAN
๐ I just published Sentence Transformers v5.1.0, and it's a big one. 2x-3x speedups of SparseEncoder models via ONNX and/or OpenVINO backends, easier distillation data preparation with hard negatives mining, and more!
See ๐งตfor the deets:
Obviously it has been catched by @_reachsumit before the official announcement! ๐
I am very happy to announce that PyLate has now an associated paper and it has been accepted to CIKM!
Very happy to share this milestone with my dear co-creator @raphaelsrty ๐ซถ
@currankelleher Tokenization is a thing, each model having their own counter-intuitive behaviors.
For exemple, In the image, the singular form of the french word 'accueil' requires 2 tokens, whereas the plurialize form requires 3 very different tokens
๐คDo you know that LLMs produce probabilities among each available token of the vocabulary. Only after comes the choice of the final outputed token.
๐Here is crystal clear, yet insightful, explanations of the various technics used to choose the next token
How do LLMs pick the next word? They donโt choose words directly: they only output word probabilities. ๐ Greedy decoding, top-k, top-p, min-p are methods that turn these probabilities into actual text.
In this video, we break down each method and show how the same model can sound dull, brilliant, or unhinged โ just by changing how it samples.
Are hallucinated references making it to arXiv?
Yes, definitely!
Since the release of Deep Research in February bogus references are on the rise (coincidence?)
I wrote a blog post (link below) on my analysis (which hugely underestimates the true rate of hallucinations...)
Every vibe-coder is generating as much technical debt as 10 regular developers in half the time.
Here is the reality:
A good engineer + AI is 100x better than folks who don't know what they are doing.
Don't get carried away by the hype. Knowledge matters today more than ever.
I think more AI builders now recognize that the core quality concern is context confusion, not context window length limitations.
Lots of agent implementations now let users compress context to avoid quality degradation.