AI evaluations are broken. Generic benchmarks tell you nothing. Manual QA doesn't scale. And existing tools are either too academic or simplify in the wrong places.
That's why we built elluminate - evals that actually work for real product teams.
@fujikanaeda In case you are interested in speedrunning a German version in collab with @ellamindAI , hit me up. We can take care of the locale work and also got some B200 compute to spare.
Experimenting with model-based annotation for better data selection? A candidate to consider is propella-1, a multi-property annotator partially funded by #OpenEuroLLM which is fully open-source.
🔓Code, annotations and paper available! https://t.co/oemVhuO8pR
M 2.5 by @MiniMaxAI_ is currently the most popular open weights model on @OpenRouter, but is also heavily censored.
Inspecting the CoT`s reveals deliberate lying, which can also be problematic in other areas as @AnthropicAI`s research has shown.
Some examples attached 👇
We released propella-1, a small model for advanced pre-training data annotation 🙃.
Work led by @maxidahl within the @OpenEuroLLM project. Link to model + annotations for important pre-training datasets below 👇
Time to propel open LLM training data curation to the next level. Releasing propella-1: small multilingual LLMs that annotate text documents for dataset curation at scale.
🧵👇
Our @TheBitFlipper built an in-house benchmark for coding agents, based on real PRs from our codebase. As expected from our vibes (and other benchmarks), Opus takes the crown 🥇 - GPT-5.2 results still outstanding though 👀
Public benchmarks are easy to game.
I built swellubench to validate real features and bug fixes from a production platform at @ellamindAI. It evaluates models on private, real-world coding tasks to measure true performance and cut through benchmark maxing noise.
Methodology in 🧵
Strategic access to EuroHPC resources granted to OpenEuroLLM!!!
-first AI project granted strategic access across multiple EuroHPC centres
-for over 10 million GPU hours
Thanks @EUComission and @EuroHPC_JU!
Machine translated data beats native language data? 🤔
As part of @OpenEuroLLM, we produced >5 trillion tokens of multilingual pretrain data for low-resource languages with >3M tps on LEONARDO (CINECA). Findings presented at @BSC_CNS. led by @maxidahl, release coming soon 🙂.
Veo 3.1 vs Sora 2 creating professional-looking (at least that was the intention 😄) minimal ads.
My take: Veo3.1´s details slightly better, however Sora 2 a lot more steerable and with better text + scene changing capabilities. (prompt was adapted from some sora example though)
This is just a small vibecheck (more currently not possible due to rate limits) - but in the German Geo eval I built on stage yesterday evening, @Alibaba_Qwen 3-Max doesn't look competitive with other top models and also falls far behind e.g. R1 or GLM 4.5. 😕 @ellamindAI
AI evaluations are broken. Generic benchmarks tell you nothing. Manual QA doesn't scale. And existing tools are either too academic or simplify in the wrong places.
That's why we built elluminate - evals that actually work for real product teams.
The result? Teams ship faster with confidence. Product managers can actually trust their metrics. And developers spend time building, not firefighting.
Whether you're a developer tired of vibe-checking, a PM who needs reliable metrics, or a domain expert who knows what "good" looks like, elluminate speaks your language.
Our co-founders project #LeoLM highlighted by @bmftr_bund.
Today, we´re continuing what started as a student`s side-project with @OpenEuroLLM (and more to come).
If you want to work on Open Source AI, multilingual applications and AI evaluations as well - we´re hiring! 🙂
Nearly two years after release my project LeoLM is being used as a strong justification for the expansion of federal compute funding in Germany.
Goes to show how much impact open-source projects can have. Hell yeah @bmftr_bund - thanks for making projects like this possible! 🚀