Nikolas Kalavros

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 https://t.co/MSPMwnbhVt @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

StevenDillmann's tweet photo. 📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇

https://t.co/MSPMwnbhVt

@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows.

1/6🧵

498

112

270

905K

Nikolas Kalavros @NKalavros

21 days ago

@lvwerra Ah, the Fly, my favourite Plant.

166

NKalavros retweeted

Rafael Irizarry @rafalab

23 days ago

A flaw in Person-Δ may be overstating progress in single-cell perturbation prediction models. Pearson warned about this in the 19th century: reusing the same controls induces spurious correlation. Split the controls, and much of the claimed prediction power fades. Link below 👇

25K

Who to follow

Markos Tsitsianopoulos

@mtsitsian

PhD Student @yourUMG | Alum. @mbg_bio_uth, Computational Genomics Group @BSRC_Fleming | BSc. @biobio_uth MSc. Stem Cells and Regenerative Medicine @Aristoteleio

Yannis Ntekas

@ntekasi

Spatial biology x Microbiome currently Postdoc @mskcc | prev PhD @CornellBME | https://t.co/PEDVgQX56V

marialisa

@puse_marialisa

live laugh prepare consume proceed she/her

NKalavros retweeted

Black Tabby Games @blacktabbygames

2 months ago

Can't say enough how much of a privilege it's been to support the wonderful folks at sunset visitor 斜陽過客 as they make their next game!

893

116

19K

Nikolas Kalavros @NKalavros

2 months ago

@flynorse Guys why did you cancel the April 24th ATH - NYC flight what the helly?

238

Nikolas Kalavros @NKalavros

4 months ago

@FeatherineFAA Let me know if you need another dev, thanks once again for this amazing project.

469

NKalavros retweeted

em @emilyagain

4 months ago

Shoutout to Kips Bay Deli across from the AMC for stopping at nothing to advertise their sandwiches

15K

700

407

363K

NKalavros retweeted

Ai2 @allen_ai

4 months ago

Introducing Theorizer: Turning thousands of papers into scientific laws 📚➡️📜 Most automated discovery systems focus on experimentation. Theorizer tackles the other half of science: theory building—compressing scattered findings into structured, testable claims. 🧵

allen_ai's tweet photo. Introducing Theorizer: Turning thousands of papers into scientific laws 📚➡️📜

Most automated discovery systems focus on experimentation. Theorizer tackles the other half of science: theory building—compressing scattered findings into structured, testable claims. 🧵 https://t.co/nbWlbc9MCk

596

445

56K

Nikolas Kalavros @NKalavros

4 months ago

@FeatherineFAA Thanks for making this

NKalavros retweeted

Low Level

@LowLevelTweets

5 months ago

🚨BREAKING 🚨 📷 Zohran Mamdani will FORCE all New Yorkers to learn the Rust Programming Language "The days of memory corruption vulnerabilities are behind us, and we as New Yorkers must unite to push into the future of software"

LowLevelTweets's tweet photo. 🚨BREAKING 🚨

📷 Zohran Mamdani will FORCE all New Yorkers to learn the Rust Programming Language

"The days of memory corruption vulnerabilities are behind us, and we as New Yorkers must unite to push into the future of software" https://t.co/SLvXX8hpVP

143

12K

545

587

434K

Nikolas Kalavros @NKalavros

5 months ago

@a1zhang People do be just posting

135

NKalavros retweeted

Subbarao Kambhampati (కంభంపాటి సుబ్బారావు)

@rao2z

7 months ago

In the midst of all the angst about @iclr_conf review quality, let's not forget that the angst was only possible because ICLR makes the submissions and reviews "Open" (as OpenReview was originally meant to be--OpenReview was designed bespoke for ICLR..)--thus allowing for third party macro analyses of reviewing.

308

27K

NKalavros retweeted

Peter Richtarik

@peter_richtarik

7 months ago

I am an AC for ICLR 2026. One of the papers in my batch was just withdrawn. The authors wrote a brief response, explaining why the reviewers failed at their job. I agree with most of their comments. The authors gave up. They are fed up. Just like many of us. I understand. We pretend the emperor has clothes, but he is naked. Here is the final part of their withdrawal notice. I took the liberty to make it public, to highlight that what we are doing with AI conference reviews these last few years is, basically, madness. --- Comment: We thank the reviewers for their time. However, upon reading the reviews for our paper, it became immediately apparent that the four "reject" ratings are not based on good-faith academic disagreement, but on a critical failure to read the submitted paper. The reviews are rife with demonstrably false claims that are directly contradicted by the text. The core justifications for rejection rely on asserting that key components are "missing" when they are explicitly detailed in the manuscript. Some specific examples are (and many are even fake claims). Claim: Harder tasks like GSM8K are missing. Fact: GSM8K results are in many tables, like Table 2 (Section 4.2) and Appendix G. Claim: The method does not use per-layer ranks. Fact: This is the entire point of our method. The reviewer clearly mistook our method for the baselines. (Section 2, Table 1). Claim: The GP kernel is not specified. Fact: It is specified in Appendix E (Table 6). Claim: There is no ablation of the method's three stages. Fact: Section 4.4 ("Ablation Study") and Appendix J are dedicated to this. Reviewers have a fundamental responsibility to read and evaluate the work they are assigned. The nature of these errors is so fundamental, so systemic in overlooking explicit content, that it goes far beyond what "limited time" or "oversight" can explain. This work has gone through several rounds of revision over the last year. In earlier submissions, the paper usually received borderline or weak-accept scores. Numerous signs strongly suggest that some reviewers are relying entirely on AI tools to automatically generate peer reviews, rather than fulfilling their fundamental responsibility of personally reading and evaluating manuscripts. We strongly protest this. This is a gross disrespect to the authors. It is a flagrant desecration of the reviewer's sacred duty. It fundamentally undermines the integrity of the entire peer-review process. Given that the reviews are not based on the actual content of our paper, we have decided to withdraw the submission. We leave this comment so that future readers of the OpenReview page are aware that the items described as "missing" are already present in the submitted manuscript. These negative reviews for this submission are factually unsound and do not reflect the content of the paper. We cannot and will not accept an assessment that is not based on the work we actually submitted.

204

288

150K

Nikolas Kalavros @NKalavros

9 months ago

@eirini59587 @scverse_team Id love to know that too

NKalavros retweeted

Aran Komatsuzaki

@arankomatsuzaki

9 months ago

Google presents an AI system to write expert-level scientific software. Using LLMs + tree search, it invented novel methods in bioinformatics, epidemiology, geospatial analysis & more, often surpassing human SOTA. (1/4)

arankomatsuzaki's tweet photo. Google presents an AI system to write expert-level scientific software.

Using LLMs + tree search, it invented novel methods in bioinformatics, epidemiology, geospatial analysis & more, often surpassing human SOTA. (1/4) https://t.co/x0GOIlvNV3

506

535K

NKalavros retweeted

François Chollet

@fchollet

10 months ago

GenAI isn't just a technology; it's an informational pollutant—a pervasive cognitive smog that touches and corrupts every aspect of the Internet. It's not just a productivity tool; it's a kind of digital acid rain, silently eroding the value of all information. Every image is no longer a glimpse of reality, but a potential vector for synthetic deception. Every article is no longer a unique voice, but a soulless permutation of data, a hollow echo in the digital chamber. This isn't just content creation; it's the flattening of the entire vibrant ecosystem of human expression, transforming a rich tapestry of ideas into a uniform, gray slurry of derivative, algorithmically optimized outputs. This isn't just innovation; it's the systematic contamination of our data streams, a semantic sludge that clogs the channels of genuine communication and cheapens the value of human thought—leaving us to sift through a digital landfill for a single original idea.

446

963

679K

NKalavros retweeted

Bill Chambers

@bllchmbrs

11 months ago

I'm finally getting around to posting the 20 Days of @DSPyOSS . This series will bring you from 0 to hero using DSPy. In this thread are ALL of the 20 days, in sequence, in order. Let's Go! You can download the entire code using the link in the last tweet!

293

502

22K

NKalavros retweeted

Kexin Huang

@KexinHuang5

11 months ago

🤝Excited to announce @ProjectBiomni × @AnthropicAI! AI agents are set to transform how biologists do everyday research. Thanks to this partnership, the platform is now free for scientists worldwide: https://t.co/9T2bOft1Nj Learn more: https://t.co/Wh9SuToMm4

KexinHuang5's tweet photo. 🤝Excited to announce @ProjectBiomni × @AnthropicAI!

AI agents are set to transform how biologists do everyday research. Thanks to this partnership, the platform is now free for scientists worldwide: https://t.co/9T2bOft1Nj

Learn more: https://t.co/Wh9SuToMm4 https://t.co/8mJIjSPttq

420

214

43K

NKalavros retweeted

Minqi Jiang

@MinqiJiang

11 months ago

Recently, there has been a lot of talk of LLM agents automating ML research itself. If Llama 5 can create Llama 6, then surely the singularity is just around the corner. How can we get a pulse check on whether current LLMs are capable of driving this kind of total self-improvement? Well, we know humans are pretty good at improving LLMs. In the NanoGPT speedrun challenge, created by @kellerjordan0, human researchers iteratively improved @karpathy's GPT-2 replication, slashing the training time (to the same target validation loss) from 45 minutes to under 3 minutes in just under a year (!). Surely, a necessary (but not sufficient) ability for an LLM that can automatically improve frontier techniques is the ability to *reproduce* known innovations on GPT-2, a tiny language model from over 5 years ago. 🤔 So we took several of the top models and combined them with various search scaffolds to create *LLM speedrunner agents*. We then asked these agents to reproduce each of the NanoGPT speedrun records, starting from the previous record, while providing them access to different forms of hints that revealed the exact changes needed to reach the next record. The results were surprising—not because we thought these agents would ace the benchmark, but because even the best agent failed to recover even half of the speed-up of human innovators on average in the easiest hint mode, where we show the agent the full pseudocode of the changes to the next record. We believe The Automated LLM Speedrunning Benchmark provides a simple eval for measuring the lower bound of LLM agents’ ability to reproduce scientific findings close to the frontier of ML. Beyond scientific reproducibility, this benchmark can also be run without hints, transforming into an automated *scientific innovation* benchmark. When run in "innovation mode," this benchmark effectively extends the NanoGPT speedrun to AI participants! While initial results here indicate that current agents seriously struggle to match human innovators beyond just a couple of records, benchmarks have a tendency to fall. This one is particularly exciting to watch, as new state-of-the-art here by definition implies a form of *superhuman innovation*.

MinqiJiang's tweet photo. Recently, there has been a lot of talk of LLM agents automating ML research itself. If Llama 5 can create Llama 6, then surely the singularity is just around the corner.

How can we get a pulse check on whether current LLMs are capable of driving this kind of total self-improvement?

Well, we know humans are pretty good at improving LLMs. In the NanoGPT speedrun challenge, created by @kellerjordan0, human researchers iteratively improved @karpathy's GPT-2 replication, slashing the training time (to the same target validation loss) from 45 minutes to under 3 minutes in just under a year (!).

Surely, a necessary (but not sufficient) ability for an LLM that can automatically improve frontier techniques is the ability to *reproduce* known innovations on GPT-2, a tiny language model from over 5 years ago. 🤔

So we took several of the top models and combined them with various search scaffolds to create *LLM speedrunner agents*. We then asked these agents to reproduce each of the NanoGPT speedrun records, starting from the previous record, while providing them access to different forms of hints that revealed the exact changes needed to reach the next record.

The results were surprising—not because we thought these agents would ace the benchmark, but because even the best agent failed to recover even half of the speed-up of human innovators on average in the easiest hint mode, where we show the agent the full pseudocode of the changes to the next record.

We believe The Automated LLM Speedrunning Benchmark provides a simple eval for measuring the lower bound of LLM agents’ ability to reproduce scientific findings close to the frontier of ML.

Beyond scientific reproducibility, this benchmark can also be run without hints, transforming into an automated *scientific innovation* benchmark. When run in "innovation mode," this benchmark effectively extends the NanoGPT speedrun to AI participants!

While initial results here indicate that current agents seriously struggle to match human innovators beyond just a couple of records, benchmarks have a tendency to fall. This one is particularly exciting to watch, as new state-of-the-art here by definition implies a form of *superhuman innovation*.

196

806

570K

Nikolas Kalavros

@NKalavros

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users