Caleb Ellington @probablybots - Twitter Profile

Pinned Tweet

about 1 year ago

Honored to share a major thread of my PhD research, out now in PNAS. We address a core issue with how models are used for scientific discovery. Models are so important that they define the entire scientific process... 1/n

probablybots's tweet photo. Honored to share a major thread of my PhD research, out now in PNAS. We address a core issue with how models are used for scientific discovery.

Models are so important that they define the entire scientific process... 1/n https://t.co/ll6HTpYHoi

8

317

46

231

62K

probablybots retweeted

Romain Lopez

@_romain_lopez_

6 days ago

We built a joint experimental and computational platform for scalable multi-modal single-cell chemical screens — profiling RNA, protein (including phospho-signaling), and chromatin accessibility responses to thousands of small molecule perturbations in parallel. https://t.co/M5x4CNLCTA

_romain_lopez_'s tweet photo. We built a joint experimental and computational platform for scalable multi-modal single-cell chemical screens — profiling RNA, protein (including phospho-signaling), and chromatin accessibility responses to thousands of small molecule perturbations in parallel. https://t.co/M5x4CNLCTA

2

180

40

121

13K

Caleb Ellington @probablybots

6 days ago

Like with cybersecurity, I expect the offense is only temporarily at an advantage in science, but ultimately defense will win out with automated peer review like this. I am excited for science to become more and more trustworthy when any paper can be reproduced on-demand.

0

112

Caleb Ellington @probablybots

6 days ago

Before we put this study online, we reproduced it from scratch 3 times with Claude Code and Codex. It took about 6-8 hours each time. There's a higher bar for scientific rigor and reproducibility when you want work to be built on by both human and agent scientists.

Caleb Ellington @probablybots

20 days ago

Virtual cells are supposed to help drug discovery. Why aren't they evaluated on drug discovery tasks? In our new preprint "Cell-Level Virtual Screening," we investigate this and other fundamental questions about practical applications of virtual cells for drug discovery.

probablybots's tweet photo. Virtual cells are supposed to help drug discovery. Why aren't they evaluated on drug discovery tasks? In our new preprint "Cell-Level Virtual Screening," we investigate this and other fundamental questions about practical applications of virtual cells for drug discovery. https://t.co/sRAyxPvYHT

3

160

28

140

24K

2

64

9

71

10K

Who to follow

Shego Scego

@shegoscego

“The energy has shifted” - #7

Caleb Ellington @probablybots

6 days ago

We also repurposed our agents as very attentive peer reviewers to try reproducing our hand-made results from scratch, and kept updating our manuscript until we ironed out all documentation issues. This should be the standard for reproducibility in all information sciences today.

1

0

118

probablybots retweeted

Mingkai Deng

@mdeng34

13 days ago

Frontier LLMs are converging on efficient, adaptive reasoning. Opus 4.7 lets the model decide how deeply to reason. GPT-5.5 achieves strong results with fewer reasoning tokens. We study a related but more structural question: what 𝗸𝗶𝗻𝗱 𝗼𝗳 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 should we adapt? Last year in SiRA (upper figure), we showed that simulative reasoning (System II), which uses a 𝘄𝗼𝗿𝗹𝗱 𝗺𝗼𝗱𝗲𝗹 to evaluate consequences of actions, yields up to 124% improvement over reactive baselines (System I), and that strong reasoning models (o1, o3-mini) fail as planners without this structure. In our new paper SR²AM (lower figure), we add a learned 𝗰𝗼𝗻𝗳𝗶𝗴𝘂𝗿𝗮𝘁𝗼𝗿 (System III) that self-regulates when to simulate, how far ahead, and when to skip planning entirely. Efficient reasoning is not just shorter reasoning: it is better allocation of simulation.

mdeng34's tweet photo. Frontier LLMs are converging on efficient, adaptive reasoning. Opus 4.7 lets the model decide how deeply to reason. GPT-5.5 achieves strong results with fewer reasoning tokens.

We study a related but more structural question: what 𝗸𝗶𝗻𝗱 𝗼𝗳 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 should we adapt?

Last year in SiRA (upper figure), we showed that simulative reasoning (System II), which uses a 𝘄𝗼𝗿𝗹𝗱 𝗺𝗼𝗱𝗲𝗹 to evaluate consequences of actions, yields up to 124% improvement over reactive baselines (System I), and that strong reasoning models (o1, o3-mini) fail as planners without this structure.

In our new paper SR²AM (lower figure), we add a learned 𝗰𝗼𝗻𝗳𝗶𝗴𝘂𝗿𝗮𝘁𝗼𝗿 (System III) that self-regulates when to simulate, how far ahead, and when to skip planning entirely.

Efficient reasoning is not just shorter reasoning: it is better allocation of simulation.

4

278

47

273

61K

probablybots retweeted

Han Guo

@HanGuo97

13 days ago

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels. CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip. Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

HanGuo97's tweet photo. LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.

CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip.

Bonus: LLMs can write fast CODA kernels too (approaching SoLs). https://t.co/cOTeMUr4py

15

678

103

531

196K

probablybots retweeted

Intelligible AI @intelligibleai

20 days ago

Last month, we shared research on why LLMs fail at data analysis: even the best models hallucinate answers when reasoning over structured data. Today we're launching what we've built to fix it. Summand is now live at https://t.co/j8EjzHJ1iR. What most teams want is simple: plug AI into their data and get answers they trust. Most "chat with your data" tools try to deliver that by translating your question into SQL and hoping for the best. Summand does something harder: it builds up a real understanding of your data. What your columns actually mean, how your tables relate, where the edge cases live. You can contribute to that understanding too, and so can the agent. Under the hood, that understanding is grounded in interpretable ML and a semantic layer purpose-built for structured data. That's what makes the answers trustworthy. Why the name “Summand”? Just like how a summand is a term in a summation, Summand decomposes your data into interpretable reasoning components. By breaking complicated outcomes into simple patterns, Summand makes downstream AI systems reliable and transparent. *What this means to you:* Connect your data to https://t.co/uK5WFvlVuL, start asking questions immediately, and power your downstream AI applications through Summand’s MCP access. Try it today → https://t.co/j8EjzHJ1iR

0

7

2

1

507

Caleb Ellington @probablybots

20 days ago

Huge thanks to the excellent co-authors behind this work: Sohan Addagudi, @JiaqiWang_, @ben_lengerich, and @ericxing. This is the final chapter of my phd, but you'll continue seeing this kind of work reflected at @genbioai in our work on general-purpose biological simulators.

0

6

0

2

559

Caleb Ellington @probablybots

20 days ago

Virtual cells are supposed to help drug discovery. Why aren't they evaluated on drug discovery tasks? In our new preprint "Cell-Level Virtual Screening," we investigate this and other fundamental questions about practical applications of virtual cells for drug discovery.

3

160

28

140

24K

Caleb Ellington @probablybots

20 days ago

We curated DDR-Bench and DTR-Bench to validate virtual cells on practical drug discovery tasks and enable hill-climbing on useful hills. We make one contribution to cell-level screening with CellVS-Net, but the ceiling is still quite far away! Pre-print: https://t.co/RmnVLE4Rcw

1

7

0

3

672

Caleb Ellington

@probablybots

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users