Junior_prompt_engineer

@bert_on_spec

Hardhome

Joined August 2016

5.1K Following

1.5K Followers

99.3K Posts

bert_on_spec retweeted

Isabella M Weber

@IsabellaMWeber

2 days ago

The inflation redistribution machine running full steam again. Energy prices are the largest contributor to inflation. Inflation pushes down real wages making the vast majority poorer. The flip side of high energy prices is windfall profits benefitting the richest of the rich.

IsabellaMWeber's tweet photo. The inflation redistribution machine running full steam again. Energy prices are the largest contributor to inflation. Inflation pushes down real wages making the vast majority poorer. The flip side of high energy prices is windfall profits benefitting the richest of the rich. https://t.co/r0LceUtaIe

458

204

20K

bert_on_spec retweeted

Sumit @_reachsumit

4 days ago

Closing the Indexing-Decoding Gap in Multimodal Generative Retrieval via Prefix Retention Optimization Introduces a framework that improves target prefix survival during beam search. 📝 https://t.co/BLMkzyConE 👨🏽‍💻 https://t.co/o5RfkCG9Ze

523

bert_on_spec retweeted

Xiuyu Li

@sheriyuo

4 days ago

The SFT+RL stack everyone treats as a new paradigm is really a return to the BERT-era pre-train then fine-tune move, explicitly tailoring the model to the behaviors and benchmarks it gets graded on. The uncomfortable implication is that a lot of what we call alignment or reasoning RL is, structurally, benchmark-shaped supervised fitting wearing fancier clothes. If post-training is mostly tailoring to the eval, then the eval is doing more of the steering than the algorithm is, which is a quietly damning thing to say about the field's progress narrative. Post-training is (Massive) Supervised Learning Paper: https://t.co/pnT2Kcf3Qb

sheriyuo's tweet photo. The SFT+RL stack everyone treats as a new paradigm is really a return to the BERT-era pre-train then fine-tune move, explicitly tailoring the model to the behaviors and benchmarks it gets graded on.

The uncomfortable implication is that a lot of what we call alignment or reasoning RL is, structurally, benchmark-shaped supervised fitting wearing fancier clothes.

If post-training is mostly tailoring to the eval, then the eval is doing more of the steering than the algorithm is, which is a quietly damning thing to say about the field's progress narrative.

Post-training is (Massive) Supervised Learning
Paper: https://t.co/pnT2Kcf3Qb

bert_on_spec retweeted

Philipp Schmid

@_philschmid

10 days ago

We made a collection @GoogleDeepMind scientific agent skils for research tasks, genomics, structural biology, cheminformatics, literature search, and more. 👉https://t.co/zkPuCtmwEE https://t.co/zkPuCtmwEE

339

259

22K

Who to follow

immusoul

@ayuan1000

Science, culture and environment reporter. More posts on Bluesky: https://t.co/qprEjCjh9M

Jing Gu

@gujingc

批判与实证，才能铸就真信仰。充满不确定的自由博弈，才是文明制度的催生婆。

Sissyintoca

@sissyintoca

I am a woman you have never gotten, I am a man you have never been.

bert_on_spec retweeted

Sumit @_reachsumit

11 days ago

Test-Time Training for Zero-Resource Dense Retrieval Reranking Presents a training-free reranker that adapts a bilinear scoring matrix at inference time using pseudo-labels from dense retrieval ranking itself, with under 10ms added latency per query. 📝 https://t.co/YdpaCDP1Ql

473

bert_on_spec retweeted

Alex Trembath

@atrembath

12 days ago

Stephen, @JigarShahDC, and @JaneAFlegal get into something that Whitehouse/McKibben are eliding in their jeremiads at the "climate hushers." The debate is not about whether to talk about climate change. The debate is about whether to center the climate emergency in politics.

bert_on_spec retweeted

Sumit @_reachsumit

11 days ago

When Is 0.1% Enough? Analyzing the Combined Effects of Dimensionality Reduction and Quantization on Text Embedding Compression Systematically studies combining dimensionality reduction and quantization for text embeddings. 📝 https://t.co/NhyGj2PxEy

762

bert_on_spec retweeted

elvis

@omarsar0

13 days ago

// The Efficiency Frontier // Cool paper on context management. As agents reuse the same documents and histories across many turns, the cheapest context strategy is not fixed. This work describes a principled rule for picking one per deployment instead of defaulting to whatever topped a benchmark in isolation. Retrieval and compression methods are almost always benchmarked on accuracy and cost separately, so you never learn when one actually beats another under real load. The Efficiency Frontier models context strategy selection as a single cost-performance problem, with a log-utility term for diminishing returns from extra context and a reuse parameter N that amortizes preprocessing across repeated queries. Sweep N and the optimal strategy changes, exposing crossover regions where retrieval, compression, or full context each wins. On 5,000 HotpotQA instances, deployment-aware selection cuts effective token usage about 25 percent at the same performance, and amortized memory compression runs over 50 percent cheaper than full-context prompting in higher-performance settings. Paper: https://t.co/CK19QYX79n Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

omarsar0's tweet photo. // The Efficiency Frontier //

Cool paper on context management.

As agents reuse the same documents and histories across many turns, the cheapest context strategy is not fixed. This work describes a principled rule for picking one per deployment instead of defaulting to whatever topped a benchmark in isolation.

Retrieval and compression methods are almost always benchmarked on accuracy and cost separately, so you never learn when one actually beats another under real load.

The Efficiency Frontier models context strategy selection as a single cost-performance problem, with a log-utility term for diminishing returns from extra context and a reuse parameter N that amortizes preprocessing across repeated queries.

Sweep N and the optimal strategy changes, exposing crossover regions where retrieval, compression, or full context each wins. On 5,000 HotpotQA instances, deployment-aware selection cuts effective token usage about 25 percent at the same performance, and amortized memory compression runs over 50 percent cheaper than full-context prompting in higher-performance settings.

Paper: https://t.co/CK19QYX79n

Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

139

159

15K

bert_on_spec retweeted

Isabella M Weber

@IsabellaMWeber

17 days ago

🍅🍅🍅Tomatoflation 🍅🍅🍅 Price of tomatoes is up 39.7%. That’s terrible news for food affordability *and* public health. Tomatoes are one of the most popular fresh vegetables.

IsabellaMWeber's tweet photo. 🍅🍅🍅Tomatoflation 🍅🍅🍅

Price of tomatoes is up 39.7%. That’s terrible news for food affordability *and* public health. Tomatoes are one of the most popular fresh vegetables. https://t.co/DPSrV40YKA

602

163

111K

bert_on_spec retweeted

Yifan Yang

@Yif_Yang

19 days ago

🚀 Introducing SkillOpt — an optimizer for agent skills. Instead of finetuning model weights, we treat a natural-language skill as a trainable external parameter. Think of it as deep learning for the frontier-model + agent era: learning rate, LR schedule, mini-batch, batch size, epoch, momentum — all in text-space optimization. SkillOpt enables stable, controllable skill updates through bounded edits, allowing the optimizer to summarize “gradient directions” from agent experience and continuously improve procedural capability. We evaluate SkillOpt across 6 benchmarks and 7 models, under both direct model calls and real agent execution loops with Codex + Claude Code. SkillOpt achieves best or tied-best results in 52/52 settings. Train the skill, not the model. 🛠️🤖 🌐 https://t.co/zinqcX2wfQ 📄 https://t.co/pCI4VWdpih

868

108

107K

bert_on_spec retweeted

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

18 days ago

Language Models Need Sleep "Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache." "increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning."

iScienceLuvr's tweet photo. Language Models Need Sleep

"Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache."

"increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning."

915

147

717

66K

bert_on_spec retweeted

Roli Bosch

@rolibosch

18 days ago

A paper more people in AI eval / research workflows should read: The Asymmetric Burden of Proof LLMs can apply asymmetric evidential standards to matched positive vs null scientific claims. Quiet failure mode. Loud implications. https://t.co/XyNuaXEZrr

bert_on_spec retweeted

Ross Taylor

@rosstaylor90

21 days ago

One of the admirable things about the way DeepSeek executed was gradually building credibility, never getting ahead of their skis and having to rush to meet some high external expectation. Western neolabs have not learnt the lesson, and instead set expectations so high to begin with that they deprive themselves of room to iterate in public - frightened that their first release won’t clear the bar. Optimising for growth is a better strategy than building for a “big bang” release. Progress is proportional to your toleration for embarrassment. And all beautiful things are multiplicative where work compounds, and where expectations follow you instead of getting ahead of you.

bert_on_spec retweeted

MONTREAL.AI

@Montreal_AI

22 days ago

A 0.6B model learned to manage giants. That is the idea behind TRINITY, a new ICLR 2026 paper by Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, and Yujin Tang. The paper is not asking: “How do we build one model that knows everything?” It is asking something more interesting: “How do we build a small intelligence layer that knows who should think, who should act, and who should verify?” TRINITY is a lightweight coordinator for LLMs. It does not merge weights. It does not require architectural compatibility. It does not need access to closed-model internals. It does not try to turn the coordinator into the smartest model in the room. Instead, it orchestrates a pool of strong models at test time, including closed and open models. At each turn, TRINITY chooses a model and gives it one of three roles: Thinker — plan and decompose Worker — solve and execute Verifier — critique and accept/revise That may sound simple. It is not. Too many multi-agent systems are still prompts plus hope. TRINITY learns the coordination policy. A compact ~0.6B language model produces hidden-state representations of the conversation. A tiny head then uses those representations to decide the next model-role pair. The authors optimize this coordinator with an evolutionary strategy, sep-CMA-ES, because the problem is expensive, high-dimensional, and reward-sparse. The result is not just better routing. It is learned division of labor. The paper reports that TRINITY outperforms individual models and existing coordination methods across coding, math, reasoning, and domain knowledge tasks. In its full-power setting, it reaches 86.2% on LiveCodeBench and transfers to held-out benchmarks including AIME, BigCodeBench, MT-Bench, and GPQA-D. The most important idea here is bigger than the benchmark. The future of AI may not be a single supermodel. It may be an organization of models. A small conductor. A team of specialists. A protocol for planning, execution, and verification. An intelligence layer that learns how to allocate cognition. This feels like a real shift: from bigger models to better systems from raw capability to coordinated capability from “which model is best?” to “what structure makes many models better together?” Full credit to the authors: Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, Yujin Tang. Paper: TRINITY: An Evolved LLM Coordinator https://t.co/H7YE67U67f I’m attaching the first page because the abstract is worth reading closely. The future of AI may not be monolithic. It may be coordinated. #ArtificialIntelligence #LLM #MultiAgentSystems #MachineLearning #EvolutionaryAlgorithms

Montreal_AI's tweet photo. A 0.6B model learned to manage giants.

That is the idea behind TRINITY, a new ICLR 2026 paper by Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, and Yujin Tang.

The paper is not asking:

“How do we build one model that knows everything?”

It is asking something more interesting:

“How do we build a small intelligence layer that knows who should think, who should act, and who should verify?”

TRINITY is a lightweight coordinator for LLMs.

It does not merge weights.
It does not require architectural compatibility.
It does not need access to closed-model internals.
It does not try to turn the coordinator into the smartest model in the room.

Instead, it orchestrates a pool of strong models at test time, including closed and open models.

At each turn, TRINITY chooses a model and gives it one of three roles:

Thinker — plan and decompose
Worker — solve and execute
Verifier — critique and accept/revise

That may sound simple.

It is not.

Too many multi-agent systems are still prompts plus hope.

TRINITY learns the coordination policy.

A compact ~0.6B language model produces hidden-state representations of the conversation. A tiny head then uses those representations to decide the next model-role pair. The authors optimize this coordinator with an evolutionary strategy, sep-CMA-ES, because the problem is expensive, high-dimensional, and reward-sparse.

The result is not just better routing.

It is learned division of labor.

The paper reports that TRINITY outperforms individual models and existing coordination methods across coding, math, reasoning, and domain knowledge tasks. In its full-power setting, it reaches 86.2% on LiveCodeBench and transfers to held-out benchmarks including AIME, BigCodeBench, MT-Bench, and GPQA-D.

The most important idea here is bigger than the benchmark.

The future of AI may not be a single supermodel.

It may be an organization of models.

A small conductor.
A team of specialists.
A protocol for planning, execution, and verification.
An intelligence layer that learns how to allocate cognition.

This feels like a real shift:

from bigger models
to better systems

from raw capability
to coordinated capability

from “which model is best?”
to “what structure makes many models better together?”

Full credit to the authors:
Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, Yujin Tang.

Paper: TRINITY: An Evolved LLM Coordinator
https://t.co/H7YE67U67f

I’m attaching the first page because the abstract is worth reading closely.

The future of AI may not be monolithic.

It may be coordinated.

#ArtificialIntelligence #LLM #MultiAgentSystems #MachineLearning #EvolutionaryAlgorithms

266

305

13K

bert_on_spec retweeted

BURKOV

@burkov

22 days ago

Small recursive models like TRM solve hard reasoning puzzles using only about 7 million parameters by repeatedly refining a hidden internal state and the predicted answer — a different style of computation than the token-by-token generation used by large language models. But this refinement is deterministic, so when the model settles on a wrong answer there is no mechanism to escape it, while LLMs can be sampled many times and combined through voting. The authors visualize what TRM does across its refinement steps and find that many failures correspond to trajectories trapped in regions of the hidden state space that decode to incorrect answers, when a small change in path would have led to the right solution. They also notice that TRM already contains a "Q head" — a small auxiliary output trained alongside the main network to estimate whether the current answer is correct, originally used only at training time to halt computation on already-solved examples — and that this head separates correct from incorrect trajectories reliably enough to serve as a verifier at inference time. Their method, PTRM, adds Gaussian noise (random numbers drawn from a bell curve centered at zero) to the hidden state at each refinement step, runs K such perturbed copies in parallel, and picks the answer with the highest Q score. Without any retraining or task-specific modifications, this raises accuracy on Sudoku-Extreme from 87.4% to 98.75% and on a set of pencil puzzles from 62.6% to 91.2%, outperforming an ensemble of seven frontier LLMs at less than one ten-thousandth of the cost. Read with an AI tutor and quizzes: https://t.co/74S4z4Asc1 PDF: https://t.co/x5hP2Nm0gl Also read this relevant paper: https://t.co/Meb0y7xr3D

bert_on_spec retweeted

Andrew Gordon Wilson

@andrewgwils

22 days ago

Seeing a lot recently about whether info theory explains modern AI. This is *exactly* what our epiplexity paper is about. It shows how to resolve paradoxes around DPI, synthetic data, and emergence, by considering computation and structural info: https://t.co/IYNGcLCAqx

542

494

52K

bert_on_spec retweeted

Huaxiu Yao

@HuaxiuYaoML

22 days ago

Every memory system for LLM agents evolves what it stores. None evolves how it retrieves. 🧬 EvolveMem is out, now shipping inside the SimpleMem v0.3.0 update. Powered by AutoResearch: the system researches its own retrieval, treating the full retrieval config as a structured action space and running a closed loop: evaluate ➜ diagnose ➜ propose ➜ validate ➜ repeat. 🔬 From a minimal baseline, 7 autonomous rounds produce a retrieval policy that beats the strongest published baseline by +25.7% on LoCoMo and +18.9% on MemBench. 🧬 It discovers entirely new retrieval dimensions not present in the original design, all integrated into the unified SimpleMem package. 📄 Paper: https://t.co/BWCXebWhG1 💻 Code: https://t.co/hhdgvVjblP Led by @itsJiaqiLiu, @XinyeYee with contributions from @richardxp888, @ZhengBerkeley, @cihangxie

HuaxiuYaoML's tweet photo. Every memory system for LLM agents evolves what it stores. None evolves how it retrieves.

🧬 EvolveMem is out, now shipping inside the SimpleMem v0.3.0 update. Powered by AutoResearch: the system researches its own retrieval, treating the full retrieval config as a structured action space and running a closed loop: evaluate ➜ diagnose ➜ propose ➜ validate ➜ repeat.

🔬 From a minimal baseline, 7 autonomous rounds produce a retrieval policy that beats the strongest published baseline by +25.7% on LoCoMo and +18.9% on MemBench.

🧬 It discovers entirely new retrieval dimensions not present in the original design, all integrated into the unified SimpleMem package.

📄 Paper: https://t.co/BWCXebWhG1
💻 Code: https://t.co/hhdgvVjblP

Led by @itsJiaqiLiu, @XinyeYee with contributions from @richardxp888, @ZhengBerkeley, @cihangxie

423

376

29K

bert_on_spec retweeted

DAIR.AI

@dair_ai

22 days ago

NEW paper worth reading. A full agentic workflow can be distilled into model weights and run at roughly 100x lower inference cost while preserving near-frontier task quality. The workflow includes multi-step LLM calls, tool invocations, intermediate scratchpads, and decision structure. Instead of expressing all of that at runtime through a framework, the paper amortizes the behavior into a compiled model through targeted distillation. This is the strongest economic argument for agent compilation so far. Runtime loops are flexible, but expensive. Compiled workflows trade some flexibility for a massive inference-cost reduction. Paper: https://t.co/4k4urYOAeQ Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

dair_ai's tweet photo. NEW paper worth reading.

A full agentic workflow can be distilled into model weights and run at roughly 100x lower inference cost while preserving near-frontier task quality.

The workflow includes multi-step LLM calls, tool invocations, intermediate scratchpads, and decision structure.

Instead of expressing all of that at runtime through a framework, the paper amortizes the behavior into a compiled model through targeted distillation.

This is the strongest economic argument for agent compilation so far. Runtime loops are flexible, but expensive. Compiled workflows trade some flexibility for a massive inference-cost reduction.

Paper: https://t.co/4k4urYOAeQ

Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

289

282

19K

bert_on_spec retweeted

David Soria Parra

@dsp_

22 days ago

The release candidate for MCP 2026-07-28 is out. The protocol is now stateless: no handshake, no session id, any request can hit any server instance. Plus extensions as first-class (MCP Apps, Tasks), auth hardening, and a proper deprecation policy so we don't have to do this again. https://t.co/XRLTu1BSkB

231

679K

bert_on_spec retweeted

BURKOV

@burkov

24 days ago

Specialized time series models are good at picking up seasonal patterns and trends from raw numbers but have no way to react to events described in news or reports, while large language models can read that text and reason about it but are weak at the actual numerical extrapolation. Most attempts to combine the two either glue a language model onto numerical data through heavy retraining or use one giant prompt that asks a single model to do everything at once, and the authors from Google and Penn State argue this is the wrong decomposition. Their contribution is Nexus, a system that splits forecasting into separate cooperating agents (each agent is just a language model given a specific role and prompt): one cleans the messy mix of numbers and text into a structured timeline, one produces a broad long-range outlook, one walks step by step through near-term changes, and a final agent merges these into a single forecast, with an added loop that tests the system's past errors on held-out historical splits and writes down correction rules only if they actually improve accuracy. Tested on stock prices and Zillow housing data drawn entirely from after the models' knowledge cutoff (so the models could not have memorized the answers), this arrangement matches or beats both a strong dedicated time series model and a single-prompt language model baseline, and it produces written explanations of why each forecast moves the way it does, which the authors check using one model family to judge the other. Read with an AI tutor and quizzes for better retention: https://t.co/DUyNz04A2V PDF: https://t.co/09LwW9WzDn

Junior_prompt_engineer

@bert_on_spec

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users