@aliuahma i think the best balance is to show enough examples of an idea to motivate the abstraction; in general there are lots of abstractions that fit any particular idea but only a few that fit many of them
@aliuahma yeah i agree, even in the most extreme scenarios where agents one shot problems the code should at least be read; also i feel the more researchy your code becomes, the more unique info is in each line, and the less AI helps lol
@AndrewLampinen Can LLM memory systems be viewed as approximations for sparse attention over extremely long contexts? Therefore, can self-improving agent techniques involving memory be seen as unifying training and in-context learning?
@furongh checkpoints that are trained through high-throughput video-free dataloaders, which then get combined with video generators through cross-attention in an MoT-style model.
@furongh In principle, inverse dynamics from video is object segmentation with extra hidden state estimation. Therefore, it's roughly the same difficulty as fine-tuning. This could be a different story if force or tactile data is introduced. Action experts may also be warm-started from 1/
@JieWang_ZJUI cool! imo multimodal memory is underexplored, especially considering scenarios like asking a robot to remember what to do with "a specific object"
yeah, IMO it's really appealing but also hard to make L0 penalty transformers work simply because hardware is so much better optimized for dense models (in my limited testing, the effective latency of loading a large block of weights has the same latency as loading 1 bc cache)
@huskydogewoof This is really cool! Do you think the attractors could eventually be represented implicitly by a verifier? It seems like the flow fields are defined explicitly by a neural network here.
@VictorTaelin They said there was no orchestration "scaffolded to search proof strategies", but that doesn't necessarily mean there wasn't some other more general memory mechanism involved that is technically broader than mathematical proofs
@SungjinAhn_ Very cool! What do you think causes the difference from diffusion models? maybe scaling diffusion steps amounts to finer discretization rather than more exploration? or the lack of explicit "diffusion time" creates a stochastic process with really good mixing?
This is pretty neat. It would be cool to eventually unify the search implicit in Langevin-style sampling, and the coarse-to-fine inductive bias known of diffusion models, in a noise-prediction-free architecture through some variant of this.
Test-time scaling, reasoning, and generally search-like processes clearly drive significant gains in LLMs. Largely owed to the structure of language. One would think the same could apply to non-linguistic domains, like image generation, but that obviously depends on whether the structure of the domain's representation lends itself to search.
1D ordered tokens (e.g., image FlexTok, video FlexTok) seem like a natural fit since they enable a step-by-step coarse-to-fine generation. We investigated that and found they indeed enable search and scale far better with test-time compute than 2D grids. See the visuals on the webpage. Appearing in @icmlconf 2026.
๐ย https://t.co/yOFqeIJrEz
๐ https://t.co/WFZCihp1m4,
Shouldn't the linearizability of softmax attention imply some fundamental bottleneck of transformer retrieval, at least qualitatively? I first saw this in Katharopoulos et al. (2020), although it's older. I wrote a quick derivation here for context.