Take the same underlying path and increase the number of samples.
As the sequence length grows, RNNs become difficult to train and Transformers become too expensive.
Our continuous-time models instead converge to a continuous hidden-state path.
Unless I’ve missed something, there are still no technical details on how they make the approach subquadratic. Anyone know how they choose which previous tokens a query should attend to without first looking at all previous tokens, or is the subquadratic claim just marketing?
The transformer architecture used for ChatGPT, Gemini, and Claude has defined the last decade of AI. It also introduced a fundamental constraint: compute scales quadratically as context grows.
Longer inputs, exponentially higher costs and accuracy that degrades well before the context window limit.
SubQ changes that.
It's the first LLM that breaks the quadratic scaling constraint delivering longer context, higher accuracy, and lower cost at the same time without tradeoffs.
Read more here.
https://t.co/QZmk07GZrQ
Using the new GPT-Image-2 to help me express what it feels like to watch a one-off specific instruction survive Codex compaction, and then get passed down forever as legend through each successive compaction
Hate to break it to you, but the first LLM was created by Andrey Markov in 1913.
he tallied up 20,000 letters from a famous novel and computed
p(vowel | vowel)
p(consonant | vowel)
p(vowel | consonant)
p(consonant | consonant)
basically 'training' a bigram by hand
A visualisation of the idea behind rough path theory: a path is not fully described by its value
Each curve has the same area: when the number of circles doubles, their radius is scaled by 2^{-1/2}. The path's value converges to the straight line, but the total area does not
ChatGPT found a song I’d been trying to find for ages, then built a playlist around it that was way better than Spotify’s suggestions.
How is an LLM better at music recommendation than a direct recommendation system?
Fantastic OxYSS session with Emma Prevot (@OxfordStats) & @benjaminwalker (@OxUniMaths) on the intersection of Causal Inference and Continuous-Time ML.
A vibrant discussion on the future of temporal modelling!
@ryu0000000001 A natural next step is developing path-to-path models that can handle irregular or over-sampled inputs.
Currently, we use the Log-ODE method, but because it outputs a sequence (the solution only at interval endpoints), the resulting models cannot be stacked.
Take the same underlying path and increase the number of samples.
As the sequence length grows, RNNs become difficult to train and Transformers become too expensive.
Our continuous-time models instead converge to a continuous hidden-state path.
@HochreiterSepp Never liked the term 'Linear RNN', as their updates are nonlinear in (h_t, x_t). The real bottleneck is restrictive structure that prevents hidden-state interactions. Without that restriction, Linear RNNs already have the expressivity for world modelling: https://t.co/Il9ZUJFlgw
And in a learning setting, it can be understood as a SLiCE with a strong inductive bias for building a contextual memory of a path.
Paper: https://t.co/5y3LxAHloU
3/3
Congratulations to our PhD student Alex on his first paper!
The Exponentially-Weighted Signature.
This new SLiCE architecture generalises the signature transform by introducing a trainable continuous-time attention over the history of a path.
1/3
It builds on the exponentially fading memory signature of Eduardo Abi Jaber and Dimitri Sotnikov by moving beyond channel-independent weighting of the past, allowing interactions between channels to shape how history is remembered.
2/3