Antonio Montano ☼ @antomon - Twitter Profile

Pinned Tweet

Antonio Montano ☼ @AntoMon

6 months ago

Beyond De-Skilling: Intelligence Explosion and the End of Skill as a Stable Category https://t.co/PGsZANt4CO

1

3

1

444

AntoMon retweeted

Niels Rogge @NielsRogge

about 23 hours ago

Great paper, made it available here: https://t.co/pwvKtvpzLq Check how it compares to other text-to-image models at the bottom

NielsRogge's tweet photo. Great paper, made it available here: https://t.co/pwvKtvpzLq

Check how it compares to other text-to-image models at the bottom https://t.co/ODCM9RUMYn

1

21

4

11

10K

AntoMon retweeted

Paul Middlebrooks @pgmid

about 19 hours ago

What if anything does computational complexity have to say about how brains work (or vice versa)? Cris made this discussion NP easy https://t.co/bILjW27f8g

pgmid's tweet photo. What if anything does computational complexity have to say about how brains work (or vice versa)?

Cris made this discussion NP easy

https://t.co/bILjW27f8g https://t.co/hEGw1lm33M

0

25

8

16

2K

Antonio Montano ☼ @AntoMon

about 3 hours ago

The Industrialization of Intelligence – Random Bits of Knowledge https://t.co/j19sojXyOL

0

2

Who to follow

Greentech Renewables

@greentech_renew

We are committed to providing customers with exceptional service and top-tier products from prominent renewable energy manufacturers.

Andrea Kremer

@Andrea_Kremer

🏆 Emmy Winning Journalist @espn @cbssports🎙 2024 Sports Broadcasting HOF 🏈 2018 Pro Football HOF | Inquiries: [email protected]

Allot

@allot_ltd

Provider of innovative #CyberSecurity #NetworkSecurity #NetworkIntelligence solutions for #telcos and enterprises worldwide.

AntoMon retweeted

IEEE Spectrum @IEEESpectrum

about 18 hours ago

Engineers have developed a process for training #LLMs that uses 14 percent less energy without sacrificing speed. Their method involves adjusting the clock frequency of the GPU during computation. https://t.co/eDAeIKGeZl

IEEESpectrum's tweet photo. Engineers have developed a process for training #LLMs that uses 14 percent less energy without sacrificing speed. Their method involves adjusting the clock frequency of the GPU during computation. https://t.co/eDAeIKGeZl https://t.co/gi2IZEwMj6

1

24

10

6

3K

AntoMon retweeted

Harrison Chase

@hwchase17

about 19 hours ago

Very cool work from @jit_infinity: 🔥Leve: filesystem-first, durable agent framework built on LangGraph. You describe an agent as a directory of files. Leve compiles that directory into an agent and runs it Inspired by Vercel's Eve https://t.co/cfWpii90Yn

11

106

14

94

10K

AntoMon retweeted

Nico

@nicos_ai

1 day ago

Google ha acabado con la mafia de las GPU 💀 VS Code ahora se conecta directamente a Google Colab. → Obtienes una GPU T4 gratuita dentro de tu editor. → Tus archivos locales. Su potencia de cómputo.

43

2K

261

3K

232K

AntoMon retweeted

IAEA NE

@IAEANE

about 22 hours ago

⬇️ Download the World Distribution of Uranium Deposits map in HD! This edition features nearly 5300 deposits, a revised classification system & improved geological visual offering a comprehensive view of 🌍 #uranium resources. Get it here 🗺️ https://t.co/e5MZAmlyVL

IAEANE's tweet photo. ⬇️ Download the World Distribution of Uranium Deposits map in HD!

This edition features nearly 5300 deposits, a revised classification system & improved geological visual offering a comprehensive view of 🌍 #uranium resources.

Get it here 🗺️ https://t.co/e5MZAmlyVL https://t.co/kiuaPyQawQ

1

53

22

51

3K

AntoMon retweeted

Kyunghyun Cho

@kchonyc

about 23 hours ago

it was fun giving a talk at MLSS 2026 in NYC. i talked about my recent efforts in "computatinalizaing" statistical and causal estimation, from learning to estimate pop. std. dev, mutual info., bayes ppd and causal effect to causal identification. links to the slide deck and the papers below.

kchonyc's tweet photo. it was fun giving a talk at MLSS 2026 in NYC. i talked about my recent efforts in "computatinalizaing" statistical and causal estimation, from learning to estimate pop. std. dev, mutual info., bayes ppd and causal effect to causal identification.

links to the slide deck and the papers below.

3

151

18

115

12K

AntoMon retweeted

Rishabh Tiwari

@rish2k1

about 18 hours ago

https://t.co/qCh4LaqUNh

5

256

28

456

28K

AntoMon retweeted

Alex Kontorovich

@AlexKontorovich

6 days ago

Oh and Kim Morrison used Claude + Aristotle + Codex to formalize the negation of the Erdos unit distance conjecture: https://t.co/Y0mPvIm7S1 It's nice to see that this was built on top of PNT+; so despite the fact that we haven't been able to upstream it to Mathlib (the Residue Theorem we have in PNT+ is just for rectangles, and Mathlib will want a much more general version...), it's still useful in other applications!...

1

66

14

23

14K

AntoMon retweeted

Harmonic

@HarmonicMath

3 days ago

The negation of Erdos unit distance conjecture, now formalized by Aristotle You can try it for free at https://t.co/azsI6J8pPv

0

53

7

11

6K

AntoMon retweeted

Matt Dancho (Business Science)

@mdancho84

1 day ago

A Research Scientist at Google DeepMind just dropped a 58 page paper on building agents that specialize in game theory. Here are the most important parts:

mdancho84's tweet photo. A Research Scientist at Google DeepMind just dropped a 58 page paper on building agents that specialize in game theory.

Here are the most important parts: https://t.co/MjMK2ggyaf

3

356

58

405

23K

AntoMon retweeted

Chemistry Net @Chemistry_Net

about 22 hours ago

Compressing Chemistry Reveals Functional Groups https://t.co/Xp2XPW19yY #JCIM Vol66 Issue7 #MachineLearning #DeepLearning

0

13

3

1

841

AntoMon retweeted

SIMD Crawford 🟣

@omershapira

about 13 hours ago

TIL Jurafsky & Martin, the textbook I used for Computational Linguistics in undergrad many years ago (when TAU didn't offer that class), released a second edition in 2026, and it has one of the clearest explanations of Transformers I have seen to date. https://t.co/FyCukgQTtb

omershapira's tweet photo. TIL Jurafsky & Martin, the textbook I used for Computational Linguistics in undergrad many years ago (when TAU didn't offer that class), released a second edition in 2026, and it has one of the clearest explanations of Transformers I have seen to date.

https://t.co/FyCukgQTtb https://t.co/ZKCYSevjas

1

64

8

43

2K

AntoMon retweeted

Isra

@israfill

1 day ago

COMPANY BEHIND TIKTOK JUST OPEN SOURCED AN AI AGENT THAT DOES YOUR WHOLE JOB FOR YOU China doesn't miss 😳 everyone's been crowning hermes the #1 agent then bytedance dropped deerflow 72,000+ github stars. 9,700+ forks. FREE. MIT it doesn't just run tools like hermes. it does the entire task you give it one job and it plans the steps, spins up a team of sub-agents, writes the code, tests it, fixes its own errors, and hands you finished work in its own sandbox research, full websites, dashboards, slide decks, reports. done, not drafts full beginner setup: easiest way (if you use claude code, cursor or codex): paste this to your agent and it installs everything for you: "clone deerflow and set it up for local dev using https://t.co/jPhzWtcHwr" manual way (about 5 min): 1. install the basics: git, docker, node 22+, uv, pnpm (deerflow's "make check" flags anything missing) 2. clone the repo: git clone https://t.co/VdMJHx0YOu cd deer-flow 3. run the setup wizard: make setup it asks which model you want and saves your key. point it at openrouter, groq or nvidia nim to run it free 4. check it works: make doctor 5. start it with docker: make docker-init make docker-start 6. open it in your browser and give it your first task now the part that'll start a fight: hermes is the most used agent on openrouter (224B tokens a day) and i've been all in on it but hermes runs your tools. deerflow runs your whole project end to end i'm actually tempted to switch and i did not expect that so which one wins right now? - hermes: american, lean, lives on your laptop - deerflow: chinese, bytedance muscle, replaces a whole team bookmark this and tell me which agent you're running

49

1K

100

2K

100K

AntoMon retweeted

Machine Learning (ML) Papers @Memoirs

1 day ago

Neural Network Implementation of the Renormalization Group for Fault Diagnosis with Class Imbalance Evgeny Nikulchev, Dmitry Ilin https://t.co/AHeomEzZ4f [𝚌𝚜.𝙻𝙶]

0

8

4

5

267

AntoMon retweeted

Machine Learning (ML) Papers @Memoirs

about 14 hours ago

On the Residual Scaling of Looped Transformers: Stability and Transferability Shaowen Wang, Bingrui Li, Ge Zhang, Wenhao Huang, Shen Yan, Jian Li https://t.co/TBHHCYz7FK [𝚌𝚜.𝙻𝙶]

0

9

5

4

374

AntoMon retweeted

Georgyeh Floydorovich

@chaumian

about 20 hours ago

BootNet: Homomorphic CNN Inference with Convolution and ReLU Fused in Bootstrapping https://t.co/FFoHSRF8AR

0

10

4

7

721

AntoMon retweeted

Pierfrancesco Beneventano

@PierBeneventano

about 17 hours ago

Edge of Stability has been a thing for a while now. We can show that most training happens at the edge of stability. This is very surprising because of a "negative " fact: Most theory does not apply (e.g., Descent Lemma). The question becomes: Does the fact that training happens at the EoS impact performance? In what way? With what mechanisms? How is that affected by the architecture? To our knowledge, these questions they are unanswered to date and our new paper is the about the first step towards this! EoS picks what part of the distribution to learn and at what speed! https://t.co/sefeVZMzbz In particular, the consequence is that for MLPs: EoS can improve robustness or OOD behavior, but only when the relevant subset is the one selected by EoS. If boundary points dominate, EoS helps near-boundary robustness. If distributional outliers dominate, EoS helps extrapolate toward the tail. Precisely: 1. We make EoS causal. We fork training from the same state at EoS onset: one branch stays at EoS, the other exits by lowering the learning rate. Same everything, only the stability constraint changes. 2. EoS is selective: it learns some parts of the data distribution faster, while slowing others. 3. The selector is surprisingly simple. A group benefits from EoS when its gradient is big and has high alignment with v_1, the top Hessian eigenvector. Translation: to benefit from EoS, a subset must (i) point in the sharpest direction, and (ii) keep a non-vanishing gradient. This gives concrete answers to the questions above: Does EoS affect performance? Yes, but not uniformly. It reallocates learning. In what way? It prioritizes the subset with largest curvature influence. By what mechanism? Self-stabilization at the sharpness boundary couples training to v_1; only groups aligned with v_1, and whose gradients persist, get the extra progress. How does architecture matter? Architecture changes the map from input geometry to gradient geometry. Thus it can change which subset dominates, but the same predictor remains: largest curvature influence wins. With the amazing Shauna Kwag, @anakha_g, and @TomasoPoggio!

PierBeneventano's tweet photo. Edge of Stability has been a thing for a while now. We can show that most training happens at the edge of stability. This is very surprising because of a "negative " fact: Most theory does not apply (e.g., Descent Lemma).

The question becomes:
Does the fact that training happens at the EoS impact performance? In what way? With what mechanisms? How is that affected by the architecture?

To our knowledge, these questions they are unanswered to date and our new paper is the about the first step towards this!

EoS picks what part of the distribution to learn and at what speed!

https://t.co/sefeVZMzbz

In particular, the consequence is that for MLPs:
EoS can improve robustness or OOD behavior, but only when the relevant subset is the one selected by EoS.
If boundary points dominate, EoS helps near-boundary robustness.
If distributional outliers dominate, EoS helps extrapolate toward the tail.

Precisely:

1. We make EoS causal. We fork training from the same state at EoS onset:
one branch stays at EoS, the other exits by lowering the learning rate. Same everything, only the stability constraint changes.

2. EoS is selective:
it learns some parts of the data distribution faster, while slowing others.

3. The selector is surprisingly simple. A group benefits from EoS when its gradient is big and
has high alignment with v_1, the top Hessian eigenvector.

Translation: to benefit from EoS, a subset must
(i) point in the sharpest direction, and
(ii) keep a non-vanishing gradient.

This gives concrete answers to the questions above:

Does EoS affect performance?
Yes, but not uniformly. It reallocates learning.

In what way?
It prioritizes the subset with largest curvature influence.

By what mechanism?
Self-stabilization at the sharpness boundary couples training to v_1; only groups aligned with v_1, and whose gradients persist, get the extra progress.

How does architecture matter?
Architecture changes the map from input geometry to gradient geometry. Thus it can change which subset dominates, but the same predictor remains:
largest curvature influence wins.

With the amazing Shauna Kwag, @anakha_g, and @TomasoPoggio!

1

76

12

56

5K

AntoMon retweeted

Jorge Bravo Abad

@bravo_abad

about 19 hours ago

Photon number as a learning-capacity knob: a polynomial scaling advantage in quantum machine learning In most variational models, capacity is something you buy through parameters and data. Add more trainable weights, feed more examples, and the model learns to generalize. But what if a physical resource of the hardware itself could do that work for you? In photonic quantum machine learning, that resource turns out to be the number of photons you send through the circuit. Yong Wang and coauthors prove, and then measure, that the learning capacity of a linear optical circuit scales polynomially with photon number. They quantify capacity using the rank of the data quantum Fisher information matrix, which counts the independent directions in parameter space that actually move the model. For an m-mode circuit, single-photon states give a capacity that grows like m, while multi-photon states push it toward m². More photons means a larger accessible state space, which is the geometric signature of better trainability and generalization. The practical payoff shows up in two experiments on a fully programmable 6-mode photonic chip, trained online with SPSA. In a unitary-learning task, single photons need at least 4 training states to recover a 5×5 unitary, while two-photon states do it with only 2, cutting the data requirement roughly in half and matching the theory exactly. In a metric-learning task on a vowel-recognition dataset, two-photon states reach markedly higher class separation and lower test loss than single-photon states under the same ansatz and training budget. The lesson is clean: you can trade a hardware resource (photon number) for what would otherwise cost you training data. For learning pipelines on photonic hardware, this reframes a scaling decision. Instead of only enlarging circuits or gathering more labeled examples, which is the expensive part in domains like molecular property prediction, materials screening, or spectroscopy-driven discovery, you can raise photon number to extract more capacity from the same chip. That matters wherever labeled data is scarce or costly to acquire, since fewer training states to reach generalization translates directly into shorter, cheaper experimental loops. Paper: Wang et al., npj Quantum Information (2026), CC BY 4.0 | https://t.co/XRq7vi5Cyj

bravo_abad's tweet photo. Photon number as a learning-capacity knob: a polynomial scaling advantage in quantum machine learning

In most variational models, capacity is something you buy through parameters and data. Add more trainable weights, feed more examples, and the model learns to generalize. But what if a physical resource of the hardware itself could do that work for you? In photonic quantum machine learning, that resource turns out to be the number of photons you send through the circuit.

Yong Wang and coauthors prove, and then measure, that the learning capacity of a linear optical circuit scales polynomially with photon number. They quantify capacity using the rank of the data quantum Fisher information matrix, which counts the independent directions in parameter space that actually move the model. For an m-mode circuit, single-photon states give a capacity that grows like m, while multi-photon states push it toward m². More photons means a larger accessible state space, which is the geometric signature of better trainability and generalization.

The practical payoff shows up in two experiments on a fully programmable 6-mode photonic chip, trained online with SPSA. In a unitary-learning task, single photons need at least 4 training states to recover a 5×5 unitary, while two-photon states do it with only 2, cutting the data requirement roughly in half and matching the theory exactly. In a metric-learning task on a vowel-recognition dataset, two-photon states reach markedly higher class separation and lower test loss than single-photon states under the same ansatz and training budget. The lesson is clean: you can trade a hardware resource (photon number) for what would otherwise cost you training data.

For learning pipelines on photonic hardware, this reframes a scaling decision. Instead of only enlarging circuits or gathering more labeled examples, which is the expensive part in domains like molecular property prediction, materials screening, or spectroscopy-driven discovery, you can raise photon number to extract more capacity from the same chip. That matters wherever labeled data is scarce or costly to acquire, since fewer training states to reach generalization translates directly into shorter, cheaper experimental loops.

Paper: Wang et al., npj Quantum Information (2026), CC BY 4.0 | https://t.co/XRq7vi5Cyj

1

33

7

9

2K

Antonio Montano ☼

@AntoMon

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users