@ArtsyMarx1st Article is paywalled but efficacy against all infection (symptomatic or not) is reported to be ~35% (21.5% -> 14.0%)
https://t.co/zF7JAaL6ho
"An antiviral pill has, for the first time, been shown to prevent COVID-19 in people exposed to the SARS-CoV-2 virus at home..
The drug, called ensitrelvir, is made by the Japanese pharmaceutical company Shionogi..
In an international study of more than 2,000 household contacts conducted from June 2023 to September 2024, about 9% of people who got a placebo within 72 hours of a housemate developing symptoms became symptomatic themselves, compared with only about 3% of those who got a five-day course of ensitrelvir.
Rates of viral transmission were lower in the ensitrelvir group, too: confirmed infections, symptomatic or not, turned up in only 14.0% of those who received the drug, compared with 21.5% of those who got a placebo.."
Not bad.
'At last, a pill that can prevent COVID after exposure to infected people'
https://t.co/H6GfHPAvLy
cc @teortaxesTex
it's a fairly simple probe reverse engineered from my personal agent grievances... but MAN this chart is so funny
[relevant task axis: "model's ability to realize when a local change requires exacting multihop global changes"]
@kalomaze Parallax still works with AdamW though and in fact beats attention with the right LR schedule, just not significantly. I wonder why something similar hasn't been reported for Shampoo and whether it's due to less adoption or people who know can't speak.
https://t.co/obSnH0HG53
~6/7~ Crucially, we find Muon counterfactually amplifies the advantage of Parallax.
The strength of Parallax depends heavily on the norm and alignment of the probe and the KV covariance, which is very sensitive to choice of optimizer.
To our knowledge, this is the first clear case of explicit architecture–optimizer codesign for attention mechanisms.
@Creative_Math_ I don't think there is a definite answer yet. A camp believes that 1 - \beta should match the frequency of the next feature the model should learn and therefore needs to decrease over time for LLMs (log-time momentum): https://t.co/kTtjgdNyRf
3/10 Why log-time schedules?
AdamW's fixed β₁, β₂, λ create a fixed memory horizon. But language has a power-law structure (Hilberg, Zipf): informative events can be Θ(T) steps apart. The longer you train, the worse the mismatch. Log-time schedules let memory grow with time.
@keshigeyan Yes I have seen that, but if I were to use GPIC for a project right now I would find the original title and description of the image on Flickr (say) useful.
@willccbb To me at least by the time Jeremy Bernstein posted on Thinky blog there is already a body of works by himself, Cesista, and Su on it (e.g. https://t.co/qWS8vLRQ0w) so I didn't pay much additional attention.
@konstmish Hold on, ScheduleFree+ still needs warm-up: "(...) a decreasing step size is not necessary with Schedule-Free Learning, however a learning rate warmup is still needed for best performance". With warmup, C-warmup, and annealing β it's ironically scheduling many variables 😅
@kankei_arahen@gbrl_dick That’s revisionist. Taiwanese view of Japan’s colonization only turned positive after
1. The living memory died with the elder generations;
2. DPP changed the textbooks to whitewash the history.
In fact their textbooks don’t even call it colonization. It’s now just 日治