Existing dl training loss and instability literature is looking at linear and quadratic coefficients to analyze convergence when actually the Taylor series diverges: https://t.co/01NCG8kmgT
@boknilev Numerous appendix could be due to original paper being very long and neurips submission forced them to place all the proofs and setup details in appendix.
Bottle gourd is my favorite. How you cut it makes all the difference, best is slices that are not too thin. Cook it with chana-dal, tomato and add fresh coriander crushed at the end with slightly watery consistency with usual indian spices (but keep them light to not overwhelm its taste).
@zdeborova I wonder if analyzing the complex zeros of your high-dimensional attention model's attention partition functions along SGD directions could connect staged specialization to step-size safety meaningfully.
@zdeborova Paper: https://t.co/yX56cjrIEB complex zeros of the softmax partition function (the “ghosts of softmax”) create singularities that limit the Taylor convergence radius of the cross-entropy loss.
Really interesting paper. I’m confused by Appendix H: if the number of (W)-sparse networks is (W^{O(W)}), then a (2^{-W}) weight-decay prior seems too weak to normalize over variable architectures, since (W^{O(W)}2^{-W}) grows like (2^{O(W\log W)-W}). Is there an implicit canonical encoding or fixed architecture restriction I’m missing?
The diagram below visualizes what brain drain looks like in India's case.
There's a strong positive relationship between the PISA scores of natives in countries and the scores of second-generation immigrants whose ancestry is from those countries.
That India is the most striking outlier in the diagram tellingly demonstrates how the West harvests and siphons off the brightest Indian talent, as measured by average PISA scores, much more aggressively than it does to other countries.
Really interesting paper. I’m confused by Appendix H: if the number of (W)-sparse networks is (W^{O(W)}), then a (2^{-W}) weight-decay prior seems too weak to normalize over variable architectures, since (W^{O(W)}2^{-W}) grows like (2^{O(W\log W)-W}). Is there an implicit canonical encoding or fixed architecture restriction I’m missing?
@Clashluke@HessianFree GN-CG ( assuming its Gauss-Newton with CG) is heavier: it chooses the direction via repeated (Jv/J^\top v), not just the scalar. So it doesn't give you an optimal LR for ADAM/PSGD direction. But GN-CG step is bound by the Taylor radius.
@f14bertolotti I agree that muon is not special. But the paper's quadratic loss model is exponentially wrong for softmax ce, so its lr theory won't translate to real-world LLM/transformer training;
@fly51fly Top ntk eigenvalue will exponentially overestimate safe step size for softmax ce so while elegant theory it doesnt apply to llm transformer training https://t.co/yX56cjrIEB