@wightmanr@jeremyphoward Great call out.
I was also searching recently for what init schemes are being used in recent work and couldn't find much. Arcee's Trinity tech report was one find. Can you suggest others in addition to Olmo?
@stochasticchasm I guess there's a middle ground where you compute the loss itself in chunks but realize the full logits as usual. Some memory savings + logits/KL unaffected
@StasBekman@aryanvs_ The speed up was due to throwing more GPUs at the workload, though, right? Baseline being a single GPU, other cases using N, and seeing a near linear decrease in overall runtime? Not a per-GPU improvement
@m_sirovatka Add like a fully_shard_bwd API to wrap modules whose grads should be RS'd together. E.g. fully_shard on a whole transformer block and wrap the MLP and attn sub-blocks each with fully_shard_bwd so their grads are bucket-reduced and freed earlier than default. Maybe too complicated
@m_sirovatka Yeah was thinking the same. Basically enabling the backwards bucketing strategy to be different from the fwd, rather than forcing both to consolidate all collectives into a single launch