@gazorp5 I have not even looked at the actual generation step yet. if you try run_inference.py it takes 85 (on my spark) seconds just to start gen. Alot of this is just low hanging fruit
Im currently at 28.6s
"The purpose of abstracting is not to be vague, but to create a new semantic level in which one can be absolutely precise."
This really is a nice quote
Reinforcement learning has exploded on Modal, and we've been cooking.
Here's a review of lessons learned helping teams train at scale, the patterns we kept seeing, and an open-source library to get started with RL on Modal quickly.
@maharshii > it's unfortunate that torch scaled mm api does not provide a global scale dequantization argument
Can you elaborate here? https://t.co/wa3EYcBZkk
This does support global scales. We should probably expand a lil in the docs but here is gist: https://t.co/rrpwcAlZjh
@_seemethere@difficultyang Yeah big caveat as that when you first use it’s gunna suck, but if you stick with it and actually muck around with the system prompt +extensions you end up with something that feels very tailored to your preferences
LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.
CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip.
Bonus: LLMs can write fast CODA kernels too (approaching SoLs).