@dcburger@karpathy Efficiency is relative to other DL architectures. From an engineering perspective biology today fails #1 and #2, as it's not expressive nor optimized via backprop/gradient descent. Transformers are a nice balance of all three.
So, "double descent" is happening b/c DF isn't really the right quantity for the the x-axis: like, the fact that we are choosing the minimum norm least squares fit actually means that the spline with 36 DF is **less** flexible than the spline with 20 DF.
Crazy, huh?
19/
@fchollet Turn The Ship Around by L David Marquet. It has nothing to do with tech but everything to do with management, turning the worst performing submarine in the fleet to the best with lasting performance. Refreshingly different perspective than being at a bull market tech company
@bbabenko They used to cure that cut in a barrel called a "butt" and apparently the name sticks. Some have moved away from calling it the "butt" and instead pork shoulder, but confusing terms to be labeled as "butt shoulder!" The actual hind legs are "ham" https://t.co/VWpmNy9DRM
@ericjang11 This is generally true of production ML where the train infra is different than test infra. With robotics you don't have to do compression+networking but still run into problems from time sync with multiple sensors, noise patterns with low exposure etc
@tunguz @dirk_hovy Once you think of DNNs as just a scalable method to train the largest universal approximators with trillions of parameters and hundreds to thousands of layers, the question becomes much more difficult!
@gdb This has at times felt like an antipattern when the data has semantics lost in translation from a loop to various tensor operations. Named tensors can help but feels like still more room for improvement to have clear intent and performance
@tunguz In that room you learn to stop using OCaml and tail recursion optimization and be forever bound by Python because Guido knows best https://t.co/wveQK9m8bb
@jefrankle@MosaicML My impression is GPU jpeg decoding was merged into torch vision for a while, is this a bottleneck with cpu based decoding? https://t.co/PFucI2PAPA
@nairbv@soumithchintala The chart seems a little confusing, the on chip memory suggests 40MB for the A100 probably refers to local mem, cache, registers, etc but the bandwidth does an apples to oranges comparison of off chip bandwidth for the GPU to on chip for the WSE-2.
@shoyer Async / event based networks require event based inputs but in practice most sensor inputs are not fully event based so you don't see effective improvements in overall performance with inference. Also extra large models that require more memory than fit on single GPU
@cHHillee@giffmana The disagreement is that memory and communication is the bottleneck, not compute. This is due to physical constraints (energy per bit goes up over distance a bit travels). Other models fit reality better than FLOPs or MACs or OPs https://t.co/G7X3yUWSrr
@wightmanr Inference time is great! But a slightly more sophisticated model capturing compute time and communication time as different dimensions allows evaluating across accelerators from big to tiny: https://t.co/G7X3yUWSrr
@ID_AA_Carmack At IBM Research there was a distinguished engineer who loved to make a joke about how they became distinguished from turning an O(n log n) approach into a faster O(n^2) algorithm. Sometimes the "dumb" brute force wins!
@karpathy Unlike compilers, leading a team gives a lot of wiggle room to change requirements to remove hard dependencies and give more parallelism. Another interesting perspective is organizing with robustness to failure that small world networks exhibit but can be slow to reach consensus