Emmett @EmmettStream - Twitter Profile

over 3 years ago

@dcburger @karpathy Efficiency is relative to other DL architectures. From an engineering perspective biology today fails #1 and #2, as it's not expressive nor optimized via backprop/gradient descent. Transformers are a nice balance of all three.

0

2

0

Emmett @EmmettStream

over 3 years ago

@bbabenko Fascinating stuff! https://t.co/JI2RKDB0GI

Daniela Witten @daniela_witten

almost 6 years ago

So, "double descent" is happening b/c DF isn't really the right quantity for the the x-axis: like, the fact that we are choosing the minimum norm least squares fit actually means that the spline with 36 DF is **less** flexible than the spline with 20 DF. Crazy, huh? 19/

5

181

6

8

0

Emmett @EmmettStream

over 3 years ago

@waxpancake Only the test data is documented, not training data. The training data is not detailed anywhere.

1

0

Emmett @EmmettStream

almost 4 years ago

@fchollet Turn The Ship Around by L David Marquet. It has nothing to do with tech but everything to do with management, turning the worst performing submarine in the fleet to the best with lasting performance. Refreshingly different perspective than being at a bull market tech company

0

2

0

Who to follow

Piotr Nawrot

@p_nawrot

LLM Efficiency @NVIDIA - views have always been only my own 🥇🥈 @ Flunkyball Polish Championships

Wenhu Chen

@WenhuChen

MSL@Meta. I led PoT, MMMU, MMLU-Pro, MAmmoTH, General-Reasoner, VL-Rethinker, Pixel-Reasoner. I contributed to Gemini-2.5. Prev @GoogleDeepMind.

John Thornhill

@johnthornhillft

Innovation editor and tech columnist @FinancialTimes. Founder of @Siftedeu. Ex-deputy editor, Moscow correspondent etc

Emmett @EmmettStream

almost 4 years ago

@bbabenko They used to cure that cut in a barrel called a "butt" and apparently the name sticks. Some have moved away from calling it the "butt" and instead pork shoulder, but confusing terms to be labeled as "butt shoulder!" The actual hind legs are "ham" https://t.co/VWpmNy9DRM

0

1

0

Emmett @EmmettStream

almost 4 years ago

@ericjang11 This is generally true of production ML where the train infra is different than test infra. With robotics you don't have to do compression+networking but still run into problems from time sync with multiple sensors, noise patterns with low exposure etc

0

1

0

Emmett @EmmettStream

about 4 years ago

@tunguz @jvrlrnzdz @rasbt ReviewNB or nbdime can be useful. ReviewNB is used by DALI and a few others https://t.co/qDwd5RPVVe

0

3

0

Emmett @EmmettStream

about 4 years ago

@tunguz @dirk_hovy Once you think of DNNs as just a scalable method to train the largest universal approximators with trillions of parameters and hundreds to thousands of layers, the question becomes much more difficult!

0

Emmett @EmmettStream

about 4 years ago

@MosaicML Is slowly changing variance a universal property of neural networks training with Adam or just some subset?

0

Emmett @EmmettStream

about 4 years ago

@gdb This has at times felt like an antipattern when the data has semantics lost in translation from a loop to various tensor operations. Named tensors can help but feels like still more room for improvement to have clear intent and performance

0

1

0

Emmett @EmmettStream

about 4 years ago

@tunguz In that room you learn to stop using OCaml and tail recursion optimization and be forever bound by Python because Guido knows best https://t.co/wveQK9m8bb

0

4

0

Emmett @EmmettStream

over 4 years ago

@jefrankle @MosaicML My impression is GPU jpeg decoding was merged into torch vision for a while, is this a bottleneck with cpu based decoding? https://t.co/PFucI2PAPA

1

3

0

Emmett @EmmettStream

over 4 years ago

@nairbv @soumithchintala The chart seems a little confusing, the on chip memory suggests 40MB for the A100 probably refers to local mem, cache, registers, etc but the bandwidth does an apples to oranges comparison of off chip bandwidth for the GPU to on chip for the WSE-2.

1

2

0

Emmett @EmmettStream

over 4 years ago

@_brohrer_ You might enjoy a network of Sinc functions: https://t.co/PxhkD4FIXI

0

2

0

Emmett @EmmettStream

over 4 years ago

@shoyer Async / event based networks require event based inputs but in practice most sensor inputs are not fully event based so you don't see effective improvements in overall performance with inference. Also extra large models that require more memory than fit on single GPU

0

Emmett @EmmettStream

over 4 years ago

@xamat @Strava OpenStreetMaps has this data also available via clients like Organic Maps. Very handy for the long runs / rides! https://t.co/nom3D1gQUb

0

Emmett @EmmettStream

over 4 years ago

@cHHillee @giffmana The disagreement is that memory and communication is the bottleneck, not compute. This is due to physical constraints (energy per bit goes up over distance a bit travels). Other models fit reality better than FLOPs or MACs or OPs https://t.co/G7X3yUWSrr

1

6

1

0

Emmett @EmmettStream

over 4 years ago

@wightmanr Inference time is great! But a slightly more sophisticated model capturing compute time and communication time as different dimensions allows evaluating across accelerators from big to tiny: https://t.co/G7X3yUWSrr

1

4

0

2

0

Emmett @EmmettStream

over 4 years ago

@ID_AA_Carmack At IBM Research there was a distinguished engineer who loved to make a joke about how they became distinguished from turning an O(n log n) approach into a faster O(n^2) algorithm. Sometimes the "dumb" brute force wins!

1

4

0

1

0

Emmett @EmmettStream

over 4 years ago

@karpathy Unlike compilers, leading a team gives a lot of wiggle room to change requirements to remove hard dependencies and give more parallelism. Another interesting perspective is organizing with robustness to failure that small world networks exhibit but can be slow to reach consensus

0

2

0

Emmett

@EmmettStream

Who to follow

Last Seen Users on Sotwe

Trends for you

Most Popular Users