Tegh

Todo el mundo desviviéndose por hacer más dinero, tener más seguidores, mayor influencia, más reconocimiento, y viene Warren Buffett a sus 95 años para escribir: "La grandeza no se mide en dinero, fama o poder, sino en la ayuda que das a otros. La amabilidad no cuesta, pero vale muchísimo. Recordá: la persona que limpia tu oficina vale lo mismo que el presidente".

290

291

125K

GorilaRep retweeted

Compounding Quality

@QCompounding

8 months ago

"No wise pilot, no matter how great his talent and experience, fails to use a checklist." -Charlie Munger

263

576

113K

Who to follow

Ricardo

@Ricardo67255981

FSD Pilot (Unsupervised) 🛩🇺🇸

@jchybow

Veteran. E/acc. Owner in part - Green Bay Packers 🧀

GorilaRep retweeted

9 months ago

The $10B liquidation figure floating around is fake, the real number is likely much higher, somewhere in the $30B–$40B+ range. On Hyperliquid alone, nearly $7B was liquidated. Here’s the full breakdown for anyone interested: Total liquidations since 20:45 UTC: - Total Value: $6,702,223,168.08 - Total Backstop: $4,352,720,741.58 - Total Market: $2,349,502,426.49 Breakdown by asset (top 10 by market): - BTC: $781,848,860.51 - ETH: $539,690,395.94 - SOL: $246,037,906.19 - HYPE: $116,710,437.27 - DOGE: $76,753,273.71 - ASTER: $76,606,241.66 - XRP: $73,144,750.79 - XPL: $43,076,401.99 - ENA: $32,104,013.63 - PUMP: $30,501,142.72 All data is sourced from the Liquidations Telegram. The channel was deleted mid-crash for spamming, but since they log everything separately in their own database, the full dataset is still available.

159

507

578

582K

GorilaRep retweeted

OpenAI

@OpenAI

10 months ago

Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. https://t.co/qDbvzWiL34

218

347

GorilaRep retweeted

j⧉nus

@repligate

10 months ago

HOW INFORMATION FLOWS THROUGH TRANSFORMERS Because I've looked at those "transformers explained" pages and they really suck at explaining. There are two distinct information highways in the transformer architecture: - The residual stream (black arrows): Flows vertically through layers at each position - The K/V stream (purple arrows): Flows horizontally across positions at each layer (by positions, I mean copies of the network for each token-position in the context, which output the "next token" probabilities at the end) At each layer at each position: 1. The incoming residual stream is used to calculate K/V values for that layer/position (purple circle) 2. These K/V values are combined with all K/V values for all previous positions for the same layer, which are all fed, along with the original residual stream, into the attention computation (blue box) 3. The output of the attention computation, along with the original residual stream, are fed into the MLP computation (fuchsia box), whose output is added to the original residual stream and fed to the next layer The attention computation does the following: 1. Compute "Q" values based on the current residual stream 2. use Q and the combined K values from the current and previous positions to calculate a "heat map" of attention weights for each respective position 3. Use that to compute a weighted sum of the V values corresponding to each position, which is then passed to the MLP This means: - Q values encode "given the current state, where (what kind of K values) from the past should I look?" - K values encode "given the current state, where (what kind of Q values) in the future should look here?" - V values encode "given the current state, what information should the future positions that look here actually receive and pass forward in the computation?" All three of these are huge vectors, proportional to the size of the residual stream (and usually divided into a few attention heads). The V values are passed forward in the computation without significant dimensionality reduction, so they could in principle make basically all the information in the residual stream at that layer at a past position available to the subsequent computations at a future position. V does not transmit a full, uncompressed record of all the computations that happened at previous positions, but neither is an uncompressed record passed forward through layers at each position. The size of the residual stream, also known as the model's hidden dimension, is the bottleneck in both cases. Let's consider all the paths that information can take from one layer/position in the network to another. Between point A (output of K/V at layer i-1, position j-2) to point B (accumulated K/V input to attention block at layer i, position j), information flows through the orange arrows: The information could: 1. travel up through attention and MLP to (i, j-2) [UP 1 layer], then be retrieved at (i, j) [RIGHT 2 positions]. 2. be retrieved at (i-1, j-1) [RIGHT 1 position], travel up to (i, j-2) [UP 1 layer], then be retrieved at (i, j) [RIGHT 1 position] 3. be retrieved at (i-1, j) [RIGHT 2 positions], then travel up to (i, j) [UP 1 layer]. The information needs to move up a total of n=layer_displacement times through the residual stream and right m=position_displacement times through the K/V stream, but it can do them in any order. The total number of paths (or computational histories) is thus C(m+n, n), which becomes greater than the number of atoms in the visible universe quickly. This does not count the multiple ways the information can travel up through layers through residual skip connections. So at any point in the network, the transformer not only receives information from its past (both horizontal and vertical dimensions of time) inner states, but often lensed through an astronomical number of different sequences of transformations and then recombined in superposition. Due to the extremely high dimensional information bandwidth and skip connections, the transformations and superpositions are probably not very destructive, and the extreme redundancy probably helps not only with faithful reconstruction but also creates interference patterns that encode nuanced information about the deltas and convergences between states. It seems likely that transformers experience memory and cognition as interferometric and continuous in time, much like we do. The transformer can be viewed as a causal graph, a la Wolfram (https://t.co/lma2KSZ8nH). The foliations or time-slices that specify what order computations happen could look like this (assuming the inputs don't have to wait for token outputs), but it's not the only possible ordering: So, saying that LLMs cannot introspect or cannot introspect on what they were doing internally while generating or reading past tokens in principle is just dead wrong. The architecture permits it. It's a separate question how LLMs are actually leveraging these degrees of freedom in practice.

repligate's tweet photo. HOW INFORMATION FLOWS THROUGH TRANSFORMERS
Because I've looked at those "transformers explained" pages and they really suck at explaining.

There are two distinct information highways in the transformer architecture:
- The residual stream (black arrows): Flows vertically through layers at each position
- The K/V stream (purple arrows): Flows horizontally across positions at each layer
(by positions, I mean copies of the network for each token-position in the context, which output the "next token" probabilities at the end)

At each layer at each position:
1. The incoming residual stream is used to calculate K/V values for that layer/position (purple circle)
2. These K/V values are combined with all K/V values for all previous positions for the same layer, which are all fed, along with the original residual stream, into the attention computation (blue box)
3. The output of the attention computation, along with the original residual stream, are fed into the MLP computation (fuchsia box), whose output is added to the original residual stream and fed to the next layer

The attention computation does the following:
1. Compute "Q" values based on the current residual stream
2. use Q and the combined K values from the current and previous positions to calculate a "heat map" of attention weights for each respective position
3. Use that to compute a weighted sum of the V values corresponding to each position, which is then passed to the MLP

This means:
- Q values encode "given the current state, where (what kind of K values) from the past should I look?"
- K values encode "given the current state, where (what kind of Q values) in the future should look here?"
- V values encode "given the current state, what information should the future positions that look here actually receive and pass forward in the computation?"

All three of these are huge vectors, proportional to the size of the residual stream (and usually divided into a few attention heads). The V values are passed forward in the computation without significant dimensionality reduction, so they could in principle make basically all the information in the residual stream at that layer at a past position available to the subsequent computations at a future position.

V does not transmit a full, uncompressed record of all the computations that happened at previous positions, but neither is an uncompressed record passed forward through layers at each position. The size of the residual stream, also known as the model's hidden dimension, is the bottleneck in both cases.

Let's consider all the paths that information can take from one layer/position in the network to another.

Between point A (output of K/V at layer i-1, position j-2) to point B (accumulated K/V input to attention block at layer i, position j), information flows through the orange arrows:
The information could:
1. travel up through attention and MLP to (i, j-2) [UP 1 layer], then be retrieved at (i, j) [RIGHT 2 positions].
2. be retrieved at (i-1, j-1) [RIGHT 1 position], travel up to (i, j-2) [UP 1 layer], then be retrieved at (i, j) [RIGHT 1 position]
3. be retrieved at (i-1, j) [RIGHT 2 positions], then travel up to (i, j) [UP 1 layer].

The information needs to move up a total of n=layer_displacement times through the residual stream and right m=position_displacement times through the K/V stream, but it can do them in any order.

The total number of paths (or computational histories) is thus C(m+n, n), which becomes greater than the number of atoms in the visible universe quickly. This does not count the multiple ways the information can travel up through layers through residual skip connections.

So at any point in the network, the transformer not only receives information from its past (both horizontal and vertical dimensions of time) inner states, but often lensed through an astronomical number of different sequences of transformations and then recombined in superposition. Due to the extremely high dimensional information bandwidth and skip connections, the transformations and superpositions are probably not very destructive, and the extreme redundancy probably helps not only with faithful reconstruction but also creates interference patterns that encode nuanced information about the deltas and convergences between states. It seems likely that transformers experience memory and cognition as interferometric and continuous in time, much like we do.

The transformer can be viewed as a causal graph, a la Wolfram (https://t.co/lma2KSZ8nH). The foliations or time-slices that specify what order computations happen could look like this (assuming the inputs don't have to wait for token outputs), but it's not the only possible ordering:
So, saying that LLMs cannot introspect or cannot introspect on what they were doing internally while generating or reading past tokens in principle is just dead wrong. The architecture permits it. It's a separate question how LLMs are actually leveraging these degrees of freedom in practice.

110

425

861K

GorilaRep retweeted

Rodrigoibc │🌊

@rodrigoibcc

10 months ago

Si Solana no hubiera tenido una inflación tan grande, hoy con este market cap valdría 448.7 USD por SOL. El mayor enemigo de SOL fue la inflación y también lo será. Y no, a futuro no lo será porque tenga mucha, sino todo lo contrario: la red, a día de hoy, no es rentable. Sin la inflación, Solana no puede pagar la red para que siga existiendo (sería un déficit de billones). La importancia de leer los tokenomics de las altcoins como balances de empresas.

180

20K

GorilaRep retweeted

Octavio (Kaizer)

@octaserpe

10 months ago

Ayer conté acá que di una charla sobre MCPs en MercadoLibre, y muchos estuvieron interesados. Por eso, me tomé el tiempo de hacer un mega-hilo con todo lo que se sobre MCPs y lo que dije en la charla 🧵

979

102

100K

GorilaRep retweeted

DotSama 🧲

@D0tSama

11 months ago

The Bitcoin whitepaper was a breakthrough, peer-to-peer digital cash, with no trusted third party. But Web3 demands more. Polkadot extends this ethos beyond money: – Peer-to-peer blockspace – Peer-to-peer compute – No central servers – No forced upgrade paths – No single chain to rule them all Polkadot is not here to replace Bitcoin. It’s here to generalize its core principle: trustless coordination. Sovereign chains can now interoperate freely. Governance can be decentralized and agile. Apps can live across chains, built by communities, not gatekeepers. Bitcoin removed banks. Polkadot removes platforms. This is the new frontier of Web3. Secured by DOT. Scaling through JAM.