xpmanoj @xpmanoj - Twitter Profile

Protocol Models are decentrally trained neural nets whose full weights are not extractable by any single actor. Here’s how it works: A model designer commits enough compute capital that show their skin in the game. Rest of the compute providers in the network votes with compute to train the model. If the model design gets enough compute, it gets trained. Each compute provider gets ownership in the trained model proportional to their risk-adjusted compute contributed as measured in verified FLOPS. Risk-adjusted means compute contribution in the beginning of a training run is weighed more than at the end since model performance is not clear at the start. The ownership is represented by a trade-able credential issued upon training a model. That credential gets consumed when an inference query runs the model. The credential is re-issued to the owner once its consumed.

0

51

0

22

Who to follow

Ethan Andrews

@ethanandrewsss

Solving the NHS workforce crisis at Asterix Health; previously building AI for local government @agilisys / @blenheimchalcot.

Yan Phun

@yanphun

COO and Founder @99dotco, Founder @TinderLabs

4 days ago

When the largest distribution channels of Solana and Ethereum are distributing Hyperliquid, you know where the puck is going to be. Hyperliquid.

xpmanoj's tweet photo. When the largest distribution channels of Solana and Ethereum are distributing Hyperliquid, you know where the puck is going to be.

Hyperliquid. https://t.co/e2I5GuIOOh

0

1

0

22

xpmanoj

@xpmanoj

6 days ago

@RyanWatkins_ Seeing it through is the most difficult part. Your conviction as represented by your size will get tested by the market multiple times, brutally during sharp drawdowns. And that’s the only moment you know what your conviction is made of.

0

11

xpmanoj

@xpmanoj

6 days ago

It's incredible this single slide from Andrew Ng's 2012 Machine Learning course still captures the spine of every major AI breakthrough since. In the spirit of a refresher, here's the breakdown: J(Θ) is a single number, the cost function, that tells you how wrong the model is right now. Obviously, you want this number to be low. And you reach a low number by penalising wrong guesses. So you want J(Θ) to be a penalty function that does four things: score every guess; punish confident-and-wrong far harder than slightly-wrong; reward confident-and-right with near-zero penalty; and combine these penalties cleanly across millions of examples. The y log(h) + (1−y) log(1−h) form is the one simple form that does all four. y is the right answer (0 or 1); h is the model's guess (a probability). When the model is confident and right, this term is ~0 (no penalty); confident and wrong, it blows up (big penalty). The two terms work like a toggle: only one is ever active, and together they capture the wrongness of one guess. In ML, this is called cross-entropy loss. The double sum (Σ over m examples, Σ over K output classes) just says: run this cross-entropy loss for every training example, and for each one check every possible answer. Add up the wrongness. The −1/m out front: flip the sign and average, so the number is positive and comparable across dataset sizes. The second line (λ/2m · ΣΣΣ Θ²) is regularization — a small extra penalty for letting weights get too big, which discourages memorising. Tunable knob (λ). This is a refinement; skip it on a first pass. So the whole top block = "one number for total wrongness, averaged, with a tidiness penalty." That's it. min J(Θ) reads: find the weights that make the wrongness smallest. That right there describes AI training in two characters. And to do it by rolling downhill (gradient descent), you need two things: 1. J(Θ) — the wrongness itself (how high on the hill am I). 2. ∂J/∂Θ — the gradient: for each weight, which way and how hard to nudge it to reduce wrongness. This is the partial-derivative line at the very bottom (the ∂/∂Θ term). That second item is Backprop. Backprop is the efficient algorithm for computing ∂J/∂Θ — every weight's nudge — in one backward sweep. This is the foundational algorithm behind everything from AlexNet to ChatGPT, and likely behind whatever comes next. Every transformer, including every SOTA frontier model, trains by computing this same ∂J/∂Θ and stepping downhill. The math on this slide is the literal foundation of all GPTs.

xpmanoj's tweet photo. It's incredible this single slide from Andrew Ng's 2012 Machine Learning course still captures the spine of every major AI breakthrough since. In the spirit of a refresher, here's the breakdown:

J(Θ) is a single number, the cost function, that tells you how wrong the model is right now. Obviously, you want this number to be low.

And you reach a low number by penalising wrong guesses. So you want J(Θ) to be a penalty function that does four things: score every guess; punish confident-and-wrong far harder than slightly-wrong; reward confident-and-right with near-zero penalty; and combine these penalties cleanly across millions of examples. The y log(h) + (1−y) log(1−h) form is the one simple form that does all four. y is the right answer (0 or 1); h is the model's guess (a probability). When the model is confident and right, this term is ~0 (no penalty); confident and wrong, it blows up (big penalty). The two terms work like a toggle: only one is ever active, and together they capture the wrongness of one guess. In ML, this is called cross-entropy loss.

The double sum (Σ over m examples, Σ over K output classes) just says: run this cross-entropy loss for every training example, and for each one check every possible answer. Add up the wrongness.

The −1/m out front: flip the sign and average, so the number is positive and comparable across dataset sizes.

The second line (λ/2m · ΣΣΣ Θ²) is regularization — a small extra penalty for letting weights get too big, which discourages memorising. Tunable knob (λ). This is a refinement; skip it on a first pass.

So the whole top block = "one number for total wrongness, averaged, with a tidiness penalty." That's it.

min J(Θ) reads: find the weights that make the wrongness smallest. That right there describes AI training in two characters. And to do it by rolling downhill (gradient descent), you need two things:

1. J(Θ) — the wrongness itself (how high on the hill am I).

2. ∂J/∂Θ — the gradient: for each weight, which way and how hard to nudge it to reduce wrongness. This is the partial-derivative line at the very bottom (the ∂/∂Θ term).

That second item is Backprop. Backprop is the efficient algorithm for computing ∂J/∂Θ — every weight's nudge — in one backward sweep. This is the foundational algorithm behind everything from AlexNet to ChatGPT, and likely behind whatever comes next.

Every transformer, including every SOTA frontier model, trains by computing this same ∂J/∂Θ and stepping downhill. The math on this slide is the literal foundation of all GPTs.

0

29

xpmanoj

@xpmanoj

8 days ago

Pluralis is making the same bet for AI that Bitcoin made for money; Decentralised, Sovereign and Trustless. And it solves the value-capture problem open source never could.

Pluralis Research @Pluralis

9 days ago

The 8B model currently training on Agora is 350B tokens in and continuing to converge. The top level metrics and evals look almost exactly like a centralised run. But; - 133 external contributors total bringing 4090's, 5090's, L40S/RTX 6000 and RTX 6000 Pros. These are cards that people actually own - there are no H100, B200's etc. - The max number of nodes the system can support (104) was filled almost immediately. The authorization layer is receiving approximately 100 requests/minute to join. - The total tokens/per second processed moves directly with amount of compute in the swarm, with Agora constantly optimising to make most efficient use of what hardware is present. - MFU is approximately 20%, TPS is 170k tok/s. There are near constant communication failures which Agora is completely absorbing without slowdown. - The system is effectively on auto-pilot, requiring very little intervention from us. Bad nodes are purged immediately before training is affected and new nodes take their place.

Pluralis's tweet photo. The 8B model currently training on Agora is 350B tokens in and continuing to converge. The top level metrics and evals look almost exactly like a centralised run. But;
- 133 external contributors total bringing 4090's, 5090's, L40S/RTX 6000 and RTX 6000 Pros. These are cards that people actually own - there are no H100, B200's etc.
- The max number of nodes the system can support (104) was filled almost immediately. The authorization layer is receiving approximately 100 requests/minute to join.
- The total tokens/per second processed moves directly with amount of compute in the swarm, with Agora constantly optimising to make most efficient use of what hardware is present.
- MFU is approximately 20%, TPS is 170k tok/s. There are near constant communication failures which Agora is completely absorbing without slowdown.
- The system is effectively on auto-pilot, requiring very little intervention from us. Bad nodes are purged immediately before training is affected and new nodes take their place.

4

140

16

48

58K

0

2

1

0

75

xpmanoj

@xpmanoj

9 days ago

@AlexanderLong there’s China commoditising the compliment (open weights) and then there are frontier closed source models. there is no real open source models. protocol learning is an elegant alternative.

0

8

xpmanoj

@xpmanoj

9 days ago

@jharohit Rooting for the generational @trans_celestial moment!

1

0

21

xpmanoj

@xpmanoj

11 days ago

Pluralis thesis boils down to one question: Does subspace-compressed model parallelism hold convergence at scale? Can GPUs talking through the drinking straw of ordinary internet train something as smart as GPUs talking through the fire hoses of datacenters? That's the whole bet.

0

30

xpmanoj

@xpmanoj

11 days ago

Fable 5 is a Ferrari for the Mind!

0

21

xpmanoj

@xpmanoj

11 days ago

The Bitcoin quantum debate keeps circling one question: Is the threat real? That's not the question that matters. Bitcoin became a trillion-dollar asset because enough people believed it was a digital store of value — nobody proved anything. The risk prices the same way. If enough people believe the threat is real and that the fix will come too slow, it's in the price. The machine doesn't have to exist.

1

0

59

xpmanoj

@xpmanoj

12 days ago

Protocol Models are decentrally trained neural nets whose full weights are not extractable by any single actor. Here’s how it works: A model designer commits enough compute capital that show their skin in the game. Rest of the compute providers in the network votes with compute to train the model. If the model design gets enough compute, it gets trained. Each compute provider gets ownership in the trained model proportional to their risk-adjusted compute contributed as measured in verified FLOPS. Risk-adjusted means compute contribution in the beginning of a training run is weighed more than at the end since model performance is not clear at the start. The ownership is represented by a trade-able credential issued upon training a model. That credential gets consumed when an inference query runs the model. The credential is re-issued to the owner once its consumed.

0

51