Protocol Models are decentrally trained neural nets whose full weights are not extractable by any single actor.
Here’s how it works:
A model designer commits enough compute capital that show their skin in the game. Rest of the compute providers in the network votes with compute to train the model. If the model design gets enough compute, it gets trained. Each compute provider gets ownership in the trained model proportional to their risk-adjusted compute contributed as measured in verified FLOPS. Risk-adjusted means compute contribution in the beginning of a training run is weighed more than at the end since model performance is not clear at the start. The ownership is represented by a trade-able credential issued upon training a model. That credential gets consumed when an inference query runs the model. The credential is re-issued to the owner once its consumed.
@RyanWatkins_ Seeing it through is the most difficult part. Your conviction as represented by your size will get tested by the market multiple times, brutally during sharp drawdowns. And that’s the only moment you know what your conviction is made of.
It's incredible this single slide from Andrew Ng's 2012 Machine Learning course still captures the spine of every major AI breakthrough since. In the spirit of a refresher, here's the breakdown:
J(Θ) is a single number, the cost function, that tells you how wrong the model is right now. Obviously, you want this number to be low.
And you reach a low number by penalising wrong guesses. So you want J(Θ) to be a penalty function that does four things: score every guess; punish confident-and-wrong far harder than slightly-wrong; reward confident-and-right with near-zero penalty; and combine these penalties cleanly across millions of examples. The y log(h) + (1−y) log(1−h) form is the one simple form that does all four. y is the right answer (0 or 1); h is the model's guess (a probability). When the model is confident and right, this term is ~0 (no penalty); confident and wrong, it blows up (big penalty). The two terms work like a toggle: only one is ever active, and together they capture the wrongness of one guess. In ML, this is called cross-entropy loss.
The double sum (Σ over m examples, Σ over K output classes) just says: run this cross-entropy loss for every training example, and for each one check every possible answer. Add up the wrongness.
The −1/m out front: flip the sign and average, so the number is positive and comparable across dataset sizes.
The second line (λ/2m · ΣΣΣ Θ²) is regularization — a small extra penalty for letting weights get too big, which discourages memorising. Tunable knob (λ). This is a refinement; skip it on a first pass.
So the whole top block = "one number for total wrongness, averaged, with a tidiness penalty." That's it.
min J(Θ) reads: find the weights that make the wrongness smallest. That right there describes AI training in two characters. And to do it by rolling downhill (gradient descent), you need two things:
1. J(Θ) — the wrongness itself (how high on the hill am I).
2. ∂J/∂Θ — the gradient: for each weight, which way and how hard to nudge it to reduce wrongness. This is the partial-derivative line at the very bottom (the ∂/∂Θ term).
That second item is Backprop. Backprop is the efficient algorithm for computing ∂J/∂Θ — every weight's nudge — in one backward sweep. This is the foundational algorithm behind everything from AlexNet to ChatGPT, and likely behind whatever comes next.
Every transformer, including every SOTA frontier model, trains by computing this same ∂J/∂Θ and stepping downhill. The math on this slide is the literal foundation of all GPTs.
Pluralis is making the same bet for AI that Bitcoin made for money; Decentralised, Sovereign and Trustless.
And it solves the value-capture problem open source never could.
The 8B model currently training on Agora is 350B tokens in and continuing to converge. The top level metrics and evals look almost exactly like a centralised run. But;
- 133 external contributors total bringing 4090's, 5090's, L40S/RTX 6000 and RTX 6000 Pros. These are cards that people actually own - there are no H100, B200's etc.
- The max number of nodes the system can support (104) was filled almost immediately. The authorization layer is receiving approximately 100 requests/minute to join.
- The total tokens/per second processed moves directly with amount of compute in the swarm, with Agora constantly optimising to make most efficient use of what hardware is present.
- MFU is approximately 20%, TPS is 170k tok/s. There are near constant communication failures which Agora is completely absorbing without slowdown.
- The system is effectively on auto-pilot, requiring very little intervention from us. Bad nodes are purged immediately before training is affected and new nodes take their place.
@AlexanderLong there’s China commoditising the compliment (open weights) and then there are frontier closed source models. there is no real open source models. protocol learning is an elegant alternative.
Pluralis thesis boils down to one question:
Does subspace-compressed model parallelism hold convergence at scale?
Can GPUs talking through the drinking straw of ordinary internet train something as smart as GPUs talking through the fire hoses of datacenters?
That's the whole bet.
The Bitcoin quantum debate keeps circling one question: Is the threat real?
That's not the question that matters. Bitcoin became a trillion-dollar asset because enough people believed it was a digital store of value — nobody proved anything.
The risk prices the same way. If enough people believe the threat is real and that the fix will come too slow, it's in the price. The machine doesn't have to exist.
Protocol Models are decentrally trained neural nets whose full weights are not extractable by any single actor.
Here’s how it works:
A model designer commits enough compute capital that show their skin in the game. Rest of the compute providers in the network votes with compute to train the model. If the model design gets enough compute, it gets trained. Each compute provider gets ownership in the trained model proportional to their risk-adjusted compute contributed as measured in verified FLOPS. Risk-adjusted means compute contribution in the beginning of a training run is weighed more than at the end since model performance is not clear at the start. The ownership is represented by a trade-able credential issued upon training a model. That credential gets consumed when an inference query runs the model. The credential is re-issued to the owner once its consumed.