tranesonic

@bryan_johnson Practical things: - Seek out an IFS (Internal-Family-Systems) therapist and do some trauma work. - Seek out true indigenous healers and practitioners that work with this medicine to help you integrate your experience. - Seek out Holotropic Breathwork practitioners.

tranesonic

@tranesonics

7 months ago

@bryan_johnson Breathe. Allow. Feel. You just had a major life experience and it takes time to integrate (and come down from). Much respect for having the courage to dive in, and allow yourself to explore whatever comes from it. Don't be afraid.

tranesonic

@tranesonics

7 months ago

@kelxyz_ You tapping dat W2 life?

tranesonic

@tranesonics

8 months ago

@ericreator @yacineMTB Not long now...

tranesonic

@tranesonics

8 months ago

@kelxyz_ gg. Say bye bye to your funds. Coinbase is cancer.

tranesonic

@tranesonics

8 months ago

Schelling, i.e. Thomas Schelling of the RAND corp in the 1950s. (You know this, but for the viewers' edification) He used this to solve the "Stag Hunt" problem in The Strategy of Conflict in 1960: https://t.co/FeCcOnEm46

191

tranesonic

@tranesonics

9 months ago

The next phase of blockchain turns consensus into a general‑purpose verification engine: the “work” that secures the ledger will be the very computations society already needs—AI training and inference, and zero‑knowledge proofs—verified succinctly and priced in open markets. In the short run, verifiable compute markets and inference‑as‑work chains will dominate the adoption curve; in the medium run, decentralized training and optimization‑as‑consensus will push blockchains beyond finance into AI, science, and industrial operations. The winning designs will minimize verification cost, keep miner competition fair, and expose proof‑native primitives that applications can compose like any other smart‑contract call.

186

tranesonic

@tranesonics

10 months ago

https://t.co/k7yykjQqvn

319

tranesonic

@tranesonics

10 months ago

Small Language Models are the Future of Agentic AI Yes. Full stop. https://t.co/hmI600l5Wd

226

tranesonic

@tranesonics

10 months ago

https://t.co/H1rRx43XbA

tranesonic

@tranesonics

10 months ago

@afurgs @afurgs let's chat about what we can do together. We're working toward the same goal at @FLOpsInc. We're also doing a spaces series, "Training without Borders" where we talk to founders, engineers, protocols, and operators in the DeAI space. Would love to have you on sometime!

tranesonic

@tranesonics

10 months ago

@kelxyz_ Two: Issac Asimov (The Last Question): https://t.co/VQL1Rvzp8O Philip K Dick (Valis: not exactly 2050, but incredible) https://t.co/IOvA7Asb6n

tranesonics retweeted

Brew Markets

@brewmarkets

10 months ago

What a chart.

412

629

309K

tranesonic

@tranesonics

10 months ago

🧵 Decentralized Training SOTA Report (2025): TL;DR — since late‑2023 we’ve gone from “promising demos” to a toolkit you can actually build with. Three big thrusts: • Low‑communication data‑parallel (DiLoCo → Streaming/Eager; DeMo/DisTrO; NoLoCo; DES‑LOC) • WAN‑tolerant model/pipeline parallel (Protocol Models; async PP w/ Nesterov; activation quantization) • Schedulers/topologies for messy networks (SWARM, Teleportation, hierarchical/“epidemic” sync) We'll go into brief analysis for each below (starting from NoLoCo and fanning out). ⤵️ 1. Low‑communication data‑parallel training NoLoCo (2025): no all‑reduce, gossip‑style sync • Idea: ditch global collectives. Periodically pair replicas and average weights inside a Nesterov‑style outer loop. • Why it matters: for a few hundred accelerators over the internet, NoLoCo’s sync is ~10× faster than DiLoCo’s all‑reduce; across 125 M–6.8 B, up to +4% faster convergence at the same loss. Open‑sourced. (arXiv) DiLoCo (2023→2024): large local steps + outer Nesterov • Idea: many inner AdamW steps locally; infrequent global sync via Nesterov “pseudo‑grads.” • Evidence: ~500× less comms than fully synchronous at parity (8 workers). Robust to churn & data skew. OpenDiLoCo: 2 continents / 3 countries, 90–95% GPU util, scaled to B‑param models via Hivemind. (arXiv) Streaming DiLoCo (2025) + Eager Updates (2025) • Idea: stream subsets of params to cut peak bandwidth; overlap comms/compute; quantize exchanges. Eager fully overlaps the outer step with the next inner loop. • Results: ~100× peak‑bandwidth reduction vs baseline DiLoCo; faster wall‑clock in WAN settings. (arXiv) DeMo (2024) → DisTrO (2024/2025): decouple momentum • Idea: let optimizer states drift per‑replica; sync only fast‑moving parts, keep momentum mostly local. • Evidence: orders‑of‑magnitude less traffic vs AdamW/DDP with matched or better convergence; used in practice >10B params; built for WAN. (arXiv, GitHub) DES‑LOC (2025) • Idea: “desync” schedules — different periods for params vs momentum. • Evidence: up to 170× less comms than DDP and 2× less than prior Local‑Adam on models to 1.7B. (arXiv) Async Local‑SGD line (2024–2025) • DeepMind: naïve async hurts via momentum on stale grads; fix with delayed Nesterov + adaptive local steps — matches sync Local‑SGD to 150 M. • PALSGD (2025): pseudo‑sync to lengthen intervals while keeping consistency. • HALoS (2025): hierarchical async (regional PS + global), up to 7.5× faster convergence vs sync baselines in geo��LLM training. (arXiv) 2. Communication‑efficient model/pipeline parallel (the WAN‑hard part) Protocol Models (Pluralis, 2025): compress activations + back‑activations • Problem: data‑parallel compression doesn’t help when shards must ship activations every microbatch. • Idea: exploit rank collapse in transformer projections; constrain to low‑rank subspaces so activations live in a predictable, reconstructable subspace. • Results: up to ~100× end‑to‑end comm reduction; trains an 8B LLaMA split across 4 regions over ~80 Mb/s with DC‑level convergence (baselines at 100 Gbps). (arXiv) Image Asynchronous pipeline with Nesterov (2025) • Idea: modify Nesterov look‑ahead to compensate for staleness in fully async PP; proof + code. • Results: on decoder‑only LMs to 1B params, outperforms other async baselines and can beat synchronous PP. (arXiv) Activation quantization for slow links (2025) • TAH‑Quant: 3–4‑bit, tile‑adaptive + Hadamard transform to tame outliers; SGD‑like rate. • Results: up to 4.3× end‑to‑end speedup with stable convergence and no extra memory. (arXiv) 3. Schedulers/topologies for unreliable, heterogenous networks SWARM parallelism (ICML’23) — still the WAN PP reference • Idea: stochastic, self‑healing pipelines; fast devices do more, slow/preempted do less; randomized rewiring handles failures. • Results: 1B‑param training on preemptible T4s with <200 Mb/s; “square‑cube law” intuition: bigger models can be easier to WAN‑train. (arXiv, PMLR) Image Teleportation (ICLR’25) • Idea: activate a subset of nodes each step, gossip within, “teleport” the active set to avoid spectral‑gap slowdowns as N grows. • Results: stable accuracy at large node counts; efficient rule to tune active‑set size. (arXiv, OpenReview) Epidemic/randomized sync & model fragmentation • Epidemic learning: randomized, partially‑overlapping sync patterns. • Model fragmentation (2024): combine async decentralization with fragment‑level updates to reduce staleness. (arXiv) Hierarchical/geo‑aware designs • HALoS: explicit intra‑ vs inter‑region behavior via local/global PS. • Varuna (EuroSys’22): strong systems baseline for low‑cost PP on spot/preemptible VMs. (arXiv, PDL) Where we started (NoLoCo’s refs → the graph) From NoLoCo’s bibliography we branched to: • DiLoCo (orig + OpenDiLoCo replication) → Streaming/Eager variants → async Local‑SGD fixes (staleness, overlap, sync freq). • WAN‑tolerant schedulers (SWARM) + topology theory (Teleportation). • Beyond DDP compression (Protocol Models; activation quant) + decoupled/desynced optimizers (DeMo/DisTrO; DES‑LOC). (arXiv) What’s deployable today? On public internet (≈80–500 Mb/s): • Data‑parallel: DiLoCo/OpenDiLoCo, Streaming DiLoCo (overlap+quant), NoLoCo, DeMo/DisTrO, DES‑LOC — if each node can hold the full model. • Model/pipeline: Protocol Models for true multi‑region MP; async PP + Nesterov if you can tolerate asynchrony; SWARM for stochastic, failure‑tolerant pipelines. (arXiv) Proof points: • OpenDiLoCo: 90–95% util across 2 continents / 3 countries. • Protocol Models: 8B across 4 regions at ~80 Mb/s with DC‑level convergence. • SWARM: 1B on <200 Mb/s preemptible nodes. (arXiv) Gaps & open problems (2025 snapshot) • Verifying off‑chain training: “proof‑of‑learning” is maturing; need practical, low‑overhead LLM‑scale proofs for marketplaces. • Privacy & data locality: async/gossip helps, but cross‑border PII + sector rules need careful routing + audit. • WAN model‑parallel: Protocol Models are a leap, but need independent reps >10B and very low‑bw; activation quant is promising but new. (arXiv) Quick reader’s map (hand‑picked & why) • NoLoCo (2025): no all‑reduce; pairwise averaging; ~10× faster sync vs DiLoCo; +4% convergence speed — the gossip intro. (arXiv) • DiLoCo (2023/24) → OpenDiLoCo: canonical local‑steps + outer momentum; reproducible over continents. (arXiv) • Streaming DiLoCo & Eager (2025): overlap/quant to slash peak bw; WAN‑practical. (arXiv) • DeMo (2024) & DisTrO (2024/25): decoupled momentum → orders‑of‑magnitude less traffic; WAN‑ready. (arXiv, GitHub) • DES‑LOC (2025): desynced schedules; strong empirical reductions. (arXiv) • Protocol Models (2025): first convincing recipe to compress activations/back‑acts for WAN MP; 8B, 4 regions, ~80 Mb/s. (arXiv) • Async PP + Nesterov (2025); TAH‑Quant (2025): async pipelines with theory + WAN‑oriented activation quant. (arXiv) • SWARM (ICML’23): stochastic, failure‑tolerant pipelines. (arXiv) • Async Local‑SGD (DeepMind’24), PALSGD (’25), HALoS (’25): what breaks (staleness) and how to fix it (delayed Nesterov, pseudo‑sync, hierarchy). (arXiv) • Context/replications: OpenDiLoCo notes; DiPaCo (modular paths + DiLoCo) pairs well with WAN training. (arXiv) Practical guidance (what to try first) • If every node fits the model: start with OpenDiLoCo or NoLoCo; add Streaming DiLoCo overlap/quant when links spike; try DeMo/DES‑LOC optimizers. • If you must split the model: use Protocol Models for WAN‑safe PP; need full asynchrony? test Async‑PP + Nesterov; if bandwidth binds, add TAH‑Quant. • If nodes churn or vary: SWARM‑style stochastic pipelines (or hierarchical HALoS) to keep throughput high. (arXiv) Libraries & code you’ll actually touch • OpenDiLoCo: code + solid replication write‑ups. • DisTrO: open repo + prelim report. • Async‑PP (Pluralis): code links available. • Hivemind: still a handy DHT/NAT‑piercing substrate for P2P‑style scheduling. (GitHub, arXiv) Sources: • NoLoCo (2025): Kolehmainen et al., arXiv:2506.10911 • DiLoCo (2023→2024): Douillard et al., arXiv:2311.08105; OpenDiLoCo (2024) Jaghouar et al., arXiv:2407.07852 • Streaming DiLoCo (2025); Eager Updates (2025) • DeMo (2024); DisTrO (2024/25) • DES‑LOC (2025) • Protocol Models (2025): Ramasinghe et al., arXiv:2506.01260 • Async PP + Nesterov (2025): Ajanthan et al., arXiv:2505.01099 • TAH‑Quant (2025): He et al., arXiv:2506.01352 • SWARM (ICML’23): Ryabinin et al. • Async Local‑SGD (2024), PALSGD (2025), HALoS (2025) See my full article writeup on https://t.co/dGRPRNzFdC that goes into much more depth at: https://t.co/5G8qH1zhve

315

tranesonic

@tranesonics

10 months ago

Decided to upload most of these papers to a knowledge base wrapper on OpenAI: https://t.co/IPTIgwmL7s Enjoy.

220

tranesonic

@tranesonics

10 months ago

It's happening. You might not believe it, or may be in denial. This is coming, and there's no stopping this train (credit to Lyn Alden). I compiled a list of the most relevant decentralized training papers from 2021-on at the drive xlsx sheet at the end of this post. You're welcome. Some highlights: - Hivemind (library)2021Library / P2P substratePeer-to-peer parameter averaging; DHT-based rendezvousInternet-grade; NAT traversal; P2PDesigned for hundreds of peers; used in OpenDiLoCoAveraging/opt steps over DHT; fault-tolerant backpropEnabler substrate rather than SOTA algorithmGitHub: learning-at-home/hivemind https://t.co/DJsd3kaG8O - NoLoCo (No-all-reduce Low Communication)2025Optimizer / Data-parallelData-parallel (inner–outer; pairwise averaging; no all-reduce)Internet-scale; sync step ~10× faster than DiLoCo (few hundred accelerators)125M–6.8B params; wide accelerator countsOuter Nesterov w/ pairwise weight averaging; inner local AdamW stepsUp to 4% faster vs DiLoCo at same loss; lower comm overheadarXiv:2506.10911 https://t.co/hqpwWDVEiU - DiLoCo2023Optimizer / Data-parallelData-parallel (inner–outer; infrequent global sync)Geo-distributed; not explicitly specified8 workers; language modeling on C4; extended in later workOuter Nesterov every ~K steps; inner local AdamWMatches fully synchronous while communicating ~500× less (8 workers)arXiv:2311.08105 https://t.co/F8SwDRUpJs - RL Swarm (Collaborative P2P RL)2025Reinforcement learning / P2PPeer-to-peer post-training (answer→critique→resolve/vote); GRPO-basedConsumer hardware to cloud; internet P2PMultiple LLM agents; open networkGossip sharing of rollouts/feedback; decentralized votingFaster learning than solo agents on showcased tasksGensyn RL Swarm (GitHub & blog) https://t.co/FDg7pS2pSp See at the link below: https://t.co/sM1Ik400VS

205

tranesonic

@tranesonics

Last Seen Users on Sotwe

Trends for you

Most Popular Users