I had this idea for a 4-player dynamic splitscreen setup.
Not sure if it's better than a traditional splitscreen setup, but I like the fact that you can see the relative locations of other players visually.
Might write a blog post about this!
#godot#gamedev
@zackslab I've seen the runaway FET thing before and it can be insidious because it depends on active cooling (fan failure), inlet cooling air temperature, whether someone blocked the vents with dust or a bookshelf or something, everything can be fine until suddenly it's not
Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/sec and 768GB of second-hand Intel Optane memory.
What happened is that a sparse model met an unusual memory tier that could hold its enormous body while the GPU handled the most time-sensitive organs.
i.e. the bulk of the sparse expert weights live in a larger, cheaper memory tier and are pulled into the computation as needed.
This worked because Kimi K2.5 is a Mixture-of-Experts model, so it has 1T total parameters but activates only 32B per token.
The RTX 3060’s 12GB VRAM holds latency-sensitive parts like routing, attention, dense layers, and shared experts.
The huge expert weights sit in Optane PMem, configured as RAM, while 192GB DDR4 ECC acts as cache.
He is using 6 Optane PMem (DCPMM) sticks. This retired memory format was made to bridge DRAM and SSD performance. The 768GB Optane configuration, using 6x128GB modules, does beat the best NVMe SSDs on latency by a wide margin, but remains 2x to 3x slower than DRAM.
llama.cpp handled hybrid GPU/CPU inference, with tensor placement tuned through flags like override-tensor.
The result was roughly 4 tokens/sec, which is slow for chat but impressive for a local 1T-parameter model on cheap retired enterprise hardware.
The DDR4 acted as cache, the Optane acted as a giant memory pool, and llama.cpp pushed routing and other critical tensors onto the 12GB GPU.
I've found the most optically annoying and the least serdes/dsp demanding optical networking architecture: optical serializer with a mode-locked laser (OTDM) so that you can have wide and slow signals coming in, and dual-comb-like homodyne detection with the same comb source as LO and same set of delays so that the detector and TIA bandwidths can also be small.
You just need to hold path length drift/pulse overlap to be sub 10s of fs, and the optical phase, the branch skew, pulse shape..
Happy World #MetrologyDay 🥳Here's a sub-ppm 20kA direct current comparator by Measurements International in the 90s based on a design by Kusters from the 60s. Single primary winding, matched cable lengths for uniform magnetic field, new readout electronics developed at CERN