What happens if your CPU gets something wrong? If it wakes up one day and decides 2+2=5?
Well, most of us will never have to worry about that. But if you work at a company the size of Google, you do, which is why this paper on "mercurial cores" is so fascinating.
What the authors report--and supposedly this is common knowledge at the hyperscalers--is that a couple cores per several thousand machines are "mercurial." Due to subtle manufacturing defects or old age, they give wrong answers for certain instructions. These can cause all sorts of impossible-to-diagnose issues. Some rare problems at Google that were traced back to bad CPUs include:
- Mutexes not working, causing application crashes
- Silent data corruption
- Garbage collectors targeting live memory, causing application crashes
- Kernel state corruption causing kernel panics
What makes CPUs go bad? It's very hard to tell. The authors posit that issues are becoming more frequent as CPUs get more complex, but there aren't solid numbers behind that. There are certainly strong relationships between frequency, temperature, voltage, and bad CPU behavior--most mercurial CPUs only cause problems under very specific conditions, but those conditions vary from CPU to CPU. Age is another source of problems, as older CPUs are more likely to exhibit problems.
Bad CPUs are an especially serious problem because they're very hard to detect. If cosmic rays flip bits in storage or on the network, that can be detected through error coding. But there's no analogy for a CPU that allows cheap online verification of its correctness. Instead, the best detection techniques involve monitoring for symptoms. If a core exhibits exceptionally high rates of process crashes or kernel panics relative to its fellows, that's a strong indication something is wrong with it. For the most critical applications, the authors propose triple modular redundancy--redoing each of its computations on three cores and majority-voting a reliable result.
More than anything, this paper is a call to action--letting everyone know that CPUs can fail. So now, if you ever find a bug you can't diagnose, you can blame the CPU! 🙂
Somehow adding a GPU to my poweredge caused it not to want to boot anymore ... the iDRAC works but I apparently forgot the password, and the one monitor I have with VGA also apparently doesn't work anymore...
What isn't cool is that on the UniFi UDM Pro, you have to enable the legacy interface, fiddle with putting in a PSK, setting to WPA Personal, enabling WPA3 support, saving, then removing the pSK and setting to Open for WPA3 OWE to actually work.
Does the iPhone 15 support Wi-Fi 6e, or more specifically, OWE? I set up my guest network to enable OWE (https://t.co/nkI5diJuck) .. my laptop is able to connect and turn on "Enhanced Open" security, but my iPhone 14 just stays on the open network.
Has anyone migrated from Google Workspace to iCloud+? I had Google Apps for free for a long time, last year got forced into Workppace, and now my accounts are $14.50/mo/user ...