$INTC can increase as much SRAM density. Groq can build inference chips with $NVDA. But can they force undo the enshittification of the software stack ? How far can hardware efficiency go to cover for the the inefficiencies of the software bloat?
@mu_chrinovic > uses O3 compiler
> Less performance than hand typed assembly
> Mfw O3 is tip of the iceberg. Compiler flag mining is harder than hand typed assembly
> Pic related
Velocity = Momentum / Mass
I am learning about hyperdimensional computational models or sumthing and these analogies keep cropping up. Strange but interesting.
Whitepill for midwits :
Economically speaking, high agency midwits will outperform low agency philosophers any day.
You may not be at the top of the class according to your grades or your teachers. But you can apply just a moderate amount of reasoning. combine it with tons of agency. And you should see your upward mobility happening in real time.
Economy rewards directional action, not deep thought with no action. Stop overthinking start acting. Iterate and improve.
In picture below : Four volume set of Intel Architecture software development manual.
Total size : 600+ 2576 + 596 + 1646 = 5418 pages !!
Intel prints new chips every year that comply with the instruction set in these thousands of pages.
If Intel published instruction-level performance data for the hundreds of chips it has printed over the last decade, there would be hundreds of thousands of pages of information.
The effort to publish that data is massive. To make it meaningful, it has to be matched with the effort from clients and end users. No client/customer has the patience to absorb the ever changing architectural performance data at that scale every year.
Every time a new chip is launched, the interested community works by benchmarking performance data for only the new instruction feature additions or a few hundred classic (ALU/MOV) instructions.
Not trivial, not great ROI.
This is more about incentives and ROI. Perhaps the incentives will change and the ROIs will improve some time in the future when AI will auto analyze new chips, publish systematic reports. And AIs will consume them to find optimal instruction sequences. But for humans, this is a waste of time. Why spend 6 months on finding the sequence, just wait that much time and the performance will improve with the next gen?
How aware are modern compilers of exact microarchitectural layouts?
Quite a lot…in one very specific way.
Intel x86 is *not* the same as AMD x86.
Sure…it’s the “same ISA” in the broadest sense, but the individual instructions often take different numbers of cycles.
You want to stall as little as possible, so the order your instructions are arranged is somewhat important. Technically, something like “AMD Zen 3” ordering on an Intel Skylake is sub-optimal.
If you look at LLVM, you’ll notice these X86Sched*.td (TableGen) files. Just about every x86 CPU generation has their own version.
It defines things like ROB Size, issue width, and misprediction penalties (in terms of cycles). What’s fascinating is how much of this is “guessed” (reverse-engineered) vs “revealed” by the hardware vendors.
From what I understand, Intel/AMD/etc will *sometimes* lend a helping hand / give some hints…but less than you’d expect.
It’s a very weird situation when you think about it. I assume vendors are tight-lipped about exact latencies for competitive secrecy…yet those are the exact things you’d need to know to extract the most performance out of your compiler!
If anyone knows other reasons for the secrecy, I’d love to hear it!
If you ever benchmarked recent chips originating from this part of the world, you will know that, sometimes, they absolutely MOG their American/European counterparts.
Even if they look half as great on the scorecards now, wait 6 months and check back.
These can absolutely transform the GPU supply chain
Huawei has begun shipping their 910c GPU with 128GB of HBM memory.
The sudden release of these chips came as a shock as China accounts for approximately 40% of $NVDA revenue once smuggling and black shipments are taken into account.
This chips are racked in the Cloud Matrix with 384 GPUs meshed together with optical network.
The Huawei Matrix is about 66% faster than $NVDA flagship GB200, consumes 2.5x the power but is designed specifically for China's ultra low cost energy sector.
Now that the Huawei 910C is in full production, the fear is China will cut off Nvidia chips entirely.
A simple idea to work under uncertainty (business operation, HFT, coding with LLMs, agents, etc) is to break a long uncertain task into shorter higher-certainty tasks. As you reduce the problem into smaller and smaller subproblems, the uncertainty (chances of error) per task reduces. By chaining higher certainty tasks, you can get higher overall success rates.
The choice is essentially between a long task with 67% success rate versus 10 smaller tasks each with an independent success rate of 99%.
P(failure | one long task ) = 67%
P (failure | sequence of short tasks ) = 0.99^10 = 90.4%
This is an understand that most production grade engineers follow. This implicitly separates robust engineering from vibe coding. If you want to build robust systems this is one way to think about it
At least in the UK, computer architecture is an ignored subject at CS undergrad level. Most students align themselves to what they *think* the companies want - AI/ML, fintech - while ignoring the real fundamentals. Computer architecture is ignored, because it is not at all catchy or attractive.
Is this right or wrong? Who cares. The job market will price it in.