After reading Huawei's paper, my takeaway:
Huawei's innovation is fundamentally about using high design complexity + high manufacturing cost + bleeding-edge thermal solutions, to partially close the process gap
Huawei's Tau Scaling Law marketing is just another way of saying More than Moore: the generalized Moore's Law that everyone already knows.
What I'm actually interested in is whether Huawei's claimed density improvement brings real power efficiency gains. The 41% power efficiency number on their PPT , how is it actually achieved?
---------------------
1. Is Huawei's "equivalent density improvement" from chip stacking real or misleading? Is it a process breakthrough? Are there tangible benefits?
The source of the equivalent density improvement is two pieces of silicon bonded together using hybrid bonding technology. In theory, the projected footprint could be cut in half. But the first generation is not full-chip double-layer folding โ it selectively folds critical logic paths. Only about 53% of the chip area is actually folded (density goes from 155 to 238 MTr/mmยฒ). In subsequent generations, folding coverage will gradually increase, approaching full-chip folding by 2030 (density 155โ292).
The 2026 first-generation equivalent density jumps from 155 MTr/mmยฒ in 2025 to 238 MTr/mmยฒ in 2026. Clock frequency also improves by 12.7%, and power efficiency improves by 41%. On the surface, this looks no different from a process node advancement. But there's one critical difference: Huawei never once mentions leakage power. As long as the process node doesn't change, I_off, gate leakage, and junction leakage won't improve from 3D stacking alone.
The density discontinuity from 2030 to 2031 most likely comes from moving from 2-layer to 3-layer stacking, just as the 2025-to-2026 density and clock frequency discontinuity comes from going from single-layer to 2-layer folding.
So clearly, the "1.4nm equivalent by 2031" claim has no connection to an actual process node breakthrough.
What it really is: using high design complexity + high cost + bleeding-edge thermal management + early deployment of advanced packaging to partially compensate for the process gap.
---------------------
So is this seemingly inflated equivalent density improvement actually useful? What are the real benefits?
Yes, there are real benefits. The topological folding means that signal paths that previously ran several millimeters horizontally now become tens of micrometers vertically. This shortens super buffers and buses, reduces clock tree depth (clock depth -42%, clock wire -28%), and improves clock skew (-25%). These translate into genuine dynamic power savings. The shortening of critical paths also makes clock frequency increases easier to achieve.
So the performance improvement shown on the PPT roadmap โ the 12.7% gain from 2025 to 2026 โ is almost entirely from clock frequency increase (12.7%).
The benefits are fundamentally from topology-driven circuit design improvements.
---------------------
Since there's no actual process improvement, what are the trade-off costs of Huawei's chip stacking?
Three costs: thermal management gets harder, design complexity goes up, and manufacturing cost increases.
The biggest cost is the simultaneous increase in thermal density. In theory, logic-on-logic stacking means the hottest CPU execution areas have their power density effectively doubled. But factoring in the 41% power efficiency improvement, the actual power density is only about 40-50% higher than the non-stacked design. That's why the first generation can only fold the most critical portions โ roughly 53% of the chip area.
This forces thermal technology to advance ahead of schedule. Huawei is deploying millimeter-scale MEMS fans as micro-cooling solutions.
The second cost is design complexity. Which logic blocks can be folded? Once folded, the entire flow from front-end to back-end design has to be reworked.
No existing EDA tools support 3D topology. The paper itself admits that full-scale LogicFolding requires an entirely new 3D-native EDA toolchain that treats multi-layer stacked dies as a single continuous design entity. Which logic can be folded, how to perform inter-die timing closure, and physical design (PD) are all significant challenges.
Manufacturing cost also goes up significantly, forced into early deployment of advanced packaging. 1.5-2ฮผm hybrid bonding combined with logic-on-logic is extremely challenging and substantially more expensive. Previously, one wafer layer required one lithography pass. Now two wafer layers are lithographed separately and then bonded, with hybrid bonding overlay control (the paper requires <0.5ฮผm), TSVs, KOZ keep-out zones, redundancy and repair, and multiplicative yield losses. Per-chip manufacturing and testing costs increase significantly.
---------------------
2. What is Tau Scaling actually scaling? Is this a one-time design topology dividend? What's the potential? Where does continued improvement come from?
The core claim of ฯ Scaling is: replace geometric linewidth with the time constant ฯ as the full-stack optimization target, compressing characteristic delay across four levels โ device, circuit, chip, and system.
The formula itself contains no new physics. "Focus on the bottleneck delay" is what every architect already does. The entire industry knows that interconnect RC is the delay bottleneck. TSMC has been using low-k dielectrics, semi-damascene, and other techniques to reduce RC with every process generation. Packaging a universally known optimization direction as a "law" is clearly a marketing move. It's essentially just another way of saying More than Moore โ the generalized Moore's Law.
Setting the marketing aside, Huawei's claimed RC delay improvement is fundamentally about topology distances shrinking after chip stacking, which reduces the effective RC along those paths โ not the RC process constants themselves.
As for what "scaling" means here, it refers to a sustainable roadmap of continued improvement. The path is: increasing the number of stacked layers across the full chip, from 2-layer stacking in 2025-2030, to 3-layer stacking starting around 2031, and potentially 4-layer stacking further out.
The first-generation folding technology isn't even full-chip double-layer folding. It selectively folds critical logic, with only about 53% of the chip area folded (density 155โ238). In subsequent generations, folding coverage will gradually increase, approaching full-chip folding by 2030 (density 155โ292). The reason the 2031 roadmap shows a density discontinuity is precisely because that's the transition point from 2-layer to 3-layer folding.
But note that the marginal returns of this scaling approach diminish with each layer. Folding from 1 to 2 layers yields up to 100% density gain. Going from 2 to 3 layers yields only 50%. If they go from 3 to 4 layers around 2035, the yield is only 33%.
And as the number of stacked layers increases, all three challenges mentioned above โ thermal, design complexity, and cost โ get progressively worse.
---------------------
3. Is Huawei's chip stacking the same hybrid bonding technology that TSMC/AMD already have? Is it cache-on-logic, cache-on-cache, or logic-on-logic? How is the thermal problem of logic-on-logic solved?
Yes, it uses existing technology. But it's also true that Huawei has pushed certain metrics to industry-leading levels. 3D stacking itself is not new, but TSMC's production hybrid bonding is still at 6ฮผm pitch. Huawei's paper states that Kirin 2026 uses a 1.5ฮผm hybrid bonding pitch.
When I first saw the stacking news, my initial reaction was to suspect it was similar to AMD's 3D V-Cache, which primarily stacks SRAM cache on top of the existing L3 cache area and typically avoids stacking directly above the hottest CPU execution logic โ precisely to avoid thermal issues. SRAM has different power density and thermal characteristics than high-activity logic. If you stack the hottest logic-on-logic, thermal management becomes extremely difficult.
But after seeing more data โ clock buffer -56%, clock depth -42%, clock wire -28% โ these numbers are only possible if the core's internal clock distribution has been restructured. Pure SRAM stacking wouldn't touch the core's internal clock tree. Furthermore, cache-on-cache alone would likely not require a dedicated MEMS micro-fan for additional cooling. The evidence overwhelmingly points to a logic-on-logic approach.
The elegant aspect of Huawei's approach is that logic-on-logic folding doesn't actually double the thermal density, because the topological benefits reduce power consumption by about 30%. This means the thermal density only increases by about 40-50%.
And the first generation doesn't fully stack 100% of the hottest execution logic. The paper explicitly says it's "selectively applied along key critical paths." Only about 53% of selected critical paths are stacked, and the granularity may not even be that fine โ it could be IP-block-on-IP-block stacking. In that case, the actual thermal density increase might stay within 20%.
But as this path continues forward, bleeding-edge thermal solutions become inevitable. Right now it's millimeter-scale MEMS fans placed directly against the processor for high thermal conductivity. Like Huawei's phones in general, the thermal engineering is aggressive and leads the industry.
Going forward, they may need to bring HBM7/8's microchannel cooling technology forward in time. After all, HBM7/8 will feature 24+ layer stacking, and Huawei may well need to deploy next-generation thermal solutions ahead of schedule.
---------------------
4. From an architecture perspective, the most important question: How is Huawei's 41% power efficiency improvement actually achieved? Why didn't AMD's 3D V-Cache produce a similar improvement?
First, let's pin down what 41% means. The paper only says "SoC performance-core power efficiency improved by 41%," without specifying the benchmark, voltage/frequency operating point, temperature conditions, or power boundary. But there's a key clue on the PPT roadmap: the ISO-Power Performance numbers are 2.75 for 2025 and 3.1 for 2026 โ a 12.7% improvement.
This matches the clock frequency increase of 12.7% exactly. This can be interpreted as: the performance improvement at constant power is 12.7%, and it's almost entirely from clock frequency.
My hypothesis for how the power efficiency improvement works: LogicFolding shortens the critical path โ at a fixed Vdd, Fmax increases from 2.75GHz to 3.1GHz โ this means at the original 2.75GHz, there's approximately 12.7% timing headroom โ this headroom can be traded for lower Vdd in iso-performance mode.
Additional power efficiency gains likely come from reduced cache hit latency after circuit folding. From industry experience, a 10% reduction in L2/L3 cache hit latency typically translates to at least a 5% overall CPU performance improvement.
The PPT shows SRAM latency reduced by 30%, and some portion of that likely translates into lower cache hit latency.
AMD's 3D V-Cache didn't produce a comparable improvement mainly because AMD's underlying logic die wasn't redesigned. The 3D cache latency actually increased rather than decreased โ it only added cache capacity. The benefit of increased capacity is less dramatic than the benefit of reduced latency.
On the other hand, the reduction in clock skew and shortening of critical paths improve circuit timing, meaning Huawei can use a lower Vdd (I estimate possibly 7-8% lower). Combined with the RC reduction from shorter paths (considering clock buffer -56%, wire -28%, SRAM pJ/bit -24%, a C_eff reduction of 10-15% seems reasonable), plus the overall shrinking of the clock tree, it's entirely possible to achieve a 30% power reduction at iso-performance on certain voltage/frequency operating points. And a 30% power reduction translates to exactly 41% power efficiency improvement.
For comparison, Apple and Qualcomm typically see iso-power single-core performance improvements of 10-20% per generation, and iso-performance power reductions of 30-40%. This is determined by the shape of the V/F curve. So empirically, the numbers check out.
The power efficiency improvement, based on the available data, can be derived from topology changes and appears to be plausible โ it may genuinely have little to do with the process node.
---------------------
5. Is this technology path replicable? Will others follow?
In the short term, no one will replicate this at scale, because the risk-reward ratio doesn't justify it. In the long term, everyone is heading in this direction โ just under different names.
Huawei's fundamental motivation for LogicFolding is the sanctions. With the process node capped at 7nm, the only option is to compensate through packaging and design. Huawei has paid a significant price for this: thermal engineering costs, design complexity, and higher manufacturing costs (including yield). This is a path born of necessity, not natural choice.
Other players who can access TSMC can achieve normal economic iteration without taking on the risk of prematurely pushing thermal technology and design complexity forward.
In the long term, Intel's Foveros, TSMC's SoIC, and AMD's MI300 3D stacking are all heading in the same direction. If the economics of chasing the most advanced nodes continue to deteriorate, then "fix a mature node + 3D topology optimization" will become increasingly attractive.
On the thermal front, MEMS micro-fans and microchannels are likely to become mainstream for future HBM cooling as well.
---------------------
In summary, Huawei's innovation here deserves genuine respect. Under sanctions, they've used extreme design complexity and cost to boldly redesign on a locked process node, extracting a significant one-time topology dividend โ though it has a ceiling. The marginal returns diminish with each additional layer (1โ2, 2โ3, 3โ4 layers yield progressively smaller percentage gains). Leakage remains unsolved. Thermal management gets harder. The 3D EDA toolchain is an entirely new challenge.
Tau Scaling is not a path that sustains exponential growth for a decade. Each step up the staircase is harder to climb, and each step is shorter than the last. If Huawei wants to continue closing the gap, they'll need to find other routes beyond this one.
@nobulexlabs@Google Sure, but that won't be majority of traffic anytime soon. The top of the funnel will converge to "trusted" agents from Anthropic, Stripe, Google, etc (which is another way of saying branding)