Reducing variation in manufacturing, monitoring behavior over time, and targeting specific workloads can have a big impact on power, performance, and area/cost.
Key Takeaways:
At 2nm and below, Moore is more, but more is also less.
In theory, the more transistors that can be packed onto a reticle-sized die, the faster a chip can process data and move it back and forth from memory. But theory and reality are diverging.
Historically, the best way to achieve that was to shrink the transistors, the wires, and the memory cells. At 2nm and below, this formula is under stress. The wires are so thin that RC delay becomes a significant challenge. SRAM scaling, which has been a mainstay for caching, is falling way behind scaling for digital logic. That, in turn, limits how much memory can fit on a single reticle-sized die. On top of that, it’s much harder to achieve the same yield in a fab due to process variation, which can creep into hundreds or even thousands of insertion points and dozens of tools used in the manufacturing process.
In any fab process there is some level of variation, but at 2nm the amount of variation and the causes of that variation are increasing. Bumps don’t always connect fully due to warpage of increasingly thin metal layers and substrates, and dozens of different steps that are needed to ensure a chip will be reliable can weaken or damage fragile interconnects or materials. The machines used to make chips have their own variation, as do the raw materials and wafers. The result is that while there are more transistors and interconnects on a die, there is a higher percentage that are faulty. Costs go up, and yield goes down.
“The expectation is that because of the shrinking of dimensions you get faster performance, lower power consumption, and higher transistor density,” said Abhijeet Chakraborty, vice president of engineering at Synopsys. “The challenge is whether you can attain that promise. Can you achieve 10% to 15% performance, and 20% to 30% lower power consumption? The lower power consumption is very compelling for a lot of applications that care about performance per watt and higher transistor density. But can you achieve that? Therein lie a lot of challenges and considerations. There are real-world challenges that lead to yield and manufacturability.”
Manufacturing processes are works in progress. They improve over time, allowing foundries to relax the restrictive design rules that add margin to every leading-edge process, and EDA and equipment vendors to develop rules and exceptions for every process. Initially, just getting these advanced-node designs to work at all is an engineering feat, and the early versions are typically worst-case scenarios that include redundant transistors, interconnects, and enough built-in self-test to reroute data and processing whenever necessary.
But that margin consumes valuable real estate, limiting the return on investment for performance and power.
“Margin becomes one of the most contested resources at 2nm and 18A,” said Evelyn Landman, CTO at proteanTecs. “There needs to be enough margin for process variation, thermal and environmental effects, workload stress, latent defects, and aging. Aggregating all of these into a single worst-case guard-band is no longer viable. Static guard-band sacrifices performance and power and still fails to protect against real-world field conditions. The only sustainable approach is to measure the guard-bands directly, by monitoring timing margin in real-time, at high coverage, under real workloads, and managing it continuously over the product lifetime. The idea is to monitor guard-bands as first order, by measuring timing margin, not by proxy.”
As processes mature, the amount of margin can be reduced while still maintaining acceptable yield. But at each new node below 5nm, and especially from 2nm and below, it’s taking longer to make that happen. Even though the 16/14 angstrom nodes are in development, and work has started on 10 angstroms (equal to 1nm), the time needed to reach high-volume manufacturing with sufficient yield for high-volume manufacturing is taking longer for each successive node after 5nm.
“After 14A, the next big node step is 10A,” said Ben Sell, vice president and general manager of logic technology at Intel. “We’re already working on that node, but the node name is less important than whether it meets customer needs. What we always do is start with a base node. We define a base node for the most critical few customers that we are working with. We call them definitional customers, and that’s how we define the node. So if this node meets the requirements of our lead customers, then beyond that, when we work with more customers, we may do some modifications to that node. You could add a couple of metal layers, or do some smaller modifications to customize what a particular product would need. But those modifications are somewhat smaller, because you want to make sure you can use all of the IP that you developed on the chip so that you don’t have to redesign IP that’s already proven and tested for this technology.”
The 10A node is likely the last that will use gate-all-around FETs (although the industry has a history of pushing technologies further than expected). In its place, give or take a node or two, will be complementary FETs, which combine nFETs and pFETs that are developed on separate wafers.
“CFET is a device architecture,” said David Fried, chief AI officer and corporate vice president for Semiverse Solutions at Lam Research. “Compared to finFET and gate-all-around FETs, it adds even more front-end complexity and challenges. There is even more structural complexity, and there are more materials in play. Those materials are going to be closer to each other than they’ve ever been. But the interesting part of CFET is that from planar to finFET to gate-all-around, our nFETs and pFETs have always been laterally in different locations next to each other. With CFETs, they’re going to be on top of one another. That introduces massive structural complexity, as well as new interconnect complexity that we’ve never seen before. A lot of the backside power distribution, for example, has to comprehend that your nFETs and pFETs are on top of each other instead of next to each other. And so CFET complexity will bleed into many other aspects of the technology. It’s not just a transistor innovation.”

Fig. 1: CFET, showing position of nFET and pFET. Source: imec
Changing economics, accelerated innovation
The massive build-up and build-out of AI data centers have fundamentally changed the dynamics of chip design and manufacturing at 2nm and beyond. While scaling is still considered a win from a power standpoint, and to a lesser extent from a performance standpoint, there isn’t enough available real estate on a reticle-sized die to keep up with the amount of data that needs to be processed for generative and agentic AI. So rather than trying to wedge everything onto a single die, the focus is shifting to multi-die assemblies of chiplets, which despite the name, can be full reticle size.
This creates a whole bunch of new tradeoffs. Area becomes less of an issue for multiple dies in a customized advanced package, but the orchestration and movement of data become significantly more challenging. Parsing an AI computation is a complex, massively parallel operation in which the processing may be split among different processing elements and results combined at the end. That last step has been a challenge since IBM first ramped up massively parallel processing in the 1980s.
For one thing, moving all the data to the right place at the same time is incredibly difficult. A delay in any processing element, or uneven aging in one or more data paths caused by workload-specific thermal gradients, can bog down performance of an entire system. And at 20 angstroms or less, the resistance of driving signals longer distances between chiplets requires more power, which increases the heat inside a package. The challenge there is determining how one of these multi-die assemblies will utilize a particular workload, because that can affect where the heat builds up inside a package. As workloads change, the heat can shift. That can cause hotspots which, in turn, can accelerate electromigration that can slow or completely block data movement.
“Workloads are now first-class design constraints,” said proteanTecs’ Landman. “It is not just how much compute is used, but how it is used over time. Large language model training and inference patterns create highly non-uniform stress across silicon. Temporal bursts, localized hotspots, and long-duration stress patterns can produce very different outcomes on identical silicon. Designs that ignore workload behavior will either be over-constrained or vulnerable in the field.”
This becomes especially important as chips become collections of chiplets attached to some type of interposer. The focus on moving more data over longer distances with skinnier wires requires other changes, as well. New materials and processes are needed at the most advanced nodes for improved electron mobility — and eventually photon mobility — over longer distances inside an advanced package and between packages, as well as for structural stability and to reduce warpage in 2.5D and 3.5D configurations.
“There are constant material innovations, from high-k dielectrics and metal gates to silicon germanium for stressors,” said Lam Research’s Fried. “With advanced logic transistor development, we would see a couple of material innovations happen every decade. In my current role — I’m involved in logic devices, DRAM, NAND, many different specialty markets, and advanced packaging — it’s constant. The transition from tungsten to molybdenum is already happening across NAND and DRAM wordlines and low-level logic interconnects. Cobalt to ruthenium seems a little bit further out, but we’re going to see that in very specific places like low-level interconnect and wiring. You can look at any one of these transitions and find the insertion point across multiple different technology sectors — logic, DRAM, and specialty advanced packaging. They’re happening all over the place. The specialty technology area is just incredibly exciting. There are things like integrated photonics. We’ve been talking about integrated photonics for a long time, but these large AI systems have really driven more attention, more focus on integrated photonics. The material set for integrated photonics can get quite complex, and so there will be material transitions in that area.”
Economies of scale and repeatability become increasingly important at these dimensions. Starting in 2008, the chip industry began work on moving from 300mm to 450mm wafers in order to get more dies out of a wafer and thereby counteract rising development costs. That effort was abandoned in 2017 because there weren’t enough companies that would benefit from 450mm wafers at that time.
The market has changed since then. There are now four leading-edge fabs — Intel Foundry, TSMC, Samsung Foundry, and now Rapidus — along with an insatiable demand for higher performance, largely due to AI.
Just turning up the clock frequency is no longer an option, because it will burn up a chip, so the industry has opted for a multi-die approach with chiplets. The most economical way to manufacture all these chiplets is to use large rectangular panels rather than 300mm round wafers. This is the same rationale as moving to 450mm wafers, but in a different shape and size and as a passive substrate. A rectangular format accomodates many more chips than a round wafer, and processes can be standardized more easily than by trying to eke more usable area from a large circular wafer. Intel Labs even proposed a panel-scale chip, which would basically take the wafer-scale approach developed by Cerebras and extend it into a full-size 500 x 500mm panel.
The magnitude of this change is daunting, however. It requires entirely new equipment and different ways of handling thin wafers, which is far from trivial. It also shifts the areas of highest variation from the edge of a wafer to the middle of a panel due to mechanical stress.
“The round wafer is going to be at the beginning, more on the 2.5D silicon interposer,” said Rozalia Beica, field CTO for packaging technologies at Rapidus. “But even there, because of the reticle size, the industry has started moving toward panels. The panel is going to be higher capacity. It really depends on the size of the interposer and the packages that we will manufacture, but we will have packaging and silicon in the same fab. We don’t have to ship it to another fab or to another country to make the packages.” Hybrid bonding also is gaining steam. “These will be die-on-wafer. Wafer on wafer fits better for hybrid memory, but when you have different-sized devices, die-on-wafer would be more suitable — and more difficult,” Beica added.

Fig. 2: Evolution of chiplet package structure. Source: Rapidus
More customization
Underlying the push toward advanced nodes is demand from companies with deep pockets willing to invest in bespoke silicon for their specific needs and data types. Rapidus’ die-on-wafer and die-on-panel approaches enable customization beyond the 2nm plumbing for electrons. Intel Foundry, meanwhile, is incorporating that plumbing into the lower metal layers, then offering additional metal layers for customization, as well as a variety of interconnects such as bridges. And TSMC is providing flexibility in its standard cell architecture with what it calls NanoFlex. Samsung plans to offer a custom HBM that will address performance from a different angle. Ultimately, each foundry will find what works best, likely a combination of approaches.
“There will be common platforms, but meaningful customization will remain,” said proteanTecs’ Landman. “Different markets value different tradeoffs. As customization increases, the need for fast correlation between design intent, silicon reality, package behavior, and system operation becomes critical. That feedback loop is what allows advanced nodes to scale beyond early adopters.”
Prioritizing, categorizing, and moving data at high speeds is critical. Inside a planar SoC, sending signals from one corner of a die to another was still faster than moving signals off-die through an interposer. In fact, the only way to achieve similar speeds in an advanced package is with full 3D-ICs, which can be floor-planned so that the distances critical data needs to travel are actually shorter than on a planar SoC. This approach is underway with a logic layer beneath an HBM memory stack, but whether a DRAM stack can achieve performance on par with or close to SRAM’s speed remains to be seen. And it’s not clear whether full 3D-ICs will be cost-effective for other applications due to thermal dissipation and variation-related issues.
Photonics may provide a viable interim step. Recent discussions around optical waveguides embedded in glass substrates could greatly accelerate data movement with minimal additional heat. The challenges there will be handling the glass so it doesn’t crack, converting optical to electrical signals in a confined space, and dealing with thermally induced optical drift. On the plus side, the coefficient of thermal expansion is roughly the same for many glasses and silicon.
Optical is playing a bigger role in multiple places. Advancements in mask writing enable a variety of shapes, including polygons and curvilinear ones, to be printed on a wafer with much greater accuracy.
“We are looking very carefully at curvilinear,” said Intel’s Sell. “It’s a tradeoff because it’s more expensive to compute these curvilinear shapes, but it’s also more accurate. So it depends on how much accuracy you need, because there is an extra cost for fine-tuning that you would need.”
Another option is high-NA EUV lithography. “Intel 18A is designed so we can use single-pass patterning on EUV, which is really nice,” Sell said. “But going forward, we will have to go to multi-pass patterning EUV, and that’s where you have an opportunity to replace multiple passes on low-NA EUV with single-pass high-NA EUV. That’s something we’re working on with Intel 14A right now. We have set up the design rules so that we can use both. But over time, if you can replace it with single-pass high-NA, then you get process flow simplification and cost improvement. We are keeping our options open for 14A. We know we can make it work with low-NA. High-NA is an opportunity to reduce cost.”
Another option is to combine different types of cells in a multi-die assembly, which can further reduce the cost. “IPs are a key component of designs, and certainly IPs are optimized for a technology node such as 2 nanometers,” said Synopsys’ Chakraborty. “So with this notion of hybrid design, you can mix and match different standard cells. You can have high-performance standard cells mixed with low-power standard cells, and maybe high-density ones. So you have more flavors of these standard cells available now, and the tools have to be really smart in choosing them in order to maximize your benefits. If you use high-performance standard cells everywhere, because you’re trying to meet very aggressive performance targets for your HPC AI design, then you’re going to pay a price in power and perhaps other metrics. But this mix is very important.”
Conclusion
Mixing and matching components shifts the focus from just shrinking digital logic to data movement. So-called “More-than-Moore” designs are the only way to achieve the performance goals of AI/HPC data centers, and in the future, high-performance edge computing. In many cases, they also will enable faster time-to-market because they take advantage of decades of learning in semiconductor manufacturing.
“We’re still going to do everything with planar processing,” said Fried. “We’re still putting a wafer into a piece of equipment and processing whatever is exposed on that wafer. Every process has its parameters, its variability, and associated measurement techniques. That seeps into whatever structure you’re building. There will be more processes, parameters, and key performance indicators. But the math hasn’t changed for understanding how those variabilities soak through the technology, how they come together and where, how they need to be controlled. It’s a lot more math, but fundamentally it hasn’t changed. When technology was simple, we knew how that math worked, and we did all that math. We started out doing it in our heads, and then we made some relatively simple systems to do it. But now there are so many processes and so many parameters that we have to use advanced systems, and physical models or virtual silicon that underpin those systems, to keep all that math straight.”
Related Articles
Challenges In Scaling Chips To 2nm And Below
TSV Complexity Leads To Manufacturing Bottleneck
Why Move To 2nm?
CFET combine nFETs and pFETs that are developed on separate wafers. How it is possible ? Please explain.
This is still in the R&D stage, but the most likely approach will be stacking of one on top of the other. It’s extremely difficult to build this monolithically, even though that may be the far-future direction. We’re just starting to see more research results on this. Here’s a good paper on this from imec, and another good one from TSMC. And here is an article from Semiconductor Engineering on vertical stacking of pFETs and nFETs.