A broad overview of the improvements coming to AMD’s Zen microarchitecture

AMD's Zen chips aren't slated to reach mass availability till the early part of 2017, but the company is already at the gates with fresh details about its new architecture. Will Zen be the chip to make AMD a real player in the high-end CPU market again?

AMD has made good on its promise to unveil more details on its upcoming x86 Zen processors. At this week’s Hot Chips conference, the chipmaker took a microscope to Zen’s architecture and unwrapped all the small details, following up on the broad strokes it had already revealed about at IDF in San Francisco.

This is the chip that will go head to head with Intel’s Kaby Lake processors next year, and is essentially AMD’s best shot at recapturing the enthusiast market. Zen is a “clean sheet” design, which means we’re finally leaving the Bulldozer architecture behind for good as AMD has built Zen from the ground up.

In addition, while we already know that AMD is claiming a 40 per cent jump in instructions per clock (IPC) compared to Excavator, a demonstration with the Blender rendering benchmark showed Zen’s IPC to actually be on par with Intel’s 14nm Broadwell-E Core i7-6900K.

Broadly speaking, the improvements AMD made to the Zen architecture can be divided into three key areas – an upgraded core engine, better cache system, and lower power consumption.

Zen's improvements can be divided into three key areas.

Zen's improvements can be divided into three key areas.

Finally, a look at the Zen core

Zen's CCX comprises four cores connected to an L3 cache.

Zen's CCX comprises four cores connected to an L3 cache.

AMD also gave us a look at its Zen CPU core, where the CPU complex (CCX) comprises four cores connected to a central 16-way 8MB L3 cache that has been split into four slices. All four cores have access to the entire L3 cache with the same average latency, but they each also have 512KB of private L2 cache.

In order to increase core counts, AMD connects multiple CCXs, for instance to create an 8-core (16 threads, 2 per core) SKU. The amount of L3 cache would double in this case as well to a total of 16MB across two CCXs.

However, we’re still missing details on what sort of interconnect it is using to link up the different CCXs.

Zen also marks the first time AMD is implementing simultaneous multi-threading (SMT) in a while, and each Zen core is now able to support two threads. Intel's version of SMT is the by-now familiar HyperThreading, but you'll not catch anyone from AMD referring to their technology as such. This finally overcomes one of the key limitations of Bulldozer, where a shared floating point unit between two threads negatively impacted floating point performance. Instead, Zen's design now hews closer to Intel's, and each thread will register as a separate core. 

AMD has finally implemented simultaneous multi-threading on Zen.

AMD has finally implemented simultaneous multi-threading on Zen.

Another key change is the inclusion of something called a micro-op cache on Zen. Bulldozer lacked an operation cache, which meant it needed to fetch data from other caches to carry out even commonly used micro-operations. Micro-operations refer to low-level instructions used to implement more complex instructions, and a micro-op cache allows these instructions that come through the pipeline to be stored on it so they can be accessed more quickly the next time.

Ultimately, this reduces the number of stages in the pipeline and allows more operations per CPU clock cycle, which improves performance and saves power at the same time.


Improved cache system

The cache hierarchy also sees significant enhancements over the previous generation processors. For starters, the L1 data cache has been doubled in size compared to Bulldozer. It also now uses the faster write-back technique, instead of the slower write-through methodology it used before.

In addition, larger queues are now supported to handle L1 and L2 cache misses. Cache misses describe what happens when data requested for processing is not found in the cache memory, resulting in execution delays as the program or application needs to fetch the data from other cache levels or even the main memory.

Longer queues let the processor continue carrying out subsequent instructions while the data request is queued, so work can proceed with minimum latency.

On top of that, there is now 512KB of L2 cache per core with 8-way associativity, compared to 256KB per core and 4-way associativity for Intel Skylake. Generally speaking, cache hit rate improves with set associativity, and a larger cache reduces the need to for the CPU access the main memory and can improve performance in some cases.

 

Prioritizing power consumption

AMD worked from the start to make Zen as power efficient as possible.

AMD worked from the start to make Zen as power efficient as possible.

One of the more interesting aspects of Zen’s design was how AMD went back to the drawing board to design a processor that gives equal priority to frequency, performance, and power consumption. Before Zen, power considerations traditionally took a backseat when building the architectural foundation, which meant that optimizations and refinements regarding power were limited.

This changes with Zen as AMD wanted to create an architecture that could successfully apply to low-power fanless notebooks and power-guzzling supercomputers. As a result, power consumption was incorporated into the design process from the outset.

The shift is all the more stark because AMD’s existing CPU line-up is so segmented, broken up as it is into low- and high-power designs. On the other hand, Zen is a single scalable architecture for a wide range of computing solutions.

That’s helped along by the adoption of a new 14nm FinFET process, which boosts performance and improves power efficiency. The increased performance lets the chip complete workloads faster, and shut off certain parts of the chip more quickly. AMD also used multi-level clock gating to reduce power consumptions in sections of the core that are idling, while working to mitigate the performance trade-offs that clock gating can bring.

Our articles may contain affiliate links. If you buy through these links, we may earn a small commission.

Share this article