Another major change is the size of the L3 cache, which has doubled on Zen 2. Each chiplet, or Core Chiplet Die (CCD) as AMD sometimes calls it, now has 32MB of L3 cache, which works out to 4MB per core, an upgrade that was directly enabled by the smaller 7nm process that increased the transistor budget inside each chiplet. According to AMD, game performance is pretty closely tied to the L3 cache size, so the larger capacity should directly translate into improved performance in game. In general, games based on older APIs or those that are more sensitive to the CPU should see the best benefit. The larger L3 cache also reduces effective memory latency by reducing the number of calls that have to be made to the main system memory.
In fact, AMD is so confident of the impact it has on game performance that it is labeling its L3 cache AMD GameCache instead. What's more, the company says that doubling the cache size has a similar effect as installing faster memory, so AMD is simply keeping data on chip rather than going off-die, which is supposedly more efficient.
Elsewhere, floating point performance has been doubled as well by moving up to two 256-bit floating point units (FPUs) that support AVX2 instructions. All this, together with an increase in integer resources and load/store resources and the following changes to branch prediction and micro-op cache, should produce an IPC uplift of roughly 15 per cent.
Ryzen 3000 uses a new TAGE, or Tagged Geometry, branch predictor, which is able to make selections with improved accuracy and granularity, in order to increase throughput by reducing stalls from branch mispredicts. This is on top of the existing Perceptron predictor for the first stage, so Ryzen 3000 actually uses a double-stage branch predictor. In addition, it is better able to handle deeper branch histories, helped along by larger branch-target buffers (BTB) and a doubling of the L1 BTB and nearly doubled L3 BTB. At the simplest level, a BTB basically contains predictions about whether the next branch will be taken or not.
And if you were sensing a theme that everything is getting larger, you'll only find confirmation of that with the larger micro-op cache. The size has doubled from 2K entry to 4K entry, so it will accommodate more decoded operations and also increase throughput by preventing the re-decode of operations. To further help this along, the dispatch rate from the micro-op cache to the buffers has been increased to up to 8 fused instructions. AMD first added a micro-op cache to its Zen architecture in 2016, and it means the CPU doesn't need to keep fetching from other caches to implement frequently used micro operations, or detailed low-level instructions used to implement more complex instructions.
Finally, AMD has added OS-level optimizations to the Windows 10 May 2019 update.
Where the cores are relative to the L3 cache plays a big impact on the overall performance of the processor as well. If the OS has topology awareness and understands where the cores and cache are, the process scheduler can allocate a certain number of threads to one CCX, or one cluster of four cores, before even spawning or migrating threads onto a secondary or tertiary CCX. This improves performance because the cores have direct L3 cache access to the nearby cache or the neighboring cache, and grouping the threads together on the nearest CCX enables the lowest latency access to its resources.
That said, not all tasks will behave in this manner. There are times when you might want to have a second thread spawn on a different chiplet, as far away as possible, in order to allow the CPU to maintain high performance without having to deal with regions of high power density. This helps with turbo performance across multiple threads.
Another thing that's important is clock speed selection. With previous Windows builds, if you wanted to ramp the clock speed from low to high, or just pick any new speed at all, it could take about 30ms. But with UEFI CPPC2, or Collaborative Power and Performance Control, which hands over control to the processor's firmware, this can be reduced to around just 1 to 2ms. This change is particular useful to brief workloads that produce short bursts in clock speeds, such as webpage rendering, web browsing, and application launches.