Feature Articles

Intel's CPU Roadmap: To Nehalem and Beyond

Intel's CPU Roadmap: To Nehalem and Beyond


Deep Diving into the Nehalem's Core

Deep Diving into the Nehalem's Core

So we've shown you the most important aspects of the Nehalem's platform/microarchitecture with Intel's QPI, the CPU's integrated DDR3 memory controller and the new tri-level cache hierarchy, but equally important and supporting these are the changes incorporated to the actual processing blocks of the CPU. Nehalem still features a 4-issue wide execution engine per core that first debuted in the Conroe (better than the competition's 3-issue wide), but builds upon the last Core microarchitecture iteration (the Penryn) and has several enhancements to increase its efficiency further:-

  • Increased Parallelism:- Since most of Intel's current processors are using an out-of-order execution architecture, one way to further increase parallelism is to enlarge the out-of-order window size to allow more instructions to be analyzed on-the-fly and possibly reduce latencies by enabling more independent operations to be executed in parallel. On the Core architecture, up to 96 mico-ops can be analyzed at any one point of time. Nehalem increased this window size by 33%, thus it's able to scrutinize up to 128 mico-ops. Intel also increased the scheduler and other relevant buffers to support this cause.
  • More Efficient Algorithms:- With a new microarchitecture, Intel took this chance to enhance some of its algorithms for speedier handling of branch mispredictions and better load-store scheduling. Nehalem also features Improved hardware prefetchers and better hardware to handle unaligned cache accesses and synchronization of primitives. Motion estimation during a video encoding process is a common example of when unaligned cache accesses occur, while synchronization of primitives pertains to synchronization of threads in multi-threaded software to improve performance.
  • Enhanced Branch Prediction:- In addition to more efficient algorithms for speedier handling branch mispredictions, Nehalem also implements a second-level branch target buffer (BTB) that's said to be especially useful for very large code footprints such as databases in general. The second-level BTB helps reduce performance penalties by not only predicting the path of a branch, but also caching the information used by the branch. So if a misprediction should occur, it can readily rollback and use the cached information of the other branch. Another hardware implementation is the renamed returned stack buffer (RSB) that stores forward and return pointers and helps avoid common return instruction mispredictions. 
  • Simultaneous Multithreading (SMT):- Intel's Hyper-Threading technology is making a come back, but it will be known as SMT and Intel claims that it is an enhanced version of the former self. If you think back on the purpose of HT based on the hardware then and SMT now with the latest multi-core processors, Intel does have a point that although it is the same concept, the optimization differs. Back then when processors were uni-core, HT was introduced to increase performance and efficiency by trying to execute two threads simultaneously on unused processor resources/registers. Then multi-core processors arrived and we all know how HT was no longer applicable as it only made execution worse.

    Fast forward a couple of years later, SMT was designed for multi-core processors and with vastly larger buffers, copious memory bandwidth and resources to support more processes, Intel figured it's prime enough to rekindle Hyper-Threading like feature again on the Nehalem. SMT doubles the potential number of overall threads that can be run simultaneous on each core. Thus a typical quad-core processor that the Nehalem microarchitecture will first debut as, would immediately have the ability to execute up to eight threads simultaneously. Again this would depend on each of the core's resource availability, and this is somewhat reflected in Intel's own brief that SMT can deliver 20% to 30% more performance depending on the application at just a slight increase in power consumption. So the more threaded the workload, or application, the better the gains.

  • Intel SSE 4.2 :- It's not yet a new SSE standard but that's what Intel terms it in short for their new Application Targeted Accelerators in addition to Intel SSE4 support. SSE4 instruction support itself remains identical to that found on Penryn processors , but Intel has added seven new Application Targeted Accelerators in Nehalem to extend the architecture's capabilities for accelerated processing of string and text processing applications such as XML. Given the extensible nature of using XML in various places from application to databases, it is an ideal area of performance improvement.
  • Improved Virtualization Performance:- Nehalem also makes a conscious effort to improve performance in virtualized environments since Nehalem will first and foremost debut on the server/workstation arena. For this, Nehalem will be the first Intel processor to support Intel's Extended Page Table (EPT) feature, which is basically Intel's version of the Nested Paging feature that's found in AMD's second and third generation Opteron processors. For those interested, you can read more on EPT from Intel's whitepapers.