Intel's CPU Roadmap: To Nehalem and Beyond

In this forward looking article, learn about Intel's high-end multi-processor (MP) platform update, microarchitecture details of the next generation processor core - the Nehalem and even info on platforms/processors beyond 2009 with Sandy Bridge and Larrabee. So sit tight and get ready to soak up all the information.

Intel's Tick-Tock at Work

No sooner than two weeks have passed since Intel's announcement of Silverthorne and Diamondville processors that help expand Intel's x86 architecture to the ultra mobility and ultra portable devices, on 18th Match Intel unveiled next generation details of the other-end of the scale, namely the high performance computing and Expandable (EX) platform segment. In this forward looking article, we detail what you can expect of Intel's high-end multi-processor (MP) platform update, microarchitecture details of the eagerly awaited next generation processor core - the Nehalem and even some information regarding its platforms and processors beyond 2009 with Sandy Bridge and Larrabee. So sit tight as we navigate you through the sea of information.

Dunnington - Industry's First 6-Core Intel x86 Processor

You've heard of dual-core, quad-core and even tri-core processors, but the rumor of a 6-core processor from Intel has been in the wild for sometime, especially since last month's leak of Intel's roadmap presentation to Sun. Today, the rumor has become a reality as Intel announces the Dunnington processor for their Caneland multi-processor platform (Intel 7300 chipset) supporting up to four physical CPUs for large scale server computing. Currently, the Caneland platform takes in "Tigerton" processors (Xeon 7300 series), which are very similar to Clovertown processors (Xeon 5300 series) with the only difference that they are qualified to operate on the Caneland platform (). For those who aren't familiar, the Clovertown processors are 65nm quad-core processor parts which are akin to the Kentsfield on the desktop lineup.

While the 45nm processor refresh has taken place on the mobile, desktop and workstation space, the high-end server computing space hasn't quite caught up yet. Dunnington is essentially a Caneland platform 'refresh' of sort as it will be manufactured on Intel's 45nm Hi-K process technology which is the basis of all the other 45nm parts produced by Intel to-date. With Dunnington, Intel would have completed introducing their 45nm processor parts in all of their CPU segments. Unlike the traditional die shrink with the expected Penryn-class enhancements, Dunnington is actually a lot more superior to their Tigerton processors. 6 processing cores, with each pair sharing a 3MB L2 cache, 16MB of shared L3 cache among all cores and is comprised of 1.9 billion transistors.\

This is a diagrammatic representation of the new Dunnington processor and its main features. Note that since Dunnington will run on the existing Caneland platform, its operating FSB remains the same at 1066MHz.

This is a diagrammatic representation of the new Dunnington processor and its main features. Note that since Dunnington will run on the existing Caneland platform, its operating FSB remains the same at 1066MHz.

This is the actual die cross-section of the Dunnington processor, revealing the exact layout of the processing cores, L2 and L3 cache. Interestingly, the architecture from a macro view some-what resembles AMD's Phenom processors. You'll soon see that the next generation Core processors will build-upon this structure.

This is the actual die cross-section of the Dunnington processor, revealing the exact layout of the processing cores, L2 and L3 cache. Interestingly, the architecture from a macro view some-what resembles AMD's Phenom processors. You'll soon see that the next generation Core processors will build-upon this structure.

With the Caneland and the four Xeon 7300 processors, the platform can support up to a total of 16 processing cores. The same platform upgraded with four Dunnington processors will yield a grand total of 24 processing cores - which are quite impressively spec'd too. It is pin compatible to the Tigerton and as such, it's a direct drop-in upgrade to the Caneland platform. Unlike other Xeon counterparts where a somewhat similar desktop uni-processor option is available, Dunnington will be solely relegated to the MP space. Intel explained that the product proposition for high-end home users just isn't there yet and it would cost too much effort to repackage it for the desktop segment as its physical packaging and electrical configuration differs. In any case, by the time such an offering is made, it would be late and expensive, which by then Nehalem would be out soon (and that's our next major topic of focus in this article). As for expected availability of Dunnington processors, it is estimated to be out in the second half of 2008.

Over on the extreme high-performance computing (HPC) segment, Intel also reiterated their upcoming Itanium update. Codenamed Tukwila and made up of 2-billion transistors, it is Intel's most complex CPU yet, doubling the performance of its predecessor at the very least. Also expected for release in late second half of this year or early 2009, you can read more about it in our .

Nehalem - Intel's Next Gen Processor goes NUMA

Probably the most anticipated processor architecture update from Intel even before the Core microarchitecture came in to existence was the notion of Intel following the footsteps of AMD's non-uniform shared memory access (NUMA) implementation for their platforms. As seen from the days the Athlon XP to the power shift in processor supremacy in AMD's heydays when they introduced the Opteron and Athlon 64 processors, their fortunes raised ever since they implemented the NUMA architecture in conjunction with an updated core architecture.

While the Core 2 Duo processor was a great success, it was still hedging on the traditional platform topology that served it well. After all these years, Intel is finally making the leap to a NUMA platform as well and that is one of the biggest highlights of the Nehalem microprocessor architecture. So why the change when the existing platform doesn't seem to be an issue? As part of their long-term planning, Nehalem was to be the first of their new generation architecture that was designed for high performance, dynamic and design-scalable microarchitecture. This would mean increased multi-processor communications as Intel ramps up the number of cores in a CPU as well as the need for ever greater memory bandwidth and at low latencies. This was the catalyst for Intel to adopt a new platform architecture altogether for better scalability and inter-component communications, thus comes the Nehalem.

At the point of publishing this article, Intel mentioned that they are on schedule to ramp production for Nehalem in the last quarter of 2008 and according to roadmaps, full scale availability would probably be at the very end of 2008, but more likely in early 2009. It's interesting that we've not yet crossed over the first quarter 2008 and yet we're here discussing on 2009's products. For those who love to look forward into the future, you'll get loads of information in this article. So read on as we detail you more of what to expect from Nehalem.

Nehalem will enter production in late 2008, but we don't expect OEM availability till 2009; unless Intel notches up their gears.

Nehalem will enter production in late 2008, but we don't expect OEM availability till 2009; unless Intel notches up their gears.

Nehalem's QuickPath & Integrated Memory Controller

Similar to AMD's use of HyperTransport links for high-speed point-to-point inter-component or processor connectivity/communications, Intel has developed its own version called QuickPath Interconnect (QPI for short) and works very much the same was as HyperTransport links on the AMD platform. QPI itself, is part of the QuickPath architecture that integrates the memory controller in to the processor while relying on QPI for inter-processor and inter-component communications (e.g. the I/O Hub). As such each processor directly interfaces with a pool of physical memory and while in a multi-processor setup, each processor can communicate between one another to use various memory banks. As such, you can see why this is termed a NUMA platform/processor architecture. Talking about multi-processor communications, you can already tell that Intel plans to introduce this to the workstation/server market, but being scalable in nature, it can also be used in a high-end uni-processor setup for desktops. Here's a diagram as to how Nehalem will be implemented on both platforms:-

Be it for Servers or high-end desktop system implementation, Nehalem was design with scalability in mind, just like the AMD Opteron and Phenom series. Also seen here is the codename for the new Tylersburg I/O hub that will be used in conjunction with the Nehalem architecture. A separate ICH hub will still exist to provide Intel various building blocks for the broad range of platforms that Nehalem will take on.

Be it for Servers or high-end desktop system implementation, Nehalem was design with scalability in mind, just like the AMD Opteron and Phenom series. Also seen here is the codename for the new Tylersburg I/O hub that will be used in conjunction with the Nehalem architecture. A separate ICH hub will still exist to provide Intel various building blocks for the broad range of platforms that Nehalem will take on.

Here are some key points of the Nehalem microarchitecture with regards to the Intel QuickPath Interconnect and the core's integrated memory controller:-

Intel QuickPath Interconnect

  • Two QPI links will be present per CPU socket in the first implementation of Nehalem. Number of QPI links can be increased or decreased as required for the market designation in future CPU revisions as the QPI is one of the extensible building blocks of the Nehalem CPU architecture and not something tied down to the core per se.
  • QPI uses up to 6.4 Gigatransfers/second links, equivalent to delivering a total bandwidth of up to 25.6Gb/s per link - vastly faster than AMD's current solutions, but that's not to say it's inferior as its bandwidth can be scaled up as well. Intel's QPI transfer speeds aren't set in concrete yet, but the figures listed here show its capabilities.
  • Built in reliability, availability and serviceability (RAS) features ensure high reliability of QPI. Examples of these are link-level CRC, self-healing links that avoid error prone areas to reconfigure themselves to use the good parts of the links, and automatic clock re-route function to data lanes in the event of a clock-pin failure.
  • QPI even has hot-plug capability to support hot-plugging of nodes such as a processor card for example.

Nehalem's Integrated DDR3 Memory Controller

  • Nehalem processors will come integrated with a new integrated DDR3 memory controller that's not a dual-channel controller, but a tri-channel controller! Therefore total memory bus width goes up from 128 bits to 192 bits.
  • The memory controller itself has a maximum memory bandwidth handling capacity of 64GB/s which is massive. With the memory controller integrated on the processor, it communicates directly with the physical memory array and thus drastically reduces memory latencies.
  • The controller supports both registered and unregistered memory DIMMs.
  • Supports DDR3-800, DDR3-1066, DDR3-1333 JEDEC standards and has room for future scalability. No additional information was shared as to how future standards will be supported. However with the memory controller able to handle 64GB/s, a full tri-channel DDR3-1333 implementation will only amount to 32GB/s maximum bandwidth utilization. Even DDR3-2000 will not max out the controller, so we hope future memory standards are supported with just a simple BIOS microcode update to perhaps supply the proper base clock to memory frequency multipliers required.
  • With 3 memory channels per processor, each channel supports a maximum of 3 DIMMs. Do the math and a single processor can support a maximum of 9 memory slots. The minimum would however be three, one DIMM per channel. So depending on the motherboard class of use, the board can come configured with three, six or nine memory slots. However, Servers are generally at least 2-way SMP systems and with two Nehalem class processors, the total memory slots supported will double to 18!
  • Unlike Intel's FSB-based architecture where the chipset can be updated to support various memory standards, the integration of the DDR3 memory controller into CPU will mean that there will be a slower progression to adapt to newer memory standards, just like AMD's processors today. That's probably a downside that the industry as a whole will need to adjust accordingly.

Nehalem's Core and New Tri-Level Cache Structure

For 2008, Intel foresees that the bulk of their shipment for the higher-end processor segment would be quad-core processors and possibly even for 2009. As such quad-core would be their main focus and Nehalem's first iteration would be that as well. However, unlike previous generations where each processing core has a shared L2 cache among them, Nehalem opts for a design that's more akin to that of AMD's Barcelona.

A die shot of the Nehalem processor and its functional blocks (note the modular layout).

A die shot of the Nehalem processor and its functional blocks (note the modular layout).

This means each processing core has a small and dedicated L1 and L2 cache, but shares a common large L3 cache among all the processing cores. Here's how Nehalem's Cache structure stacks up:-

  • L1 Cache per core (32KB Instruction and 32KB Data) - similar to Intel's current Core microarchitecture.
  • L2 Cache per core (256KB, low latency)
  • L3 Cache (8MB, fully shared among all cores) - adopts an Inclusive Cache Policy
Nehalem's new 3-level cache structure.

Nehalem's new 3-level cache structure.

With Nehalem adopting an integrated memory controller to interface with memory directly and using QuickPath Interconnects for speedy inter-processor communications, Intel doesn't necessarily have to buffer up on huge amounts of cache like they used to on their high-end Xeons (some which have more than 12MB of L2) using the existing FSB based architecture. Thus, Nehalem uses small L1 and L2 caches dedicated to each core, but Intel has still given the processor a generous 8MB of L3 cache (so even though it has half the Barcelona's L2 cache, it has four times its L3 cache). The Inclusive Cache Policy of the L3 cache further ensures to minimize snoop traffic since it doesn't have to snoop every other cache level if the required data isn't on the L3 cache (because whatever is in L1 and L2 are in L3 as well). Barcelona however adopts an exclusive cache policy, which does allow more caching to take place, but that's because it requires it since it has a much smaller 2MB L3 cache. Intel on the other hand can really afford the die space given the massive L2 caches it is able to squeeze into the Penryn. In addition to the main cache structure change, Nehalem also incorporates a second-level 512-entry Translation Look Aside Buffer (TLB) to further improve the performance of virtual address translations.

To maintain a modular structure to scale processor designs easily, note that the L3 cache is not exactly part of the main core, but an additional building block of the processor. Likewise, the cores, QPI blocks and the integrated memory controller are all various building blocks that make up the base design of the Nehalem processor. The slide from Intel below better illustrates the use of these various building blocks to scale processor designs and an illustrative example here is a comparison of the expected 4-core processor with that of a possible 8-core processor. Take note that Intel can even integrate a graphics core into the CPU if it so wishes to, but no other details were shared other than this possibility. It won't be till much later in the year when Nehalem is expected to arrive, so there's really a long way ahead for further details to crop up on their integrated graphics option (most probably to combat AMD's Fusion strategy).

The scalable and modular design of Nehalem's microarchitecture.

The scalable and modular design of Nehalem's microarchitecture.

Deep Diving into the Nehalem's Core

So we've shown you the most important aspects of the Nehalem's platform/microarchitecture with Intel's QPI, the CPU's integrated DDR3 memory controller and the new tri-level cache hierarchy, but equally important and supporting these are the changes incorporated to the actual processing blocks of the CPU. Nehalem still features a 4-issue wide execution engine per core that first debuted in the Conroe (better than the competition's 3-issue wide), but builds upon the last Core microarchitecture iteration (the Penryn) and has several enhancements to increase its efficiency further:-

  • Increased Parallelism:- Since most of Intel's current processors are using an out-of-order execution architecture, one way to further increase parallelism is to enlarge the out-of-order window size to allow more instructions to be analyzed on-the-fly and possibly reduce latencies by enabling more independent operations to be executed in parallel. On the Core architecture, up to 96 mico-ops can be analyzed at any one point of time. Nehalem increased this window size by 33%, thus it's able to scrutinize up to 128 mico-ops. Intel also increased the scheduler and other relevant buffers to support this cause.
  • More Efficient Algorithms:- With a new microarchitecture, Intel took this chance to enhance some of its algorithms for speedier handling of branch mispredictions and better load-store scheduling. Nehalem also features Improved hardware prefetchers and better hardware to handle unaligned cache accesses and synchronization of primitives. Motion estimation during a video encoding process is a common example of when unaligned cache accesses occur, while synchronization of primitives pertains to synchronization of threads in multi-threaded software to improve performance.
  • Enhanced Branch Prediction:- In addition to more efficient algorithms for speedier handling branch mispredictions, Nehalem also implements a second-level branch target buffer (BTB) that's said to be especially useful for very large code footprints such as databases in general. The second-level BTB helps reduce performance penalties by not only predicting the path of a branch, but also caching the information used by the branch. So if a misprediction should occur, it can readily rollback and use the cached information of the other branch. Another hardware implementation is the renamed returned stack buffer (RSB) that stores forward and return pointers and helps avoid common return instruction mispredictions. 
  • Simultaneous Multithreading (SMT):- Intel's Hyper-Threading technology is making a come back, but it will be known as SMT and Intel claims that it is an enhanced version of the former self. If you think back on the purpose of HT based on the hardware then and SMT now with the latest multi-core processors, Intel does have a point that although it is the same concept, the optimization differs. Back then when processors were uni-core, HT was introduced to increase performance and efficiency by trying to execute two threads simultaneously on unused processor resources/registers. Then multi-core processors arrived and we all know how HT was no longer applicable as it only made execution worse.
    Fast forward a couple of years later, SMT was designed for multi-core processors and with vastly larger buffers, copious memory bandwidth and resources to support more processes, Intel figured it's prime enough to rekindle Hyper-Threading like feature again on the Nehalem. SMT doubles the potential number of overall threads that can be run simultaneous on each core. Thus a typical quad-core processor that the Nehalem microarchitecture will first debut as, would immediately have the ability to execute up to eight threads simultaneously. Again this would depend on each of the core's resource availability, and this is somewhat reflected in Intel's own brief that SMT can deliver 20% to 30% more performance depending on the application at just a slight increase in power consumption. So the more threaded the workload, or application, the better the gains.
  • Intel SSE 4.2 :- It's not yet a new SSE standard but that's what Intel terms it in short for their new Application Targeted Accelerators in addition to Intel SSE4 support. SSE4 instruction support itself remains identical to that found on , but Intel has added seven new Application Targeted Accelerators in Nehalem to extend the architecture's capabilities for accelerated processing of string and text processing applications such as XML. Given the extensible nature of using XML in various places from application to databases, it is an ideal area of performance improvement.
  • Improved Virtualization Performance:- Nehalem also makes a conscious effort to improve performance in virtualized environments since Nehalem will first and foremost debut on the server/workstation arena. For this, Nehalem will be the first Intel processor to support Intel's Extended Page Table (EPT) feature, which is basically Intel's version of the Nested Paging feature that's found in AMD's second and third generation Opteron processors. For those interested, you can read more on EPT from Intel's whitepapers.

CPUs in 2010 - The 32nm Troup: Westmere & Sandy Bridge

The Nehalem microarchitecture sounds rather all-encompassing and seems to solidify Intel's leading position for a couple more years. While that may sound like a sweeping statement against AMD, they have clearly been unable to keep with the blue team for the last couple of years and the way they've been delaying to ramp up processor clock speeds for both the Phenom and third-generation Opterons, 2008 doesn't seem any better for them. It almost seems like they're back to their Athlon XP days, playing the price war to stay competitive in the lower-end segment. And it doesn't take a rocket scientist to come to these opinions if you truly analyze it from the neutral perspective given what has been boasted/promised and what's been delivered (both physically and in performance). While we all do hope AMD has a hidden ace up its sleeves, realistically speaking that just seems to be empty hoping in the near term since they would have highly likely played that trump card should they have one.

As if Nehalem wasn't enough of gazing into the future, Intel even teased us with their strategies even beyond Nehalem. If 2009 is the year Nehalem gets into every segment of Intel's processor offerings, by the end of that year, they expect to transition Nehalem production to the 32nm process technology. No details of the process technology were made known yet, but they did drop us the codename for Nehalem on 32nm - Westmere. We suppose with Westmere, that would give Intel even more leeway for scaling the Nehalem architecture with more functional units than the initial Nehalem processor.

Since Westmere is just a die-shrink of Nehalem, the next microarchitecture update happens some time after that and that's estimated to be in 2010 with Sandy Bridge. As per Intel's Tick-Tock design strategy, Sandy Bridge will utilize the 32nm process technology as well and feature new extensions to the existing instruction set, the Intel AVX. Short for Advanced Vector Extensions, it primarily increases the current 128-bit wide vector instructions to handle 256-bit wide instructions which will greatly increase floating-point performance in FP-intensive applications. Here's a summary slide from Intel on what to expect from Intel's AVX:-

Sneak Peek at Larrabee - Intel gets Serious on 3D Graphics

Recall the Intel StarFighter AGP graphics card powered by the Intel i740 graphics chip? Codeveloped with Real 3D, that was Intel's only real stab at entering the discrete graphics market back in 1998. Though they fared reasonably, Intel never followed it up and exited the 3D graphics card market quite shortly as they went back to focus on their core areas of expertise. Ironically, they still hold the largest graphics market share by virtue of their integrated graphics engines in their core logic chipsets that's almost ubiquitous these days.

10 years later, Intel is developing a new generation architecture to head back into the visualization market. Visual Computing is what they term it and they are really serious this time round. They plan to tackle life-like rendering, HD audio/video processing and physics model processing by utilizing a programmable and readily available architecture such as several simpler Intel Architecture (IA) cores. Intel plans to add a vector computational unit to each of the cores as well as introduce a vector handling instruction set. They believe their leadership in the total computing architecture of the various platforms and a vast software engineering department will help them achieve their goal of creating Larrabee. Based on a flexible computing architecture (similar to Nehalem's various building blocks), it can be scaled up or down for various market needs. Here's a slide from Intel and how the Larrabee processing architecture would look like:-

Also expected in the 2010 timeframe, Larrabbee is likely to be a discrete 'GPU' like offering from Intel. However, it may or may be a 'graphics card'. For all we know, it may even use Intel's QuickPath Interconnects as a drop-in to an auxiliary socket to communicate with the processors and may even utilize the system's main memory. Remember, Nehalem's memory controller can handle up to 64GB/s - that's a fait bit of memory bandwidth for several high-end graphics cards these days. However the question of latencies will be one of the toughest to tackle and we've not yet seen how QPI handles itself in real life. But this idea could be ideal for a lower-end offering from Intel whereas higher-end versions can contain dedicated local frame buffers. Still, all of these are just speculations for now as there's a long way to go.

Even though Larrabee uses an array of IA cores, thus relying on the x86 instruction set and its extensions, Larrabee will support industry standard APIs like Direct3D and OpenGL. How this will be achieved would be the job of their software development team to create tools for development and performance optimization in taking in the D3D and OGL calls from software/games such that they can be successfully processed on a standard x86 ISA core. This will be their biggest challenge since graphics hardware and CPU hardware are designed differently to tackle very different workloads. However the key difference is that Larrabee though based on an array of x86 cores, its instruction set supported wouldn't be the same as what current CPUs utilize since it will have further extensions on vector processing and the likes. Furthermore, its structure will be tweaked for the purpose of addressing Visual Computing. However, if software/game developers focus on Larrabee's instruction set architecture directly, the gains derived would be far more. Intel certainly has a lot of work cut out for itself if they are to succeed in this arena. They have however commented that industry developers have already shown keen interest in Larrabee, but it will be a while before firm partnership and development press announcements are made known since it's still early days.

In meantime, we expect AMD's Fusion to be available in 2010 as well (or earlier) and that should further spice up the competition for Intel and NVIDIA (the current visualization leader). AMD's Fusion however has thus far been portrayed more of a simple GPU being integrated into the CPU and would likely be an integrated graphics chipset replacement solution more than a powerful visual computing companion - at least that's how we've been told.

One thing's for certain, 2010 is certainly a year where sparks will literally fly as we expect a new level competition on the CPU and GPU fronts by the big boys of the tech industry. Exciting times await us, so start placing your bets and keep a watch on their stocks.

Our articles may contain affiliate links. If you buy through these links, we may earn a small commission.

Share this article