Intel Core 2 Extreme QX9650 - Entering 45nm

Penryn Features and Enhancements

Process Technology Enhancements

Following Intel's tick-tock strategy, the Penryn comes into the silicon compaction/shrinking cycle. But what does this mean for users? Is Penryn just a 45nm die-shrunk Conroe? Is the upgrade worth it or should you wait till the next 'tock' cycle where the next microprocessor architecture overhaul is supposed to take place on the Nehalem core?

While not nearly as exciting as the initial release of the Intel Core microarchitecture (ala Conroe processor), or the circumstances that forced Intel into its current overdrive innovation cycle, calling the Penryn just a die-shrunk Conroe would be a grave mistake. The new 45nm core itself is a major improvement in process technology that reduces switching power and leakage, while improving switching speeds and allows Intel to cram more transistors on the die. The dual-core Penryn die size has now shrunk to a mere 107mm^2 compared to the 143mm^2 of the Conroe and 162mm^2 of the Pressler before it on 65nm. Yet, the Penryn's will boast around 410 million transistors, up from 291 million of the Conroe. A large chunk of this will be due to the increased L2 cache size of the Penryn, which now sports a shared 6MB L2 cache, up from 4MB of the Conroe.

A 45nm Penryn (or Wolfdale for the desktop) die. Put two of these together and you have a quad-core Yorkfield

The TDP envelop for Penryn hasn't changed though. Initial desktop processors will feature a 65W TDP for dual-core mainstream processors, 95W for quad-core mainstream and 130W for the Extreme editions. To top off a list of accomplishments, the Penryn can boast as a 100% Lead free processor.

Intel Core Microarchitecture Enhancements

Besides the new process technology, Penryn processors will also feature some improvements to last year's Core microarchitecture. A summary of these enhancements were covered in our Penryn performance preview article (the chart is also reproduced below). As you can see, Intel has delivered some enhancements to every aspect of the Core microarchitecture, so we'd like to focus on the key improvements and what you can probably expect from them.

Core microarchitecture enhancements. Notice that the power features are only available on mobile processors, which is not touched upon in this article.

Most of the enhancements seen offer improvements only to specific needs such as the Fast Radix-16 Divider, which will improve divide performance generally used in scientific and mathematically heavy software. There is also a beefed up the virtualization engine on the processor, which can potentially speed up virtual machine transitions up to 75%. Again, this is a usage specific improvement that will only benefit a select group of users.

The general performance increase will come from the universally larger 6MB L2 cache of course, and Intel has further improved cache and memory management as well with a 24-way associative L2 cache, enhanced cache split line loading and immediate store to load capabilities. The Penryn processors are also built to be ready for another FSB speed increment from 1333MHz to 1600MHz, so when that change happens, users should see another automatic bump to performance across the board.

However, the main feature improvement in the Penryn is the new SSE4 instructions and Super Shuffle Engine. Dubbed as the “most significant media instruction set architecture advancement since 2001”, SSE4's new instruction sets focus on two major categories for improvements to media acceleration and string(text) processing. SSE4 has the potential to offer very significant performance boosts in graphics, video processing, 3D imaging and data compression algorithms to name a few. However, unlike universal performance gains from say a larger cache, applications much first be optimized to take advantage of SSE4 enhancements.

You can check out the Intel white paper if you're really interested to know the full details, but in short, these are the SSE4 features that can be found in Penryn:-

  • Adding support for two different vectored 32-bit integer multiply operations.
  • Introducing 8-bit unsigned min/max operations, plus 16-bit and 32-bit signed and unsigned versions.
  • Introducing features to improve the compiler’s ability to vectorize integer and single-precision code more efficiently.
  • Adding highly specialized operations that can provide significant application level gains in video encode acceleration functions, floating-point dot product operation, 3D content creation and streaming load instruction.

Accompanying the new SSE4 instructions, the Penryn will also feature a Super Shuffle Engine. With the Core microarchitecture, Intel introduced full 128-bit wide SSE registers that enabled SSE instructions to be executed in a single cycle. This year, the Super Shuffle Engine enhances SSE algorithms further with a 128-bit shuffle unit that will now be able to execute full-width shuffles in one cycle. Not constrained only to SSE4, the Super Shuffle Engine will reduce latency and improve the speed of a wide range of SSE instructions with shuffle operations.