NVIDIA GeForce 8800 GTX / GTS (G80) - The World's First DX10 GPU

Hidden in secrecy for four years, the G80 graphics core comes alive in the form of the GeForce 8800 series. Boasting a new unified architecture, NVIDIA's Lumenex engine and the first DirectX 10 GPU ready for whatever Windows Vista and its games have in store, the flagship GeForce 8800 GTX is a showstopper!

By Vijay Anand - 8 Nov 2006

A New Generation for a New DirectX Era

The GPU and graphics card industry has always been a busy scene, but nowhere in history has it been this engaging and intense as it has been for the last one year; perhaps even outdoing the bitter CPU wars of yesteryears as is evident from paper launches time to time. Considering the dizzying variety of GPU SKUs as well as the several grades of pre-overclocked/bundled graphics cards, this has been a bumper year for graphics card upgrades for a wide variety of folks ranging from HTPC users to demanding gamers as well as the elite overclockers who go all out to set more world records. On the performance ranking scales, NVIDIA chalked up better overall standings in the low, mid and mainstream performance series thanks to well-balanced graphics card models below the US$300 mark. Beyond that, though ATI had plenty of voids to fill in their lineup, their most recent refresh has seen the X1950 PRO, X1900 XT, X1950 XT and the X1950 XTX claiming the performance crowns for their price categories. For the red team, this was a much-needed comeback, albeit they were late again to deliver the right parts that general DIY and enthusiasts alike were waiting for. ATI might have claimed the fastest single-GPU graphics card title, but it still doesn't hold the record of the fastest 'single' graphics card. Costing as much as US$500 or more, NVIDIA's GeForce 7950 GX2 dual-slot SLI-combo graphics card holds the record performance for any graphics card at the moment.

Despite the strengths and weakness of ATI's and NVIDIA's top-end solutions, the question came knocking on many enthusiasts if they should even bother splurging on cards so expensive. Their performance and their associated costs were never in question, however their timeframe of introduction was within months of Microsoft's next generation operating system, Windows Vista. Apart from its fancy new GUI and interactivity, Vista's most talked about technical aspect has been it's new driver handling model and Direct 3D 10 specification (DirectX 10). Taking seven years to materialize, the Direct X10 standard has almost finally arrived and as we all know by now, it is exclusively available on Microsoft Vista only. From all their efforts, Microsoft intends to push this strongly as the next API of choice used by games, applications and even Vista's operating system to interface with the graphics card. While Vista is compatible with old DirectX standards (even for the Aero interface) and thus existing applications, games and hardware, the part that irks people to reconsider a hefty graphics card upgrade in the last few months is because forthcoming games that may be coded in DirectX 10 aren't backward compatible with older standards and will solely work with the DirectX 10 API and thus requires DirectX 10 compliant graphics hardware as well. It's interesting how software has begun to dictate hardware capabilities where once upon a time, it was progressing in the opposite model. Standardization in features and capabilities is key here to spring forward new games and applications without compatibility issues that can catapult development based on common standards; thus the new DirectX 10 API model.

When one is contemplating on spending up to US$400 or beyond on a hardware item, it's only natural to expect a certain degree of future proofing. With existing cream of the crop graphics cards such as ATI's Radeon X1950 XTX and NVIDIA's GeForce 7950 GX2 bearing only DirectX 9.0c compliancy, that's bound to give you second thoughts for splurging when Vista's around the corner.

Behold, the GeForce 8800 graphics processor - the first DirectX 10 compliant GPU.

Fortunately this mental tussle would be moot after the 8th of November 2006 as NVIDIA delivers the world's first shipping DirectX 10 compliant graphics cards - the GeForce 8800 GTX and GeForce 8800 GTS based on a fresh new G80 GPU architecture. Having been on the rumor boards for what seems an eternity, the G80 has been speculated to have some really intriguing specs and we weren't a least a bit surprised since it's associated with the DirectX standard bearing a roman numeral X. What did fascinate us is how NVIDIA delivered the G80 GPU well ahead of ATI's equivalent part and that even though the G80 has been in the making for four years, its true capabilities has remained largely unknown till only days earlier. Read on as we decipher NVIDIA's design choices of the G80, how it relates to the DirectX 10 spec, what DirectX 10 means to the consumer and above all, how this new breed of graphics cards fare in today's usage context.

Embracing a Unified Shader Architecture

One of the most prominently sighted improvements in next generation graphics engines was the unification of the traditional fragmented and linear shader processing units in the GPU. On conventional GPU pipelines of the DirectX 8/9 era, which practically encompasses all modern graphics engines, there are two major shader units that take the brunt of most of the initial processing workload - the vertex shader and the pixel shader. The vertex shader manipulates vertices information which then pass the data to the triangle setup engine to actually map them in a 3D space (forming the 3D object). The pixel shader is the next stage in line that actually fills up the triangle fragments that make up the 3D object with individual colored pixels that have undergone complex computations such as lighting models or other complex transformation effects to derive the final pixel color applied. So far, both the vertex and pixel shaders have had different hardware feature sets and consequently work on distinct instruction sets.

Increasingly in recent times with game developers embracing the programmable nature of Shader Models (SM) version 2 and then version 3, games became more and more complex, requiring greater processing power. To meet those demands GPU designs have been increasing the rendering pipelines available, which normally encompass the vertex and pixel shader units. The number of these shader units possessed in a GPU is a really big deal and plays a heavy role (among others) in defining whether a GPU is passed off as a low, mid or high-end model In fact, ATI was so certain that pixel shaders was going to play an even stronger role than ever before when complex shader programs were embraced in a big way that their Radeon X1K series of GPUs were heavily skewed toward pixel shader processing power. Thus with the rise of programmable graphics pipeline, these shader units play a major role as pinpointed above and occupy significant portion of the GPU die. Herein lies an issue. With the increasing number of shader units packed into the GPU for vertex shader and pixel shader functionality, there's bound to be many a time when a number of these resources go idle.

Depicted below by NVIDIA are two directly opposite workloads run on a fictitious traditional GPU to better understand the situation at hand. The first scene has a very complex 3D mesh (naturally requiring thousands of vertices), resulting in a very high polygon count. This is a case where the vertex shader is highly used to build the scene whereas the pixel shader is sparingly used, at least while the scene is being constructed. This could take a while as performance is being bottlenecked with the few vertex units available. The scene below that with the water body looks realistic and nice, but if you switch to the wire frame view, it would look sparse and simple with a very low polygon count as the pixel shader performs complex calculations to render this scene accurately, carefully coloring each pixel. It could have several pixel shader programs operating simultaneously to handle the water, waves, sky, lighting, refraction and many more. Thus, here we have little use of the vertex shader units but we could use more resources on the pixel shader front. From these two extreme scenarios (which aren't far-fetched in reality), you can see that the traditional architecture with a fixed set of vertex and pixel shader units are highly inefficient at times, leading to performance bottlenecking. Furthermore as NVIDIA cites, it's not efficient from a power (performance/watt) or die size and cost (performance/sq-millimeter) perspective either.

Inefficient use of vertex and shader units in a traditional GPU architecture and its associated 'net performance' in these worst-case workloads.

Thus the proposal of the unified shader architecture which uses an array of shader processors, not specifically tuned for either pixel or vertex, but one that can handle both instruction types and can tackle both tasks simultaneously assigning various processors for each kind of task on the fly. As depicted below, this greatly boosts the efficiency of the shader units and increases overall performance.

Efficient use of all shader processors in a unified architecture and its associated 'net performance' in this ideal scenario.

ATI was the first of the two graphics giants to come forward with the first ever implementation of a GPU with a unified shader architecture and that's the ATI Xenos GPU used on the Xbox 360 console. Despite the number of rumors floated around the Internet of ATI's next generation graphics cards for the PC utilizing the unified shader architecture model, the R600, NVIDIA actually surprised those outside of the development world with the world's first unified shader architecture for the PC industry and it's the first ever to be fully DirectX 10 compliant at that too. In fact when ATI first announced that their next fully ground-up GPU would be taking the unified route, NVIDIA actually played it coy by replying they don't actually see the need to venture that path right now. At least not when things are going fine, so why 'fix' it? Turns out that it was just to play down ATI's announcement while they worked on the secretive G80 project that was destined to be using this unified architecture.

NVIDIA's Unified Architecture - Part 1

For a unified shader architecture to be fulfilled, there were a few prerequisites to be satisfied as it's not a simple re-organization of shader units. Referring back to the traditional GPU architecture highlighted on the earlier page:-

The vertex and shader units of the traditional GPU design had differing feature sets that complied with the earlier Shader Model and DirectX standards. These had to be standardized for the common shader unit, which would assume its multipurpose role in a unified shader architecture. Microsoft's DirectX 10 specification covers that area and we'll detail that later.
Secondly, because of the vertex and shader unit's differing abilities on the traditional pipeline coupled with the old pipeline hierarchy where data flows linearly, separate instruction set addressing are used. To better facilitate the move to unified shader architecture, a unified instruction set mandated by Shader Model 4.0 in the DirectX 10 specifications has to be adopted (more on this later).
Thirdly, there should be an efficient dispatcher and load balancer governing the populating of the shader processing units in the unified model.

Since NVIDIA's G80 GPU adheres to the DirectX 10 spec, it has got the above-mentioned requirements covered, but here's how they accomplished it. Gone are the conventional and limited fixed function vertex and pixel shader units and these have been replaced with128 unified stream processors. Don't be intimidated by the term stream processor (which can be as easily referred to us a general purpose shader processor). For quite a while, graphics processors have been functioning like stream processors, albeit in a more constricted manner. Since the G80 has bolstered the shader processors to be of equal standing, they have now become general floating-point processors and can now tackle vertices, pixels, geometry, physics and more. Here's how the G80 GPU architecture stands now:-

Completely unlike past GPU designs, the unified massively parallel shader design of the G80 GPU is geared for the next generation operating system, API and games. It will most likely be a while to fully realize its potential, but it has other enhancements to ensure it stays on top (for now at least). The top half corresponds to the NVIDIA Unified Architecture while the bottom half represents the new Lumenex Engine, which we'll touch upon later.

NVIDIA's Unified Architecture - Part 2

So how does this unified stream processor array compare against existing GPUs? Current GPUs are based on vector processing units and this is such because many graphics operations occur with vector data. Even so, scalar computations are bound to be present even with vector-optimized code. However after much shader code analysis during the design of the GeForce 8800 GPU, NVIDIA has found that scalar operations are increasingly far more apparent in many of the modern lengthy shader programs. This finding has prompted them to equip the GeForce 8800 GPU with 128 scalar processors instead of 32 4-component vector processors used on its previous high-end GPUs. Also notable is that scalar computations are more difficult to execute on a vector pipeline, but not so for vector-based shader codes which are converted to scalar operations for the GeForce 8800 GPU's compatibility. For knowledge sake, a single instruction issued to a vector processor can operate on multiple data elements concurrently, while scalar processors can process one data item at a time (hence the reduced processor count on older GPU designs versus the increased quantity on the GeForce 8800). Due to the current nature of shader program code, the change in shader processor type and the high efficiency of shader processors on the GeForce 8800, NVIDIA claims that it can deliver up to twice the performance gains on existing DirectX 9 applications. A bold claim indeed and we'll see to what extent was that achievable.

The entire array of 128 unified stream processors are grouped in a collection of 16 stream processors with each having a set of Texture Addressing units, Texture Filtering units (also known as texture mapping units or TMU for short) and its associated cache units to function independently. Here's a close-up of one of these groups:-

A close-up of a functioning group of steam processors and its associated supporting units.

Eight such groups make up the flagship GeForce 8800 GTX graphics card while a toned down version with six such groups active in the G80 core corresponds to the GeForce 8800 GTS. There are other features that differentiate them, but not to worry as we've a proper comparison page installed later in the article. To manage all of these stream processors and ensure they are all optimally used, NVIDIA has incorporated a thread manager of sort, called GigaThread Technology that supports thousands of executing threads in flight, much like ATI's Ultra-Treading Dispatch Processor in the Radeon X1K series.

Besides better utilization of shader processors, elevating early performance limitations from vertex or pixel shader bound application bottlenecks, another reason why the unified architecture route was developed is to perform more than just the standard roles of what conventional vertex and pixel shaders did. As mentioned above, the stream processors used for the unified shader architecture on the GeForce 8800 take on a more generalized floating-point processing role with a generalized feature set, they can tackle other data processing functions as well. That and the fact the unified shader model breaks away from the traditional linear pipeline stages (vertex --> triangle setup --> pixel --> ROP --> Memory), it allows the unified shader processors to perform multiple shader operations on a data set (e.g. vertices) whose results of the initial processing pass are passed back to the top of the shader cores to be dispatched again, processed again (maybe for pixel shading this time), looped to the top again and the cycle continues till the processed data is ready to enter the raster operations stage (ROP) and then finally be passed on to the frame buffer for scan out. Because of this processing model's flexibility, the GeForce 8800's unified architecture can dynamically allocate processing power to crunch vertex, pixel, geometry, physics and even other forms of workloads in future, all at the same time running in parallel.

The unified architecture conceptual schematic.

Geometry and Physics. These are the newly defined workloads that are made possible by the 128 streaming processors on the GeForce 8800 GTX working in conjunction with GigaThread technology. Geometry processing capability is required of the Shader Model 4.0 standard and we'll soon touch on that in our overview of the DirectX 10 standard. Not much is known yet of the physics processing capability, but that's a bonus of the unified architecture which NVIDIA fondly coined it as their Quantum Effect Technology. How this would be put to use and under what API has not been discussed in detail yet, but you can rest assure that with both NVIDIA and ATI adopting a generalized array of shader processors to do their bidding, we believe it would be much more palatable for game developers to engineer games for the next level of interactive and visually pleasing environmental effects than building games supporting the Ageia physics processor.

The stream processors on the GeForce 8800 series are driven by their own high-speed clock, which differs from what the core clock that drives the rest of the GPU. The GeForce 8800 GTX has a core clock of 575MHz while its stream processors operate at 1.35GHz - yet another reason why NVIDIA claims high shader throughout, benefiting the overall performance of the card. The lesser GeForce 8800 GTS version tops has clocks of 500/1200MHz respectively.

NVIDIA's Lumenex Engine - The AA Engine gets an Update

From the GeForce 8800's GPU functional diagram, we've thus far covered the NVIDIA Unified Architecture segment that takes up most of the top-half of the schematic. The other half of the GPU deals with the image quality aspects and that previously went by the name of Intellisample engine, but on the GeForce 8800, it has been given a notable upgrade and a new name to set it apart from the predecessor.

Here's how NVIDIA generally partitions their GPU; the top half is the prior discussed new NVIDIA Unified Architecture while the bottom half corresponds to their updated image quality engine - the Lumenex Engine.

The Lumenex engine is essentially the new anti-aliasing (AA), anisotropic filtering (AF), high dynamic range (HDR) and display engine. Since AA is involved, you can be sure that the traditional raster operator (ROP) engines do the needy to smoothen all the jagged edges you see in your game. On the GeForce 8800 GTX, there are 24 of ROP units while the GeForce 8800 GTS features 20, still more than the 16 found on the earlier single GPU configurations. The usual multi-sampling AA (MSAA) and supersampling (SSAA) are present on the GeForce 8800 series, but in addition to those, there's now a new method called Coverage Sampling AA (CSAA). According to NVIDIA, it uses a new algorithm that uses intelligent coverage information to perform high-quality anti-aliasing without much performance hit as associated in the past. Basically, it applies AA by analyzing the number of subsamples to see where it requires smoothening, but taking into account of fewer color/Z information of the samples, hence the term Coverage Sampling AA. Given that 16x AA is selected, the CSAA technique analyzes 16 subsamples, but only stores four values of the color/Z information. MSAA analyses 16 subsamples and stores color/Z information for all 16 of them. SSAA factors what MSAA does but actually also stores all of the texture samples too, thus the reason why SSAA is extremely bandwidth hungry and is seldom used.

Normally you wouldn't use beyond four sub-samples for anti-aliasing, unless you happen to have a high-end SLI or CrossFire setup. Now with CSAA technique, you can use 16x CSAA to obtain higher quality gaming without much more performance hit than normal 4x MSAA. NVIDIA places the figure to be roughly 10 to 20% more taxing than 4x MSAA on the GeForce 8800 series, which is not bad at all since they have plenty of processing power to spare. However, don't expect drastic improvements in image quality as 4x MSAA at high resolution generally irons out the more obvious jaggies, leaving only minor irritants. That's where 16x CSAA comes in to ensure better smoothening without much performance loss, but it is of course still a small compromise in quality when compared against 16x MSAA.

In total, CSAA introduces three new AA modes, 8x, 16x and 16xQ (where 16x uses four subsample's color/Z information while 16xQ uses eight of them). On MSAA, 8xQ has been added. All of these four new modes are available on a single GeForce 8800 series graphics card. Previously, you had to run SLI to be offered these high AA levels (bases on MSAA), but with the advancements made to the AA engine and the AA techniques supported on the GeForce 8800, just a single card is all you need to sustain ultra high quality gaming.

All the anti-aliasing modes available on a single GeForce 8800 graphics card. Of which, take note that 8x, 16x and 16xQ utilize the CSAA technique. 8xQ however, is a new MSAA entry.

The usual method of applying AA is normally via the game's interface or forced through the graphics drivers. However, there's now another option, "Enhance Application Setting". For games that have AA options or capability built-in, NVIDIA highly recommends this option. This is because there are games, which apply AA intelligently where required, or where it's been flagged and not just blindly throughout the entire scene. This naturally means bandwidth and process cycles saved in return for better performance or power efficiency. However you may not be satisfied with the AA options available in-game, thus you can 'enhance' the AA levels by selecting the "Enhance Application Setting" option in the driver control panel and setting your desired value. Thus the game still applies AA smartly, but at the desired level of AA you require. Take note that the new CSAA anti-aliasing technique specifically requires you to use this mode of operation.

Enhance Application Setting is the newest method of specifying how AA is applied in your game and it's also the recommended method to make use of the new CSAA technique.

Games that do not feature any AA controls and lack built-in AA, users can fallback upon the usual option to override application settings and force the desired number of AA sub-samples compared.

As usual, all AA methods are compatible with the NVIDIA's secondary anti-aliasing technique, Transparency anti-aliasing, which allows anti-aliasing of alpha textures as well. For those who would like a recap on transparency anti-aliasing and what kind of image and performance impact it posses, we've got that covered in .

Lumenex Engine (continued)

Angle-Independent Anisotropic Filtering

Besides notching up on the anti-aliasing standards, the Lumenex engine also updated the GPU's capability to process angle-independent anisotropic filtering - which finally puts them on par with ATI's Radeon X1K on anisotropic filtering (AF) image quality. NVIDIA graphics cards prior to the GeForce 8800 were unfortunately not capable of this (to save processing cycles) and were stuck with angle-dependent AF such as straight surfaces that completely adjacent to the ground or parallel to the ground. This caused certain textures to 'shimmer'. It's not something that would make or break an average user's upgrade/buying decision as it's not readily noticeable. However, ardent gamers and enthusiasts have a very keen eye and for them, this wasn't favorable. Thankfully, the GeForce 8800 series finally rectifies this shortcoming.

High Dynamic Range Lighting

High dynamic range (HDR) is all the rage in the latest of games as it offers some really nice visual effects portraying very bright light sources and very dark regions accurately as in real life without suppressing them. Current game engines embrace 64-bit HDR (assigning 16-bit per color component, including the alpha channel). While it suffices today, NVIDIA's taking no chances and has upgraded this to support 128-bits for an even higher-precision and more accurate representation of HDR (which NVIDIA likes to call it True HDR). So that's now 32-bits floating-point precision for each color component. We don't expect this to be of significance anytime soon as it would be sometime before game engines embrace 128-bit HDR. So for the moment, we can't quite relate how much better the already swell looking 64-bit HDR would feel like.

Here's another big equalizer that NVIDIA scored with the GeForce 8800 series - HDR with anti-aliasing is finally available! Note that we said "equalizer" as ATI's Radeon X1K series could handle this since they were announced last year. While we admit that besides the high-end Radeon X1950 series, the rest of them don't posses the raw processing power to let anyone enjoy fluent gaming with both HDR and AA combined. So while the entire Radeon X1K series had the option, they just weren't capable of it anyway. For the GeForce 8800 series which currently has the GTX and GTS versions, these are plenty fast and it's only right they were capable of handling this.

Early Z Technology & ROP

Other notable qualities of the Lumenex engine is a full 10-bit display pipeline, which again brings it on par with ATI's Radeon X1K series offering up to 1billion unique colors to be displayed - if and when we get 10-bit displays in the mainstream at all. Last but not least, Early Z technology is present on the GeForce 8800 series for improved rendering efficiency. As you know, Z-buffer (or also otherwise known as the depth buffer) stores information pertaining to which pixels in a scene are visible and which aren't. Z-Cull is a method that used the Z-buffer information to remove the non-visible pixels at a high-rate but this happens rather late in the rendering pipeline at the rasterization stage. So all this while we have the GPU having to process useless data that may never be rendered to screen. To save on this workload that consequently improves overall performance or conserves power, Early Z technology is employed to test individual pixel's Z value even before they enter the pixel shading stage and removes unnecessary pixels.

Speaking of Z processing which occurs in the Raster Operators (ROP), the GeForce 8800 GTX has six such ROP partitions with each capable of processing 4 pixels (or 16 subsamples). That's a total of 24 pixels/clock output for color and Z processing. The GeForce 8800 GTS has one less, five ROP partitions, which has a peak output of 20 pixlels/clock for color and Z processing.

Memory Subsystem

Once the pixel data has been completely processed at the ROP stage, they are then passed on to the frame buffer for scan out to the display device. To communicate with the frame buffer, the GPU has memory controllers of course, and the GeForce 8800 series has the weirdest count and total memory bus width of any GPU in recent history.

The GeForce 8800 GTX has six standard 64-bit memory controllers (which supports DDR1, DDR2, DDR3 and DDR4 memory) and that equates to a total of 384-bit memory bus width. Not your usual power of two values such as 128-bit and 256-bit bus widths. NVIDIA has evaluated the cost to performance aspects of various memory bus widths and at the moment, the 384-bit memory bus is fair strike between the two scales, which is a logical take in our view. With an odd memory controller size, the GeForce 8800 GTX is also associated with an unusual frame buffer size of 768MB. However if you do the math, six memory controllers interfacing a total of twelve 512Mbit x32 memory devices, you get a grand total of 768MB of GDDR3 memory (clocked at 1.8GHz DDR). The toned down GeForce 8800 GTS counterpart, has five memory controllers which gives the card a graphics memory bus width of 320 bits and total frame buffer size of 640MB (clocked at 1.6GHz DDR). Though unusual numbers, these specs are definitely better than any other solo GPU graphics card out in the market at the moment.

DirectX 10 Compliancy- Shader Model 4.0

You could say that a lot of the GeForce 8800's core features revolve around making it a true DirectX 10 compliant graphics card, but the converse is equally true as NVIDIA has been working with Microsoft for a long time to fine tune the specifications. In fact, the GeForce 8800 is not only the first shipping DirectX 10 GPU, but also the reference GPU for DirectX 10 API development and certification. This statement alone says a lot about the GeForce 8800 adherence to Microsoft's latest DirectX standards and you thus you can expect a smooth experience for DirectX 10 class applications and games (if all goes well on the developer's side).

As with the recent progression of DirectX standards that have largely evolved based on progressing graphics processing and thus new Shader Model standards that have accompanied it, DirectX 10 is no different as it introduces a fourth generation Shader Model standard. Since the GeForce 8800 is a fully DirectX 10 compliant GPU, it meets all the hardware requirements specified. Interestingly, the much discussed about unified shading architecture on the GeForce 8800 is actually not a requirement of Shader Model 4.0 (SM 4.0). What it primarily requires for compliancy is:-

A new programmable stage called the geometry shader,
A unified instruction set and common resources for all shaders (vertex, pixel, geometry)
And more shader resources than the previous SM standard to support the new functionality offered in SM 4.0.

The unification of the instruction set among the shaders would make it easier for programmers to program code since they no longer have to treat each shader type as a separate entity on its own. To facilitate this, all the shaders would have to take on similar specifications across the board such as the number of instruction slots, registers, other resources and limits. So in essence, a DirectX 10 GPU may have fixed function vertex, geometry and pixel shaders, but equip the shaders identically in features and give them access to more resources.

However, with all the shader processors having a similar feature set and sharing the same resources, it made more sense to combine them into a common array of shader processors and dynamically assign each processor the workload required to crunch. This unified architecture also opened the gateway for the GPU to process more than just the specified vertex and pixel workloads of the past, but also tackle geometry and even physics. It doesn't just stop there as there are even other forms of data processing that can take place on the G80 GPU that have not been defined yet. Thus the requirement to feature a unified instruction set has also indirectly promoted the use of unified shader architecture (intentionally or unintentionally).

Here's how the shader model specs and characteristics rack up for Shader Model 3.0 and the newly established version 4.0:-

Shader Model Specs	SM 3.0 (DX9.0c)	SM 4.0 (DX10)
Instruction Slots	Vertex	512 (min)	65K
Pixel	512 (min)
Constant Registers	Vertex	256	16 x 4096
Pixel	224
Temporary Registers	Vertex	32	4096
Pixel	32
No. of Inputs	Vertex	16	16
Pixel	10	32
Render Targets	-	4	8
No. of Textures Supported (Texture Samplers)	Vertex	4	128
Pixel	16
2D Texture Size	-	2K x 2K	8K x 8K
Int / Load Ops	-	-	Yes
Derivatives Function	-	-	Yes
Flow Control	Vertex	Static / Dynamic	Dynamic
Pixel	Static / Dynamic

DirectX 10 Compliancy - Geometry Shader

The other major requirement and a highlight of SM 4.0 is the presence of a geometry shader. Since the GeForce 8800 uses a unified shader architecture, it doesn't physically have such units, but logically, this shader is nested between the vertex and pixel shader hierarchy. This is because the geometry shader takes the output from the vertex shader and for the first time ever, the geometry shader allows creation or destruction of vertices (data amplification or data minimization) without the intervention of the CPU. Traditionally altering an existing 3D model would require heavy intervention from the CPU, which would hamper the ability to perform such operations in real-time effectively.

The logical pipeline of a DirectX 9 graphics card.

The new logical pipeline of a DirectX 10 graphics card.

With the geometry shader, the CPU is relived of this task and real-time alteration of the model can take place many folds speedier on the GPU. A supporting feature that makes all this possible is called stream output. Basically, stream output allows data output from the vertex or geometry shaders to be written directly to the frame buffer without requiring to pass through the entire graphics rendering pipeline. The written output can then be dispatched again to the shader processors for further processing and advanced shader effects. The stream output is a more generalized form of the previously known "render to vertex buffer" feature that we've touched on in the days of the G70 and NV40. Geometry shaders together with stream processing allow complex geometry processing and GPU-based physical simulation with little CPU overhead. An important area for application of these benefits is in realistic character animation and facial expressions - all possible thanks to the new DirectX 10 spec. Such rendering can be done via software on CPUs but the outcome would be nowhere as fast the GPU which has a much stronger floating point performance. Thus, the GPU based geometry shaders help to shift certain geometry processing from the CPU to the GPU for much better performance.

Some of the other visual quality enhancements made by the DirectX 10 spec upgrade include the requirement to support 128-bit HDR, increased render target support allowing more complex shaders to be used, vastly improved instancing support which is now able to create variations of the original through texture arrays and render targets with reduced CPU intervention, and finally vertex texturing which was formally used with vertex shaders is now a feature extended to the geometry shader to perform displacements that modify vertex positions of objects as well as create new shapes, forms and geometry data. Many of these are improved upon variations of the more limited original implementations found in DirectX 9.0c standard which we've explained here and it would be good reference if you aren't familiar with them.

DirectX 10 Compliancy - Efficiency Enhancements & Gelling with Windows Vista

Besides all the visual enhancements made possible with the new DirectX 10 specifications, it's also far more leaner and reduced CPU overhead for managing and altering GPU resources and calls. New features have been added to reduce CPU intervention such as texture arrays that allow many textures to be stored in an array structure allowing shader programs direct access to them instead of the CPU managing multiple textures. Predicted draw, a technique to prevent redundant overdraws by means of drawing simple box approximations of complex objects to test for occlusion before it's drawn in full or discarded, is now fully processed in GPU unlike in the past DirectX standards requiring CPU intervention. This particular feature is an addition to the usual hardware Z culling prevention measure. The third is via stream output that has been discussed earlier.

DirectX 10 has also gone leaner because it no longer supports earlier DirectX standards, which means there's no backward compatibility. Thus there are no capability bits in DirectX 10 that poll to find out what processing hardware and standards are supported. As we know it, DirectX 10 will debut with Windows Vista but since Vista will be adopting a fresh new graphics driver model known as the Window display driver model (WDDM) as well as new driver models for just about everything else as well, DirectX 10 is architected to work well hand-in-hand with the driver model. This is one of the main reasons besides starting fresh and lean as to why DirectX 10 forewent backward compatibility as it would very difficult to maintain the differing DirectX capabilities with the new driver model architecture. However, Microsoft Vista will also ship with an older DirectX library standard, but tweaked for Windows Vista's operating system and driver model. Known as DirectX 9.0L ("L" for Longhorn), it is the DirectX 9.0c equivalent of that in Windows XP. And DirectX 9.0L is further backward compatible with older DirectX standards such as 8.0 and even earlier. This will ensure that Microsoft Vista is compatible with a vast variety of hardware, but it would require a minimum standard of DirectX 9.0 to enable using it's fancy Aero interface among other more advanced features.

With this information of Windows Vista and it's DirectX support, here's how Vista will support the following working configurations of games and graphics hardware used:-

Scenario / Characteristics	Game Type	Graphics Hardware	DirectX Interface Used
Scenario 1	DX8	DX8 class or newer	DX9.0L
Scenario 2	DX9	DX9 class or newer	DX9.0L
Scenario 3	DX10	DX10	DX10

Specs:- GeForce 8800 GTX and GTS Compared

With all the pages of technology discussed that encompass the GeForce 8800 GPU, just how complex is it? Hold you breath and read this:- six hundred and eighty one million transistors. Yes, you read it right, it's really 681 million transistors. That's nowhere near any mainstream graphics processor ever made or CPU for that matter. Instead, it's got enough transistors to perhaps challenge the Itanium server processor. What's more interesting is that it's still fabricated on a 90nm process technology. You just got to wonder what's NVIDIA's yield rate of these monster GPU, but as with all public relations talk, it's all too rosy to believe. Both variants of the GeForce 8800 series will use the same G80 GPU, so perhaps the lower binned versions get marked up for the GeForce 8800 GTS cards. Here's the full specs of the GeForce 8800 graphics cards discussed and how it stacks up to some of its predecessors and relevant comparisons:-

Model	NVIDIA GeForce 8800 GTX 768MB	NVIDIA GeForce 8800 GTS 640MB	NVIDIA GeForce 7950 GX2 1GB	NVIDIA GeForce 7900 GTX 512MB	NVIDIA GeForce 7950 GT 512MB	ATI Radeon X1950 XTX 512MB
Core Code	G80	G80	G71	G71	G71	R580+
Transistor Count	681 million	681 million	2 x 278 million	278 million	278 million	384 million
Manufacturing Process (microns)	0.09	0.09	0.09	0.09	0.09	0.09
Core Clock	575MHz	500MHz	500MHz	650MHz	550MHz	650MHz
Vertex Shaders	128 Stream Processors (operating at 1350MHz)	96 Stream Processors (operating at 1200MHz)	2 x 8	8	8	8
Rendering (Pixel) Pipelines	2 x 24	24	24	16
Pixel Shader Processors	2 x 24	24	24	48
Texture Mapping Units (TMU) or Texture Filtering (TF) units	64	48	2 x 24	24	24	16
Raster Operator units (ROP)	24	20	2 x 16	16	16	16
Memory Clock	1800MHz DDR3	1600MHz DDR3	1200MHz DDR3	1600MHz DDR3	1400MHz DDR3	2000MHz DDR4
DDR Memory Bus	384-bit	320-bit	2 x 256-bit	256-bit	256-bit	256-bit
Memory Bandwidth	86.4GB/s	64.0GB/s	76.8GB/s	51.2GB/s	44.8GB/s	64.0GB/s
Ring Bus Memory Controller	NIL	NIL	NIL	NIL	NIL	512-bit (for memory reads only)
PCI Express Interface	x16	x16	x16	x16	x16	x16
Molex Power Connectors	Yes (dual)	Yes (dual)	Yes	Yes	Yes	Yes
Multi GPU Technology	Yes (SLI)	Yes (SLI)	Yes (SLI, Quad SLI)	Yes (SLI)	Yes (SLI)	Yes (CrossFire)
DVI Output Support	2 x Dual-Link	2 x Dual-Link	2 x Dual-Link	2 x Dual-Link	2 x Dual-Link	2 x Dual-Link
HDCP Output Cable?	Yes	Yes	Yes	No - vendor dependent	Yes	Yes
Street Price	US$599 (SRP)	US$449 (SRP)	~ US$500 - 570	~ US$410 - 470	~ US$285 - 320	US$449

The more appropriate query for the enthusiasts is if there will be enough of G80 GPUs being churned out to keep to NVIDIA's suggested retail price. The GeForce 8800 GTX is slated to go for US$599 and the GeForce 8800 GTS at US$449. These figures are probably what some of you have dished out to obtain the GeForce 7950 GX2 and the GeForce 7900 GTX once upon a time. So it's really interesting to see these aggressive price points offered that's actually eating out the aforementioned GeForce 7 series SKUs. At the time of writing, here's the updated suggested retail price for all of NVIDIA's graphics cards:-

Graphics Card Models	Suggested Retail Price ($US)
GeForce 8800 GTX 768MB	$599
GeForce 8800 GTS 640MB	$449
GeForce 7950 GT 512MB	$299
GeForce 7900 GS 256MB	$199
GeForce 7600 GT 256MB	$159
GeForce 7600 GS	$129
GeForce 7300 GT	< $99

Photo Gallery: GeForce 8800 GTX and GTS Compared

The graphics card at the top is a typical GeForce 8800 GTS while the bottom is the flagship GeForce 8800 GTX. The smaller GTS version's card length is typical of any other high-end graphics card such the GeForce 7900 GTX and isn't any longer than it. The lengthier GeForce 8800 GTX owes it extended figure to a more robust power circuitry design and is 26.5cm long. So if you're gunning for the top edition, check your chassis clearance first.

These new gems are proper dual-slot graphics cards, have dual dual-link DVI connectors with HDCP compliant outputs. Just as we expected. TV-output is still available via the 7-pin min DIN connector.

Watch out, the GeForce 8800 GTX requires a two PCIe power connectors and that's mandatory. If you don't, there's a buzzer on board it sound you out. With the card's already extended length, it's a good thing that both connectors are angled upwards. A minimum of a 450W power supply unit is recommended for this GTX version.

The slightly slower GeForce 8800 GTS brother requires a single PCIe power connector. A 400W power supply unit is recommend.

SLI is of course supported on these GeForce 8800 beasts, but the GTX variant comes with a pair of SLI gold fingers. We suppose it's for quad SLI support by means of daisy chaining multiple cards or to dedicate a third card for physics processing, but these are just educated guesses for now as NVIDIA doesn't plant to disclose them now. If you do get a pair of these cards, you can use either connector to enable SLI.

Removing the huge cooler, you'll see that it heatsink base is thick, huge and heavy with multiple thermal pads cooling various components on board. Thankfully the cooler is so silent, you can't tell if it's functioning!

The GeForce 8800 GPU, codenamed G80, is the amazing brain of this graphics card. It's huge package measuring 43mm x 43mm in dimensions and the die is hidden by an integrated heat spreader. The last we seen a heat spreader was on the NV30 and NV35, so you can take a hint of the heat output, but the cooler does a fair job and quietly at that.

The memory chip's markings are almost missing, but these are Samsung 1.1ns GDD3 memory. That's 900MHz clock rate for a net 1.8GHz in DDR speed. Not the same speed as the GDDR4 used on ATI's X1950 XTX, but the ultra-wide memory controller combined gives the GeForce 8800 GTX a whopping 86.4GB/s net bandwidth.

Test Setup

Now lets get down to the numbers. Since the vast majority of consumer would still be using Windows XP, we've used the Profession edition operating system as our benchmarking environment choice with service pack 2. The testbed this time round was overhauled to an Intel Core 2 Duo E6700 (2.67GHz) processor with 2GB of DDR2-800 memory operating off Intel's D975XBX motherboard 'Bad Axe'. The following graphics cards were lined up to gauge the new GeForce 8800 stalwarts:-

NVIDIA GeForce 8880 GTX 768MB (ForceWare 96.94)
NVIDIA GeForce 8800 GTS 640MB (ForceWare 96.89)
NVIDIA GeForce 7900 GTX 512MB - SLI (ForceWare 93.71)
NVIDIA GeForce 7950 GX2 1GB (ForceWare 93.71)
NVIDIA GeForce 7900 GTX 512MB (ForceWare 93.71)
NVIDIA GeForce 7950 GT 512MB (ForceWare 93.71)
ATI Radeon X1950 XTX 512MB - CrossFire (Catalyst 6.10)
ATI Radeon X1950 XTX 512MB (Catalyst 6.10)

The most crucial comparison would be how the GeForce 8800 cards faired against the similarly priced GeForce 7900 GTX, Radeon X1950 XTX and the GeForce 7950 GX2 graphics cards respectively. Dual graphics card solutions of the current high-end were thrown into the mix for knowledge purposes only and are not meant to be seriously compared against the GeForce 8800 single-GPU graphics cards. Besides the dual graphics card combo would far exceed the price envelope of any one single GeForce 8800 graphics card.

Here then are the benchmarks we used to gather the performance results presented in the following pages; new to the group here is Company of Heroes:-

Futuremark 3DMark05 Pro (version 120)
Futuremark 3DMark06 Pro (version 102)
Tom Clancy's Splinter Cell 3: Chaos Theory (version 1.3)
F.E.A.R
FarCry (version 1.33)
Company of Heroes (version 1.2)
Chronicles of Riddick: Escape from Butcher Bay (version 1.1)
Quake 4 (version 1.2)

Results - 3DMark05 Pro & 3DMark06 Pro

In our Futuremaark benchmarks, the GeForce 8800 GTS faired as well as the GeForce 7950 GX2, while the GeForce 8800 GTX bettered the GeForce 7900 GTX SLI pair and even gave the Radeon X1950 XTX pair a tough fight in 3DMark06. When using FSAA and HDR, our pool of comparisons shrunk, leaving only the NVIDIA cards to be pitted against the Radeon X1950 XTX. Overall, the results were looking quite rosy so far with a reasonable lead ahead of the single GPU graphics cards.

Results - Splinter Cell 3: Chaos Theory (DirectX 9 Benchmark)

Analyzing the first set of results without FSAA, the GeForce 8800 GTS isn't any better than a single GPU graphics card and fails to match the performance figures of the GeForce 7950 GX2 by a large margin. The bigger brother GeForce 8800 GTX manages to perform much better on the other hand. With FSAA involved, the GTS again didn't stand in good light, but the GTX fared better.

Results - F.E.A.R (DirectX 9 Benchmark)

F.E.A.R. didn't see a very marked improvement like the 50% magic figure we've been expecting for most of our benchmarks. Rather both cards obtained a 30% speedup against the respective comparisons, which is still a good step up. However, they did meet our expectations in matching the performance of the GeForce 7950 GX2 and the GeForce 7900 GTX SLI respectively.

Results - FarCry (DirectX 9 Benchmark)

** Updated as of 17th November 2006 **

Our original published scores for this game test had HDR missing for the ATI cards and thus they garnered much higher performance than expected. We've since remedied this and have updated the page to reflect its true performance.

Once more, we found the GeForce 8800 GTS neck to neck with the GeForce 7950 GX2 that it's replacing while the GeForce 8800 GTX surpasses all previous graphics card records. Oddly, the ATI Radeon X1950 XTX didn't match up to its usual levels of performance, not even with CrossFire.

Results - Company of Heroes (DirectX 9 Benchmark)

In the latest game in town, Company of Heroes, ATI's CrossFire performance is actually the exact opposite of what we saw on from the earlier FarCry benchmark - a stark contrast that casts questions on CrossFire's compatibility. The GeForce 7950 GX2 on the other hand didn't quite scale to our level of expectations with FSAA. There's a possibility that multi-GPU configuration on this game hasn't been quite ironed out yet perhaps. For now, we'll have to wait for more driver updates and/or patches to affirm the state of this brand new game. The results returned by the GeForce 8800 pair are fine though.

Results - Chronicles of Riddick: EFBB (SM 2.0+ OpenGL Benchmark)

The performance in Chronicles of Riddick was spot-on to our expectations at non FSAA resolutions. However with FSAA, we find the GeForce 8800 GTS not as strong a performer it was without FSAA.

Results - Quake 4 (SM 2.0+ OpenGL Benchmark)

In our final game test, Quake 4 too ranked the GeForce 8800 GTS as almost on equal footing with the GeForce 7950 GX2 while the GeForce 88000 GTX went ahead to take on the GeForce 7900 GTX SLI pair at a much lower price tag.

Power Consumption

NVIDIA had listed the maximum TDP of the GeForce 8800 series to be about 140W at peak, which clearly puts them as the top graphics card power guzzlers. Instead of just grappling with the mentioned figures, we had our total test system power measured and that's graphed below. The GeForce 8800 GTX tops the list for a single graphics card, but then again it does deliver a lot more performance, quite justifiable if you ask us. The GTS version most of the time performs in the ballpark of the GeForce 7950 GX2 and so is its power consumption.

Temperature Testing

The GeForce 8800 graphics cards are no doubt among the fastest cards there are now, with the GTX variant the king of the ring judging by the results we've shown you so far. They are also quieter than a whisper, but does that mean they operate extremely hot? Not at all as the graph shows below. The results speak for themselves and the default cooler does a splendid job in our opinion.

GeForce 8800 - Yet Another Showstopper from the Leader

Well, the first ever DirectX 10 GPU and graphics card has arrived ahead of Windows Vista and Microsoft can finally proudly show the world what kind of hardware is available on the market to harness the DirectX 10 API - all thanks to NVIDIA. Well, no games are out yet solely harnessing the new feature set of DX10, but you can be sure that game developers are using NVIDIA's GeForce 8800 graphics hardware for development and games boasting this should hit the shelves first half of next year if all works out smoothly. So although it would be a while before we can accurately tell if the GeForce 8800 is doing a swell job for what it was designed to do, the good thing is that NVIDIA's new fangled graphics card isn't catered only for DirectX 10 and will work just as well on current DX9 games and applications or for that matter any standard below DX10.

Under the hood, a host of new features have come into place like the unified shader architecture and GigaThread technology working hand-in-hand with its new shader processor array to enable processing of geometry and even physics on the GPU. In the image quality department, we see some genuinely new tweaks like 128-bit HDR, 16x AA even on a single GPU, new AA techniques, while others like the ability to use HDR and AA at the same time are seen as playing catch up with its competitor. We can honestly say that the GeForce 8800 graphics processor has a lot of potential in it, but much of it won't be evident too soon, not at least Windows Vista is in full force with the right games taking advantage of the flexibility offered by DirectX 10 to unleash an immersive and entertaining gaming environment.

Fortunately, the enhancements and tweaks within the GPU are not all beneficial only to Windows Vista as we can testify immediately even on the current gaming platform of choice (Windows XP), the GeForce 8800 GTX has made a quantum leap in performance. For the record, the GeForce 8800 GTX is the fastest single graphics card ever and its performance is speedy enough that it can easily replace a pair of GeForce 7900 GTX cards running in SLI. To put things in perspective against the single graphics card realm, the GeForce 8800 GTX is 30% more powerful than the 'pseudo dual-card' GeForce 7950 GX2 and up to 70% speedier against the GeForce 7900 GTX. Take note that the for roughly the same price envelope that the GeForce 7950 GX2 graphics card first debuted, the GeForce 8800 GTX offers more performance in addition to its vast hidden potential waiting to be untapped as Windows Vista and its applications/games mature. About the only gripe we have is its high power consumption (though negated by high performance as well) and its extremely lengthy profile. In fact, it's a card just made to ensure top performance, and it delivers that; no questions asked. The GeForce 8800 GTS on the other hand didn't set any speed records, but at its SRP of US$449, it's an excellent replacement for the outgoing GeForce 7900 GTX and GeForce 7950 GX2 graphics cards that are in the same price bracket while matching the performance of the GeForce 7950 GX2 for the most part.

Considering the complexity of the GPU and questionable availability in large quantities, prices may take a climb if NVIDIA's rosy statement of sufficient stocks doesn't hold true. Apart from that, it is hats off to NVIDIA for pulling another successful hard launch of a significant product that's paving the way of how we work and play. Forecasting a little ahead, we can easily foresee a product refresh of the GeForce 8800 graphics cards once NVIDIA and TSMC have mustered delivering the next smaller process technology. For the GeForce 8800 series to be more economical and go mainstream, that's perhaps the only way moving forward. For those of you who have big bucks and wish to stay ahead with the latest hardware now, it doesn't get any better than the GeForce 8800 series and we suspect it's going to be that way even sometime past Christmas as well. Speaking of which, you're going to need a very long stocking if you hope to receive a GeForce 8800 GTX come Christmas morning!

Our articles may contain affiliate links. If you buy through these links, we may earn a small commission.