Below is an overview of the graphics architecture on the GTX 200 series of GPUs. It's not that much different from NVIDIA's existing first generation. Basically, there's a thread dispatch logic unit, along with setup/raster units that takes the given instructions and assigns them to the texture processing clusters, each of which has a set of general purpose processors which can execute all types of instruction threads, whether they are pixel, vertex, geometry or compute. The ROPs and the memory interface units make up the rest of this architecture.
As you can see, the performance of this thread scheduler will directly determine whether the GPU will be operating effectively or whether there are idle processors. According to NVIDIA, the thread scheduler can support over 30,000 threads simultaneously, an increase from the appropriately 12,000 on the GeForce 8/9 series, while also improving in efficiency to reach almost 100%. The GPU can also switch threads to process with very low latency.
Just like how the CPU manufacturers engage in their race to squeeze more cores into a single die, there is a similar process going on for the GPU. The difference is that the general purpose processing cores that form the basis of the unified shader architecture underlying both ATI and NVIDIA's present generation of GPUs number in the hundreds, though obviously they are not as complex as that found in CPUs.
For the case of the original first generation GeForce 8800 GTX (G80), there are 128 streaming processors. This number has been almost doubled, to 240 for the flagship GTX 280 GPU and 192 for the performance model, the GTX 260. This is achieved by tweaking the prior arrangement of the stream processors, with NVIDIA engineers increasing the number of clusters of such processors (TPCs) while at the same time, increasing the number of multi-processors (SMs) in each cluster. Of course, you'll find within each streaming multi-processor group the necessary supporting infrastructure like Texture Mapping Units (TMUs) and local memory caches.
Obviously, the increase in processor cores and thread scheduling has implications beyond graphics. NVIDIA has been rather vocal in stressing the general purpose computing prowess of its GPUs and the same hardware on the GTX 200 series can be converted for parallel computing with the appropriate software (in this case, CUDA). Applications that used to be the domain of the CPU can now be ported to run on the GPU, with significant improvements in performance. These include video transcoding, distributed computing, financial modeling and scientific simulations.
Related to this GPGPU focus, we also find NVIDIA enabling double precision support to the GTX 200 architecture, which is important for high performance computing that requires a high degree of mathematical precision. This is done by adding double precision 64-bit floating math units to the mix (a total of 30), all of which are IEEE754R floating point specification compliant. Accordingly, NVIDIA claims that the overall double precision performance of a GeForce GTX 200 GPU is around the level of an eight-core Xeon CPU. It's also the first NVIDIA GPU to have such a feature.
While ATI high-end GPUs have been on a 512-bit memory interface for some time now, NVIDIA's equivalent GPUs have been relying on a 384-bit bus at best. This is set to change with the GTX 280, as these GPUs have now moved to adopting a 512-bit memory interface (8 x 64-bit memory interface units). The GTX 260 meanwhile gets a smaller upgrade to 448-bit (7 x 64-bit). NVIDIA also claims to have modified the paths in the memory controllers to allow for higher memory speed and the GTX 280 already runs at a high stock speed of 2214MHz DDR (and we have in fact reached higher speeds during overclocking). Improvements in compression and caching algorithms have also enabled better performance at higher resolutions.
Along with the memory interface boost, the total amount of memory on the GTX 200 series has also gone up. The GTX 280 comes with 1GB of frame buffer and the GTX 260 has 896MB. This should improve the GPU's performance in the latest games with anti-aliasing at higher resolutions, especially for newer games that use deferred shading, which consume more memory.
Internally, NVIDIA has also doubled the file size of the local registers within each SM unit from that on the GeForce 8/9 series. This would reduce instances where large, complex shaders were too big for the registers, necessitating a slower and more inefficient swap to memory. Newer game engines are more likely to benefit from this, as they tend to use more complex shaders.