When it comes to testing the performance of graphics cards, we, as a hardware review site, obviously have a vested interest in getting things right, particularly our testing methodology. We owe it to our readers to give a fair and unbiased opinion of the products we receive for testing and to do that, we often rely on the built-in benchmarking and time demos included so conveniently by game developers nowadays.
Now, benchmarking computer hardware is not exactly held to the same rigorous standards that one would expect from say, a peer reviewed research paper. But we all believe in and adhere to the scientific method of testing, which means our 'observations' or benchmark results are repeatable and as far as possible, devoid of human error at least.
Does this mean that we often use synthetic benchmarks that have nothing to do with any games available out there? Yes. But it also means that if you replicate our test system independently, from the components down to the drivers and benchmark used, you would get similar results. This reproducibility is one of the main principles of the scientific method and something that can only be done easily with the use of scripted time demos and other such 'canned' benchmarks that are widely available and hence easy for users to try. We'll even admit that such canned benchmarks are extremely convenient to get a quick and rough idea of a card's capabilities.
Does it mean that this method of testing is perfect? No. As demonstrated by previous incidents where companies have tried to 'game' these benchmarks through specific optimizations that do nothing for the actual performance in-game, hardware vendors know all too well the marketing potential of these commonly used benchmarks and have tried to enhance their products' performance in them. Unfortunately, 'real world' testing is also rift with its own inherent problem of subjectivity, which has the effect of reducing hardware performance testing to something akin to a movie or book review. After all, what does a playable level of performance mean for different individuals? Which portion of the game (selection bias?) should be used for the benchmarking? Are the reviewers able to duplicate exactly what they did when 'benchmarking' a particular map?
In case you're wondering why we have just spent so many paragraphs clarifying our testing approach, it's because the subject of 'real world' vs 'canned' benchmarks recently arose again, with HardOCP's article, "Benchmarking the benchmarks" , which explains their stand on this subject and as an example, attempts to highlight a seeming flaw with the Crysis in-game timedemo. Obviously, they may have their points about how actual game play performance could be very different from scripted benchmarks that could be manipulated by optimized drivers. However, these 'illegal' optimizations bring about bad publicity when discovered and vendors run the risk of being exposed by eagle eyed tech editors hoping for such a scandal.
In the end, the highly subjective, "highest playable setting" approach taken by HardOCP is useful but this approach fails to give users the relative difference in performance between graphics cards, which those otherwise 'meaningless' numbers reported by benchmarks, give. Not to mention that readers cannot duplicate HardOCP's findings even if they wanted to and basically have to trust that they got it right. Perhaps reading a variety of reviews using both methods of testing (something like your metacritic.com) would give the more accurate overall picture.
So, now that we have raised a topic worthy of further discussion in our forums and such, let's not forget the actual article today, which is about a very mundane, reference model of the Radeon HD 3870 X2 from PowerColor:
There are no surprises here and if you aren't up to speed on the whole 'two cards are better than one' approach, please check our related links for our earlier articles on this GPU.