How to measure computer performance
June 24, 1996
With a car, it's easy enough to know that going 90 mph is faster than 85.
But judging computer performance is a lot more complicated. From MIPS to SPEC to Viewperfs, the industry has developed a plethora of benchmarks aimed at answering what seems like a simple-enough question: Which computer systems offer the best performance?
Cybercontacts |
---|
Readers can learn more about the benchmarks and workstations mentioned in this report via the Internet. Please tell them you were referred by Design News.Digital Equipment Corp. http://www.dec.comIBM http://www.ibm.comIntergraphhttp://www.intergraph.comSilicon Graphics http://www.sgi.com/Technology/benchmarkSPEC http://www.specbench.orgSun Microsystemshttp://www.sun.com |
"It is a major ordeal to try to evaluate hardware platforms," says John Kundrat, manager of business partner relations at Structural Dynamic Research Corp. "If it's so difficult for us who do this day in and day out, imagine how hard it is for users."
Why is it so hard to pin down computer speed? Different tasks put different strains on a system, so being fastest at displaying and ro- tating graphics doesn't necessarily mean a machine is equally adept at finite-element-analysis number crunching. That's one reason why experts caution that a single benchmark result is not enough to rate a computer's performance, and users should look at a variety of test results before drawing conclusions about how different machines stack up.
Raw speed. One popular benchmark suite, from the Standard Performance Evaluation Corp. (SPEC), measures a computer's central processing unit (CPU). But even this--which doesn't take into account graphics display, or how fast data can be pulled off a hard disk for use--is fairly complicated.
One set of SPEC tests, SPECint95, looks at the CPU's integer performance, or how it handles simple operations. Another group of benchmarks, SPECfp95, examines floating-point performance, or how fast the chip does more complex math.
Results generally show up in news reports, if at all, as two single numbers. But to get truly useful information from SPEC, it's important to look at the individual tests comprising both integer and floating point, says SPEC President Kaivalya Dixit.
"People shouldn't be comparing one number," he advises. "SPECfp can vary 4 or 5 to 1." For example, a given workstation might be five times as fast as another computer on the "tomcatv" fluid-dynamics test, but only twice as fast on a different analysis test, he says. And, such huge variations are typical.
He suggests engineers look at the numbers closest to their everday tasks. For example, most would care less about SPEC's weather-prediction component and more about the 101.tomcatv mesh-generation program (one of 10 component pieces of the SPECfp95 number.) Other tests in the SPECfp95 suite include 104.hydro2d, with hydrodynamical Navier Stokes equations; and 110.aplu, parabolic/elliptic partial differential equations.
SPEC recently updated its test suite from 1992 to '95 versions, in order to keep pace with rapidly advancing technology. "SPEC92 is dead," Dixit notes. "If you use it, you will get wrong information." One reason: computer systems became so much more powerful in the past three years, the old benchmarks could sit comfortably inside a system's cache (on-chip memory), thus not accurately putting a processor through its paces.
The '95 SPECs use real ap- plications, and not exercises dreamed up in a lab, Walter Bays, senior staff engineer at Sun, notes. "It's a big improvement."
But others in the industry say the numbers are of limited use. "SPECmarks and PLBs (Picture-Level Benchmarks) don't do a very good job conveying how the system will work in real-world applications," maintainss Ty Rabe, manager of CAD application marketing at Digital Equipment Corp. "Graphics benchmarks are more useful, but few people understand their mix of graphics."
Graphics performance. The Graphics Performance Committee (GPC), recently merged with SPEC, aims to measure how fast computer systems can run graphics-intensive applications.
"GPC probably has the best standard benchmarks within the industry," says Bjorn Andersson, senior product manager, Ultra1 desktop, at Sun Microsystems Computer Corp. "They give you quite a good indication within different data sets. ... But you have to be careful which numbers you're looking at."
GPC numbers have been published in an inch-thick volume that can be difficult for anyone outside the computer industry to plow through. "When are they going to get a comprehensible number?" one industry analyst asked. "The reports are impossible to understand."
"It is a little daunting," admits Mike Bailey, chair of the GPC Group. "We prefer to think of it as complete."
Such an array of test results is reported so that users can look at performance on the specific tasks they're likely to do, he says. "Vectors per second or polygons per second are not terribly meaningful," he notes, because vectors can be many different sizes; arbitrarily generating a number using 10-pixel lines is unlikely to duplicate anyone's real-world experience.
"Users can have a lot of faith in the numbers because vendors are being policed by all the other vendors," Bailey says. "It makes for some long meetings." GPC, a non-profit organization, consists of manufacturers, users, consultants, and researchers.
PLBs (Picture-Level Benchmarks) are test models that are rotated, panned, zoomed, and so on, for a would-be purchaser to time how long such tasks take on different platforms. Two catch-all numbers, PLBwire93 and PLBsurf, give combined results for tests in wireframe and surface models, respectively. However, as with SPECmarks, users can pick the specific test models most likely to reflect their actual work. For engineers, that could include sys_chassis and race_car in wireframe, and cyl_head in surface.
Each hardware vendor can write the software used to perform rotating, panning, and other tasks on each model--leading critics to complain that the benchmarking codes are more finely tuned than an off-the-shelf software package is ever likely to be. "We think PLB is a dubious benchmark," says John Spitzer, manager of desktop performance engineering at Silicon Graphics.
However, this can give a look at a system's potential, if software programmers take advantage of a computer's specific features.
Viewperfs are the first Open-GL performance benchmark endorsed by the GPC's OpenGL Performance Characterization (OPC) subcommittee. Developed by IBM, they test performance (using frames per second) on specific data sets running in actual software packages. So far, there are standard "viewsets" for Parametric Technology's Pro/CDRS industrial-design software, IBM's Data Explorer visualization, and Intergraph's DesignReview 3-D model review package. The committee is now looking to get other software vendors to contribute test suites.
Yet another benchmark, STREAM, was developed at the University of Delaware to measure sustained memory bandwidth. It consists of four tests with long vector operations, designed to stress memory access--key point of system performance for applications that place such demands on a computer.
Standards alternatives. Most hardware and software companies have their own benchmarks as well. To many vendors, this is the only way they can fully test the capabilities of their own systems. However, it is difficult for buyers to know which such numbers are trustworthy.
"I've seen so many benchmarks," says Kundrat at SDRC. "They can be structured to do a lot of things."
"They are tremendously easy to abuse," notes Rabe at DEC. However, such internal benchmarks can be useful for company engineers, who may spot specific performance problems and redesign future systems accordingly. "They allowed us to make substantial improvement on Alpha systems," he says. Or, proprietary tests can help determine if a system has been optimized for important markets.
Many major companies develop their own benchmarks, based specifically on the work engineers plan to do with new computer systems--something many in the industry highly recommend as the best way to test how well a computer will perform the tasks it would be assigned. Ford, for example, reportedly has a suite of 15 different applications running on multiple platforms. And at Eastman Kodak, staff engineer Rex Hays used some internally generated code from actual work in progress to see how much faster Sun's new UltraSPARC ran vs. the older SPARCstation 20. Results varied from a two- to five-fold increase, depending on the task, he says.
Time spent running such tests, if they are to be useful, is considerable; one test alone ran for more than 70 hours. "It's a tedious process," according to Kundrat at SDRC. And, during the weeks or months of evaluating systems, new technology can come out making the older systems obsolete.
Ultimately, most in the industry agree, benchmarks can be useful as a guide to expected performance but not an exact prognosticator. "It's kind of like the EPA mileage estimate," Rabe at Digital concludes. "Your mileage may vary."
Glossary of Terms:
Sorting out the Benchmarks
GPC--Graphics Performance Characterization Committee. This non-profit organization of vendors, users, analysts, and researchers includes several subgroups: XPC, measuring XWindow performance (such as 2-D lines and 2-D polygons); OPC, the Open GL Performance Characterization group, testing implementations of Open-GL graphics routines; and Picture-Level Benchmarking, where vendors can devise their own ways of describing various standard graphics scenes and then measure performance on their implementations.
MFLOPS--million floating-point operations per second. A less popular benchmark than in the past, as more sophisticated tests measuring real-world applications come into favor. MIPS--million instructions per second. Raw measurement of how many simple instructions a computer chip can process. Often criticized for providing little useful real-world data.
PLB--Picture Level Benchmark, from the Graphics Performance Characterization (GPC) Committee. Features a number of different models, and then two categories of results for surface (PLBsurf93) and wireframe (PLBwire93). Critics say that because each vendor gets to write its own code for PLB tests, the numbers tend to be much more finely tuned than real-world software is to take advantage of specific chip capabilities.
SPEC--Standard Performance Evaluation Corp., a non-profit group of major hardware vendors who jointly develop benchmark tests. SPEC tests measure central-processing-unit speeds and not graphics. SPEC recently came out with new benchmarks, SPEC95, to replace the '92 test suite. SPECint95, which measures a processor's integer performance, can be useful for looking at how a system might handle 2-D, wireframe CAD. SPECfp95 is recommended for more demanding computing; it measures a processor's floating-point speed. However, for tasks such as 3-D solid modeling and visualization, users should run graphics benchmarks as well, since CPU performance alone will not accurately reflect performance. SPEC results are available on the World Wide Web at http://www.specbench.org
Viewperf--'Stand-alone' benchmark model from the Graphics Performance Committee. A user can feed the model in, and it spins, thus measuring system performance. There are seven tests for the Pro/CDRS industrial design (Parametric Technology Corp.) "viewset," 10 for Data Explorer visualization (IBM), and 10 for Design Review (Intergraph). Others are under development.
Proprietary tests can also yield useful results
Along with industry-standard benchmarks, companies develop their own test suites in order to measure computer performance. Hewlett-Packard and Structural Dynamics Research Corp. agreed to share the results of one such proprietary test suite with Design News.
This Hedgetrimmer benchmark is a suite of 13 tests using I-DEAS Master SeriesRelease 3 design, drafting, and simulation modules. It runs a 20-MByte model file of an electric hedgetrimmer through various simulations to mimic common computing situations. Such tests are aimed at helping both workstation designers and potential buyers see how much additional performance they might expect to receive, as they move up the product line.
The detailed steps:
1. Build the hedgetrimmer assembly from various parts (blade, housing, switch, etc.)
2. Save the resulting assembly to a new model file
3. Explode assembly into individual parts
4. Reassemble assembly from the parts
5. Shade the hedgetrimmer using hardware shading
6. Display hidden-line (wireframe) view of the hedgetrimmer
7. Display hedgetrimmer using ray tracing (CPU intensive)
8. Move to Drafting Module and shade top and front views
9. Display hidden-line view in Drafting mode
10. Set up assembly for Drafting mode
11. Enter Simulation Module and mesh the hedgetrimmer blade
12. Restrain the blade and perform analysis (complex FEA)
13. Display analysis results graphically
Tips from the experts
Using the right numbers is only part of the story when measuring computer performance, according to industry experts. Here are some other tips:
Don't use a single benchmark to try to rate computer performance.
Use industry-standard benchmarks to find systems within the performance range you're looking for.
Once you've narrowed down the choices, test your own specific applications on several different platforms. If you don't have the resources to develop testing software in-house, you can check with your software vendor or an outside consultant.
Make sure to examine how the computer system fits into your overall business process.
Factors other than speed and system performance are also important in making a purchase decision: available application software, vendor reliability, upgradability, and support services.
About the Author
You May Also Like