Understanding Variations in Dhrystone Performance By Reinhold P. Weicker, Siemens AG, AUT E 51, Erlangen April 1989 This article has appeared in: Microprocessor Report, May 1989 (Editor: M. Slater), pp. 16-17 Microprocessor manufacturers tend to credit all the performance measured by benchmarks to the speed of their processors, they often don't even mention the programming language and compiler used. In their detailed documents, usually called "performance brief" or "performance report," they usually do give more details. However, these details are often lost in the press releases and other marketing statements. For serious performance evaluation, it is necessary to study the code generated by the various compilers. Dhrystone was originally published in Ada (Communications of the ACM, Oct. 1984). However, since good Ada compilers were rare at this time and, together with UNIX, C became more and more popular, the C version of Dhrystone is the one now mainly used in industry. There are "official" versions 2.1 for Ada, Pascal, and C, which are as close together as the languages' semantic differences permit. Dhrystone contains two statements where the programming language and its translation play a major part in the execution time measured by the benchmark: o String assignment (in procedure Proc_0 / main) o String comparison (in function Func_2) In Ada and Pascal, strings are arrays of characters where the length of the string is part of the type information known at compile time. In C, strings are also arrays of characters, but there are no operators defined in the language for assignment and comparison of strings. Instead, functions "strcpy" and "strcmp" are used. These functions are defined for strings of arbitrary length, and make use of the fact that strings in C have to end with a terminating null byte. For general-purpose calls to these functions, the implementor can assume nothing about the length and the alignment of the strings involved. The C version of Dhrystone spends a relatively large amount of time in these two functions. Some time ago, I made measurements on a VAX 11/785 with the Berkeley UNIX (4.2) compilers (often-used compilers, but certainly not the most advanced). In the C version, 23% of the time was spent in the string functions; in the Pascal version, only 10%. On good RISC machines (where less time is spent in the procedure calling sequence than on a VAX) and with better optimizing compilers, the percentage is higher; MIPS has reported 34% for an R3000. Because of this effect, Pascal and Ada Dhrystone results are usually better than C results (except when the optimization quality of the C compiler is considerably better than that of the other compilers). Several people have noted that the string operations are over-represented in Dhrystone, mainly because the strings occurring in Dhrystone are longer than average strings. I admit that this is true, and have said so in my SIGPLAN Notices paper (Aug. 1988); however, I didn't want to generate confusion by changing the string lengths from version 1 to version 2. Even if they are somewhat over-represented in Dhrystone, string operations are frequent enough that it makes sense to implement them in the most efficient way possible, not only for benchmarking purposes. This means that they can and should be written in assembly language code. ANSI C also explicitly allows the strings functions to be implemented as macros, i.e. by inline code. There is also a third way to speed up the "strcpy" statement in Dhrystone: For this particular "strcpy" statement, the source of the assignment is a string constant. Therefore, in contrast to calls to "strcpy" in the general case, the compiler knows the length and alignment of the strings involved at compile time and can generate code in the same efficient way as a Pascal compiler (word instructions instead of byte instructions). This is not allowed in the case of the "strcmp" call: Here, the addresses are formal procedure parameters, and no assumptions can be made about the length or alignment of the strings. Any such assumptions would indicate an incorrect implementation. They might work for Dhrystone, where the strings are in fact word-aligned with typical compilers, but other programs would deliver incorrect results. So, for an apple-to-apple comparison between processors, and not between several possible (legal or illegal) degrees of compiler optimization, one should check that the systems are comparable with respect to the following three points: (1) String functions in assembly language vs. in C Frequently used functions such as the string functions can and should be written in assembly language, and all serious C language systems known to me do this. (I list this point for completeness only.) Note that processors with an instruction that checks a word for a null byte (such as AMD's 29000 and Intel's 80960) have an advantage here. (This advantage decreases relatively if optimization (3) is applied.) Due to the length of the strings involved in Dhrystone, this advantage may be considered too high in perspective, but it is certainly legal to use such instructions - after all, these situations are what they were invented for. (2) String function code inline vs. as library functions. ANSI C has created a new situation, compared with the older Kernighan/Ritchie C. In the original C, the definition of the string function was not part of the language. Now it is, and inlining is explicitly allowed. I probably should have stated more clearly in my SIGPLAN Notices paper that the rule "No procedure inlining for Dhrystone" referred to the user level procedures only and not to the library routines. (3) Fixed-length and alignment assumptions for the strings Compilers should be allowed to optimize in these cases if (and only if) it is safe to do so. For Dhrystone, this is the "strcpy" statement, but not the "strcmp" statement (unless, of course, the "strcmp" code explicitly checks the alignment at execution time and branches accordingly). A "Dhrystone switch" for the compiler that causes the generation of code that may not work under certain circumstances is certainly inappropriate for comparisons. It has been reported in Usenet that some C compilers provide such a compiler option; since I don't have access to all C compilers involved, I cannot verify this. If the fixed-length and word-alignment assumption can be used, a wide bus that permits fast multi-word load instructions certainly does help; however, this fact by itself should not make a really big difference. A check of these points - something that is necessary for a thorough evaluation and comparison of the Dhrystone performance claims - requires object code listings as well as listings for the string functions (strcpy, strcmp) that are possibly called by the program. I don't pretend that Dhrystone is a perfect tool to measure the integer performance of microprocessors. The more it is used and discussed, the more I myself learn about aspects that I hadn't noticed yet when I wrote the program. And of course, the very success of a benchmark program is a danger in that people may tune their compilers and/or hardware to it, and with this action make it less useful. Whetstone and Linpack have their critical points also: The Whetstone rating depends heavily on the speed of the mathematical functions (sine, sqrt, ...), and Linpack is sensitive to data alignment for some cache configurations. Introduction of a standard set of public domain benchmark software (something the SPEC effort attempts) is certainly a worthwhile thing. In the meantime, people will continue to use whatever is available and widely distributed, and Dhrystone ratings are probably still better than MIPS ratings if these are - as often in industry - based on no reproducible derivation. However, any serious performance evaluation requires more than just a comparison of raw numbers; one has to make sure that the numbers have been obtained in a comparable way.