How to Measure Time Accurately in Programs
Posted on In ProgrammingIt is quite common to measure the time in programs using APIs like clock()
and gettimeofday()
. We may also want to measure the time “accurately” for certain purposes, such as measuring a small piece of code’s execution time for performance analysis, or measuring the time in time-sensitive game software. It is hard to measure the time very accurately. But we surely can measure the time to the granularity that we can accept for our purpose. Let’s look at possible methods.
gettimeofday and clock_gettime
gettimeofday
and clock_gettime
are POSIX APIs to get the time. gettimeofday
is easy to use, but does not specify or tell the resolution of the system clock. For clock_gettime
, clock_getres
can be used to find out the resolution of a clock.
On the other hand, the calling gettimeofday
and clock_gettime
themselves have cost. Assume they get the time from the same source, one important factor for the accuracy is the cost (or time) for calling these APIs. At which level do these APIs cost? Is gettimeofday
very slow?
A benchmark and the results by David Terei may give us a brief picture. I quote part of the results here with time
and ftime
although they provide granularity of seconds or micro-seconds:
time (s) => 4ns
ftime (ms) => 39ns
gettimeofday (us) => 30ns
clock_gettime (ns) => 26ns (CLOCK_REALTIME)
clock_gettime (ns) => 8ns (CLOCK_REALTIME_COARSE)
clock_gettime (ns) => 26ns (CLOCK_MONOTONIC)
clock_gettime (ns) => 9ns (CLOCK_MONOTONIC_COARSE)
clock_gettime (ns) => 170ns (CLOCK_PROCESS_CPUTIME_ID)
clock_gettime (ns) => 154ns (CLOCK_THREAD_CPUTIME_ID)
The performance/cost of gettiemofday
is at 10s of ns. This cost and the fact the the actual resolution is unkown may be acceptable for many programs. These APIs on modern Linux are implemented with VDSO and are avoided to call into kernel (see a discussion here). If lower cost (10ns) and known resolution are required by the program, clock_gettime
with (CLOCK_MONOTONIC_COARSE or CLOCK_REALTIME_COARSE) may be a good choice.
For even higher resolution, rdtsc
may be on put the table.
rdtsc and rdtscp
rdtsc
is an instruction supported since Pentium class CPUs to read the current time stamp counter (TSC) which is incremented every CPU tick (1/CPU_HZ). The TSC is a 64-bit register on x86 processors. PowerPC provides similar capability. TSC/rdtsc
allow to measure time in an accurate fashion.
There are a couple of good implementations using rdtsc
in C/asm on the Web, you can check them: Time-stamp counter, cycle.h and Pentium Time Stamp Counter.
Everything has two sides. You need to pay special attention to their drawbacks if you used rdtsc
in your program.
First, the rdtsc
instructions may not be performed in the order that they appear in the executable because of out-of-order execution. This can make one rdtsc
executed later than expected and produce a misleading cycle count. Here is an example from Using the RDTSC Instruction for Performance Monitoring:
rdtsc ; read time stamp
mov time, eax ; move counter into variable
fdiv ; floating-point divide
rdtsc ; read time stamp
sub eax, time ; find the difference
This code tries to measure the time it takes to perform a floating-point division by fdiv
. The fdiv
will take a long time to complete and, potentially, the second rdtsc
instruction could actually execute before the fdiv
. If this happened, the cycle count will not be the one expected.
Inserting serializing instructions, such cpuid
, which forces every preceding instructions in the code to complete before allowing the program to continue, can keep the rdtsc
instructions from being performed out-of-order. The code using cpuid
for the above example is as follows.
cpuid ; force all previous instructions to complete
rdtsc ; read time stamp counter
mov time, eax ; move counter into variable
fdiv ; floating-point divide
cpuid ; wait for FDIV to complete before RDTSC
rdtsc ; read time stamp counter
sub eax, time ; find the difference
An alternative way is to use rdtscp
which will wait until all previous instructions have been executed before reading the counter. However, rdtscp
is not supported on all CPU models. It is indicated by CPUID leaf 80000001H, EDX
bit 27. If the bit is set to 1 then rdtscp
is present on the processor. For more details, check https://www.systutorials.com/x86-64-isa-assembly-references#x86-64-.28and-x86.29-isa-reference/.
There are other cons with rdtsc
used. Here is a list of these concerns combined from Game Timing and Multicore Processors and Time Stamp Counter which together summarize these possible problems quite well.
Discontinuous values. Multiprocessor and dual-core systems do not guarantee synchronization of their cycle counters between cores. This is exacerbated when combined with modern power management technologies that idle and restore various cores at different times, which results in the cores typically being out of synchronization. For an application, this generally results in glitches or in potential crashes as the thread jumps between the processors and gets timing values that result in large deltas, negative deltas, or halted timing.
Variability of the CPU’s frequency. Technology that changes the frequency of the CPU is in use in many high-end desktop PCs. Recent Intel processors include a constant rate TSC. While this makes time keeping more consistent, it can skew benchmarks, where a certain amount of spin-up time is spent at a lower clock rate before the OS switches the processor to the higher rate.
Portability. Reliance on the time stamp counter also reduces portability, as other processors may not have a similar feature.