resperf (1) - Linux Manuals
resperf: test the resolution performance of a caching DNS server
NAME
resperf - test the resolution performance of a caching DNS serverSYNOPSIS
resperf-report
resperf
During the test, resperf listens for responses from the server and keeps
track of response rates, failure rates, and latencies. It will also continue
listening for responses for an additional 40 seconds after it has stopped
sending traffic, so that there is time for the server to respond to the last
queries sent. This time period was chosen to be longer than the overall
query timeout of both Nominum Vantio and current versions of BIND.
If the test is successful, the query rate will at some point exceed the
capacity of the server and queries will be dropped, causing the response
rate to stop growing or even decrease as the query rate increases.
The result of the test is a set of measurements of the query rate, response
rate, failure response rate, and average query latency as functions of time.
Make sure there is no stateful firewall between the server and the Internet,
because most of them can't handle the amount of UDP traffic the test will
generate and will end up dropping packets, skewing the test results. Some
will even lock up or crash.
You should run resperf on a machine separate from the server under test, on
the same LAN. Preferably, this should be a Gigabit Ethernet network. The
machine running resperf should be at least as fast as the machine being
tested; otherwise, it may end up being the bottleneck.
There should be no other applications running on the machine running
resperf. Performance testing at the traffic levels involved is essentially a
hard real-time application - consider the fact that at a query rate of
100,000 queries per second, if resperf gets delayed by just 1/100 of a
second, 1000 incoming UDP packets will arrive in the meantime. This is more
than most operating systems will buffer, which means packets will be
dropped.
Because the granularity of the timers provided by operating systems is
typically too coarse to accurately schedule packet transmissions at
sub-millisecond intervals, resperf will busy-wait between packet
transmissions, constantly polling for responses in the meantime. Therefore,
it is normal for resperf to consume 100% CPU during the whole test run, even
during periods where query rates are relatively low.
You will also need a set of test queries in the dnsperf file format.
See the dnsperf man page for instructions on how to construct this
query file. To make the test as realistic as possible, the queries should be
derived from recorded production client DNS traffic, without removing
duplicate queries or other filtering. With the default settings, resperf
will use up to 3 million queries in each test run.
If the caching server to be tested has a configurable limit on the number of
simultaneous resolutions, like the max-recursive-clients statement
in Nominum Vantio or the recursive-clients option in BIND 9, you will
probably have to increase it. As a starting point, we recommend a value of
10000 for Nominum Vantio and 100000 for BIND 9. Should the limit be reached,
it will show up in the plots as an increase in the number of failure
responses.
The server being tested should be restarted at the beginning of each test to
make sure it is starting with an empty cache. If the cache already contains
data from a previous test run that used the same set of queries, almost all
queries will be answered from the cache, yielding inflated performance
numbers.
To use the resperf-report script, you need to have gnuplot
installed. Make sure your installed version of gnuplot supports the
png terminal driver. If your gnuplot doesn't support png but does
support gif, you can change the line saying terminal=png in the
resperf-report script to terminal=gif.
When running resperf-report, you will need to specify at least the server IP
address and the query data file. A typical invocation will look like
With default settings, the test run will take at most 100 seconds (60
seconds of ramping up traffic and then 40 seconds of waiting for responses),
but in practice, the 60-second traffic phase will usually be cut short. To
be precise, resperf can transition from the traffic-sending phase to the
waiting-for-responses phase in three different ways:
Regardless of which of the above conditions caused the traffic-sending phase
of the test to end, you should examine the resulting plots to make sure the
server's response rate is flattening out toward the end of the test. If it
is not, then you are not loading the server enough. If you are getting the
"Fell behind" message, make sure that the machine running resperf is fast
enough and has no other applications running.
You should also monitor the CPU usage of the server under test. It should
reach close to 100% CPU at the point of maximum traffic; if it does not, you
most likely have a bottleneck in some other part of your test setup, for
example, your external Internet connection.
The report generated by resperf-report will be stored with a unique
file name based on the current date and time, e.g.,
20060812-1550.html. The PNG images of the plots and other auxiliary
files will be stored in separate files beginning with the same date-time
string. To view the report, simply open the .html file in a web
browser.
If you need to copy the report to a separate machine for viewing, make sure
to copy the .png files along with the .html file (or simply copy all the
files, e.g., using scp 20060812-1550.* host:directory/).
The "Query/response/failure rate" plot contains three graphs. The "Queries
sent per second" graph shows the amount of traffic being sent to the server;
this should be very close to a straight diagonal line, reflecting the linear
ramp-up of traffic.
The "Total responses received per second" graph shows how many of the
queries received a response from the server. All responses are counted,
whether successful (NOERROR or NXDOMAIN) or not (e.g., SERVFAIL).
The "Failure responses received per second" graph shows how many of the
queries received a failure response. A response is considered to be a
failure if its RCODE is neither NOERROR nor NXDOMAIN.
By visually inspecting the graphs, you can get an idea of how the server
behaves under increasing load. The "Total responses received per second"
graph will initially closely follow the "Queries sent per second" graph
(often rendering it invisible in the plot as the two graphs are plotted on
top of one another), but when the load exceeds the server's capacity, the
"Total responses received per second" graph may diverge from the "Queries
sent per second" graph and flatten out, indicating that some of the queries
are being dropped.
The "Failure responses received per second" graph will normally show a
roughly linear ramp close to the bottom of the plot with some random
fluctuation, since typical query traffic will contain some small percentage
of failing queries randomly interspersed with the successful ones. As the
total traffic increases, the number of failures will increase
proportionally.
If the "Failure responses received per second" graph turns sharply upwards,
this can be another indication that the load has exceeded the server's
capacity. This will happen if the server reacts to overload by sending
SERVFAIL responses rather than by dropping queries. Since Nominum Vantio and
BIND 9 will both respond with SERVFAIL when they exceed their
max-recursive-clients or recursive-clients limit,
respectively, a sudden increase in the number of failures could mean that
the limit needs to be increased.
The "Latency" plot contains a single graph marked "Average latency". This
shows how the latency varies during the course of the test. Typically, the
latency graph will exhibit a downwards trend because the cache hit rate
improves as ever more responses are cached during the test, and the latency
for a cache hit is much smaller than for a cache miss. The latency graph is
provided as an aid in determining the point where the server gets
overloaded, which can be seen as a sharp upwards turn in the graph. The
latency graph is not intended for making absolute latency measurements or
comparisons between servers; the latencies shown in the graph are not
representative of production latencies due to the initially empty cache and
the deliberate overloading of the server towards the end of the test.
Note that all measurements are displayed on the plot at the horizontal
position corresponding to the point in time when the query was sent, not
when the response (if any) was received. This makes it it easy to compare
the query and response rates; for example, if no queries are dropped, the
query and response graphs will be identical. As another example, if the plot
shows 10% failure responses at t=5 seconds, this means that 10% of the
queries sent at t=5 seconds eventually failed, not that 10% of the responses
received at t=5 seconds were failures.
The summary statistics in the "Resperf output" section of the report
contains a "Maximum throughput" value which by default is determined from
the maximum rate at which the server was able to return responses, without
regard to the number of queries being dropped or failing at that point. This
method of throughput measurement has the advantage of simplicity, but it may
or may not be appropriate for your needs; the reported value should always
be validated by a visual inspection of the graphs to ensure that service has
not already deteriorated unacceptably before the maximum response rate is
reached. It may also be helpful to look at the "Lost at that point" value in
the summary statistics; this indicates the percentage of the queries that
was being dropped at the point in the test when the maximum throughput was
reached.
Alternatively, you can make resperf report the throughput at the point in
the test where the percentage of queries dropped exceeds a given limit (or
the maximum as above if the limit is never exceeded). This can be a more
realistic indication of how much the server can be loaded while still
providing an acceptable level of service. This is done using the -L
command line option; for example, specifying -L 10 makes resperf
report the highest throughput reached before the server starts dropping more
than 10% of the queries.
There is no corresponding way of automatically constraining results based on
the number of failed queries, because unlike dropped queries, resolution
failures will occur even when the the server is not overloaded, and the
number of such failures is heavily dependent on the query data and network
conditions. Therefore, the plots should be manually inspected to ensure that
there is not an abnormal number of failures.
To generate a constant traffic load, use the -c command line option,
together with the -m option which specifies the desired constant
query rate. For example, to send 10000 queries per second for an hour, use
-m 10000 -c 3600. This will include the usual 30-second gradual
ramp-up of traffic at the beginning, which may be useful to avoid initially
overwhelming a server that is starting with an empty cache. To start the
onslaught of traffic instantly, use -m 10000 -c 3600 -r 0.
To be precise, resperf will do a linear ramp-up of traffic from 0 to
-m queries per second over a period of -r seconds, followed by
a plateau of steady traffic at -m queries per second lasting for
-c seconds, followed by waiting for responses for an extra 40
seconds. Either the ramp-up or the plateau can be suppressed by supplying a
duration of zero seconds with -r 0 and -c 0, respectively. The
latter is the default.
Sending traffic at high rates for hours on end will of course require very
large amounts of input data. Also, a long-running test will generate a large
amount of plot data, which is kept in memory for the duration of the test.
To reduce the memory usage and the size of the plot file, consider
increasing the interval between measurements from the default of 0.5 seconds
using the -i option in long-running tests.
When using resperf for long-running tests, it is important that the
traffic rate specified using the -m is one that both resperf
itself and the server under test can sustain. Otherwise, the test is likely
to be cut short as a result of either running out of query IDs (because of
large numbers of dropped queries) or of resperf falling behind its
transmission schedule.
-d datafile
-s server_addr
-p port
-a local_addr
-x local_port
If acting as multiple clients and the wildcard port is used, each client
will use a different random port. If a port is specified, the clients will
use a range of ports starting with the specified one.
-t timeout
resperf times out unanswered requests in order to reclaim query IDs so
that the query ID space will not be exhausted in a long-running test, such
as when "soak testing" a server for an day with -m 10000 -c 86400.
The timeouts and the ability to tune them are of little use in the more
typical use case of a performance test lasting only a minute or two.
The default timeout of 45 seconds was chosen to be longer than the query
timeout of current caching servers. Note that this is longer than the
corresponding default in dnsperf, because caching servers can take
many orders of magnitude longer to answer a query than authoritative servers
do.
If a short timeout is used, there is a possibility that resperf will
receive a response after the corresponding request has timed out; in this
case, a message like Warning: Received a response with an unexpected id: 141
will be printed.
-b bufsize
-f family
-e
-D
-y [alg:]name:secret
-h
-i interval
-m max_qps
-P plot_data_file
-r rampup_time
-c constant_traffic_time
-L max_loss
-C clients
-q max_outstanding
The first line of the file is a comment identifying the fields. It may be
recognized as a comment by its leading hash sign (#).
Subsequent lines contain the actual plot data. For purposes of generating
the plot data file, the test run is divided into time intervals of 0.5
seconds (or some other length of time specified with the -i command
line option). Each line corresponds to one such interval, and contains the
following values as floating-point numbers:
Time
Target queries per second
Actual queries per second
Responses per second
Failures per second
Average latency
DESCRIPTION
resperf is a companion tool to dnsperf. dnsperf was
primarily designed for benchmarking authoritative servers, and it does not
work well with caching servers that are talking to the live Internet. One
reason for this is that dnsperf uses a "self-pacing" approach, which is
based on the assumption that you can keep the server 100% busy simply by
sending it a small burst of back-to-back queries to fill up network buffers,
and then send a new query whenever you get a response back. This approach
works well for authoritative servers that process queries in order and one
at a time; it also works pretty well for a caching server in a closed
laboratory environment talking to a simulated Internet that's all on the
same LAN. Unfortunately, it does not work well with a caching server talking
to the actual Internet, which may need to work on thousands of queries in
parallel to achieve its maximum throughput. There have been numerous
attempts to use dnsperf (or its predecessor, queryperf) for benchmarking
live caching servers, usually with poor results. Therefore, a separate tool
designed specifically for caching servers is needed.
How resperf works
Unlike the "self-pacing" approach of dnsperf, resperf works by sending DNS
queries at a controlled, steadily increasing rate. By default, resperf will
send traffic for 60 seconds, linearly increasing the amount of traffic from
zero to 100,000 queries per second.
What you will need
Benchmarking a live caching server is serious business. A fast caching
server like Nominum Vantio running on a XEON server, resolving a mix of
cacheable and non-cacheable queries typical of ISP customer traffic, is
capable of resolving over 100,000 queries per second. In the process, it
will send more than 40,000 queries per second to authoritative servers on
the Internet, and receive responses to most of them. Assuming an average
request size of 50 bytes and a response size of 150 bytes, this amounts to
some 16 Mbps of outgoing and 48 Mbps of incoming traffic. If your Internet
connection can't handle the bandwidth, you will end up measuring the speed
of the connection, not the server, and may saturate the connection causing a
degradation in service for other users.
Running the test
Resperf is typically invoked via the resperf-report script, which
will run resperf with its output redirected to a file and then
automatically generate an illustrated report in HTML format. Command line
arguments given to resperf-report will be passed on unchanged to resperf.
resperf-report -s 10.0.0.2 -d queryfile
Interpreting the report
The .html file produced by resperf-report consists of two
sections. The first section, "Resperf output", contains output from the
resperf program such as progress messages, a summary of the command
line arguments, and summary statistics. The second section, "Plots",
contains two plots generated by gnuplot: "Query/response/failure rate"
and "Latency".
Determining the server's maximum throughput
Often, the goal of running resperf is to determine the server's
maximum throughput, in other words, the number of queries per second it is
capable of handling. This is not always an easy task, because as a server is
driven into overload, the service it provides may deteriorate gradually, and
this deterioration can manifest itself either as queries being dropped, as
an increase in the number of SERVFAIL responses, or an increase in latency.
The maximum throughput may be defined as the highest level of traffic at
which the server still provides an acceptable level of service, but that
means you first need to decide what an acceptable level of service means in
terms of packet drop percentage, SERVFAIL percentage, and latency.
GENERATING CONSTANT TRAFFIC
In addition to ramping up traffic linearly, resperf also has the
capability to send a constant stream of traffic. This can be useful when
using resperf for tasks other than performance measurement; for
example, it can be used to "soak test" a server by subjecting it to a
sustained load for an extended period of time.
OPTIONS
Because the resperf-report script passes its command line options
directly to the resperf programs, they both accept the same set of
options, with one exception: resperf-report automatically adds an
appropriate -P to the resperf command line, and therefore does
not itself take a -P option.
THE PLOT DATA FILE
The plot data file is written by the resperf program and contains the
data to be plotted using gnuplot. When running resperf via the
resperf-report script, there is no need for the user to deal with
this file directly, but its format and contents are documented here for
completeness and in case you wish to run resperf directly and use its
output for purposes other than viewing it with gnuplot.
AUTHOR
Nominum, Inc.