| The Internet is a network of networks with
no single owner or manager. As a result there
has always been a lack of high level information
on the status and performance of the Internet
as a whole. Internet Health Report from Keynote
Systems provides network engineers and corporate
web site managers a unique service to help monitor
and diagnose the Internet.
How it Works
The Internet Health Report delivers data about
network performance (latency) between major United
States Internet backbones. The measurement agents
selected for the Internet Health Report have direct,
single homed connections to a backbone so their
connectivity is unambiguous. Each measurement
agent connects to every other agent every 15 minutes
and measures the latency (delay) of establishing
a TCP connection across the Internet.
The resulting data in the Internet Health Report
provides a logical performance map of the Internet.
The Internet Health Report matrix is updated every
15 minutes. Any time the delay between two networks
exceeds a threshold, a color coded alert indicates
the level of the performance problem. Users can
drill down to see specifically where the delays
are occurring by clicking on any of the color
coded boxes in the display.
The numbers in the display represent latency
between networks in milliseconds. The total in
each box is the geometric mean of all the data
points collected between the networks during the
specified interval.
Interpretation -
Top-level matrix
At the top level, the matrix shows aggregate latency
between all the agents on backbone one and all
the agents on backbone two. The color of the cell
represents the highest latency of all cells in
the second level matrix. The color of the number
represents the rating of the geometric mean of
the data that makes up the second level matrix.

As an example, consider the following top level
matrix cell:
and the second-level matrix that sits beneath
it:

We visualize the data this way so that you can
determine at a glance the overall performance
between a pair of backbones (by looking at the
color of the number on the top-level matrix) as
well as the poorest performance between any pair
of agents on the pair of backbones (by looking
at the color of the cell).
Since the measurements are round-trip delays,
the top level matrix is symmetric around the upper-left
to lower-right diagonal. The numbers for a given
pair of backbones will be the same on either side
of the diagonal.
Second-level Matrix
To drill down to the second level matrix, click
on the cell of interest in the top level matrix:
Each cell at this level reports the geometric
mean of all data points taken in the last hour
(or last day, depending on your selection) for
each agent pair. The number and cell color will
always be the same
You may occasionally see a black cell in the
matrix:
This indicates that we have no data for that particular
agent pair over the time period you are examining.
This can happen for a number of reasons, including
loss of connectivity, agent failure, or Keynote
database maintenance.
A Note on Comparison
Please note that using the top-level
grid to compare the internal performance of different
backbones is problematic. Because we use existing
Keynote Perspective agents to take these measurements,
there is not necessarily an equal number of measurement
points on each measured backbone. In addition,
the absolute physical distance between agents
varies widely between backbones. For example,
consider comparing a backbone that has two agent
locations on opposite coasts to one that has two
agent locations on the east coast. The latter
backbone will report shorter latencies simply
because there is less distance between the agents.
As we add additional agents to the measured backbones,
this will become less of a problem.
Methodology
Measurements
Keynote maintains a network of measurement agents
in over 50 cities around the world. These agents
are placed on backbones that are statistically
selected to represent the average experience of
a business-to-business end user. Each location
(a city-backbone pair) is connected to the selected
backbone over a dedicated DS3 circuit or a dedicated
ethernet connection. A list of all our agent locations
is available here.
We have selected up to four agents on six US
backbones for the initial release of Internet
Health Report. Over time, we plan on adding additional
agents per backbone, additional backbones, international
measurements, and backbone-to-hosting facility
measurements.
Every fifteen minutes each agent measures the
time it takes to perform a TCP Open to every agent
in the matrix, including itself. These measurements
are randomly distributed across each fifteen minute
period. The times are reported back to Keynote's
central database.
We use TCP Open time rather than ICMP (Ping)
times because the TCP Open is a much better representation
of actual internet user experience. The most-used
internet applications (Web, Email, Telnet, ...)
begin by performing a TCP Open. Thus, all the
intervening routers will treat this measurement
just as they treat actual end-user requests, and
not in some special way that may be triggered
by the use of ICMP packets.
We do not currently measure packet loss directly.
A loss of a packet during the TCP Open handshake
will result in a much longer latency, as the lost
packet will be retransmitted by the sender only
after a timeout.
Reporting
Every fifteen minutes a report is run against
the database which extracts every data point for
the previous hour and day. These data points are
then sorted by agent pair, a geometric mean is
taken, and the Internet Health Report pages are
generated and copied to the web server for public
view.
We use geometric means because the Internet does
not behave in a normally distributed way. Latencies
tend to have a "heavy tailed" time distribution:
most measurements clustered around a relatively
small number (say, 50mS), with a few measurements
at a very large number (say, 400mS). As a result,
normal-distribution-based calculations of means
(averages) do not give the best view of actual
performance.
As an example, consider the situation of five
houses in a neighborhood selling for $100K, and
one selling for $10M. The average house price
would be $1.75M (not an optimal representation
of actual house prices - the expensive house pulls
the average too high), the median house price
would be $100K (not an optimal representation
either, as the information added by the price
of the expensive house is lost), while the geometric
mean would be $215K (much closer to the cost of
the five houses, and the information about the
expensive house is not lost).
To calculate a geometric mean, we take the log
of each data point, average the resultant numbers,
and exponentiate by the same base as the original
logarithm. Since the base in the log and exponentiation
are the same, it does not matter what it is, as
the exponentiation cancels the logarithm.
Thresholds
We have set the latency thresholds at 90mS for
Green (Healthy), 120mS for Blue (Stable),
180mS
for Yellow (Severe), and greater than 180mS for
Red (Critical). These thresholds are based
on
the bi-coastal round-trip time for light through
fiber (40mS per ITU-T G.114) with an allowance
of an additional 50mS for propagation delay through
intervening routers. We may change these thresholds
over time as overall internet performance improves,
and may use different thresholds for International
connectivity measurements.
Contacting Us
For more information on Keynote products and services,
please contact Keynote Sales at sales@keynote.com
or 1-888-KEYNOTE (539-6683).
For comments or suggestions for the Internet
Health Report, please send email to inthealth@keynote.com.
Press or analysts may contact Keynote Public
Relations at press@keynote.com
or 650-403-3254.
|