summaryrefslogtreecommitdiffstats
path: root/contrib/ntp/html/debug.htm
diff options
context:
space:
mode:
Diffstat (limited to 'contrib/ntp/html/debug.htm')
-rw-r--r--contrib/ntp/html/debug.htm288
1 files changed, 288 insertions, 0 deletions
diff --git a/contrib/ntp/html/debug.htm b/contrib/ntp/html/debug.htm
new file mode 100644
index 0000000..bf16049
--- /dev/null
+++ b/contrib/ntp/html/debug.htm
@@ -0,0 +1,288 @@
+<HTML><HEAD><TITLE>
+NTP Debugging Techniques
+</TITLE></HEAD><BODY><H3>
+NTP Debugging Techniques
+</H3>
+
+<IMG align=left SRC="pic/pogo.gif"><I>Pogo Possum</I>, with toolkit
+and bug, Walt Kelly
+<br clear=left><hr>
+
+<P>Once the NTP software distribution has been compiled and installed
+and the configuration file constructed, the next step is to verify
+correct operation and fix any bugs that may result. Usually, the command
+line that starts the daemon is included in the system startup file, so
+it is executed only at system boot time; however, the daemon can be
+stopped and restarted from root at any time. Usually, no command-line
+arguments are required, unless special actions described in the
+<TT><A HREF="ntpd.htm">ntpd</A></TT> page are required. Once started,
+the daemon will begin sending messages, as specified in the
+configuration file, and interpreting received messages.
+
+<P>The best way to verify correct operation is using the <TT><A
+HREF="ntpq.htm">ntpq</A></TT> and <TT><A HREF="ntpdc.htm">ntpdc</A></TT>
+utility programs, either on the server itself or from another machine
+elsewhere in the network. The <TT>ntpq</TT> program implements the
+management functions specified in Appendix A of the NTP specification <A
+HREF="http://www.eecis.udel.edu/~mills/database/rfc/rfc1305/rfc1305c.ps"
+>
+RFC-1305, Appendix A</A>. The <TT>ntpdc</TT> program implements
+additional functions not provided in the standard. Both programs can be
+used to inspect the state variables defined in the specification and, in
+the case of <TT>ntpdc</TT>, additional ones of interest. In addition,
+the <TT>ntpdc</TT> program can be used to selectively enable and disable
+some functions of the daemon while the daemon is running.
+
+<P>In extreme cases with elusive bugs, the daemon can operate in two
+modes, depending on the presence of the <TT>-d</TT> command-line debug
+switch. If not present, the daemon detaches from the controlling
+terminal and proceeds autonomously. If one or more <TT>-d</TT> switches
+are present, the daemon does not detach and generates special output
+useful for debugging. In general, interpretation of this output requires
+reference to the sources. However, a single <TT>-d</TT> does produce
+only mildly cryptic output and can be very useful in finding problems
+with configuration and network troubles. With a little experience, the
+volume of output can be reduced by piping the output to <TT>grep
+</TT>and specifying the keyword of the trace you want to see.
+
+<P>Some problems are immediately apparent when the daemon first starts
+running. The most common of these are the lack of a ntp (UDP port 123)
+in the host <TT>/etc/services</TT> file. Note that NTP does not use TCP
+in any form. Other problems are apparent in the system log file. The log
+file should show the startup banner, some cryptic initialization data,
+and the computed precision value. The next most common problem is
+incorrect DNS names. Check that each DNS name used in the configuration
+file responds to the Unix <TT>ping</TT> command.
+
+<P>When first started, the daemon normally polls the servers listed in
+the configuration file at 64-second intervals. In order to allow a
+sufficient number of samples for the NTP algorithms to reliably
+discriminate between correctly operating servers and possible intruders,
+at least four valid messages from at least one server is required before
+the daemon can set the local clock. However, if the current local time
+is greater than 1000 seconds in error from the server time, the daemon
+will not set the local clock; instead, it will plant a message in the
+system log and shut down. It is necessary to set the local clock to
+within 1000 seconds first, either by a time-of-year hardware clock, by
+first using the <A HREF="ntpdate.htm"><TT>ntpdate</TT> </A>program or
+manually be eyeball and wristwatch.
+
+<P>After starting the daemon, run the <TT>ntpq</TT> program using the
+<TT>-n</TT> switch, which will avoid possible distractions due to name
+resolution problems. Use the <TT>pe</TT> command to display a billboard
+showing the status of configured peers and possibly other clients poking
+the daemon. After operating for a few minutes, the display should be
+something like:
+
+<PRE>ntpq>pe
+remote refid st t when poll reach delay offset disp
+===================================================================
++128.4.2.6 132.249.16.1 2 u 131 256 373 9.89 16.28 23.25
+*128.4.1.20 .WWVB. 1 u 137 256 377 280.62 21.74 20.23
+-128.8.2.88 128.8.10.1 2 u 49 128 376 294.14 5.94 17.47
++128.4.2.17 .WWVB. 1 u 173 256 377 279.95 20.56 16.40
+</PRE>
+
+The host addresses shown in the <TT>remote</TT> column should agree with
+the DNS entries in the configuration file, plus any peers not mentioned
+in the file at the same or lower than your stratum that happen to be
+configured to peer with you. Be prepared for surprises in cases where
+the peer has multiple addresses or multiple names. The <TT>refid</TT>
+entry shows the current source of synchronization for each peer, while
+the <TT>st</TT> reveals the stratum, <TT>t</TT> the type (<TT>u</TT> =
+unicast, <TT>m</TT> = multicast, <TT>l</TT> = local, <TT>-</TT> = don't
+know), and <TT>poll</TT> the polling interval in seconds. The
+<TT>when</TT> entry shows the time since the peer was last heard,
+normally in seconds, while the <TT>reach</TT> entry shows the status of
+the reachability register (see RFC-1305) in octal. The remaining entries
+show the latest delay, offset and dispersion computed for the peer in
+milliseconds. Note that in NTP Version 4 the dispersion entry includes
+only the RMS error component; earlier versions included all components.
+
+<P>The tattletale character at the left margin displays the
+synchronization status of each peer. The currently selected peer is
+marked <TT>*</TT>, while additional peers designated acceptable for
+synchronization, but not currently selected, are marked <TT>+</TT>.
+Peers marked <TT>*</TT> and <TT>+</TT> are included in a weighted
+average computation to set the local clock; the data produced by peers
+marked with other symbols are discarded. See the <TT>ntpq</TT>
+documentation for the meaning of these symbols.
+
+<P>Additional details for each peer separately can be determined by the
+following procedure. First, use the <TT>as</TT> command to display an
+index of association identifiers, such as
+
+<PRE>ntpq>as
+ind assID status conf reach auth condition last_event cnt
+=========================================================
+ 1 11670 7414 no yes ok candidate reachable 1
+ 2 11673 7614 no yes ok sys.peer reachable 1
+ 3 11833 7314 no yes ok outlyer reachable 1
+ 4 11868 7414 no yes ok candidate reachable 1
+ </PRE>
+
+Each line in this billboard is associated with the corresponding line
+the <TT>pe</TT> billboard above. Next, use the <TT>rv</TT> command and
+the respective identifier to display a detailed synopsis of the selected
+peer, such as
+
+<PRE>ntpq>rv 11670
+status=7414 reach, auth, sel_sync, 1 event, event_reach
+srcadr=128.4.2.6, srcport=123, dstadr=128.4.2.7, dstport=123, keyid=1,
+stratum=2, precision=-10, rootdelay=362.00, rootdispersion=21.99,
+refid=132.249.16.1,
+reftime=af00bb44.849b0000 Fri, Jan 15 1993 4:25:40.517,
+delay= 9.89, offset= 16.28,
+dispersion=23.25, reach=373, valid=8,
+hmode=2, pmode=1, hpoll=8, ppoll=10, leap=00, flash=0x0,
+org=af00bb48.31a90000 Fri, Jan 15 1993 4:25:44.193,
+rec=af00bb48.305e3000 Fri, Jan 15 1993 4:25:44.188,
+xmt=af00bb1e.16689000 Fri, Jan 15 1993 4:25:02.087,
+filtdelay= 16.40 9.89 140.08 9.63 9.72 9.22 10.79 122.99,
+filtoffset= 13.24 16.28 -49.19 16.04 16.83 16.49 16.95 -39.43,
+filterror= 16.27 20.17 27.98 31.89 35.80 39.70 43.61 47.52
+</PRE>
+
+A detailed explanation of the fields in this billboard are beyond the
+scope of this discussion; however, most variables defined in the
+specification RFC-1305 can be found. The most useful portion for
+debugging is the last three lines, which give the roundtrip delay, clock
+offset and dispersion for each of the last eight measurement rounds, all
+in milliseconds. Note that the dispersion, which is an estimate of the
+error, increases as the age of the sample increases. From these data, it
+is usually possible to determine the incidence of severe packet loss,
+network congestion, and unstable local clock oscillators. There are no
+hard and fast rules here, since every case is unique; however, if one or
+more of the rounds show zeros, or if the clock offset changes
+dramatically in the same direction for each round, cause for alarm
+exists.
+
+<P>Finally, the state of the local clock can be determined using the
+<TT>rv</TT> command (without the argument), such as
+
+<PRE>ntpq>rv
+status=0664 leap_none, sync_ntp, 6 events, event_peer/strat_chg
+system="UNIX", leap=00, stratum=2, rootdelay=280.62,
+rootdispersion=45.26, peer=11673, refid=128.4.1.20,
+reftime=af00bb42.56111000 Fri, Jan 15 1993 4:25:38.336,
+poll=8, clock=af00bbcd.8a5de000 Fri, Jan 15 1993 4:27:57.540,
+phase=21.147, freq=13319.46, compliance=2
+</PRE>
+
+The most useful data in this billboard show when the clock was last
+adjusted <TT>reftime</TT>, together with its status and most recent
+exception event. An explanation of these data is in the specification
+RFC-1305.
+
+<P>When nothing seems to happen in the <TT>pe</TT> billboard after some
+minutes, there may be a network problem. The most common network problem
+is an access controlled router on the path to the selected peer. No
+known public NTP time server selectively restricts access at this time,
+although this may change in future; however, many private networks do.
+It also may be the case that the server is down or running in
+unsynchronized mode due to a local problem. Use the <TT>ntpq</TT>
+program to spy on its own variables in the same way you can spy on your
+own.
+
+<P>Once the daemon has set the local clock, it will continuously track
+the discrepancy between local time and NTP time and adjust the local
+clock accordingly. There are two components of this adjustment, time and
+frequency. These adjustments are automatically determined by the clock
+discipline algorithm, which functions as a hybrid phase/frequency
+feedback loop. The behavior of this algorithm is carefully controlled to
+minimize residual errors due to network jitter and frequency variations
+of the local clock hardware oscillator that normally occur in practice.
+However, when started for the first time, the algorithm may take some
+time to converge on the intrinsic frequency error of the host machine.
+
+<P>It has sometimes been the experience that the local clock oscillator
+frequency error is too large for the NTP discipline algorithm, which can
+correct frequency errors as large as 43 seconds per day. There are two
+possibilities that may result in this problem. First, the hardware time-
+of-year clock chip must be disabled when using NTP, since this can
+destabilize the discipline process. This is usually done using the
+<TT><A HREF="tickadj.htm">tickadj</A></TT> program and the <TT>-s</TT>
+command line argument, but other means may be necessary. For instance,
+in the Sun Solaris kernel, this can be done using a command in the
+system startup file.
+
+<P>Normally, the daemon will adjust the local clock in small steps in
+such a way that system and user programs are unaware of its operation.
+The adjustment process operates continuously as long as the apparent
+clock error exceeds 128 milliseconds, which for most Internet paths is a
+quite rare event. If the event is simply an outlyer due to an occasional
+network delay spike, the correction is simply discarded; however, if the
+apparent time error persists for an interval of about 20 minutes, the
+local clock is stepped to the new value (as an option, the daemon can be
+compiled to slew at an accelerated rate to the new value, rather than be
+stepped). This behavior is designed to resist errors due to severely
+congested network paths, as well as errors due to confused radio clocks
+upon the epoch of a leap second.
+
+<H4>Debugging Checklist</H4>
+
+If the <TT>ntpq</TT> or <TT>ntpdc</TT> programs do not show that
+messages are being received by the daemon or that received messages do
+not result in correct synchronization, verify the following:
+
+<OL>
+
+<P><LI>Verify the <TT>/etc/services</TT> file host machine is configured
+to
+accept UDP packets on the NTP port 123. NTP is specifically designed to
+use UDP and does not respond to TCP.</LI>
+
+<P><LI>Check the system log for <TT>ntpd</TT> messages about
+configuration
+errors, name-lookup failures or initialization problems.</LI>
+
+<P><LI>Using the <TT>ntpdc</TT> program and <TT>iostats</TT> command,
+verify that the received packets and packets sent counters are
+incrementing. If the packets send counter does not increment and the
+configuration file includes designated servers, something may be wrong
+in the network configuration of the ntpd host. If this counter does
+increment and packets are actually being sent to the network, but the
+received packets counter does not increment, something may be wrong in
+the network or the server may not be responding.</LI>
+
+<P><LI>If both the packets sent counter and received packets counter do
+increment, but the <TT>rec</TT> timestamp in the <TT>pe</TT> billboard
+shows far from the current date, received packets are probably being
+discarded for some reason. There is a handy, undocumented state variable
+<TT>flash</TT> visible in the <TT>pe</TT>billboard. The value is in hex
+and normally has the value zero (OK). However, if something is wrong,
+the bits of this variable, reading from the right, correspond to the
+sanity checks listed in Section 3.4.3 of the NTP specification <A
+HREF="http://www.eecis.udel.edu/~mills/database/rfc/rfc1305/rfc1305b.ps"
+>RFC-1305</A>. A bit other than zero indicates the associated sanity
+check failed.</LI>
+
+<P><LI>If the <TT>org, rec</TT> and <TT>xmt</TT> timestamps in the
+<TT>pe</TT> billboard appear current, but the local clock is not set, as
+indicated by a stratum number less than 16 in the <TT>rv</TT> command
+without arguments, verify that valid clock offset, roundtrip delay and
+dispersion are displayed for at least one peer. The clock offset should
+be less than 1000 seconds, the roundtrip delay less than one second and
+the dispersion less than one second.</LI>
+
+
+<P><LI>While the algorithm can tolerate a relatively large frequency
+error (up to 500 parts per million or 43 seconds per day), various
+configuration errors (and in some cases kernel bugs) can exceed this
+tolerance, leading to erratic behavior. This can result in frequent loss
+of synchronization, together with wildly swinging offsets. Use the
+<TT>ntpdc</TT> program (or temporary configuration file) and <TT>disable
+pll</TT> command to prevent the <TT>ntpd</TT> daemon from setting the
+clock. Using the <TT>ntpq</TT> or <TT>ntpdc</TT> programs, watch the
+apparent offset as it varies over time to determine the intrinsic
+frequency error. If the error increases by more than 22 milliseconds per
+64-second poll interval, the intrinsic frequency must be reduced by some
+means. The easiest way to do this is with the <TT><A
+HREF="tickadj.htm">tickadj</A></TT> program and the <TT>-t</TT>
+command line argument.</LI>
+
+</OL>
+
+<hr><a href=index.htm><img align=left src=pic/home.gif></a><address><a
+href=mailto:mills@udel.edu> David L. Mills &lt;mills@udel.edu&gt;</a>
+</address></a></body></html>
OpenPOWER on IntegriCloud