diff options
Diffstat (limited to 'contrib/ntp/html/debug.htm')
-rw-r--r-- | contrib/ntp/html/debug.htm | 288 |
1 files changed, 288 insertions, 0 deletions
diff --git a/contrib/ntp/html/debug.htm b/contrib/ntp/html/debug.htm new file mode 100644 index 0000000..bf16049 --- /dev/null +++ b/contrib/ntp/html/debug.htm @@ -0,0 +1,288 @@ +<HTML><HEAD><TITLE> +NTP Debugging Techniques +</TITLE></HEAD><BODY><H3> +NTP Debugging Techniques +</H3> + +<IMG align=left SRC="pic/pogo.gif"><I>Pogo Possum</I>, with toolkit +and bug, Walt Kelly +<br clear=left><hr> + +<P>Once the NTP software distribution has been compiled and installed +and the configuration file constructed, the next step is to verify +correct operation and fix any bugs that may result. Usually, the command +line that starts the daemon is included in the system startup file, so +it is executed only at system boot time; however, the daemon can be +stopped and restarted from root at any time. Usually, no command-line +arguments are required, unless special actions described in the +<TT><A HREF="ntpd.htm">ntpd</A></TT> page are required. Once started, +the daemon will begin sending messages, as specified in the +configuration file, and interpreting received messages. + +<P>The best way to verify correct operation is using the <TT><A +HREF="ntpq.htm">ntpq</A></TT> and <TT><A HREF="ntpdc.htm">ntpdc</A></TT> +utility programs, either on the server itself or from another machine +elsewhere in the network. The <TT>ntpq</TT> program implements the +management functions specified in Appendix A of the NTP specification <A +HREF="http://www.eecis.udel.edu/~mills/database/rfc/rfc1305/rfc1305c.ps" +> +RFC-1305, Appendix A</A>. The <TT>ntpdc</TT> program implements +additional functions not provided in the standard. Both programs can be +used to inspect the state variables defined in the specification and, in +the case of <TT>ntpdc</TT>, additional ones of interest. In addition, +the <TT>ntpdc</TT> program can be used to selectively enable and disable +some functions of the daemon while the daemon is running. + +<P>In extreme cases with elusive bugs, the daemon can operate in two +modes, depending on the presence of the <TT>-d</TT> command-line debug +switch. If not present, the daemon detaches from the controlling +terminal and proceeds autonomously. If one or more <TT>-d</TT> switches +are present, the daemon does not detach and generates special output +useful for debugging. In general, interpretation of this output requires +reference to the sources. However, a single <TT>-d</TT> does produce +only mildly cryptic output and can be very useful in finding problems +with configuration and network troubles. With a little experience, the +volume of output can be reduced by piping the output to <TT>grep +</TT>and specifying the keyword of the trace you want to see. + +<P>Some problems are immediately apparent when the daemon first starts +running. The most common of these are the lack of a ntp (UDP port 123) +in the host <TT>/etc/services</TT> file. Note that NTP does not use TCP +in any form. Other problems are apparent in the system log file. The log +file should show the startup banner, some cryptic initialization data, +and the computed precision value. The next most common problem is +incorrect DNS names. Check that each DNS name used in the configuration +file responds to the Unix <TT>ping</TT> command. + +<P>When first started, the daemon normally polls the servers listed in +the configuration file at 64-second intervals. In order to allow a +sufficient number of samples for the NTP algorithms to reliably +discriminate between correctly operating servers and possible intruders, +at least four valid messages from at least one server is required before +the daemon can set the local clock. However, if the current local time +is greater than 1000 seconds in error from the server time, the daemon +will not set the local clock; instead, it will plant a message in the +system log and shut down. It is necessary to set the local clock to +within 1000 seconds first, either by a time-of-year hardware clock, by +first using the <A HREF="ntpdate.htm"><TT>ntpdate</TT> </A>program or +manually be eyeball and wristwatch. + +<P>After starting the daemon, run the <TT>ntpq</TT> program using the +<TT>-n</TT> switch, which will avoid possible distractions due to name +resolution problems. Use the <TT>pe</TT> command to display a billboard +showing the status of configured peers and possibly other clients poking +the daemon. After operating for a few minutes, the display should be +something like: + +<PRE>ntpq>pe +remote refid st t when poll reach delay offset disp +=================================================================== ++128.4.2.6 132.249.16.1 2 u 131 256 373 9.89 16.28 23.25 +*128.4.1.20 .WWVB. 1 u 137 256 377 280.62 21.74 20.23 +-128.8.2.88 128.8.10.1 2 u 49 128 376 294.14 5.94 17.47 ++128.4.2.17 .WWVB. 1 u 173 256 377 279.95 20.56 16.40 +</PRE> + +The host addresses shown in the <TT>remote</TT> column should agree with +the DNS entries in the configuration file, plus any peers not mentioned +in the file at the same or lower than your stratum that happen to be +configured to peer with you. Be prepared for surprises in cases where +the peer has multiple addresses or multiple names. The <TT>refid</TT> +entry shows the current source of synchronization for each peer, while +the <TT>st</TT> reveals the stratum, <TT>t</TT> the type (<TT>u</TT> = +unicast, <TT>m</TT> = multicast, <TT>l</TT> = local, <TT>-</TT> = don't +know), and <TT>poll</TT> the polling interval in seconds. The +<TT>when</TT> entry shows the time since the peer was last heard, +normally in seconds, while the <TT>reach</TT> entry shows the status of +the reachability register (see RFC-1305) in octal. The remaining entries +show the latest delay, offset and dispersion computed for the peer in +milliseconds. Note that in NTP Version 4 the dispersion entry includes +only the RMS error component; earlier versions included all components. + +<P>The tattletale character at the left margin displays the +synchronization status of each peer. The currently selected peer is +marked <TT>*</TT>, while additional peers designated acceptable for +synchronization, but not currently selected, are marked <TT>+</TT>. +Peers marked <TT>*</TT> and <TT>+</TT> are included in a weighted +average computation to set the local clock; the data produced by peers +marked with other symbols are discarded. See the <TT>ntpq</TT> +documentation for the meaning of these symbols. + +<P>Additional details for each peer separately can be determined by the +following procedure. First, use the <TT>as</TT> command to display an +index of association identifiers, such as + +<PRE>ntpq>as +ind assID status conf reach auth condition last_event cnt +========================================================= + 1 11670 7414 no yes ok candidate reachable 1 + 2 11673 7614 no yes ok sys.peer reachable 1 + 3 11833 7314 no yes ok outlyer reachable 1 + 4 11868 7414 no yes ok candidate reachable 1 + </PRE> + +Each line in this billboard is associated with the corresponding line +the <TT>pe</TT> billboard above. Next, use the <TT>rv</TT> command and +the respective identifier to display a detailed synopsis of the selected +peer, such as + +<PRE>ntpq>rv 11670 +status=7414 reach, auth, sel_sync, 1 event, event_reach +srcadr=128.4.2.6, srcport=123, dstadr=128.4.2.7, dstport=123, keyid=1, +stratum=2, precision=-10, rootdelay=362.00, rootdispersion=21.99, +refid=132.249.16.1, +reftime=af00bb44.849b0000 Fri, Jan 15 1993 4:25:40.517, +delay= 9.89, offset= 16.28, +dispersion=23.25, reach=373, valid=8, +hmode=2, pmode=1, hpoll=8, ppoll=10, leap=00, flash=0x0, +org=af00bb48.31a90000 Fri, Jan 15 1993 4:25:44.193, +rec=af00bb48.305e3000 Fri, Jan 15 1993 4:25:44.188, +xmt=af00bb1e.16689000 Fri, Jan 15 1993 4:25:02.087, +filtdelay= 16.40 9.89 140.08 9.63 9.72 9.22 10.79 122.99, +filtoffset= 13.24 16.28 -49.19 16.04 16.83 16.49 16.95 -39.43, +filterror= 16.27 20.17 27.98 31.89 35.80 39.70 43.61 47.52 +</PRE> + +A detailed explanation of the fields in this billboard are beyond the +scope of this discussion; however, most variables defined in the +specification RFC-1305 can be found. The most useful portion for +debugging is the last three lines, which give the roundtrip delay, clock +offset and dispersion for each of the last eight measurement rounds, all +in milliseconds. Note that the dispersion, which is an estimate of the +error, increases as the age of the sample increases. From these data, it +is usually possible to determine the incidence of severe packet loss, +network congestion, and unstable local clock oscillators. There are no +hard and fast rules here, since every case is unique; however, if one or +more of the rounds show zeros, or if the clock offset changes +dramatically in the same direction for each round, cause for alarm +exists. + +<P>Finally, the state of the local clock can be determined using the +<TT>rv</TT> command (without the argument), such as + +<PRE>ntpq>rv +status=0664 leap_none, sync_ntp, 6 events, event_peer/strat_chg +system="UNIX", leap=00, stratum=2, rootdelay=280.62, +rootdispersion=45.26, peer=11673, refid=128.4.1.20, +reftime=af00bb42.56111000 Fri, Jan 15 1993 4:25:38.336, +poll=8, clock=af00bbcd.8a5de000 Fri, Jan 15 1993 4:27:57.540, +phase=21.147, freq=13319.46, compliance=2 +</PRE> + +The most useful data in this billboard show when the clock was last +adjusted <TT>reftime</TT>, together with its status and most recent +exception event. An explanation of these data is in the specification +RFC-1305. + +<P>When nothing seems to happen in the <TT>pe</TT> billboard after some +minutes, there may be a network problem. The most common network problem +is an access controlled router on the path to the selected peer. No +known public NTP time server selectively restricts access at this time, +although this may change in future; however, many private networks do. +It also may be the case that the server is down or running in +unsynchronized mode due to a local problem. Use the <TT>ntpq</TT> +program to spy on its own variables in the same way you can spy on your +own. + +<P>Once the daemon has set the local clock, it will continuously track +the discrepancy between local time and NTP time and adjust the local +clock accordingly. There are two components of this adjustment, time and +frequency. These adjustments are automatically determined by the clock +discipline algorithm, which functions as a hybrid phase/frequency +feedback loop. The behavior of this algorithm is carefully controlled to +minimize residual errors due to network jitter and frequency variations +of the local clock hardware oscillator that normally occur in practice. +However, when started for the first time, the algorithm may take some +time to converge on the intrinsic frequency error of the host machine. + +<P>It has sometimes been the experience that the local clock oscillator +frequency error is too large for the NTP discipline algorithm, which can +correct frequency errors as large as 43 seconds per day. There are two +possibilities that may result in this problem. First, the hardware time- +of-year clock chip must be disabled when using NTP, since this can +destabilize the discipline process. This is usually done using the +<TT><A HREF="tickadj.htm">tickadj</A></TT> program and the <TT>-s</TT> +command line argument, but other means may be necessary. For instance, +in the Sun Solaris kernel, this can be done using a command in the +system startup file. + +<P>Normally, the daemon will adjust the local clock in small steps in +such a way that system and user programs are unaware of its operation. +The adjustment process operates continuously as long as the apparent +clock error exceeds 128 milliseconds, which for most Internet paths is a +quite rare event. If the event is simply an outlyer due to an occasional +network delay spike, the correction is simply discarded; however, if the +apparent time error persists for an interval of about 20 minutes, the +local clock is stepped to the new value (as an option, the daemon can be +compiled to slew at an accelerated rate to the new value, rather than be +stepped). This behavior is designed to resist errors due to severely +congested network paths, as well as errors due to confused radio clocks +upon the epoch of a leap second. + +<H4>Debugging Checklist</H4> + +If the <TT>ntpq</TT> or <TT>ntpdc</TT> programs do not show that +messages are being received by the daemon or that received messages do +not result in correct synchronization, verify the following: + +<OL> + +<P><LI>Verify the <TT>/etc/services</TT> file host machine is configured +to +accept UDP packets on the NTP port 123. NTP is specifically designed to +use UDP and does not respond to TCP.</LI> + +<P><LI>Check the system log for <TT>ntpd</TT> messages about +configuration +errors, name-lookup failures or initialization problems.</LI> + +<P><LI>Using the <TT>ntpdc</TT> program and <TT>iostats</TT> command, +verify that the received packets and packets sent counters are +incrementing. If the packets send counter does not increment and the +configuration file includes designated servers, something may be wrong +in the network configuration of the ntpd host. If this counter does +increment and packets are actually being sent to the network, but the +received packets counter does not increment, something may be wrong in +the network or the server may not be responding.</LI> + +<P><LI>If both the packets sent counter and received packets counter do +increment, but the <TT>rec</TT> timestamp in the <TT>pe</TT> billboard +shows far from the current date, received packets are probably being +discarded for some reason. There is a handy, undocumented state variable +<TT>flash</TT> visible in the <TT>pe</TT>billboard. The value is in hex +and normally has the value zero (OK). However, if something is wrong, +the bits of this variable, reading from the right, correspond to the +sanity checks listed in Section 3.4.3 of the NTP specification <A +HREF="http://www.eecis.udel.edu/~mills/database/rfc/rfc1305/rfc1305b.ps" +>RFC-1305</A>. A bit other than zero indicates the associated sanity +check failed.</LI> + +<P><LI>If the <TT>org, rec</TT> and <TT>xmt</TT> timestamps in the +<TT>pe</TT> billboard appear current, but the local clock is not set, as +indicated by a stratum number less than 16 in the <TT>rv</TT> command +without arguments, verify that valid clock offset, roundtrip delay and +dispersion are displayed for at least one peer. The clock offset should +be less than 1000 seconds, the roundtrip delay less than one second and +the dispersion less than one second.</LI> + + +<P><LI>While the algorithm can tolerate a relatively large frequency +error (up to 500 parts per million or 43 seconds per day), various +configuration errors (and in some cases kernel bugs) can exceed this +tolerance, leading to erratic behavior. This can result in frequent loss +of synchronization, together with wildly swinging offsets. Use the +<TT>ntpdc</TT> program (or temporary configuration file) and <TT>disable +pll</TT> command to prevent the <TT>ntpd</TT> daemon from setting the +clock. Using the <TT>ntpq</TT> or <TT>ntpdc</TT> programs, watch the +apparent offset as it varies over time to determine the intrinsic +frequency error. If the error increases by more than 22 milliseconds per +64-second poll interval, the intrinsic frequency must be reduced by some +means. The easiest way to do this is with the <TT><A +HREF="tickadj.htm">tickadj</A></TT> program and the <TT>-t</TT> +command line argument.</LI> + +</OL> + +<hr><a href=index.htm><img align=left src=pic/home.gif></a><address><a +href=mailto:mills@udel.edu> David L. Mills <mills@udel.edu></a> +</address></a></body></html> |