diff options
Diffstat (limited to 'share/doc/iso/wisc/trans_design.nr')
-rw-r--r-- | share/doc/iso/wisc/trans_design.nr | 1466 |
1 files changed, 1466 insertions, 0 deletions
diff --git a/share/doc/iso/wisc/trans_design.nr b/share/doc/iso/wisc/trans_design.nr new file mode 100644 index 0000000..6aeb54a --- /dev/null +++ b/share/doc/iso/wisc/trans_design.nr @@ -0,0 +1,1466 @@ +.NC "The Design of the ARGO Transport Entity" +.sh 1 "Protocol Hooks" +.pp +The design of the AOS kernel IPC support to some +extent mandates the +design of protocols. +Each protocol must provide the following +protocol hooks, which are procedures called through a +protocol switch table +(an array of type \fIprotosw\fR as described in +Chapter Five. +.ip "pr_input()" 5 +Called when data are to be passed up from a lower layer. +.ip "pr_output()" 5 +Called when data are to be passed down from a higher layer. +.ip "pr_init()" 5 +Called when the system is brought up. +.ip "pr_fasttimo()" 5 +Called every 200 milliseconds by the clock functional unit. +.ip "pr_slowtimo()" 5 +Called every 500 milliseconds by the clock functional unit. +.ip "pr_drain()" 5 +This is meant to be called when buffer space is low. +Each protocol is expected to provide this routine to free +non-critical buffer space. +This is not yet called anywhere. +.ip "pr_ctlinput()" 5 +Used for exchanging information between +protocols, such as notifying a transport protocol of changes +in routing or configuration information. +.ip "pr_ctloutput()" 5 +Supports the protocol-dependent +\fIgetsockopt()\fR +and +\fIsetsockopt()\fR +options. +.ip "pr_usrreq()" 5 +Called by the socket code to pass along a \*(lquser request\*(rq - +in other words a service primitive. +This call is also used for other protocol functions. +The functions served by the \fIpr_usrreq()\fR routine are: +.ip " PRU_ATTACH" 10 +Creates a protocol control block and attaches it to a given socket. +Called as a result of a \fIsocket()\fR system call. +.ip " PRU_DISCONNECT" 10 +Called as a result of a +\fIclose()\fR system call. +Initiates disconnection. +.ip " PRU_DETACH" 10 +Disassociates a protocol control block from a socket and recycles +the buffer space used for the protocol control block. +Called after PRU_DISCONNECT. +.ip " PRU_SHUTDOWN" 10 +Called as a result of a +\fIshutdown()\fR system call. +If the protocol supports the notion of half-open connections, +this closes the connection in one direction or both directions, +depending on the arguments passed to +\fIshutdown\fR. +.ip " PRU_BIND" 10 +Gives an address to a socket. +Called as a result of a +\fIbind()\fR system call, also +when +socket without a bound address is used. +In the latter case, an unused transport suffix is located and +bound to the socket. +.ip " PRU_LISTEN" 10 +Called as a result of a +\fIlisten()\fR system call. +Marks the socket as willing to queue incoming connection +requests. +.ip " PRU_CONNECT" 10 +Called as a result of a +\fIconnect()\fR system call. +Initiates a connection request. +.ip " PRU_ACCEPT" 10 +Called as a result of an +\fIaccept()\fR system call. +Dequeues a pending connection request, or blocks waiting for +a connection request to arrive. +In the latter case, it marks the socket as willing to accept +connections. +.ip " PRU_RCVD" 10 +The protocol module is expected to have put incoming data +into the socket's receive buffer, \fIso_rcv\fR. +When a receive primitive is used +(\fIrecv(), recvmsg(), recvfrom(), +read(), readv(), \fRand +\fIrecvv()\fR system calls) +the socket code module copies data from the +\fIso_rcv\fR to the user's +address space. +The protocol module may arrange to be informed each time the socket code +does this, in which case the socket code calls \fIpr_usrreq\fR(PRU_RCVD) +after the data were copied to the user. +.ip " PRU_SEND" 10 +This performs the protocol-dependent part of a send primitive +(\fIsend(), sendmsg(), sendto(), write(), writev(), +\fRand \fIsendv()\fR system calls). +The socket code +(procedures \fIsendit() and \fIsosend()\fR) +moves outgoing data from the user's +address space into a chain of \fImbufs\fR. +The socket code takes as much data from the user as it +determines will fit into the outgoing socket buffer, so_snd. +It passes this much data in the form of an mbuf chain to the protocol +via \fIpr_usrreq\fR(PRU_SEND). +If there are more data than +the so_snd can accommodate, +the socket code, which is running on behalf of a user process, +puts the user process to sleep. +The protocol module is expected to wake up the user process when +more room appears in so_snd. +.ip " PRU_ABORT" 10 +Called when a socket is closed and that socket +is accepting connections and has +queued pending +connection requests or +partially open connections. +.ip " PRU_CONTROL" 10 +Called as a result of an +\fIioctl()\fR system call. +.ip " PRU_SENSE" 10 +Called as a result of an +\fIfstat()\fR system call. +.ip " PRU_RCVOOB" 10 +Performs the work of receiving \*(lqout-of-band\*(rq data. +The socket module has already allocated an mbuf into which +the protocol module is expected to put the incoming +\*(lqout-of-band\*(rq data. +The socket code will then move the data from this mbuf +to the user's address space. +.ip " PRU_SENDOOB" 10 +Performs the work of sending \*(lqout-of-band\*(rq data. +The socket module has already moved the data +from the user's address space into a chain of mbufs, +which it now passes to the protocol module. +.ip " PRU_SOCKADDR" 10 +Supports the system call +\fIgetsockname()\fR. +Puts the socket's bound address into an mbuf. +.ip " PRU_PEERADDR" 10 +Supports the system call +\fIgetpeername\fR(). +Puts the peer's address into an mbuf. +.ip " PRU_CONNECT2" 10 +This is used in the Unix domain to support pipes. +It is not generally supported by transport protocols. +.ip " PRU_FASTTIMO, PRU_SLOWTIMO" 10 +These are superfluous. +None of the transport protocols uses them. +.ip " PRU_PROTORCV, PRU_PROTOSEND" 10 +None of the transport protocols uses these. +.ip " PRU_SENDEOT" 10 +This was added to support TP. +This indicates that the end of the data sent in this +send primitive should +be marked by the protocol as the end of the TSDU. +.sh 1 "The Interface Between the Transport Entity and Lower Layers" +.pp +The transport layer may run over a network layer such as IP +or the ISO connectionless network layer, +or it may run over a multi-purpose layer such as the service +provided by X.25. +X.25 is viewed as a network layer when +TP runs over X.25, and as a +subnetwork layer +when IP is running over X.25. +The software interface between data link and network layers differs +considerably from the software interface between transport and network +layers in AOS. +For this reason some modification of the transport-to-lower-layer +interface is necessary to support the suite of protocols included in +ARGO. +.pp +In AOS it is assumed that the transport layer will run over one +and only one network layer, and therefore it may call the +network layer output procedure directly. +In order to allow TP to run over a set of lower layers, +all domain-specific functions have been put into a set of routines +that are called indirectly through a domain-specific switch table. +The primary reason for this is that the transport and network +layers share information, mostly information pertaining to addresses. +The protocol control blocks for different network layers +differ, so the transport layer cannot just directly +access the network layer's pcb. +Similarly, a network layer may not directly access the transport +pcb because a multitude of transport protocols can run over each +of the network protocols. +.pp +To permit different network-layer protocol control blocks to coexist +under one transport layer, all transport-dependent control +information was put into a transport-specific protocol control block. +A new field, \fIso_tpcb\fR, +was added to the \fIsocket\fR structure to hold a pointer to +the transport-layer protocol control block. +The existing +field \fCso_pcb\fR is used for the network layer pcb. +.pp +The following structure was added to allow domain-specific +functions to be called indirectly. +All these functions operate on a network-layer pcb. +.pp +.(b +\fC +.TS +tab(+); +l s s s. +struct nl_protosw { +.T& +l l l l. ++int+nlp_afamily;+/* address family */ ++int+(*nlp_putnetaddr)();+/* puts addrs in pcb */ ++int+(*nlp_getnetaddr)();+/* gets addrs from pcb */ ++int+(*nlp_putsufx)();+/* transp suffix -> pcb */ ++int+(*nlp_getsufx)();+/* gets t-suffix */ ++int+(*nlp_recycle_suffix)();+/* zeroes suffix */ ++int+(*nlp_mtu)();+/* get maximum ++++transmission unit size */ ++int+(*nlp_pcbbind)();+/* bind to pcb */ ++int+(*nlp_pcbconn)();+/* connect */ ++int+(*nlp_pcbdisc)();+/* disconnect */ ++int+(*nlp_pcbdetach)();+/* detach pcb */ ++int+(*nlp_pcballoc)();+/* allocate a pcb */ ++int+(*nlp_output)();+/* emit packet */ ++int+(*nlp_dgoutput)();+/* emit datagram */ ++caddr_t+nlp_pcblist;+/* list of pcbs ++++for management ++++of connections */ +}; +.TE +\fR +.)b +.lp +The switch is based on the address family chosen when the +\fIsocket()\fR system call is made prior to connection establishment. +This unfortunately ties the address family to the domain, +but the only alternative is to add an argument to the \fIsocket()\fR +system call to let the user specify the desired network layer. +In the case of a connection oriented environment with no multi-homing, +it would be possible to determine which network layer is to be +used +from routing +information, but to do this requires unrealistic assumptions +about the environment. +For these reasons, linking the address family to the network +layer protocol is seen as the least of the evils. +The transport suffixes are kept in the network layer's pcb +as well as in the transport layer because +full transport address pairs are used to identify a connection +in the Internet domain. +.sh 1 "The Architecture of the Transport Protocol Entity" +.pp +A set of protocol hooks is required +by the AOS IPC architecture. +These hooks are used by the protocol-independent parts of the kernel +to gain entry to protocol-specific code. +The protocol code can be entered in one of the following ways: +.ip "1) " 5 +at boot time, when autoconfiguration +initializes each protocol through +the +\fIpr_init()\fR +hook, +.ip "2) " 5 +from above, either +a user program making a system call, through +the \fIpr_usrreq()\fR or \fIpr_ctloutput()\fR hooks, or +from a higher layer protocol using the +\fIpr_output()\fR hook, +.ip "3) " 5 +from below, a device interrupt servicing an incoming packet +through the \fIpr_input()\fR and \fIpr_ctlinput()\fR hooks, and +.ip "4) " 5 +from a clock interrupt through the \fIpr_slowtimo()\fR +or the +\fIpr_fasttimo()\fR hook. +.\" FIGURE +.so figs/trans_flow.nr +.\".so figs/trans_flow.grn +.pp +The protocol code can be divided into +the following modules, which are described in more detail below. +.CF +shows the flow of data and control +among these modules. +.in +5 +.ip "Timers and References:" 5 +The code executed on behalf of \fIpr_slowtimo()\fR. +The fast timeout is not used by TP. +.ip "Driver:" 5 +This is the finite state machine for TP. +.ip "Input: " 5 +This is the module that decodes incoming packets, +identifies or creates the pcb for which +the packet is destined, and creates an "event" to +pass to the driver. +.ip "Output:" 5 +This is the module that creates a packet header of a given type +with fields containing +values that are appropriate to the connection +on which the packet is being sent, appends data if necessary, +and hands a packet +to the lower layer, according to the transport-to-lower-layer +interface. +.ip "Send: " 5 +This module packetizes data from the outbound +socket buffer, \fIso_snd\fR, +handles retransmissions of packetized data, and +drops packetized data from the retransmission queue. +.ip "Receive:" 5 +This module reorders packets if necessary, +depacketizes data, passes it to the socket code module, +and determines when acknowledgments should be sent. +.in -5 +.sh 1 "Timers and References" +.pp +TP identifies sockets by \fIreference numbers\fR, or +\fIreferences\fR, +which are \*(lqfrozen\*(rq (may not be reassigned) +until some locally defined time after +a connection is broken and its protocol control block +is discarded. +An array of \fIreference blocks\fR is maintained by TP. +The reference number of a reference block is its +offset in the array. +When a reference block is in use it contains +a pointer to the pcb for the socket to which the +reference applies. +.pp +The system clock calls the \fIpr_slowtimo()\fR and +\fIpr_fasttimo()\fR hooks for each protocol in the protocol switch table +every 500 and 200 microseconds, respectively. +Each protocol handles its own timers its own way. +The timers in TP take two forms +- those that typically are cancelled and +those that usually expire. +The latter form may have more than one instantiation at any given +time. +The former may not. +The two are implemented slightly +differently for the sake of performance. +.pp +The timers that normally expire +are kept in a queue, their values all relative +to the value of preceding timer. +Thus all timer values are decremented by a single +operation on the value of the first timer. +The timer is represented by the Ecallout structure: +.(b +\fC +.TS +tab(+); +l s s s. +struct Ecallout { +.T& +l l l l. ++int+c_time;+/* incremental time */ ++int+c_func;+/* function to call */ ++u_int+c_arg1;+/* argument to routine */ ++u_int+c_arg2;+/* argument to routine */ ++int+c_arg3;+/* argument to routine */ ++struct Ecallout+*c_next; +}; +.TE +\fR +.)b +.lp +When an Ecallout structure migrates to the head +of the E timer list, and its \fIc_time\fR +field is decremented to zero, +the function stored in \fIc_func\fR is +called, with \fIc_arg1, c_arg2\fR, and \fIc_arg3\fR +as arguments. +Setting and cancelling these timers +are accomplished by a linear search and one +insertion or deletion from the timer queue. +This queue is linked to the +reference block associated with a communication endpoint. +This form used for the reference timer +and for the retransmission timers for data TPDUs. +.pp +The second form of timer, the type that +typically is cancelled, is used for several +timers - the inactivity timer, the sendack timer, +and the retransmission +timer for all types of TPDUs except data TPDUs. +.(b +\fC +.TS +tab(+); +l s s s. +struct Ccallout { +.T& +l l l l. ++int+c_time;+/* incremental time */ ++int+c_active;+/* this timer is active? */ +}; +.TE +\fR +.)b +.lp +All of these timers are stored +directly +in the reference block. +These timers are decremented in one linear scan of +the reference blocks. +Cancelling, setting, and both +cancelling and resetting one of these timers is accomplished by a +single assignment to an array element. +.sh 1 "Driver" +.pp +This is the finite state machine for TP. +A connection is managed by the finite state machine (fsm). +All events that pertain to a connection cause the +finite state machine driver to be called. +The driver takes two arguments - the pcb for the connection +and an event structure. +The event structure contains a field that discriminates +the different types of events, and a union of +structures that are specific to the event types. +The driver evaluates a set of predicates based on the current +state of the finite state machine (which is kept in the pcb) and the event type. +The result of the predicate evaluation determines +a set of actions to take and a state transition. +The driver takes the actions and if they complete +without errors, the driver makes the state transition. +.pp +The states, event types, predicates, actions, and state transitions are all +specified as a \fIxebec transition file\fR. +\fIXebec\fR is a utility that takes a human-readable description +of a finite state machine +and produces a set of tables and C source code for the driver. +The driver procedure is called \fItp_driver()\fR. +It is located in a file generated by xebec, +\fCtp_driver.c\fR. +For more details about xebec, see the manual page \fIxebec(1)\fR. +.pp +The transition file for TP is \fCtp.trans\fR, +and it is a good place to begin a perusal of the TP +source code. +.sh 1 "Input" +.pp +This is the module that decodes an incoming packet, +locates or creates the pcb for which +the packet is destined, and creates an event to +pass to the driver. +The network layer passes a packet up to the appropriate +transport layer by indirectly calling a transport input +routine through the protocol switch table for the network +domain. +There is one protocol switch entry for TP for each domain in which +TP will run (Internet, ISO). +In the Internet domain, the protocol switch field \fIpr_input()\fR +takes the value \fItpip_input()\fR. +This procedure accepts a packet from IP, with the IP header +still intact. +It extracts the network addresses from the IP header, +strips the IP header, and calls the domain-independent +input procedure for TP, +\fItp_input()\fR. +\fITp_input()\fR +decodes a TPDU. +The multitude of options, the variable-length +nature of the options, the semantics of the +options, and the possible combinations of concatenated +TPDUs make this a +complex procedure. +It is sensitive to changes, and from +the point of view of a software maintenance, it is a +potential hazard. +Because it is in the +critical path of TP however, some compromise +was made between maintainability and efficiency. +Multiple copies of sections of code were avoided as much as +possible, +not for the sake of saving space, but rather for the sake +of maintainability. +Ironically, +this detracts somewhat from the readability of the code. +.pp +Once a TPDU has been decoded and a pcb has been +identified for the TPDU, +the appropriate fields of the TPDU +are extracted and their values are placed in +an event structure. +Finally, \fItp_driver()\fR is called with +the event structure and the pcb as parameters. +.sh 1 "Output" +.pp +This module creates a TPDU header of a given type +with field values that are appropriate to the connection +on which the TPDU is being sent, appends data if necessary, +and hands a TPDU +to the lower layer according to the transport-to-lower-layer +interface. +Whenever a TPDU is to be sent to the peer or prospective peer, +the function \fItp_emit()\fR +is called, passing as arguments the pcb a TPDU type and several miscellaneous +other type-specific arguments, possibly including some data. +The data are in the form of an mbuf chain. +\fITp_emit()\fR prepends to the data an mbuf containing a TP header, +fills in the fields of the header according to the parameters +given, performs the checksum if appropriate, and +calls a domain-specific output routine. +For the Internet domain, this output routine is +\fItpip_output()\fR, which takes +as arguments the mbuf chain representing the TPDU, +and a network level pcb. +Some protocol errors cannot be associated with +a connection +but require that TP issue +an ER TPDU or a DR TPDU. +When these errors occur the routine +\fItp_error_emit()\fR is called. +This procedure creates the appropriate type of TPDU +and passes it to a domain-dependent routine for transmitting datagrams. +In the Internet domain, +\fItpip_output_dg()\fR is called. +This takes as arguments an mbuf chain representing the TPDU, +a source network address, and a destination network address. +.sh 1 "Send" +.\" FIGURE +.so figs/mbufsnd.nr +.\".so figs/mbufsnd.grn +.pp +This module packetizes data from the outbound +socket buffer, \fIso_snd\fR, +handles retransmissions of packetized data, and +drops packetized data from the retransmission queue. +The major routine in this module is \fItp_send()\fR, which +takes a range of sequence numbers as arguments. +For each sequence number in the range, +it packetizes the an appropriate amount +of outbound data, and places the resulting TPDU on +a retransmission control queue subject to the +constraints imposed by the rules of expedited data, +maximum packet sizes, and end-of-TSDU markers. +.pp +The most complicating factor is that of managing +expedited data. +A normal datum may not be sent (for its first time) before the +acknowledgment of any expedited datum +that was received from the user after the +normal datum was received. +In order to enforce this rule, +each TPDU must be marked in some way +so that it will be known which expedited datum +must be delivered and acknowledged by the peer before this TPDU may be transmitted +for the first time. +Markers are placed in \fIso_snd\fR +when an +outgoing expedited datum arrives from the user. +A marker is an mbuf structure with an \fIm_len\fR +of zero, but with the data area nevertheless containing +the sequence number of an expedited data TPDU. +The \fIm_type\fR of a marker is a new type, MT_XPD. +.pp +\fITp_send()\fR stops packetizing data when it encounters a marker +for an unacknowledged expedited datum. +If it encounters a marker for an expedited TPDU that has already +been acknowledged, the marker is jettisoned. +.CF +illustrates the structure of the sending socket buffer used +for normal data. +.pp +When \fItp_send()\fR moves data from mbufs on \fIso_snd\fR to the retransmission +control queue, it needs to know +how many octets of data can be placed in each TPDU. +The appropriate amount depends on, among other things, +the maximum transmission unit of the network layer +on the route the packet will take. +To determine the maximum transmission unit, +TP queries the network layer through +the domain-dependent switch table's field, \fInl_mtu\fR. +In the Internet domain, this resolves to \fItp_inmtu()\fR. +The header sizes for the network and transport layers +also affect the amount of data that can go into a packet, +and these sizes depend on the connection's characteristics. +.pp +Once the maximum amount of data per TPDU is determined, +\fItp_send()\fR can pull this amount off the \fIso_snd\fR queue to form +a TPDU, +assign a TPDU sequence number, +and place the new TPDU on the +retransmission control queue. +The retransmission control queue is a list of mbuf chains. +Each mbuf chain represents one TPDU, preceded by an +\fIrtc structure\fR: +.(b +\fC +.TS +tab(+); +l s s s. +struct tp_rtc { +.T& +l l l l. ++struct tp_rtc+*tprt_next;+/* next rtc struct in list */ ++SeqNum+tprt_seq;+/* seq # of this TPDU */ ++int+tprt_eot;+/* end of TSDU? */ ++int+tprt_octets;+/* # octets in this TPDU */ ++struct mbuf+*tprt_data;+/* ptr to the octets of data */ +.\"/* Performance measurment info: */ +.\"int tprt_window; /* in which call to tp_send() was +.\" * this TPDU formed? +.\" */ +.\"struct timeval tprt_sess_time; /* time session received the +.\" * majority of the data for this packet on send; +.\" * on recv, this is the time it's given to session +.\" */ +.\"struct timeval tprt_net_time; /* time first copy was given to net layer +.\" * on send; on receive it's the time received from +.\" * the network +.\" */ +}; +.TE +\fR +.)b +.lp +Once TPDUs are on the retransmission control queue, +they are retransmitted or dropped by the actions +of timers. +The procedure \fItp_sbdrop()\fR +removes the TPDUs from the retransmission queue. +It takes a sequence number as an argument and drops +all TPDUs up to and including the TPDU with that sequence number. +.pp +When an AK TPDU arrives, the values from +its credit and sequence number fields +are passed to \fItp_goodack()\fR, which +determines whether or not the AK brought any news with it, +and therefore whether TP can send more data +or expedited data. +If this AK acknowledges something heretofore unacknowledged, +\fItp_goodack()\fR drops the appropriate TPDU(s) from the retransmission +control list, computes the smoothed average round trip time +and standard deviation of the round trip time, +and updates +the retransmission timer based on these statistics. +It sets a flag in the pcb if the TP entity is obliged to +send the flow control confirmation parameter on its next +AK TPDU. +\fITp_goodack()\fR returns true if the AK brought some news with it, +either with respect to a change in credit or with respect to +new acknowledgments. +.pp +The function \fItp_goodXack()\fR is called when an XAK TPDU +arrives. +It takes the XAK sequence number as an argument and +determines if the XAK acknowledges the last XPD TPDU sent. +If so, it drops the expedited data from the outgoing +expedited data buffer. +By its definition in the TP specification, +the expedited data stream has a window +of size 1, +that is, +only one expedited datum (packet) can be buffered +at a time. +\fITp_goodXack()\fR returns true if the XAK acknowledged +the last XPD TPDU sent and the data were dropped, +and it returns false if the acknowledgment caused no action to be taken. +.\" NEXT FIGURE +.so figs/mbufrcv.nr +.\".so figs/mbufrcv.grn +.sh 1 "Receive" +.pp +This module reorders incoming TPDUs if necessary, +depacketizes data, passes it to the socket code module, +and determines when acknowledgments should be sent. +The function +\fItp_stash()\fR +takes an DT TPDU as an argument, and if the TPDU is not in +sequence, it saves the TPDU in a \fItp_rtc\fR structure in +a list, with the TPDUs +kept in order. +When the next expected TPDU arrives, the +list of out-of-order TPDUs is scanned for +more TPDUs in sequence, updating +a field in the pcb, \fItp_rcvnxt\fR which +always contains the sequence +number of +the next expected TPDU. +If an acknowledgment is to be generated +at any time, the value of tp_rcvnxt goes into the +\fIYR-TU-NR\fR\** field of the acknowledgment TPDU. +.(f +\** +This is the name used in ISO 8073 for the field +which indicates the sequence number of the next expected DT TPDU. +.)f +.pp +\fITp_stash()\fR returns true if an acknowledgment needs to be generated +immediately, false not. +The acknowledgment strategy is therefore implemented in this routine. +Acknowledgments may be generated for one or more of several reasons, +listed below. +\fITp_stash()\fR increments a counter for each of these reasons +for which an acknowledgment is generated, and a counter for TPDUs +that are not acknowledged immediately. +.ip "ACK_STRAT_EACH" 5 +The acknowledgment strategy in use calls for acknowledging each +data packet with an AK TPDU. +.ip "ACK_STRAT_FULLWIN" 5 +The acknowledgment strategy in use calls for acknowledging +upon receiving the DT TPDU that represents the upper window +edge of the last advertised window. +.ip "ACK_DUP" 5 +A duplicate data TPDU was received. +.ip "ACK_REORDER" 5 +A DT TPDU arrived in the window but out of order. +.ip "ACK_EOT" 5 +A DT TPDU arrived, and it had the end-of-TSDU flag set. +.pp +Upon receipt of a DT TPDU that is in order, and upon reordering +DT TPDUs, +\fItp_stash()\fR +places the TSDUs into the socket's receive +socket buffer, \fIso->so_rcv\fR in mbuf chains, with +TSDUs delimited by mbufs of the \fIm_type\fR MT_EOT, +which is a new type with the ARGO kernel. +.CF +illustrates the structure of the receiving socket buffer used +for normal data. +.pp +A separate socket buffer, \fItpcb->tp_Xrcv\fR, +is used for +buffering expedited data. +Only one expedited data packet may reside in this buffer at a time +because the TP standard limits the size of the window on expedited flow +to be 1. +This means the data structures are straightforward; +there is no need to distinguish between separate TSDUs in this socket buffer. +.pp +Credit is determined +by dividing the total amount of available +space in the receive buffer +by the negotiated maximum TPDU size. +TP can often offer a larger credit than this if it uses +an average of the measured actual TPDU sizes. +This strategy was once an option in the ARGO kernel, +but it was removed because unless the actual TPDU size +is constant, it leads to reneging of credit, +retransmissions, and decreased performance. +It does not work well when there is any fluctuation in the sizes +of TPDUs and it carries the penalty of lengthening the critical path +of the TP entity. +.sh 1 "Major Data Structures and Types" +.pp +In addition to the types commonly used in the kernel, +such as +.(b +\fC +.TS +tab(+); +l l l l. + +typedef+unsigned char+u_char; + +typedef+unsigned int+u_int; + +typedef+unsigned short+u_short; +.TE +\fR +.)b +TP uses the following types: +.(b +\fC +.TS +tab(+); +l l l l. + +typedef+unsigned int+SeqNum + +typedef+unsigned short+RefNum; + +typedef+int+ProtoHook; +.TE +\fR +.)b +.pp +Sequence numbers can be either 7 or 31 bits. +An unsigned integer is used in all cases, and the proper type +of arithmetic is performed with bit masks. +Reference numbers are 16 bits. +ProtoHook is the type of the procedures that are in switch +tables, which, +although they are not functions, +are declared \fIint\fR rather than \fIvoid\fR +to be consistent with the rest of the kernel. +.pp +The following structures are fundamental +types used throughout TP, +in addition to those already described in the +section, +"The Design of the Transport Entity". +.(b +\fC +.TS +tab(+); +l s s s. +struct tp_ref { +.T& +l l l l. ++u_char+tpr_state;+/* REF_FROZEN...*/ ++struct Ccallout+tpr_callout[N_CTIMERS];+/* C timers */ ++struct Ecallout+tpr_calltodo;+/* E timers list */ ++struct tp_pcb+*tpr_pcb;+/* --> PCB */ +}; +.TE +\fR +.)b +.lp +The reference structure is logically a part of the protocol +control block and it is linked to a pcb, but it may outlive +a pcb. +When a connection is dissolved, the pcb may be recycled +but the reference structure must remain until the reference +timer goes off. +The field \fItpr_state\fR takes the values +REF_FROZEN (a reference timer is ticking), +REF_OPEN (in use, has timers and an associated pcb), +REF_OPENING (has a pcb but no timers), and +REF_FREE (free to reallocate). +.pp +The TP protocol control block is too large to fit into +one mbuf structure so it comprises two structures +linked together, the +\fItp_pcb\fR structure and the. +\fItp_pcb_aux\fR structure. +The \fItp_pcb_aux\fR structure contains +items that are used less frequently than those in +the former structure, since each access to these +items requires a second pointer dereference. +.(b +\fC +.TS +tab(+); +l s s s. +struct tp_pcb_aux { +.T& +l l l s. + +struct sockbuf+tpa_Xsnd;+/* for expedited data */ ++struct sockbuf+tpa_Xrcv;+/* for expedited data */ ++u_char +tpa_vers;+/* protocol version */ ++u_char +tpa_peer_acktime;+/* to compute DT TPDU ++++retrans timer value */ ++SeqNum+tpa_Xsndnxt;+/* seq # of ++++next XPD to send */ ++SeqNum+tpa_Xuna;+/* seq # of ++++unacked XPD */ ++SeqNum+tpa_Xrcvnxt;+/* next XPD seq # ++++expect to recv */ ++/* addressing */ ++u_short+tpa_domain;+/* domain AF_ISO,...*/ ++u_short+tpa_fsuffixlen;+/* foreign suffix */ ++u_char+tpa_fsuffix[MAX_TSAP_SEL_LEN];+ ++u_short+tpa_lsuffixlen;+/* local suffix */ ++u_char+tpa_lsuffix[MAX_TSAP_SEL_LEN];+ +.T& +l s s s. + +/* AK subsequencing */ +.T& +l l l s. + +u_short+tpa_s_subseq;+/* next subseq to send */ ++u_short+tpa_r_subseq;+/* highest recv subseq */ +}; +.TE +\fR +.)b +.pp +The major portion of the protocol control block is in the +\fItp_pcb\fR structure: +.(b +\fC +.TS +tab(%); +l s s s. +struct tp_pcb { +.\" *************************************** +.T& +l l l l. +.\" The next line sets the spacing for the table: 1+3 17+3 17+3 13+3 + % % % +.\"456789 123456789- 123456789 123456-789 123456789 1234567890 +.\" + %struct tp_ref%*tp_refp;% +.T& +l l l s. +%%/* reference structure */% +.\" *************************************** +.T& +l l l l. + %struct tp_pcb_aux%*tp_aux;% +.T& +l l l s. + %%/*rest of tpcb (auxiliary struct)*/% +.\" *************************************** +.T& +l l l l. + %caddr_t%tp_npcb;%/* to ll pcb */ +%struct nl_protosw%*tp_nlproto;% +.T& +l l l s. + % %/* domain-dependent routines */% +.\" *************************************** +.T& +l l l l. + %struct socket%*tp_sock;%/* back ptr */ +.\" *************************************** +.T& +l s s s. + +/* local and foreign reference numbers: */ +.T& +l l l l. + %RefNum%tp_lref;% +%RefNum%tp_fref;% +.\" *************************************** +.T& +l s s s. +.\"456789 123456789 123456789 123456789 123456789 1234567890 + +/* Stuff for sequence space arithmetic: + * Maintaining 2 sequence spaces is a pain so we set these + * values once at connection establishment time. Sequence + * number arithmetic is a set of macros which uses these. + * Sequence numbers are stored as 32 bits. + * tp_seqmask tells which of the 32 bits is used. + * tp_seqibt is the lsb that is not used. When set, + * it indicates wraparound has occurred. + * tp_seqhalf is the value that is half the sequence space. + * (or half plus one). + */ +.T& +l l l l. +%u_int%tp_seqmask;%/* mask */ +%u_int%tp_seqbit;%/* wraparound */ +%u_int%tp_seqhalf;%/* half space */ +.\" *************************************** +.T& +l s s s. + +/* flags: values are defined in tp_user.h. + * Here we keep such info as which options + * are in use: checksum, extended format, + * flow control in class 2, etc. + * See tp(4p) man page. + */ +.\" *************************************** +.T& +l l l l. + %u_short%tp_state;%/* fsm */ +%short%tp_retrans;% +.T& +l l l s. + % % /* # times to retransmit */% +.\" *************************************** +.T& +l s s s. + +/* credit & sequencing info for SENDING: */ +.T& +l l l s. + %u_short%tp_fcredit;% + % %/* remote real window */% + %u_short%tp_cong_win;% + % %/* remote congestion window */% +.\" *************************************** +%SeqNum%tp_snduna;% +.T& +l l l s. + % %/* seq # of lowest unacked DT */% +.\" *************************************** +.T& +l l l l. + %struct tp_rtc %*tp_snduna_rtc;% +.T& +l l l s. + % %/* ptr to mbufs containing lowest% +%% * unacked TPDUs sent so far% +%% */% +.\" *************************************** +.T& +l l l l. + %SeqNum%tp_sndhiwat;% +.T& +l l l s. + % %/* highest DT sent yet */% +.\" *************************************** +.T& +l l l l. + %struct tp_rtc%*tp_sndhiwat_rtc;% +.T& +l l l s. + % %/* ptr to mbufs containing the last% +%% * DT sent - this is the last item % +%% * on the list that starts% +%% * at tp_snduna_rtc% +%% */% +.\" *************************************** +.T& +l l l l. + %int %tp_Nwindow;%/* for perf. measmt */ +.\" *************************************** +.T& +l s s s. + +/* credit & sequencing info for RECEIVING: */ +.\" *************************************** +.T& +l l l s. + %SeqNum%tp_sent_lcdt;% + %%/* cdt according to last AK sent */% + %SeqNum%tp_sent_uwe;% + % %/* upper window edge, according to% +%% * the last AK sent % +%% */* + %SeqNum%tp_sent_rcvnxt;% + % %/* rcvnxt, according to% +%% * the last AK sent% +%% */* +.\" *************************************** +.T& +l l l l. + %short%tp_lcredit;%/* local */ +.\" *************************************** +.T& +l l l l. + %SeqNum%tp_rcvnxt;% +.T& +l l l s. + % %/* next DT seq# we expect to recv */% +.\" *************************************** +.T& +l l l l. + %struct tp_rtc%*tp_rcvnxt_rtc;% +.T& +l l l s. + % %/* ptr to mbufs containing unacked % +%% * DTs received out of order, and % +%% * which we haven't acknowledged% +%% */% +.\" *************************************** +.TE +.TS +tab(%); +l s s s. +/* Items kept in the aux structure: */ + +.\" *************************************** +.T& +l s s l. +#define tp_vers%tp_aux->tpa_vers +#define tp_peer_acktime%tp_aux->tpa_peer_acktime +#define tp_Xsnd%tp_aux->tpa_Xsnd +#define tp_Xrcv%tp_aux->tpa_Xrcv +#define tp_Xrcvnxt%tp_aux->tpa_Xrcvnxt +#define tp_Xsndnxt%tp_aux->tpa_Xsndnxt +#define tp_Xuna%tp_aux->tpa_Xuna +#define tp_domain%tp_aux->tpa_domain +#define tp_fsuffixlen%tp_aux->tpa_fsuffixlen +#define tp_fsuffix%tp_aux->tpa_fsuffix +#define tp_lsuffixlen%tp_aux->tpa_lsuffixlen +#define tp_lsuffix%tp_aux->tpa_lsuffix +#define tp_s_subseq%tp_aux->tpa_s_subseq +#define tp_r_subseq%tp_aux->tpa_r_subseq +.\" *************************************** +.T& +l s s s. + % % % +/* parameters per-connection controllable by user: */ +.\" *************************************** +.T& +l l l l. + %struct%tp_conn_param%_tp_param; + % % % +.\" *************************************** +.T& +l s s l. +#define tp_Nretrans%_tp_param.p_Nretrans +#define tp_dr_ticks%_tp_param.p_dr_ticks +#define tp_cc_ticks%_tp_param.p_cc_ticks +#define tp_dt_ticks%_tp_param.p_dt_ticks +#define tp_xpd_ticks%_tp_param.p_x_ticks +#define tp_cr_ticks%_tp_param.p_cr_ticks +#define tp_keepalive_ticks%_tp_param.p_keepalive_ticks +#define tp_sendack_ticks%_tp_param.p_sendack_ticks +#define tp_refer_ticks%_tp_param.p_ref_ticks +#define tp_inact_ticks%_tp_param.p_inact_ticks +#define tp_xtd_format%_tp_param.p_xtd_format +#define tp_xpd_service%_tp_param.p_xpd_service +#define tp_ack_strat%_tp_param.p_ack_strat +#define tp_rx_strat%_tp_param.p_rx_strat +#define tp_use_checksum%_tp_param.p_use_checksum +#define tp_tpdusize%_tp_param.p_tpdusize +#define tp_class%_tp_param.p_class +#define tp_winsize%_tp_param.p_winsize +#define tp_netservice%_tp_param.p_netservice +#define tp_no_disc_indications%_tp_param.p_no_disc_indications +#define tp_dont_change_params%_tp_param.p_dont_change_params +.\" *************************************** +.TE +.\" *************************************** +.\" *************************************** +.\" *************************************** +.TS +tab(%); +l l l l. +.\" The next line sets the spacing for the table: 1+3 17+3 17+3 13+3 +.\"456789 123456789- 123456789 123456-789 123456789 1234567890 +.\" +.T& +l l l s. + %%/* log2(the negotiated max size) */% +.T& +l l l l. + %int%tp_l_tpdusize;%/* # bytes */ +.\" *************************************** + %struct timeval%tp_rtt;% +.T& +l l l s. + % %/* smoothed avg round-trip time */% + %struct timeval%tp_rtv;% + % %/* std deviation of round-trip time */% +%struct timeval%tp_rttemit[ TP_RTT_NUM + 1 ];% +%%/* times that the last TP_RTT_NUM % +%% * DT_TPDUs were transmitted % +%% */% +.\" *************************************** + %unsigned % % +% tp_sendfcc:1,%/* shall next ack % +% %include flow control conf. param? */% +.\" *************************************** +.T& +l l l s. + % tp_trace:1,%/* is this pcb being traced?% +%% * (not used yet) % +%% */% +.\" *************************************** +% tp_perf_on:1,%/* statistics being kept? */% +.\" *************************************** +% tp_reneged:1,%/* have we reneged on credit% +%% * since the last AK TPDU was sent? % +%% */% +% tp_decbit:4,%/* congestion experienced? */% +% tp_flags:8,%/* see #defines below */% +.\" *************************************** +% tp_unused:16;%% +.T& +l s s l. +#define TPF_XPD_PRESENT%TPFLAG_XPD_PRESENT +#define TPF_NLQOS_PDN%TPFLAG_NLQOS_PDN +#define TPF_PEER_ON_SAMENET%TPFLAG_PEER_ON_SAMENET +%%% +.\" *************************************** +.T& +l l l l. + %struct tp_pmeas%*tp_p_meas;% +.T& +l l l s. + % %/* ptr to mbuf to hold the perf.% +%% * statistics structure % +%% */% +.\" *************************************** +}; +.TE +\fR +.\" +.\" end of tpcb structure (thank you) +.\" +.)b +.fi +.sh 1 "Sequence Number Arithmetic" +.pp +Sequence numbers in TP can be either 7 bits +(\*(lqnormal format\*(rq) +or 31 bits +(\*(lqextended format\*(rq). +Sequence numbers are unsigned integers, +regardless of their format. +Three fields are kept in the pcb to manage the sequence +number arithmetic: +.(b +\fC +.TS +tab(+); +l l l l. + +u_int+tp_seqmask;+/* mask for seq space */ + +u_int+tp_seqbit;+/* bit for seq # wraparound */ + +u_int+tp_seqhalf;+/* half the seq space */ +.TE +\fR +.)b +.lp +\fITp_seqmask\fR +is a bit mask indicating which bits are legitimate +for a sequence number of either format. +It takes the value 0x7f if 7-bit sequence numbers are in use, +and 0x7fffffff if 31-bit sequence numbers are in use. +\fITp_seqbit\fR +is the bit that becomes set when a sequence number wraps around +while being incremented. +Its value is 0x80 for normal format, 0x80000000 for extended format. +\fITp_seqhalf\fR +takes the value which is in the middle of the sequence space, +0x40 for normal format, +and +0x40000000 for extended format. +.(b +.nf +The macro +.fi +\fC +.TS +tab(+); +l l l l. + SEQ(tpcb, x) +.TE +\fR +.)b +.lp +extracts a sequence number from the location +in which it is stored. +.pp +The macros +.(b +\fC +.TS +tab(+); +l l s s l. + +SEQ_GT(tpcb, seq, t)+is seq > t? + +SEQ_GEQ(tpcb, seq, t)+is seq >= t? + +SEQ_LT(tpcb, seq, t)+is seq < t? + +SEQ_LEQ(tpcb, seq, t)+is seq <= t? + +SEQ_INC(tpcb, seq)+seq\+\+ + +SEQ_DEC(tpcb, seq)+seq-- + +SEQ_SUB(tpcb, seq, amt)+seq -= amt + +SEQ_ADD(tpcb, seq, amt)+seq \+= amt +.TE +\fR +.)b +.lp +perform the indicated comparisons and arithmetic +on their arguments. +.pp +An example of how these macros +are used is as follows. +To determine if a sequence +number \fIseq\fR is in a receive window +bounded by +\fIlwe\fR and \fIuwe\fR, +we define the +macro +.(b +\fC +.TS +tab(+); +l l. +#define+IN_RWINDOW(tpcb, seq, lwe, uwe)\\ ++( SEQ_GEQ(tpcb, seq, lwe) && SEQ_LT(tpcb, seq, uwe) ) +.TE +\fR +.)b +.sh 1 "TP Implementation Options" +.pp +The transport protocol specification leaves several +things to the discretion of the implementor, +some of which may affect the performance +of individual connections and +aggregate performance. +Wherever different strategies are likely to favor +the performance of +individual connections to the detriment of aggregate performance +or vice versa, the +various strategies are under the control of options via the +\fIgetsockopt()\fR and +\fIsetsockopt()\fR system calls (see the manual pages +\fIgetsockopt(2)\fR, +\fIsetsockopt(2)\fR +and +\fItp(4p)\fR +for details). +In some cases the preferred strategies differ for the different +subnetworks, so the strategies chosen will be determined +by the subnetwork in use. +.sh 2 "TPDU size" +.pp +The limitation of the maximum TPDU size to a power of two is +unfortunate in the LAN environment. +For example, if the maximum NSDU size is around 1500, as in the case of an +Ethernet, +using a maximum TPDU size of 1024 reduces +the possible throughput by approximately 30%. +TP negotiates a maximum TPDU size of 2048 and +generates TPDUs of size around 1500. +Obviously this works well only when the peer is known to be +using the same scheme (so that the peer +doesn't send TPDUs of size 2048 and cause its +network layer to fragment the TPDUs). +This is likely to be the case in a LAN where +all protocol entities are under the same administrative +control. +The maximum TPDU size negotiated is under the control of the user, +so +it is possible to prevent this scheme from being used +by default +when the peer is not on the same LAN, by +setting the \fItp.tpdusize\fR parameter in the ARGO directory service +file to +something less than the network's maximum transmission +unit. +.\"*********************************************************** +.sh 2 "Congestion Window Strategy" +.pp +The congestion window strategy from the +DoD Internet +was adapted for use with TP. +The strategy is intended to minimize the +adverse effect +of transport's retransmission on an +already congested network. +.pp +A TP entity keeps two notions of the peer's window: +the real window, which is that advertised by the peer +in AK TPDUs, and the congestion window, which is a locally +controlled window. +TP uses the smaller of the two windows when transmitting. +The congestion window starts small, which keeps a +new connection from overloading the network with a sudden +burst of packets +immediately after connection establishement. +This is called \fIslow start\fR. +For each successful acknowledgment received, the congestion +window grows by one, until eventually the real window +is the one in use. +If a retransmission timer expires, the congestion window +is reset to size one. +.pp +The congestion window strategy is used for class 4 unless +the transport user requests that it not be used. +The slow start strategy is used for traffic over a PDN +unless +the transport user requests that it not be used. +Slow start is not used for traffic over a LAN unless +its use is requested by the transport user. +.\"*********************************************************** +.sh 2 "Retransmission strategies" +.pp +A retransmission timer is invoked for each set of DT TPDUs +sent in one send operation (call to \fItp_send()\fR). +This set of packets is called the \fIsend window\fR for the purpose +of this discusssion. +.pp +The number of TPDUs +in a send window +depends on the remote credit and the amount of data +in the local send buffers. +When a retransmission timer goes off, the lower +window edge +is reevaluated but the upper window edge is not reevaluated. +.pp +There are several retransmission strategies implemented in +ARGO TP. +The choice of strategies is the user's, and is made with the +\fIsetsockopt()\fR system call. +The strategies are summarized here: +.ip "Retransmit LWE TPDU only:" 5 +Only the TPDU representing the new lower window edge +is retransmitted. +This is the default retransmission strategy. +.ip "Retransmit whole send window:" 5 +Retransmission begins with the new lower window edge +and continues up to the old upper window edge. +.pp +The value of the data retransmission timer +adapts to the average round trip time and the standard deviation of +the round trip time. +A round trip time is the time that passes between +the moment of a packet's first transmission and +the moment it is first acknowledged. +The average round trip time +is kept by the sending side of TP, using +a formula for +smoothing the average: +.(b +\fC +.TS +tab(+); +l l l l. +#define+TP_RTT_ALPHA+3 +#define+TP_RTV_ALPHA+2 ++++ +#define+SMOOTH(alpha, old, new) \\ ++(((new-old) >> alpha ) \+ (old) ) +.TE +\fR +.)b +.lp +The times included in the average are chosen as follows. +The time of +each packet's initial transmission is kept (for the last +\fIN\fR packets, where \fIN\fR is a defined constant). +When an AK TPDU arrives, ARGO TP subtracts the initial transmission +time for the lowest unacknowledged sequence number that was +acknowledged by this AK TPDU from the current time, +and apply the resulting time to the average. +Hence, not all packets are included in this average, +which is as it should be since +the purpose of this measurement is +to find a good value for the retransmission timer. +.pp +Each time part of a window is retransmitted, +the retransmission timer for that window is increased. +This does not affect the retransmission timers for other windows. +.\"*********************************************************** +.sh 2 "Acknowledgment strategies" +.pp +The transport protocol specification +requires acknowledgments to be sent immediately +upon receipt +of CC TPDUs (in class 4), XPD TPDUs, and DT TPDUs containing an +EOT marker, and at other times as required for flow control, +otherwise acknowledgments may be delayed. +In addition to the times when an acknowledgment is required, +ARGO TP transmits an AK TPDU whenever the user receives some data, +thereby increasing the size of the window. +For those times when +immediate acknowledgment is optional, +ARGO TP offers two acknowledgment strategies: +.ip " Acknowledge each TPDU" 10 +Upon receipt of a DT TPDU and AK TPDU is sent. +.ip " Acknowledge full window" 10 +Acknowledgment is issued +upon receipt of enough data to +consume the last advertised credit. +.pp +The latter strategy +requires a timer to trigger an acknowledgment +in case the peer doesn't send the entire window +quickly. +This timer is called the +\fIsendack timer\fR. +The upper bound on the value of this timer +is called the \fIlocal acknowledgment time\fR. +The local acknowledgment time may be "advertised" to the +peer during connection establishment, and the +peer may choose to use this value to +adjust its retransmission timers. +The ARGO TP entity advertises its local acknowledgment time +on a CR TPDU, but it is not +constrained by +the remote acknowledge time, should the peer +advertise it. +Instead, +ARGO TP adapts its sendack timer +to the behavior of the connection. +.pp +Under the assumption that the round trip time is +often +symmetric, +and lacking +a method to measure +the round trip time in the other direction, +ARGO TP uses the measured average round trip time +to adjust the the sendack timer. +.pp +The choice of strategies is made with the +\fIsetsockopt()\fR system call. +The default strategy is +to +delay acknowledgments until the most recently advertised window is filled. |