diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/Changes | 1 | ||||
-rw-r--r-- | Documentation/SubmittingPatches | 5 | ||||
-rw-r--r-- | Documentation/dontdiff | 1 | ||||
-rw-r--r-- | Documentation/fb/vesafb.txt | 16 | ||||
-rw-r--r-- | Documentation/infiniband/core_locking.txt | 114 | ||||
-rw-r--r-- | Documentation/infiniband/user_mad.txt | 53 | ||||
-rw-r--r-- | Documentation/kprobes.txt | 588 | ||||
-rw-r--r-- | Documentation/networking/bonding.txt | 978 | ||||
-rw-r--r-- | Documentation/pcmcia/driver-changes.txt | 9 | ||||
-rw-r--r-- | Documentation/sound/alsa/ALSA-Configuration.txt | 44 | ||||
-rw-r--r-- | Documentation/stable_api_nonsense.txt | 2 | ||||
-rw-r--r-- | Documentation/stable_kernel_rules.txt | 58 | ||||
-rw-r--r-- | Documentation/usb/usbmon.txt | 2 | ||||
-rw-r--r-- | Documentation/video4linux/CARDLIST.cx88 | 1 | ||||
-rw-r--r-- | Documentation/video4linux/CARDLIST.tuner | 2 | ||||
-rw-r--r-- | Documentation/video4linux/bttv/Insmod-options | 3 | ||||
-rw-r--r-- | Documentation/x86_64/boot-options.txt | 15 |
17 files changed, 1547 insertions, 345 deletions
diff --git a/Documentation/Changes b/Documentation/Changes index dfec756..5eaab04 100644 --- a/Documentation/Changes +++ b/Documentation/Changes @@ -65,6 +65,7 @@ o isdn4k-utils 3.1pre1 # isdnctrl 2>&1|grep version o nfs-utils 1.0.5 # showmount --version o procps 3.2.0 # ps --version o oprofile 0.9 # oprofiled --version +o udev 058 # udevinfo -V Kernel compilation ================== diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches index 6761a7b..7f43b04 100644 --- a/Documentation/SubmittingPatches +++ b/Documentation/SubmittingPatches @@ -149,6 +149,11 @@ USB, framebuffer devices, the VFS, the SCSI subsystem, etc. See the MAINTAINERS file for a mailing list that relates specifically to your change. +If changes affect userland-kernel interfaces, please send +the MAN-PAGES maintainer (as listed in the MAINTAINERS file) +a man-pages patch, or at least a notification of the change, +so that some information makes its way into the manual pages. + Even if the maintainer did not respond in step #4, make sure to ALWAYS copy the maintainer when you change their code. diff --git a/Documentation/dontdiff b/Documentation/dontdiff index b974cf5..96bea27 100644 --- a/Documentation/dontdiff +++ b/Documentation/dontdiff @@ -104,6 +104,7 @@ logo_*.c logo_*_clut224.c logo_*_mono.c lxdialog +mach-types mach-types.h make_times_h map diff --git a/Documentation/fb/vesafb.txt b/Documentation/fb/vesafb.txt index 814e2f5..62db675 100644 --- a/Documentation/fb/vesafb.txt +++ b/Documentation/fb/vesafb.txt @@ -144,7 +144,21 @@ vgapal Use the standard vga registers for palette changes. This is the default. pmipal Use the protected mode interface for palette changes. -mtrr setup memory type range registers for the vesafb framebuffer. +mtrr:n setup memory type range registers for the vesafb framebuffer + where n: + 0 - disabled (equivalent to nomtrr) + 1 - uncachable + 2 - write-back + 3 - write-combining (default) + 4 - write-through + + If you see the following in dmesg, choose the type that matches the + old one. In this example, use "mtrr:2". +... +mtrr: type mismatch for e0000000,8000000 old: write-back new: write-combining +... + +nomtrr disable mtrr vremap:n remap 'n' MiB of video RAM. If 0 or not specified, remap memory diff --git a/Documentation/infiniband/core_locking.txt b/Documentation/infiniband/core_locking.txt new file mode 100644 index 0000000..e167854 --- /dev/null +++ b/Documentation/infiniband/core_locking.txt @@ -0,0 +1,114 @@ +INFINIBAND MIDLAYER LOCKING + + This guide is an attempt to make explicit the locking assumptions + made by the InfiniBand midlayer. It describes the requirements on + both low-level drivers that sit below the midlayer and upper level + protocols that use the midlayer. + +Sleeping and interrupt context + + With the following exceptions, a low-level driver implementation of + all of the methods in struct ib_device may sleep. The exceptions + are any methods from the list: + + create_ah + modify_ah + query_ah + destroy_ah + bind_mw + post_send + post_recv + poll_cq + req_notify_cq + map_phys_fmr + + which may not sleep and must be callable from any context. + + The corresponding functions exported to upper level protocol + consumers: + + ib_create_ah + ib_modify_ah + ib_query_ah + ib_destroy_ah + ib_bind_mw + ib_post_send + ib_post_recv + ib_req_notify_cq + ib_map_phys_fmr + + are therefore safe to call from any context. + + In addition, the function + + ib_dispatch_event + + used by low-level drivers to dispatch asynchronous events through + the midlayer is also safe to call from any context. + +Reentrancy + + All of the methods in struct ib_device exported by a low-level + driver must be fully reentrant. The low-level driver is required to + perform all synchronization necessary to maintain consistency, even + if multiple function calls using the same object are run + simultaneously. + + The IB midlayer does not perform any serialization of function calls. + + Because low-level drivers are reentrant, upper level protocol + consumers are not required to perform any serialization. However, + some serialization may be required to get sensible results. For + example, a consumer may safely call ib_poll_cq() on multiple CPUs + simultaneously. However, the ordering of the work completion + information between different calls of ib_poll_cq() is not defined. + +Callbacks + + A low-level driver must not perform a callback directly from the + same callchain as an ib_device method call. For example, it is not + allowed for a low-level driver to call a consumer's completion event + handler directly from its post_send method. Instead, the low-level + driver should defer this callback by, for example, scheduling a + tasklet to perform the callback. + + The low-level driver is responsible for ensuring that multiple + completion event handlers for the same CQ are not called + simultaneously. The driver must guarantee that only one CQ event + handler for a given CQ is running at a time. In other words, the + following situation is not allowed: + + CPU1 CPU2 + + low-level driver -> + consumer CQ event callback: + /* ... */ + ib_req_notify_cq(cq, ...); + low-level driver -> + /* ... */ consumer CQ event callback: + /* ... */ + return from CQ event handler + + The context in which completion event and asynchronous event + callbacks run is not defined. Depending on the low-level driver, it + may be process context, softirq context, or interrupt context. + Upper level protocol consumers may not sleep in a callback. + +Hot-plug + + A low-level driver announces that a device is ready for use by + consumers when it calls ib_register_device(), all initialization + must be complete before this call. The device must remain usable + until the driver's call to ib_unregister_device() has returned. + + A low-level driver must call ib_register_device() and + ib_unregister_device() from process context. It must not hold any + semaphores that could cause deadlock if a consumer calls back into + the driver across these calls. + + An upper level protocol consumer may begin using an IB device as + soon as the add method of its struct ib_client is called for that + device. A consumer must finish all cleanup and free all resources + relating to a device before returning from the remove method. + + A consumer is permitted to sleep in its add and remove methods. diff --git a/Documentation/infiniband/user_mad.txt b/Documentation/infiniband/user_mad.txt index cae0c83..750fe5e 100644 --- a/Documentation/infiniband/user_mad.txt +++ b/Documentation/infiniband/user_mad.txt @@ -28,13 +28,37 @@ Creating MAD agents Receiving MADs - MADs are received using read(). The buffer passed to read() must be - large enough to hold at least one struct ib_user_mad. For example: - - struct ib_user_mad mad; - ret = read(fd, &mad, sizeof mad); - if (ret != sizeof mad) + MADs are received using read(). The receive side now supports + RMPP. The buffer passed to read() must be at least one + struct ib_user_mad + 256 bytes. For example: + + If the buffer passed is not large enough to hold the received + MAD (RMPP), the errno is set to ENOSPC and the length of the + buffer needed is set in mad.length. + + Example for normal MAD (non RMPP) reads: + struct ib_user_mad *mad; + mad = malloc(sizeof *mad + 256); + ret = read(fd, mad, sizeof *mad + 256); + if (ret != sizeof mad + 256) { + perror("read"); + free(mad); + } + + Example for RMPP reads: + struct ib_user_mad *mad; + mad = malloc(sizeof *mad + 256); + ret = read(fd, mad, sizeof *mad + 256); + if (ret == -ENOSPC)) { + length = mad.length; + free(mad); + mad = malloc(sizeof *mad + length); + ret = read(fd, mad, sizeof *mad + length); + } + if (ret < 0) { perror("read"); + free(mad); + } In addition to the actual MAD contents, the other struct ib_user_mad fields will be filled in with information on the received MAD. For @@ -50,18 +74,21 @@ Sending MADs MADs are sent using write(). The agent ID for sending should be filled into the id field of the MAD, the destination LID should be - filled into the lid field, and so on. For example: + filled into the lid field, and so on. The send side does support + RMPP so arbitrary length MAD can be sent. For example: + + struct ib_user_mad *mad; - struct ib_user_mad mad; + mad = malloc(sizeof *mad + mad_length); - /* fill in mad.data */ + /* fill in mad->data */ - mad.id = my_agent; /* req.id from agent registration */ - mad.lid = my_dest; /* in network byte order... */ + mad->hdr.id = my_agent; /* req.id from agent registration */ + mad->hdr.lid = my_dest; /* in network byte order... */ /* etc. */ - ret = write(fd, &mad, sizeof mad); - if (ret != sizeof mad) + ret = write(fd, &mad, sizeof *mad + mad_length); + if (ret != sizeof *mad + mad_length) perror("write"); Setting IsSM Capability Bit diff --git a/Documentation/kprobes.txt b/Documentation/kprobes.txt new file mode 100644 index 0000000..0541fe1 --- /dev/null +++ b/Documentation/kprobes.txt @@ -0,0 +1,588 @@ +Title : Kernel Probes (Kprobes) +Authors : Jim Keniston <jkenisto@us.ibm.com> + : Prasanna S Panchamukhi <prasanna@in.ibm.com> + +CONTENTS + +1. Concepts: Kprobes, Jprobes, Return Probes +2. Architectures Supported +3. Configuring Kprobes +4. API Reference +5. Kprobes Features and Limitations +6. Probe Overhead +7. TODO +8. Kprobes Example +9. Jprobes Example +10. Kretprobes Example + +1. Concepts: Kprobes, Jprobes, Return Probes + +Kprobes enables you to dynamically break into any kernel routine and +collect debugging and performance information non-disruptively. You +can trap at almost any kernel code address, specifying a handler +routine to be invoked when the breakpoint is hit. + +There are currently three types of probes: kprobes, jprobes, and +kretprobes (also called return probes). A kprobe can be inserted +on virtually any instruction in the kernel. A jprobe is inserted at +the entry to a kernel function, and provides convenient access to the +function's arguments. A return probe fires when a specified function +returns. + +In the typical case, Kprobes-based instrumentation is packaged as +a kernel module. The module's init function installs ("registers") +one or more probes, and the exit function unregisters them. A +registration function such as register_kprobe() specifies where +the probe is to be inserted and what handler is to be called when +the probe is hit. + +The next three subsections explain how the different types of +probes work. They explain certain things that you'll need to +know in order to make the best use of Kprobes -- e.g., the +difference between a pre_handler and a post_handler, and how +to use the maxactive and nmissed fields of a kretprobe. But +if you're in a hurry to start using Kprobes, you can skip ahead +to section 2. + +1.1 How Does a Kprobe Work? + +When a kprobe is registered, Kprobes makes a copy of the probed +instruction and replaces the first byte(s) of the probed instruction +with a breakpoint instruction (e.g., int3 on i386 and x86_64). + +When a CPU hits the breakpoint instruction, a trap occurs, the CPU's +registers are saved, and control passes to Kprobes via the +notifier_call_chain mechanism. Kprobes executes the "pre_handler" +associated with the kprobe, passing the handler the addresses of the +kprobe struct and the saved registers. + +Next, Kprobes single-steps its copy of the probed instruction. +(It would be simpler to single-step the actual instruction in place, +but then Kprobes would have to temporarily remove the breakpoint +instruction. This would open a small time window when another CPU +could sail right past the probepoint.) + +After the instruction is single-stepped, Kprobes executes the +"post_handler," if any, that is associated with the kprobe. +Execution then continues with the instruction following the probepoint. + +1.2 How Does a Jprobe Work? + +A jprobe is implemented using a kprobe that is placed on a function's +entry point. It employs a simple mirroring principle to allow +seamless access to the probed function's arguments. The jprobe +handler routine should have the same signature (arg list and return +type) as the function being probed, and must always end by calling +the Kprobes function jprobe_return(). + +Here's how it works. When the probe is hit, Kprobes makes a copy of +the saved registers and a generous portion of the stack (see below). +Kprobes then points the saved instruction pointer at the jprobe's +handler routine, and returns from the trap. As a result, control +passes to the handler, which is presented with the same register and +stack contents as the probed function. When it is done, the handler +calls jprobe_return(), which traps again to restore the original stack +contents and processor state and switch to the probed function. + +By convention, the callee owns its arguments, so gcc may produce code +that unexpectedly modifies that portion of the stack. This is why +Kprobes saves a copy of the stack and restores it after the jprobe +handler has run. Up to MAX_STACK_SIZE bytes are copied -- e.g., +64 bytes on i386. + +Note that the probed function's args may be passed on the stack +or in registers (e.g., for x86_64 or for an i386 fastcall function). +The jprobe will work in either case, so long as the handler's +prototype matches that of the probed function. + +1.3 How Does a Return Probe Work? + +When you call register_kretprobe(), Kprobes establishes a kprobe at +the entry to the function. When the probed function is called and this +probe is hit, Kprobes saves a copy of the return address, and replaces +the return address with the address of a "trampoline." The trampoline +is an arbitrary piece of code -- typically just a nop instruction. +At boot time, Kprobes registers a kprobe at the trampoline. + +When the probed function executes its return instruction, control +passes to the trampoline and that probe is hit. Kprobes' trampoline +handler calls the user-specified handler associated with the kretprobe, +then sets the saved instruction pointer to the saved return address, +and that's where execution resumes upon return from the trap. + +While the probed function is executing, its return address is +stored in an object of type kretprobe_instance. Before calling +register_kretprobe(), the user sets the maxactive field of the +kretprobe struct to specify how many instances of the specified +function can be probed simultaneously. register_kretprobe() +pre-allocates the indicated number of kretprobe_instance objects. + +For example, if the function is non-recursive and is called with a +spinlock held, maxactive = 1 should be enough. If the function is +non-recursive and can never relinquish the CPU (e.g., via a semaphore +or preemption), NR_CPUS should be enough. If maxactive <= 0, it is +set to a default value. If CONFIG_PREEMPT is enabled, the default +is max(10, 2*NR_CPUS). Otherwise, the default is NR_CPUS. + +It's not a disaster if you set maxactive too low; you'll just miss +some probes. In the kretprobe struct, the nmissed field is set to +zero when the return probe is registered, and is incremented every +time the probed function is entered but there is no kretprobe_instance +object available for establishing the return probe. + +2. Architectures Supported + +Kprobes, jprobes, and return probes are implemented on the following +architectures: + +- i386 +- x86_64 (AMD-64, E64MT) +- ppc64 +- ia64 (Support for probes on certain instruction types is still in progress.) +- sparc64 (Return probes not yet implemented.) + +3. Configuring Kprobes + +When configuring the kernel using make menuconfig/xconfig/oldconfig, +ensure that CONFIG_KPROBES is set to "y". Under "Kernel hacking", +look for "Kprobes". You may have to enable "Kernel debugging" +(CONFIG_DEBUG_KERNEL) before you can enable Kprobes. + +You may also want to ensure that CONFIG_KALLSYMS and perhaps even +CONFIG_KALLSYMS_ALL are set to "y", since kallsyms_lookup_name() +is a handy, version-independent way to find a function's address. + +If you need to insert a probe in the middle of a function, you may find +it useful to "Compile the kernel with debug info" (CONFIG_DEBUG_INFO), +so you can use "objdump -d -l vmlinux" to see the source-to-object +code mapping. + +4. API Reference + +The Kprobes API includes a "register" function and an "unregister" +function for each type of probe. Here are terse, mini-man-page +specifications for these functions and the associated probe handlers +that you'll write. See the latter half of this document for examples. + +4.1 register_kprobe + +#include <linux/kprobes.h> +int register_kprobe(struct kprobe *kp); + +Sets a breakpoint at the address kp->addr. When the breakpoint is +hit, Kprobes calls kp->pre_handler. After the probed instruction +is single-stepped, Kprobe calls kp->post_handler. If a fault +occurs during execution of kp->pre_handler or kp->post_handler, +or during single-stepping of the probed instruction, Kprobes calls +kp->fault_handler. Any or all handlers can be NULL. + +register_kprobe() returns 0 on success, or a negative errno otherwise. + +User's pre-handler (kp->pre_handler): +#include <linux/kprobes.h> +#include <linux/ptrace.h> +int pre_handler(struct kprobe *p, struct pt_regs *regs); + +Called with p pointing to the kprobe associated with the breakpoint, +and regs pointing to the struct containing the registers saved when +the breakpoint was hit. Return 0 here unless you're a Kprobes geek. + +User's post-handler (kp->post_handler): +#include <linux/kprobes.h> +#include <linux/ptrace.h> +void post_handler(struct kprobe *p, struct pt_regs *regs, + unsigned long flags); + +p and regs are as described for the pre_handler. flags always seems +to be zero. + +User's fault-handler (kp->fault_handler): +#include <linux/kprobes.h> +#include <linux/ptrace.h> +int fault_handler(struct kprobe *p, struct pt_regs *regs, int trapnr); + +p and regs are as described for the pre_handler. trapnr is the +architecture-specific trap number associated with the fault (e.g., +on i386, 13 for a general protection fault or 14 for a page fault). +Returns 1 if it successfully handled the exception. + +4.2 register_jprobe + +#include <linux/kprobes.h> +int register_jprobe(struct jprobe *jp) + +Sets a breakpoint at the address jp->kp.addr, which must be the address +of the first instruction of a function. When the breakpoint is hit, +Kprobes runs the handler whose address is jp->entry. + +The handler should have the same arg list and return type as the probed +function; and just before it returns, it must call jprobe_return(). +(The handler never actually returns, since jprobe_return() returns +control to Kprobes.) If the probed function is declared asmlinkage, +fastcall, or anything else that affects how args are passed, the +handler's declaration must match. + +register_jprobe() returns 0 on success, or a negative errno otherwise. + +4.3 register_kretprobe + +#include <linux/kprobes.h> +int register_kretprobe(struct kretprobe *rp); + +Establishes a return probe for the function whose address is +rp->kp.addr. When that function returns, Kprobes calls rp->handler. +You must set rp->maxactive appropriately before you call +register_kretprobe(); see "How Does a Return Probe Work?" for details. + +register_kretprobe() returns 0 on success, or a negative errno +otherwise. + +User's return-probe handler (rp->handler): +#include <linux/kprobes.h> +#include <linux/ptrace.h> +int kretprobe_handler(struct kretprobe_instance *ri, struct pt_regs *regs); + +regs is as described for kprobe.pre_handler. ri points to the +kretprobe_instance object, of which the following fields may be +of interest: +- ret_addr: the return address +- rp: points to the corresponding kretprobe object +- task: points to the corresponding task struct +The handler's return value is currently ignored. + +4.4 unregister_*probe + +#include <linux/kprobes.h> +void unregister_kprobe(struct kprobe *kp); +void unregister_jprobe(struct jprobe *jp); +void unregister_kretprobe(struct kretprobe *rp); + +Removes the specified probe. The unregister function can be called +at any time after the probe has been registered. + +5. Kprobes Features and Limitations + +As of Linux v2.6.12, Kprobes allows multiple probes at the same +address. Currently, however, there cannot be multiple jprobes on +the same function at the same time. + +In general, you can install a probe anywhere in the kernel. +In particular, you can probe interrupt handlers. Known exceptions +are discussed in this section. + +For obvious reasons, it's a bad idea to install a probe in +the code that implements Kprobes (mostly kernel/kprobes.c and +arch/*/kernel/kprobes.c). A patch in the v2.6.13 timeframe instructs +Kprobes to reject such requests. + +If you install a probe in an inline-able function, Kprobes makes +no attempt to chase down all inline instances of the function and +install probes there. gcc may inline a function without being asked, +so keep this in mind if you're not seeing the probe hits you expect. + +A probe handler can modify the environment of the probed function +-- e.g., by modifying kernel data structures, or by modifying the +contents of the pt_regs struct (which are restored to the registers +upon return from the breakpoint). So Kprobes can be used, for example, +to install a bug fix or to inject faults for testing. Kprobes, of +course, has no way to distinguish the deliberately injected faults +from the accidental ones. Don't drink and probe. + +Kprobes makes no attempt to prevent probe handlers from stepping on +each other -- e.g., probing printk() and then calling printk() from a +probe handler. As of Linux v2.6.12, if a probe handler hits a probe, +that second probe's handlers won't be run in that instance. + +In Linux v2.6.12 and previous versions, Kprobes' data structures are +protected by a single lock that is held during probe registration and +unregistration and while handlers are run. Thus, no two handlers +can run simultaneously. To improve scalability on SMP systems, +this restriction will probably be removed soon, in which case +multiple handlers (or multiple instances of the same handler) may +run concurrently on different CPUs. Code your handlers accordingly. + +Kprobes does not use semaphores or allocate memory except during +registration and unregistration. + +Probe handlers are run with preemption disabled. Depending on the +architecture, handlers may also run with interrupts disabled. In any +case, your handler should not yield the CPU (e.g., by attempting to +acquire a semaphore). + +Since a return probe is implemented by replacing the return +address with the trampoline's address, stack backtraces and calls +to __builtin_return_address() will typically yield the trampoline's +address instead of the real return address for kretprobed functions. +(As far as we can tell, __builtin_return_address() is used only +for instrumentation and error reporting.) + +If the number of times a function is called does not match the +number of times it returns, registering a return probe on that +function may produce undesirable results. We have the do_exit() +and do_execve() cases covered. do_fork() is not an issue. We're +unaware of other specific cases where this could be a problem. + +6. Probe Overhead + +On a typical CPU in use in 2005, a kprobe hit takes 0.5 to 1.0 +microseconds to process. Specifically, a benchmark that hits the same +probepoint repeatedly, firing a simple handler each time, reports 1-2 +million hits per second, depending on the architecture. A jprobe or +return-probe hit typically takes 50-75% longer than a kprobe hit. +When you have a return probe set on a function, adding a kprobe at +the entry to that function adds essentially no overhead. + +Here are sample overhead figures (in usec) for different architectures. +k = kprobe; j = jprobe; r = return probe; kr = kprobe + return probe +on same function; jr = jprobe + return probe on same function + +i386: Intel Pentium M, 1495 MHz, 2957.31 bogomips +k = 0.57 usec; j = 1.00; r = 0.92; kr = 0.99; jr = 1.40 + +x86_64: AMD Opteron 246, 1994 MHz, 3971.48 bogomips +k = 0.49 usec; j = 0.76; r = 0.80; kr = 0.82; jr = 1.07 + +ppc64: POWER5 (gr), 1656 MHz (SMT disabled, 1 virtual CPU per physical CPU) +k = 0.77 usec; j = 1.31; r = 1.26; kr = 1.45; jr = 1.99 + +7. TODO + +a. SystemTap (http://sourceware.org/systemtap): Work in progress +to provide a simplified programming interface for probe-based +instrumentation. +b. Improved SMP scalability: Currently, work is in progress to handle +multiple kprobes in parallel. +c. Kernel return probes for sparc64. +d. Support for other architectures. +e. User-space probes. + +8. Kprobes Example + +Here's a sample kernel module showing the use of kprobes to dump a +stack trace and selected i386 registers when do_fork() is called. +----- cut here ----- +/*kprobe_example.c*/ +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/kprobes.h> +#include <linux/kallsyms.h> +#include <linux/sched.h> + +/*For each probe you need to allocate a kprobe structure*/ +static struct kprobe kp; + +/*kprobe pre_handler: called just before the probed instruction is executed*/ +int handler_pre(struct kprobe *p, struct pt_regs *regs) +{ + printk("pre_handler: p->addr=0x%p, eip=%lx, eflags=0x%lx\n", + p->addr, regs->eip, regs->eflags); + dump_stack(); + return 0; +} + +/*kprobe post_handler: called after the probed instruction is executed*/ +void handler_post(struct kprobe *p, struct pt_regs *regs, unsigned long flags) +{ + printk("post_handler: p->addr=0x%p, eflags=0x%lx\n", + p->addr, regs->eflags); +} + +/* fault_handler: this is called if an exception is generated for any + * instruction within the pre- or post-handler, or when Kprobes + * single-steps the probed instruction. + */ +int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr) +{ + printk("fault_handler: p->addr=0x%p, trap #%dn", + p->addr, trapnr); + /* Return 0 because we don't handle the fault. */ + return 0; +} + +int init_module(void) +{ + int ret; + kp.pre_handler = handler_pre; + kp.post_handler = handler_post; + kp.fault_handler = handler_fault; + kp.addr = (kprobe_opcode_t*) kallsyms_lookup_name("do_fork"); + /* register the kprobe now */ + if (!kp.addr) { + printk("Couldn't find %s to plant kprobe\n", "do_fork"); + return -1; + } + if ((ret = register_kprobe(&kp) < 0)) { + printk("register_kprobe failed, returned %d\n", ret); + return -1; + } + printk("kprobe registered\n"); + return 0; +} + +void cleanup_module(void) +{ + unregister_kprobe(&kp); + printk("kprobe unregistered\n"); +} + +MODULE_LICENSE("GPL"); +----- cut here ----- + +You can build the kernel module, kprobe-example.ko, using the following +Makefile: +----- cut here ----- +obj-m := kprobe-example.o +KDIR := /lib/modules/$(shell uname -r)/build +PWD := $(shell pwd) +default: + $(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules +clean: + rm -f *.mod.c *.ko *.o +----- cut here ----- + +$ make +$ su - +... +# insmod kprobe-example.ko + +You will see the trace data in /var/log/messages and on the console +whenever do_fork() is invoked to create a new process. + +9. Jprobes Example + +Here's a sample kernel module showing the use of jprobes to dump +the arguments of do_fork(). +----- cut here ----- +/*jprobe-example.c */ +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/fs.h> +#include <linux/uio.h> +#include <linux/kprobes.h> +#include <linux/kallsyms.h> + +/* + * Jumper probe for do_fork. + * Mirror principle enables access to arguments of the probed routine + * from the probe handler. + */ + +/* Proxy routine having the same arguments as actual do_fork() routine */ +long jdo_fork(unsigned long clone_flags, unsigned long stack_start, + struct pt_regs *regs, unsigned long stack_size, + int __user * parent_tidptr, int __user * child_tidptr) +{ + printk("jprobe: clone_flags=0x%lx, stack_size=0x%lx, regs=0x%p\n", + clone_flags, stack_size, regs); + /* Always end with a call to jprobe_return(). */ + jprobe_return(); + /*NOTREACHED*/ + return 0; +} + +static struct jprobe my_jprobe = { + .entry = (kprobe_opcode_t *) jdo_fork +}; + +int init_module(void) +{ + int ret; + my_jprobe.kp.addr = (kprobe_opcode_t *) kallsyms_lookup_name("do_fork"); + if (!my_jprobe.kp.addr) { + printk("Couldn't find %s to plant jprobe\n", "do_fork"); + return -1; + } + + if ((ret = register_jprobe(&my_jprobe)) <0) { + printk("register_jprobe failed, returned %d\n", ret); + return -1; + } + printk("Planted jprobe at %p, handler addr %p\n", + my_jprobe.kp.addr, my_jprobe.entry); + return 0; +} + +void cleanup_module(void) +{ + unregister_jprobe(&my_jprobe); + printk("jprobe unregistered\n"); +} + +MODULE_LICENSE("GPL"); +----- cut here ----- + +Build and insert the kernel module as shown in the above kprobe +example. You will see the trace data in /var/log/messages and on +the console whenever do_fork() is invoked to create a new process. +(Some messages may be suppressed if syslogd is configured to +eliminate duplicate messages.) + +10. Kretprobes Example + +Here's a sample kernel module showing the use of return probes to +report failed calls to sys_open(). +----- cut here ----- +/*kretprobe-example.c*/ +#include <linux/kernel.h> +#include <linux/module.h> +#include <linux/kprobes.h> +#include <linux/kallsyms.h> + +static const char *probed_func = "sys_open"; + +/* Return-probe handler: If the probed function fails, log the return value. */ +static int ret_handler(struct kretprobe_instance *ri, struct pt_regs *regs) +{ + // Substitute the appropriate register name for your architecture -- + // e.g., regs->rax for x86_64, regs->gpr[3] for ppc64. + int retval = (int) regs->eax; + if (retval < 0) { + printk("%s returns %d\n", probed_func, retval); + } + return 0; +} + +static struct kretprobe my_kretprobe = { + .handler = ret_handler, + /* Probe up to 20 instances concurrently. */ + .maxactive = 20 +}; + +int init_module(void) +{ + int ret; + my_kretprobe.kp.addr = + (kprobe_opcode_t *) kallsyms_lookup_name(probed_func); + if (!my_kretprobe.kp.addr) { + printk("Couldn't find %s to plant return probe\n", probed_func); + return -1; + } + if ((ret = register_kretprobe(&my_kretprobe)) < 0) { + printk("register_kretprobe failed, returned %d\n", ret); + return -1; + } + printk("Planted return probe at %p\n", my_kretprobe.kp.addr); + return 0; +} + +void cleanup_module(void) +{ + unregister_kretprobe(&my_kretprobe); + printk("kretprobe unregistered\n"); + /* nmissed > 0 suggests that maxactive was set too low. */ + printk("Missed probing %d instances of %s\n", + my_kretprobe.nmissed, probed_func); +} + +MODULE_LICENSE("GPL"); +----- cut here ----- + +Build and insert the kernel module as shown in the above kprobe +example. You will see the trace data in /var/log/messages and on the +console whenever sys_open() returns a negative value. (Some messages +may be suppressed if syslogd is configured to eliminate duplicate +messages.) + +For additional information on Kprobes, refer to the following URLs: +http://www-106.ibm.com/developerworks/library/l-kprobes.html?ca=dgr-lnxw42Kprobe +http://www.redhat.com/magazine/005mar05/features/kprobes/ diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index 0bc2ed1..24d0294 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -1,5 +1,7 @@ - Linux Ethernet Bonding Driver HOWTO + Linux Ethernet Bonding Driver HOWTO + + Latest update: 21 June 2005 Initial release : Thomas Davis <tadavis at lbl.gov> Corrections, HA extensions : 2000/10/03-15 : @@ -11,15 +13,22 @@ Corrections, HA extensions : 2000/10/03-15 : Reorganized and updated Feb 2005 by Jay Vosburgh -Note : ------- +Introduction +============ + + The Linux bonding driver provides a method for aggregating +multiple network interfaces into a single logical "bonded" interface. +The behavior of the bonded interfaces depends upon the mode; generally +speaking, modes provide either hot standby or load balancing services. +Additionally, link integrity monitoring may be performed. -The bonding driver originally came from Donald Becker's beowulf patches for -kernel 2.0. It has changed quite a bit since, and the original tools from -extreme-linux and beowulf sites will not work with this version of the driver. + The bonding driver originally came from Donald Becker's +beowulf patches for kernel 2.0. It has changed quite a bit since, and +the original tools from extreme-linux and beowulf sites will not work +with this version of the driver. -For new versions of the driver, patches for older kernels and the updated -userspace tools, please follow the links at the end of this file. + For new versions of the driver, updated userspace tools, and +who to ask for help, please follow the links at the end of this file. Table of Contents ================= @@ -30,9 +39,13 @@ Table of Contents 3. Configuring Bonding Devices 3.1 Configuration with sysconfig support +3.1.1 Using DHCP with sysconfig +3.1.2 Configuring Multiple Bonds with sysconfig 3.2 Configuration with initscripts support +3.2.1 Using DHCP with initscripts +3.2.2 Configuring Multiple Bonds with initscripts 3.3 Configuring Bonding Manually -3.4 Configuring Multiple Bonds +3.3.1 Configuring Multiple Bonds Manually 5. Querying Bonding Configuration 5.1 Bonding Configuration @@ -56,21 +69,30 @@ Table of Contents 11. Promiscuous mode -12. High Availability Information +12. Configuring Bonding for High Availability 12.1 High Availability in a Single Switch Topology -12.1.1 Bonding Mode Selection for Single Switch Topology -12.1.2 Link Monitoring for Single Switch Topology 12.2 High Availability in a Multiple Switch Topology -12.2.1 Bonding Mode Selection for Multiple Switch Topology -12.2.2 Link Monitoring for Multiple Switch Topology -12.3 Switch Behavior Issues for High Availability +12.2.1 HA Bonding Mode Selection for Multiple Switch Topology +12.2.2 HA Link Monitoring for Multiple Switch Topology + +13. Configuring Bonding for Maximum Throughput +13.1 Maximum Throughput in a Single Switch Topology +13.1.1 MT Bonding Mode Selection for Single Switch Topology +13.1.2 MT Link Monitoring for Single Switch Topology +13.2 Maximum Throughput in a Multiple Switch Topology +13.2.1 MT Bonding Mode Selection for Multiple Switch Topology +13.2.2 MT Link Monitoring for Multiple Switch Topology -13. Hardware Specific Considerations -13.1 IBM BladeCenter +14. Switch Behavior Issues +14.1 Link Establishment and Failover Delays +14.2 Duplicated Incoming Packets -14. Frequently Asked Questions +15. Hardware Specific Considerations +15.1 IBM BladeCenter -15. Resources and Links +16. Frequently Asked Questions + +17. Resources and Links 1. Bonding Driver Installation @@ -86,16 +108,10 @@ the following steps: 1.1 Configure and build the kernel with bonding ----------------------------------------------- - The latest version of the bonding driver is available in the + The current version of the bonding driver is available in the drivers/net/bonding subdirectory of the most recent kernel source -(which is available on http://kernel.org). - - Prior to the 2.4.11 kernel, the bonding driver was maintained -largely outside the kernel tree; patches for some earlier kernels are -available on the bonding sourceforge site, although those patches are -still several years out of date. Most users will want to use either -the most recent kernel from kernel.org or whatever kernel came with -their distro. +(which is available on http://kernel.org). Most users "rolling their +own" will want to use the most recent kernel from kernel.org. Configure kernel with "make menuconfig" (or "make xconfig" or "make config"), then select "Bonding driver support" in the "Network @@ -103,8 +119,8 @@ device support" section. It is recommended that you configure the driver as module since it is currently the only way to pass parameters to the driver or configure more than one bonding device. - Build and install the new kernel and modules, then proceed to -step 2. + Build and install the new kernel and modules, then continue +below to install ifenslave. 1.2 Install ifenslave Control Utility ------------------------------------- @@ -147,9 +163,9 @@ default kernel source include directory. Options for the bonding driver are supplied as parameters to the bonding module at load time. They may be given as command line arguments to the insmod or modprobe command, but are usually specified -in either the /etc/modprobe.conf configuration file, or in a -distro-specific configuration file (some of which are detailed in the -next section). +in either the /etc/modules.conf or /etc/modprobe.conf configuration +file, or in a distro-specific configuration file (some of which are +detailed in the next section). The available bonding driver parameters are listed below. If a parameter is not specified the default value is used. When initially @@ -162,34 +178,34 @@ degradation will occur during link failures. Very few devices do not support at least miimon, so there is really no reason not to use it. Options with textual values will accept either the text name - or, for backwards compatibility, the option value. E.g., - "mode=802.3ad" and "mode=4" set the same mode. +or, for backwards compatibility, the option value. E.g., +"mode=802.3ad" and "mode=4" set the same mode. The parameters are as follows: arp_interval - Specifies the ARP monitoring frequency in milli-seconds. If - ARP monitoring is used in a load-balancing mode (mode 0 or 2), - the switch should be configured in a mode that evenly - distributes packets across all links - such as round-robin. If - the switch is configured to distribute the packets in an XOR + Specifies the ARP link monitoring frequency in milliseconds. + If ARP monitoring is used in an etherchannel compatible mode + (modes 0 and 2), the switch should be configured in a mode + that evenly distributes packets across all links. If the + switch is configured to distribute the packets in an XOR fashion, all replies from the ARP targets will be received on the same link which could cause the other team members to - fail. ARP monitoring should not be used in conjunction with - miimon. A value of 0 disables ARP monitoring. The default + fail. ARP monitoring should not be used in conjunction with + miimon. A value of 0 disables ARP monitoring. The default value is 0. arp_ip_target - Specifies the ip addresses to use when arp_interval is > 0. - These are the targets of the ARP request sent to determine the - health of the link to the targets. Specify these values in - ddd.ddd.ddd.ddd format. Multiple ip adresses must be - seperated by a comma. At least one IP address must be given - for ARP monitoring to function. The maximum number of targets - that can be specified is 16. The default value is no IP - addresses. + Specifies the IP addresses to use as ARP monitoring peers when + arp_interval is > 0. These are the targets of the ARP request + sent to determine the health of the link to the targets. + Specify these values in ddd.ddd.ddd.ddd format. Multiple IP + addresses must be separated by a comma. At least one IP + address must be given for ARP monitoring to function. The + maximum number of targets that can be specified is 16. The + default value is no IP addresses. downdelay @@ -207,11 +223,13 @@ lacp_rate are: slow or 0 - Request partner to transmit LACPDUs every 30 seconds (default) + Request partner to transmit LACPDUs every 30 seconds fast or 1 Request partner to transmit LACPDUs every 1 second + The default is slow. + max_bonds Specifies the number of bonding devices to create for this @@ -221,10 +239,11 @@ max_bonds miimon - Specifies the frequency in milli-seconds that MII link - monitoring will occur. A value of zero disables MII link - monitoring. A value of 100 is a good starting point. The - use_carrier option, below, affects how the link state is + Specifies the MII link monitoring frequency in milliseconds. + This determines how often the link state of each slave is + inspected for link failures. A value of zero disables MII + link monitoring. A value of 100 is a good starting point. + The use_carrier option, below, affects how the link state is determined. See the High Availability section for additional information. The default value is 0. @@ -246,17 +265,31 @@ mode active. A different slave becomes active if, and only if, the active slave fails. The bond's MAC address is externally visible on only one port (network adapter) - to avoid confusing the switch. This mode provides - fault tolerance. The primary option affects the - behavior of this mode. + to avoid confusing the switch. + + In bonding version 2.6.2 or later, when a failover + occurs in active-backup mode, bonding will issue one + or more gratuitous ARPs on the newly active slave. + One gratutious ARP is issued for the bonding master + interface and each VLAN interfaces configured above + it, provided that the interface has at least one IP + address configured. Gratuitous ARPs issued for VLAN + interfaces are tagged with the appropriate VLAN id. + + This mode provides fault tolerance. The primary + option, documented below, affects the behavior of this + mode. balance-xor or 2 - XOR policy: Transmit based on [(source MAC address - XOR'd with destination MAC address) modulo slave - count]. This selects the same slave for each - destination MAC address. This mode provides load - balancing and fault tolerance. + XOR policy: Transmit based on the selected transmit + hash policy. The default policy is a simple [(source + MAC address XOR'd with destination MAC address) modulo + slave count]. Alternate transmit policies may be + selected via the xmit_hash_policy option, described + below. + + This mode provides load balancing and fault tolerance. broadcast or 3 @@ -270,7 +303,17 @@ mode duplex settings. Utilizes all slaves in the active aggregator according to the 802.3ad specification. - Pre-requisites: + Slave selection for outgoing traffic is done according + to the transmit hash policy, which may be changed from + the default simple XOR policy via the xmit_hash_policy + option, documented below. Note that not all transmit + policies may be 802.3ad compliant, particularly in + regards to the packet mis-ordering requirements of + section 43.2.4 of the 802.3ad standard. Differing + peer implementations will have varying tolerances for + noncompliance. + + Prerequisites: 1. Ethtool support in the base drivers for retrieving the speed and duplex of each slave. @@ -333,7 +376,7 @@ mode When a link is reconnected or a new slave joins the bond the receive traffic is redistributed among all - active slaves in the bond by intiating ARP Replies + active slaves in the bond by initiating ARP Replies with the selected mac address to each of the clients. The updelay parameter (detailed below) must be set to a value equal or greater than the switch's @@ -396,6 +439,60 @@ use_carrier 0 will use the deprecated MII / ETHTOOL ioctls. The default value is 1. +xmit_hash_policy + + Selects the transmit hash policy to use for slave selection in + balance-xor and 802.3ad modes. Possible values are: + + layer2 + + Uses XOR of hardware MAC addresses to generate the + hash. The formula is + + (source MAC XOR destination MAC) modulo slave count + + This algorithm will place all traffic to a particular + network peer on the same slave. + + This algorithm is 802.3ad compliant. + + layer3+4 + + This policy uses upper layer protocol information, + when available, to generate the hash. This allows for + traffic to a particular network peer to span multiple + slaves, although a single connection will not span + multiple slaves. + + The formula for unfragmented TCP and UDP packets is + + ((source port XOR dest port) XOR + ((source IP XOR dest IP) AND 0xffff) + modulo slave count + + For fragmented TCP or UDP packets and all other IP + protocol traffic, the source and destination port + information is omitted. For non-IP traffic, the + formula is the same as for the layer2 transmit hash + policy. + + This policy is intended to mimic the behavior of + certain switches, notably Cisco switches with PFC2 as + well as some Foundry and IBM products. + + This algorithm is not fully 802.3ad compliant. A + single TCP or UDP conversation containing both + fragmented and unfragmented packets will see packets + striped across two interfaces. This may result in out + of order delivery. Most traffic types will not meet + this criteria, as TCP rarely fragments traffic, and + most UDP traffic is not involved in extended + conversations. Other implementations of 802.3ad may + or may not tolerate this noncompliance. + + The default value is layer2. This option was added in bonding +version 2.6.3. In earlier versions of bonding, this parameter does +not exist, and the layer2 policy is the only policy. 3. Configuring Bonding Devices @@ -448,8 +545,9 @@ Bonding devices can be managed by hand, however, as follows. slave devices. On SLES 9, this is most easily done by running the yast2 sysconfig configuration utility. The goal is for to create an ifcfg-id file for each slave device. The simplest way to accomplish -this is to configure the devices for DHCP. The name of the -configuration file for each device will be of the form: +this is to configure the devices for DHCP (this is only to get the +file ifcfg-id file created; see below for some issues with DHCP). The +name of the configuration file for each device will be of the form: ifcfg-id-xx:xx:xx:xx:xx:xx @@ -459,7 +557,7 @@ the device's permanent MAC address. Once the set of ifcfg-id-xx:xx:xx:xx:xx:xx files has been created, it is necessary to edit the configuration files for the slave devices (the MAC addresses correspond to those of the slave devices). -Before editing, the file will contain muliple lines, and will look +Before editing, the file will contain multiple lines, and will look something like this: BOOTPROTO='dhcp' @@ -496,16 +594,11 @@ STARTMODE="onboot" BONDING_MASTER="yes" BONDING_MODULE_OPTS="mode=active-backup miimon=100" BONDING_SLAVE0="eth0" -BONDING_SLAVE1="eth1" +BONDING_SLAVE1="bus-pci-0000:06:08.1" Replace the sample BROADCAST, IPADDR, NETMASK and NETWORK values with the appropriate values for your network. - Note that configuring the bonding device with BOOTPROTO='dhcp' -does not work; the scripts attempt to obtain the device address from -DHCP prior to adding any of the slave devices. Without active slaves, -the DHCP requests are not sent to the network. - The STARTMODE specifies when the device is brought online. The possible values are: @@ -531,9 +624,17 @@ for the bonding mode, link monitoring, and so on here. Do not include the max_bonds bonding parameter; this will confuse the configuration system if you have multiple bonding devices. - Finally, supply one BONDING_SLAVEn="ethX" for each slave, -where "n" is an increasing value, one for each slave, and "ethX" is -the name of the slave device (eth0, eth1, etc). + Finally, supply one BONDING_SLAVEn="slave device" for each +slave. where "n" is an increasing value, one for each slave. The +"slave device" is either an interface name, e.g., "eth0", or a device +specifier for the network device. The interface name is easier to +find, but the ethN names are subject to change at boot time if, e.g., +a device early in the sequence has failed. The device specifiers +(bus-pci-0000:06:08.1 in the example above) specify the physical +network device, and will not change unless the device's bus location +changes (for example, it is moved from one PCI slot to another). The +example above uses one of each type for demonstration purposes; most +configurations will choose one or the other for all slave devices. When all configuration files have been modified or created, networking must be restarted for the configuration changes to take @@ -544,7 +645,7 @@ effect. This can be accomplished via the following: Note that the network control script (/sbin/ifdown) will remove the bonding module as part of the network shutdown processing, so it is not necessary to remove the module by hand if, e.g., the -module paramters have changed. +module parameters have changed. Also, at this writing, YaST/YaST2 will not manage bonding devices (they do not show bonding interfaces on its list of network @@ -559,12 +660,37 @@ format can be found in an example ifcfg template file: Note that the template does not document the various BONDING_ settings described above, but does describe many of the other options. +3.1.1 Using DHCP with sysconfig +------------------------------- + + Under sysconfig, configuring a device with BOOTPROTO='dhcp' +will cause it to query DHCP for its IP address information. At this +writing, this does not function for bonding devices; the scripts +attempt to obtain the device address from DHCP prior to adding any of +the slave devices. Without active slaves, the DHCP requests are not +sent to the network. + +3.1.2 Configuring Multiple Bonds with sysconfig +----------------------------------------------- + + The sysconfig network initialization system is capable of +handling multiple bonding devices. All that is necessary is for each +bonding instance to have an appropriately configured ifcfg-bondX file +(as described above). Do not specify the "max_bonds" parameter to any +instance of bonding, as this will confuse sysconfig. If you require +multiple bonding devices with identical parameters, create multiple +ifcfg-bondX files. + + Because the sysconfig scripts supply the bonding module +options in the ifcfg-bondX file, it is not necessary to add them to +the system /etc/modules.conf or /etc/modprobe.conf configuration file. + 3.2 Configuration with initscripts support ------------------------------------------ This section applies to distros using a version of initscripts with bonding support, for example, Red Hat Linux 9 or Red Hat -Enterprise Linux version 3. On these systems, the network +Enterprise Linux version 3 or 4. On these systems, the network initialization scripts have some knowledge of bonding, and can be configured to control bonding devices. @@ -614,10 +740,11 @@ USERCTL=no Be sure to change the networking specific lines (IPADDR, NETMASK, NETWORK and BROADCAST) to match your network configuration. - Finally, it is necessary to edit /etc/modules.conf to load the -bonding module when the bond0 interface is brought up. The following -sample lines in /etc/modules.conf will load the bonding module, and -select its options: + Finally, it is necessary to edit /etc/modules.conf (or +/etc/modprobe.conf, depending upon your distro) to load the bonding +module with your desired options when the bond0 interface is brought +up. The following lines in /etc/modules.conf (or modprobe.conf) will +load the bonding module, and select its options: alias bond0 bonding options bond0 mode=balance-alb miimon=100 @@ -629,6 +756,33 @@ options for your configuration. will restart the networking subsystem and your bond link should be now up and running. +3.2.1 Using DHCP with initscripts +--------------------------------- + + Recent versions of initscripts (the version supplied with +Fedora Core 3 and Red Hat Enterprise Linux 4 is reported to work) do +have support for assigning IP information to bonding devices via DHCP. + + To configure bonding for DHCP, configure it as described +above, except replace the line "BOOTPROTO=none" with "BOOTPROTO=dhcp" +and add a line consisting of "TYPE=Bonding". Note that the TYPE value +is case sensitive. + +3.2.2 Configuring Multiple Bonds with initscripts +------------------------------------------------- + + At this writing, the initscripts package does not directly +support loading the bonding driver multiple times, so the process for +doing so is the same as described in the "Configuring Multiple Bonds +Manually" section, below. + + NOTE: It has been observed that some Red Hat supplied kernels +are apparently unable to rename modules at load time (the "-obonding1" +part). Attempts to pass that option to modprobe will produce an +"Operation not permitted" error. This has been reported on some +Fedora Core kernels, and has been seen on RHEL 4 as well. On kernels +exhibiting this problem, it will be impossible to configure multiple +bonds with differing parameters. 3.3 Configuring Bonding Manually -------------------------------- @@ -638,10 +792,11 @@ scripts (the sysconfig or initscripts package) do not have specific knowledge of bonding. One such distro is SuSE Linux Enterprise Server version 8. - The general methodology for these systems is to place the -bonding module parameters into /etc/modprobe.conf, then add modprobe -and/or ifenslave commands to the system's global init script. The -name of the global init script differs; for sysconfig, it is + The general method for these systems is to place the bonding +module parameters into /etc/modules.conf or /etc/modprobe.conf (as +appropriate for the installed distro), then add modprobe and/or +ifenslave commands to the system's global init script. The name of +the global init script differs; for sysconfig, it is /etc/init.d/boot.local and for initscripts it is /etc/rc.d/rc.local. For example, if you wanted to make a simple bond of two e100 @@ -649,7 +804,7 @@ devices (presumed to be eth0 and eth1), and have it persist across reboots, edit the appropriate file (/etc/init.d/boot.local or /etc/rc.d/rc.local), and add the following: -modprobe bonding -obond0 mode=balance-alb miimon=100 +modprobe bonding mode=balance-alb miimon=100 modprobe e100 ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up ifenslave bond0 eth0 @@ -657,11 +812,7 @@ ifenslave bond0 eth1 Replace the example bonding module parameters and bond0 network configuration (IP address, netmask, etc) with the appropriate -values for your configuration. The above example loads the bonding -module with the name "bond0," this simplifies the naming if multiple -bonding modules are loaded (each successive instance of the module is -given a different name, and the module instance names match the -bonding interface names). +values for your configuration. Unfortunately, this method will not provide support for the ifup and ifdown scripts on the bond devices. To reload the bonding @@ -684,20 +835,23 @@ appropriate device driver modules. For our example above, you can do the following: # ifconfig bond0 down -# rmmod bond0 +# rmmod bonding # rmmod e100 Again, for convenience, it may be desirable to create a script with these commands. -3.4 Configuring Multiple Bonds ------------------------------- +3.3.1 Configuring Multiple Bonds Manually +----------------------------------------- This section contains information on configuring multiple -bonding devices with differing options. If you require multiple -bonding devices, but all with the same options, see the "max_bonds" -module paramter, documented above. +bonding devices with differing options for those systems whose network +initialization scripts lack support for configuring multiple bonds. + + If you require multiple bonding devices, but all with the same +options, you may wish to use the "max_bonds" module parameter, +documented above. To create multiple bonding devices with differing options, it is necessary to load the bonding driver multiple times. Note that @@ -724,11 +878,16 @@ named "bond0" and creates the bond0 device in balance-rr mode with an miimon of 100. The second instance is named "bond1" and creates the bond1 device in balance-alb mode with an miimon of 50. + In some circumstances (typically with older distributions), +the above does not work, and the second bonding instance never sees +its options. In that case, the second options line can be substituted +as follows: + +install bonding1 /sbin/modprobe bonding -obond1 mode=balance-alb miimon=50 + This may be repeated any number of times, specifying a new and -unique name in place of bond0 or bond1 for each instance. +unique name in place of bond1 for each subsequent instance. - When the appropriate module paramters are in place, then -configure bonding according to the instructions for your distro. 5. Querying Bonding Configuration ================================= @@ -846,8 +1005,8 @@ tagged internally by bonding itself. As a result, bonding must self generated packets. For reasons of simplicity, and to support the use of adapters -that can do VLAN hardware acceleration offloding, the bonding -interface declares itself as fully hardware offloaing capable, it gets +that can do VLAN hardware acceleration offloading, the bonding +interface declares itself as fully hardware offloading capable, it gets the add_vid/kill_vid notifications to gather the necessary information, and it propagates those actions to the slaves. In case of mixed adapter types, hardware accelerated tagged packets that @@ -880,7 +1039,7 @@ bond interface: matches the hardware address of the VLAN interfaces. Note that changing a VLAN interface's HW address would set the -underlying device -- i.e. the bonding interface -- to promiscouos +underlying device -- i.e. the bonding interface -- to promiscuous mode, which might not be what you want. @@ -923,7 +1082,7 @@ down or have a problem making it unresponsive to ARP requests. Having an additional target (or several) increases the reliability of the ARP monitoring. - Multiple ARP targets must be seperated by commas as follows: + Multiple ARP targets must be separated by commas as follows: # example options for ARP monitoring with three targets alias bond0 bonding @@ -1045,7 +1204,7 @@ install bonding /sbin/modprobe tg3; /sbin/modprobe e1000; This will, when loading the bonding module, rather than performing the normal action, instead execute the provided command. This command loads the device drivers in the order needed, then calls -modprobe with --ingore-install to cause the normal action to then take +modprobe with --ignore-install to cause the normal action to then take place. Full documentation on this can be found in the modprobe.conf and modprobe manual pages. @@ -1130,14 +1289,14 @@ association. common to enable promiscuous mode on the device, so that all traffic is seen (instead of seeing only traffic destined for the local host). The bonding driver handles promiscuous mode changes to the bonding -master device (e.g., bond0), and propogates the setting to the slave +master device (e.g., bond0), and propagates the setting to the slave devices. For the balance-rr, balance-xor, broadcast, and 802.3ad modes, -the promiscuous mode setting is propogated to all slaves. +the promiscuous mode setting is propagated to all slaves. For the active-backup, balance-tlb and balance-alb modes, the -promiscuous mode setting is propogated only to the active slave. +promiscuous mode setting is propagated only to the active slave. For balance-tlb mode, the active slave is the slave currently receiving inbound traffic. @@ -1148,46 +1307,182 @@ sending to peers that are unassigned or if the load is unbalanced. For the active-backup, balance-tlb and balance-alb modes, when the active slave changes (e.g., due to a link failure), the -promiscuous setting will be propogated to the new active slave. +promiscuous setting will be propagated to the new active slave. -12. High Availability Information -================================= +12. Configuring Bonding for High Availability +============================================= High Availability refers to configurations that provide maximum network availability by having redundant or backup devices, -links and switches between the host and the rest of the world. - - There are currently two basic methods for configuring to -maximize availability. They are dependent on the network topology and -the primary goal of the configuration, but in general, a configuration -can be optimized for maximum available bandwidth, or for maximum -network availability. +links or switches between the host and the rest of the world. The +goal is to provide the maximum availability of network connectivity +(i.e., the network always works), even though other configurations +could provide higher throughput. 12.1 High Availability in a Single Switch Topology -------------------------------------------------- - If two hosts (or a host and a switch) are directly connected -via multiple physical links, then there is no network availability -penalty for optimizing for maximum bandwidth: there is only one switch -(or peer), so if it fails, you have no alternative access to fail over -to. + If two hosts (or a host and a single switch) are directly +connected via multiple physical links, then there is no availability +penalty to optimizing for maximum bandwidth. In this case, there is +only one switch (or peer), so if it fails, there is no alternative +access to fail over to. Additionally, the bonding load balance modes +support link monitoring of their members, so if individual links fail, +the load will be rebalanced across the remaining devices. + + See Section 13, "Configuring Bonding for Maximum Throughput" +for information on configuring bonding with one peer device. + +12.2 High Availability in a Multiple Switch Topology +---------------------------------------------------- + + With multiple switches, the configuration of bonding and the +network changes dramatically. In multiple switch topologies, there is +a trade off between network availability and usable bandwidth. + + Below is a sample network, configured to maximize the +availability of the network: -Example 1 : host to switch (or other host) + | | + |port3 port3| + +-----+----+ +-----+----+ + | |port2 ISL port2| | + | switch A +--------------------------+ switch B | + | | | | + +-----+----+ +-----++---+ + |port1 port1| + | +-------+ | + +-------------+ host1 +---------------+ + eth0 +-------+ eth1 - +----------+ +----------+ - | |eth0 eth0| switch | - | Host A +--------------------------+ or | - | +--------------------------+ other | - | |eth1 eth1| host | - +----------+ +----------+ + In this configuration, there is a link between the two +switches (ISL, or inter switch link), and multiple ports connecting to +the outside world ("port3" on each switch). There is no technical +reason that this could not be extended to a third switch. +12.2.1 HA Bonding Mode Selection for Multiple Switch Topology +------------------------------------------------------------- -12.1.1 Bonding Mode Selection for single switch topology --------------------------------------------------------- + In a topology such as the example above, the active-backup and +broadcast modes are the only useful bonding modes when optimizing for +availability; the other modes require all links to terminate on the +same peer for them to behave rationally. + +active-backup: This is generally the preferred mode, particularly if + the switches have an ISL and play together well. If the + network configuration is such that one switch is specifically + a backup switch (e.g., has lower capacity, higher cost, etc), + then the primary option can be used to insure that the + preferred link is always used when it is available. + +broadcast: This mode is really a special purpose mode, and is suitable + only for very specific needs. For example, if the two + switches are not connected (no ISL), and the networks beyond + them are totally independent. In this case, if it is + necessary for some specific one-way traffic to reach both + independent networks, then the broadcast mode may be suitable. + +12.2.2 HA Link Monitoring Selection for Multiple Switch Topology +---------------------------------------------------------------- + + The choice of link monitoring ultimately depends upon your +switch. If the switch can reliably fail ports in response to other +failures, then either the MII or ARP monitors should work. For +example, in the above example, if the "port3" link fails at the remote +end, the MII monitor has no direct means to detect this. The ARP +monitor could be configured with a target at the remote end of port3, +thus detecting that failure without switch support. + + In general, however, in a multiple switch topology, the ARP +monitor can provide a higher level of reliability in detecting end to +end connectivity failures (which may be caused by the failure of any +individual component to pass traffic for any reason). Additionally, +the ARP monitor should be configured with multiple targets (at least +one for each switch in the network). This will insure that, +regardless of which switch is active, the ARP monitor has a suitable +target to query. + + +13. Configuring Bonding for Maximum Throughput +============================================== + +13.1 Maximizing Throughput in a Single Switch Topology +------------------------------------------------------ + + In a single switch configuration, the best method to maximize +throughput depends upon the application and network environment. The +various load balancing modes each have strengths and weaknesses in +different environments, as detailed below. + + For this discussion, we will break down the topologies into +two categories. Depending upon the destination of most traffic, we +categorize them into either "gatewayed" or "local" configurations. + + In a gatewayed configuration, the "switch" is acting primarily +as a router, and the majority of traffic passes through this router to +other networks. An example would be the following: + + + +----------+ +----------+ + | |eth0 port1| | to other networks + | Host A +---------------------+ router +-------------------> + | +---------------------+ | Hosts B and C are out + | |eth1 port2| | here somewhere + +----------+ +----------+ + + The router may be a dedicated router device, or another host +acting as a gateway. For our discussion, the important point is that +the majority of traffic from Host A will pass through the router to +some other network before reaching its final destination. + + In a gatewayed network configuration, although Host A may +communicate with many other systems, all of its traffic will be sent +and received via one other peer on the local network, the router. + + Note that the case of two systems connected directly via +multiple physical links is, for purposes of configuring bonding, the +same as a gatewayed configuration. In that case, it happens that all +traffic is destined for the "gateway" itself, not some other network +beyond the gateway. + + In a local configuration, the "switch" is acting primarily as +a switch, and the majority of traffic passes through this switch to +reach other stations on the same network. An example would be the +following: + + +----------+ +----------+ +--------+ + | |eth0 port1| +-------+ Host B | + | Host A +------------+ switch |port3 +--------+ + | +------------+ | +--------+ + | |eth1 port2| +------------------+ Host C | + +----------+ +----------+port4 +--------+ + + + Again, the switch may be a dedicated switch device, or another +host acting as a gateway. For our discussion, the important point is +that the majority of traffic from Host A is destined for other hosts +on the same local network (Hosts B and C in the above example). + + In summary, in a gatewayed configuration, traffic to and from +the bonded device will be to the same MAC level peer on the network +(the gateway itself, i.e., the router), regardless of its final +destination. In a local configuration, traffic flows directly to and +from the final destinations, thus, each destination (Host B, Host C) +will be addressed directly by their individual MAC addresses. + + This distinction between a gatewayed and a local network +configuration is important because many of the load balancing modes +available use the MAC addresses of the local network source and +destination to make load balancing decisions. The behavior of each +mode is described below. + + +13.1.1 MT Bonding Mode Selection for Single Switch Topology +----------------------------------------------------------- This configuration is the easiest to set up and to understand, although you will have to decide which bonding mode best suits your -needs. The tradeoffs for each mode are detailed below: +needs. The trade offs for each mode are detailed below: balance-rr: This mode is the only mode that will permit a single TCP/IP connection to stripe traffic across multiple @@ -1206,6 +1501,23 @@ balance-rr: This mode is the only mode that will permit a single interface's worth of throughput, even after adjusting tcp_reordering. + Note that this out of order delivery occurs when both the + sending and receiving systems are utilizing a multiple + interface bond. Consider a configuration in which a + balance-rr bond feeds into a single higher capacity network + channel (e.g., multiple 100Mb/sec ethernets feeding a single + gigabit ethernet via an etherchannel capable switch). In this + configuration, traffic sent from the multiple 100Mb devices to + a destination connected to the gigabit device will not see + packets out of order. However, traffic sent from the gigabit + device to the multiple 100Mb devices may or may not see + traffic out of order, depending upon the balance policy of the + switch. Many switches do not support any modes that stripe + traffic (instead choosing a port based upon IP or MAC level + addresses); for those devices, traffic flowing from the + gigabit device to the many 100Mb devices will only utilize one + interface. + If you are utilizing protocols other than TCP/IP, UDP for example, and your application can tolerate out of order delivery, then this mode can allow for single stream datagram @@ -1220,16 +1532,21 @@ active-backup: There is not much advantage in this network topology to connected to the same peer as the primary. In this case, a load balancing mode (with link monitoring) will provide the same level of network availability, but with increased - available bandwidth. On the plus side, it does not require - any configuration of the switch. + available bandwidth. On the plus side, active-backup mode + does not require any configuration of the switch, so it may + have value if the hardware available does not support any of + the load balance modes. balance-xor: This mode will limit traffic such that packets destined for specific peers will always be sent over the same interface. Since the destination is determined by the MAC - addresses involved, this may be desirable if you have a large - network with many hosts. It is likely to be suboptimal if all - your traffic is passed through a single router, however. As - with balance-rr, the switch ports need to be configured for + addresses involved, this mode works best in a "local" network + configuration (as described above), with destinations all on + the same local network. This mode is likely to be suboptimal + if all your traffic is passed through a single router (i.e., a + "gatewayed" network configuration, as described above). + + As with balance-rr, the switch ports need to be configured for "etherchannel" or "trunking." broadcast: Like active-backup, there is not much advantage to this @@ -1241,122 +1558,131 @@ broadcast: Like active-backup, there is not much advantage to this protocol includes automatic configuration of the aggregates, so minimal manual configuration of the switch is needed (typically only to designate that some set of devices is - usable for 802.3ad). The 802.3ad standard also mandates that - frames be delivered in order (within certain limits), so in - general single connections will not see misordering of + available for 802.3ad). The 802.3ad standard also mandates + that frames be delivered in order (within certain limits), so + in general single connections will not see misordering of packets. The 802.3ad mode does have some drawbacks: the standard mandates that all devices in the aggregate operate at the same speed and duplex. Also, as with all bonding load balance modes other than balance-rr, no single connection will be able to utilize more than a single interface's worth of - bandwidth. Additionally, the linux bonding 802.3ad - implementation distributes traffic by peer (using an XOR of - MAC addresses), so in general all traffic to a particular - destination will use the same interface. Finally, the 802.3ad - mode mandates the use of the MII monitor, therefore, the ARP - monitor is not available in this mode. - -balance-tlb: This mode is also a good choice for this type of - topology. It has no special switch configuration - requirements, and balances outgoing traffic by peer, in a - vaguely intelligent manner (not a simple XOR as in balance-xor - or 802.3ad mode), so that unlucky MAC addresses will not all - "bunch up" on a single interface. Interfaces may be of - differing speeds. On the down side, in this mode all incoming - traffic arrives over a single interface, this mode requires - certain ethtool support in the network device driver of the - slave interfaces, and the ARP monitor is not available. - -balance-alb: This mode is everything that balance-tlb is, and more. It - has all of the features (and restrictions) of balance-tlb, and - will also balance incoming traffic from peers (as described in - the Bonding Module Options section, above). The only extra - down side to this mode is that the network device driver must - support changing the hardware address while the device is - open. - -12.1.2 Link Monitoring for Single Switch Topology -------------------------------------------------- + bandwidth. + + Additionally, the linux bonding 802.3ad implementation + distributes traffic by peer (using an XOR of MAC addresses), + so in a "gatewayed" configuration, all outgoing traffic will + generally use the same device. Incoming traffic may also end + up on a single device, but that is dependent upon the + balancing policy of the peer's 8023.ad implementation. In a + "local" configuration, traffic will be distributed across the + devices in the bond. + + Finally, the 802.3ad mode mandates the use of the MII monitor, + therefore, the ARP monitor is not available in this mode. + +balance-tlb: The balance-tlb mode balances outgoing traffic by peer. + Since the balancing is done according to MAC address, in a + "gatewayed" configuration (as described above), this mode will + send all traffic across a single device. However, in a + "local" network configuration, this mode balances multiple + local network peers across devices in a vaguely intelligent + manner (not a simple XOR as in balance-xor or 802.3ad mode), + so that mathematically unlucky MAC addresses (i.e., ones that + XOR to the same value) will not all "bunch up" on a single + interface. + + Unlike 802.3ad, interfaces may be of differing speeds, and no + special switch configuration is required. On the down side, + in this mode all incoming traffic arrives over a single + interface, this mode requires certain ethtool support in the + network device driver of the slave interfaces, and the ARP + monitor is not available. + +balance-alb: This mode is everything that balance-tlb is, and more. + It has all of the features (and restrictions) of balance-tlb, + and will also balance incoming traffic from local network + peers (as described in the Bonding Module Options section, + above). + + The only additional down side to this mode is that the network + device driver must support changing the hardware address while + the device is open. + +13.1.2 MT Link Monitoring for Single Switch Topology +---------------------------------------------------- The choice of link monitoring may largely depend upon which mode you choose to use. The more advanced load balancing modes do not support the use of the ARP monitor, and are thus restricted to using -the MII monitor (which does not provide as high a level of assurance -as the ARP monitor). - - -12.2 High Availability in a Multiple Switch Topology ----------------------------------------------------- - - With multiple switches, the configuration of bonding and the -network changes dramatically. In multiple switch topologies, there is -a tradeoff between network availability and usable bandwidth. - - Below is a sample network, configured to maximize the -availability of the network: - - | | - |port3 port3| - +-----+----+ +-----+----+ - | |port2 ISL port2| | - | switch A +--------------------------+ switch B | - | | | | - +-----+----+ +-----++---+ - |port1 port1| - | +-------+ | - +-------------+ host1 +---------------+ - eth0 +-------+ eth1 - - In this configuration, there is a link between the two -switches (ISL, or inter switch link), and multiple ports connecting to -the outside world ("port3" on each switch). There is no technical -reason that this could not be extended to a third switch. - -12.2.1 Bonding Mode Selection for Multiple Switch Topology ----------------------------------------------------------- - - In a topology such as this, the active-backup and broadcast -modes are the only useful bonding modes; the other modes require all -links to terminate on the same peer for them to behave rationally. - -active-backup: This is generally the preferred mode, particularly if - the switches have an ISL and play together well. If the - network configuration is such that one switch is specifically - a backup switch (e.g., has lower capacity, higher cost, etc), - then the primary option can be used to insure that the - preferred link is always used when it is available. - -broadcast: This mode is really a special purpose mode, and is suitable - only for very specific needs. For example, if the two - switches are not connected (no ISL), and the networks beyond - them are totally independant. In this case, if it is - necessary for some specific one-way traffic to reach both - independent networks, then the broadcast mode may be suitable. - -12.2.2 Link Monitoring Selection for Multiple Switch Topology +the MII monitor (which does not provide as high a level of end to end +assurance as the ARP monitor). + +13.2 Maximum Throughput in a Multiple Switch Topology +----------------------------------------------------- + + Multiple switches may be utilized to optimize for throughput +when they are configured in parallel as part of an isolated network +between two or more systems, for example: + + +-----------+ + | Host A | + +-+---+---+-+ + | | | + +--------+ | +---------+ + | | | + +------+---+ +-----+----+ +-----+----+ + | Switch A | | Switch B | | Switch C | + +------+---+ +-----+----+ +-----+----+ + | | | + +--------+ | +---------+ + | | | + +-+---+---+-+ + | Host B | + +-----------+ + + In this configuration, the switches are isolated from one +another. One reason to employ a topology such as this is for an +isolated network with many hosts (a cluster configured for high +performance, for example), using multiple smaller switches can be more +cost effective than a single larger switch, e.g., on a network with 24 +hosts, three 24 port switches can be significantly less expensive than +a single 72 port switch. + + If access beyond the network is required, an individual host +can be equipped with an additional network device connected to an +external network; this host then additionally acts as a gateway. + +13.2.1 MT Bonding Mode Selection for Multiple Switch Topology ------------------------------------------------------------- - The choice of link monitoring ultimately depends upon your -switch. If the switch can reliably fail ports in response to other -failures, then either the MII or ARP monitors should work. For -example, in the above example, if the "port3" link fails at the remote -end, the MII monitor has no direct means to detect this. The ARP -monitor could be configured with a target at the remote end of port3, -thus detecting that failure without switch support. + In actual practice, the bonding mode typically employed in +configurations of this type is balance-rr. Historically, in this +network configuration, the usual caveats about out of order packet +delivery are mitigated by the use of network adapters that do not do +any kind of packet coalescing (via the use of NAPI, or because the +device itself does not generate interrupts until some number of +packets has arrived). When employed in this fashion, the balance-rr +mode allows individual connections between two hosts to effectively +utilize greater than one interface's bandwidth. - In general, however, in a multiple switch topology, the ARP -monitor can provide a higher level of reliability in detecting link -failures. Additionally, it should be configured with multiple targets -(at least one for each switch in the network). This will insure that, -regardless of which switch is active, the ARP monitor has a suitable -target to query. +13.2.2 MT Link Monitoring for Multiple Switch Topology +------------------------------------------------------ + Again, in actual practice, the MII monitor is most often used +in this configuration, as performance is given preference over +availability. The ARP monitor will function in this topology, but its +advantages over the MII monitor are mitigated by the volume of probes +needed as the number of systems involved grows (remember that each +host in the network is configured with bonding). -12.3 Switch Behavior Issues for High Availability -------------------------------------------------- +14. Switch Behavior Issues +========================== - You may encounter issues with the timing of link up and down -reporting by the switch. +14.1 Link Establishment and Failover Delays +------------------------------------------- + + Some switches exhibit undesirable behavior with regard to the +timing of link up and down reporting by the switch. First, when a link comes up, some switches may indicate that the link is up (carrier available), but not pass traffic over the @@ -1370,30 +1696,70 @@ relevant interface(s). Second, some switches may "bounce" the link state one or more times while a link is changing state. This occurs most commonly while the switch is initializing. Again, an appropriate updelay value may -help, but note that if all links are down, then updelay is ignored -when any link becomes active (the slave closest to completing its -updelay is chosen). +help. Note that when a bonding interface has no active links, the -driver will immediately reuse the first link that goes up, even if -updelay parameter was specified. If there are slave interfaces -waiting for the updelay timeout to expire, the interface that first -went into that state will be immediately reused. This reduces down -time of the network if the value of updelay has been overestimated. +driver will immediately reuse the first link that goes up, even if the +updelay parameter has been specified (the updelay is ignored in this +case). If there are slave interfaces waiting for the updelay timeout +to expire, the interface that first went into that state will be +immediately reused. This reduces down time of the network if the +value of updelay has been overestimated, and since this occurs only in +cases with no connectivity, there is no additional penalty for +ignoring the updelay. In addition to the concerns about switch timings, if your switches take a long time to go into backup mode, it may be desirable to not activate a backup interface immediately after a link goes down. Failover may be delayed via the downdelay bonding module option. -13. Hardware Specific Considerations +14.2 Duplicated Incoming Packets +-------------------------------- + + It is not uncommon to observe a short burst of duplicated +traffic when the bonding device is first used, or after it has been +idle for some period of time. This is most easily observed by issuing +a "ping" to some other host on the network, and noticing that the +output from ping flags duplicates (typically one per slave). + + For example, on a bond in active-backup mode with five slaves +all connected to one switch, the output may appear as follows: + +# ping -n 10.0.4.2 +PING 10.0.4.2 (10.0.4.2) from 10.0.3.10 : 56(84) bytes of data. +64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.7 ms +64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) +64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) +64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) +64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!) +64 bytes from 10.0.4.2: icmp_seq=2 ttl=64 time=0.216 ms +64 bytes from 10.0.4.2: icmp_seq=3 ttl=64 time=0.267 ms +64 bytes from 10.0.4.2: icmp_seq=4 ttl=64 time=0.222 ms + + This is not due to an error in the bonding driver, rather, it +is a side effect of how many switches update their MAC forwarding +tables. Initially, the switch does not associate the MAC address in +the packet with a particular switch port, and so it may send the +traffic to all ports until its MAC forwarding table is updated. Since +the interfaces attached to the bond may occupy multiple ports on a +single switch, when the switch (temporarily) floods the traffic to all +ports, the bond device receives multiple copies of the same packet +(one per slave device). + + The duplicated packet behavior is switch dependent, some +switches exhibit this, and some do not. On switches that display this +behavior, it can be induced by clearing the MAC forwarding table (on +most Cisco switches, the privileged command "clear mac address-table +dynamic" will accomplish this). + +15. Hardware Specific Considerations ==================================== This section contains additional information for configuring bonding on specific hardware platforms, or for interfacing bonding with particular switches or other devices. -13.1 IBM BladeCenter +15.1 IBM BladeCenter -------------------- This applies to the JS20 and similar systems. @@ -1407,12 +1773,12 @@ JS20 network adapter information -------------------------------- All JS20s come with two Broadcom Gigabit Ethernet ports -integrated on the planar. In the BladeCenter chassis, the eth0 port -of all JS20 blades is hard wired to I/O Module #1; similarly, all eth1 -ports are wired to I/O Module #2. An add-on Broadcom daughter card -can be installed on a JS20 to provide two more Gigabit Ethernet ports. -These ports, eth2 and eth3, are wired to I/O Modules 3 and 4, -respectively. +integrated on the planar (that's "motherboard" in IBM-speak). In the +BladeCenter chassis, the eth0 port of all JS20 blades is hard wired to +I/O Module #1; similarly, all eth1 ports are wired to I/O Module #2. +An add-on Broadcom daughter card can be installed on a JS20 to provide +two more Gigabit Ethernet ports. These ports, eth2 and eth3, are +wired to I/O Modules 3 and 4, respectively. Each I/O Module may contain either a switch or a passthrough module (which allows ports to be directly connected to an external @@ -1432,29 +1798,30 @@ BladeCenter networking configuration of ways, this discussion will be confined to describing basic configurations. - Normally, Ethernet Switch Modules (ESM) are used in I/O + Normally, Ethernet Switch Modules (ESMs) are used in I/O modules 1 and 2. In this configuration, the eth0 and eth1 ports of a JS20 will be connected to different internal switches (in the respective I/O modules). - An optical passthru module (OPM) connects the I/O module -directly to an external switch. By using OPMs in I/O module #1 and -#2, the eth0 and eth1 interfaces of a JS20 can be redirected to the -outside world and connected to a common external switch. - - Depending upon the mix of ESM and OPM modules, the network -will appear to bonding as either a single switch topology (all OPM -modules) or as a multiple switch topology (one or more ESM modules, -zero or more OPM modules). It is also possible to connect ESM modules -together, resulting in a configuration much like the example in "High -Availability in a multiple switch topology." - -Requirements for specifc modes ------------------------------- - - The balance-rr mode requires the use of OPM modules for -devices in the bond, all connected to an common external switch. That -switch must be configured for "etherchannel" or "trunking" on the + A passthrough module (OPM or CPM, optical or copper, +passthrough module) connects the I/O module directly to an external +switch. By using PMs in I/O module #1 and #2, the eth0 and eth1 +interfaces of a JS20 can be redirected to the outside world and +connected to a common external switch. + + Depending upon the mix of ESMs and PMs, the network will +appear to bonding as either a single switch topology (all PMs) or as a +multiple switch topology (one or more ESMs, zero or more PMs). It is +also possible to connect ESMs together, resulting in a configuration +much like the example in "High Availability in a Multiple Switch +Topology," above. + +Requirements for specific modes +------------------------------- + + The balance-rr mode requires the use of passthrough modules +for devices in the bond, all connected to an common external switch. +That switch must be configured for "etherchannel" or "trunking" on the appropriate ports, as is usual for balance-rr. The balance-alb and balance-tlb modes will function with @@ -1484,17 +1851,18 @@ connected to the JS20 system. Other concerns -------------- - The Serial Over LAN link is established over the primary + The Serial Over LAN (SoL) link is established over the primary ethernet (eth0) only, therefore, any loss of link to eth0 will result in losing your SoL connection. It will not fail over with other -network traffic. +network traffic, as the SoL system is beyond the control of the +bonding driver. It may be desirable to disable spanning tree on the switch (either the internal Ethernet Switch Module, or an external switch) to -avoid fail-over delays issues when using bonding. +avoid fail-over delay issues when using bonding. -14. Frequently Asked Questions +16. Frequently Asked Questions ============================== 1. Is it SMP safe? @@ -1505,8 +1873,8 @@ The new driver was designed to be SMP safe from the start. 2. What type of cards will work with it? Any Ethernet type cards (you can even mix cards - a Intel -EtherExpress PRO/100 and a 3com 3c905b, for example). They need not -be of the same speed. +EtherExpress PRO/100 and a 3com 3c905b, for example). For most modes, +devices need not be of the same speed. 3. How many bonding devices can I have? @@ -1524,11 +1892,12 @@ system. disabled. The active-backup mode will fail over to a backup link, and other modes will ignore the failed link. The link will continue to be monitored, and should it recover, it will rejoin the bond (in whatever -manner is appropriate for the mode). See the section on High -Availability for additional information. +manner is appropriate for the mode). See the sections on High +Availability and the documentation for each mode for additional +information. Link monitoring can be enabled via either the miimon or -arp_interval paramters (described in the module paramters section, +arp_interval parameters (described in the module parameters section, above). In general, miimon monitors the carrier state as sensed by the underlying network device, and the arp monitor (arp_interval) monitors connectivity to another host on the local network. @@ -1536,7 +1905,7 @@ monitors connectivity to another host on the local network. If no link monitoring is configured, the bonding driver will be unable to detect link failures, and will assume that all links are always available. This will likely result in lost packets, and a -resulting degredation of performance. The precise performance loss +resulting degradation of performance. The precise performance loss depends upon the bonding mode and network configuration. 6. Can bonding be used for High Availability? @@ -1550,12 +1919,12 @@ depends upon the bonding mode and network configuration. In the basic balance modes (balance-rr and balance-xor), it works with any system that supports etherchannel (also called trunking). Most managed switches currently available have such -support, and many unmananged switches as well. +support, and many unmanaged switches as well. The advanced balance modes (balance-tlb and balance-alb) do not have special switch requirements, but do need device drivers that support specific features (described in the appropriate section under -module paramters, above). +module parameters, above). In 802.3ad mode, it works with with systems that support IEEE 802.3ad Dynamic Link Aggregation. Most managed and many unmanaged @@ -1565,17 +1934,19 @@ switches currently available support 802.3ad. 8. Where does a bonding device get its MAC address from? - If not explicitly configured with ifconfig, the MAC address of -the bonding device is taken from its first slave device. This MAC -address is then passed to all following slaves and remains persistent -(even if the the first slave is removed) until the bonding device is -brought down or reconfigured. + If not explicitly configured (with ifconfig or ip link), the +MAC address of the bonding device is taken from its first slave +device. This MAC address is then passed to all following slaves and +remains persistent (even if the the first slave is removed) until the +bonding device is brought down or reconfigured. If you wish to change the MAC address, you can set it with -ifconfig: +ifconfig or ip link: # ifconfig bond0 hw ether 00:11:22:33:44:55 +# ip link set bond0 address 66:77:88:99:aa:bb + The MAC address can be also changed by bringing down/up the device and then changing its slaves (or their order): @@ -1591,23 +1962,28 @@ from the bond (`ifenslave -d bond0 eth0'). The bonding driver will then restore the MAC addresses that the slaves had before they were enslaved. -15. Resources and Links +16. Resources and Links ======================= The latest version of the bonding driver can be found in the latest version of the linux kernel, found on http://kernel.org +The latest version of this document can be found in either the latest +kernel source (named Documentation/networking/bonding.txt), or on the +bonding sourceforge site: + +http://www.sourceforge.net/projects/bonding + Discussions regarding the bonding driver take place primarily on the bonding-devel mailing list, hosted at sourceforge.net. If you have -questions or problems, post them to the list. +questions or problems, post them to the list. The list address is: bonding-devel@lists.sourceforge.net -https://lists.sourceforge.net/lists/listinfo/bonding-devel - -There is also a project site on sourceforge. + The administrative interface (to subscribe or unsubscribe) can +be found at: -http://www.sourceforge.net/projects/bonding +https://lists.sourceforge.net/lists/listinfo/bonding-devel Donald Becker's Ethernet Drivers and diag programs may be found at : - http://www.scyld.com/network/ diff --git a/Documentation/pcmcia/driver-changes.txt b/Documentation/pcmcia/driver-changes.txt index 59ccc63..403e7b4 100644 --- a/Documentation/pcmcia/driver-changes.txt +++ b/Documentation/pcmcia/driver-changes.txt @@ -56,3 +56,12 @@ This file details changes in 2.6 which affect PCMCIA card driver authors: memory regions in-use. The name argument should be a pointer to your driver name. Eg, for pcnet_cs, name should point to the string "pcnet_cs". + +* CardServices is gone + CardServices() in 2.4 is just a big switch statement to call various + services. In 2.6, all of those entry points are exported and called + directly (except for pcmcia_report_error(), just use cs_error() instead). + +* struct pcmcia_driver + You need to use struct pcmcia_driver and pcmcia_{un,}register_driver + instead of {un,}register_pccard_driver diff --git a/Documentation/sound/alsa/ALSA-Configuration.txt b/Documentation/sound/alsa/ALSA-Configuration.txt index 104a994..a18ecb9 100644 --- a/Documentation/sound/alsa/ALSA-Configuration.txt +++ b/Documentation/sound/alsa/ALSA-Configuration.txt @@ -636,11 +636,16 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. 3stack-digout 3-jack in back, a HP out and a SPDIF out 5stack 5-jack in back, 2-jack in front 5stack-digout 5-jack in back, 2-jack in front, a SPDIF out + 6stack 6-jack in back, 2-jack in front + 6stack-digout 6-jack with a SPDIF out w810 3-jack z71v 3-jack (HP shared SPDIF) asus 3-jack uniwill 3-jack F1734 2-jack + test for testing/debugging purpose, almost all controls can be + adjusted. Appearing only when compiled with + $CONFIG_SND_DEBUG=y CMI9880 minimal 3-jack in back @@ -1054,6 +1059,13 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. The power-management is supported. + Module snd-pxa2xx-ac97 (on arm only) + ------------------------------------ + + Module for AC97 driver for the Intel PXA2xx chip + + For ARM architecture only. + Module snd-rme32 ---------------- @@ -1173,6 +1185,13 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. Module supports up to 8 cards. + Module snd-sun-dbri (on sparc only) + ----------------------------------- + + Module for DBRI sound chips found on Sparcs. + + Module supports up to 8 cards. + Module snd-wavefront -------------------- @@ -1371,7 +1390,7 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. Module snd-vxpocket ------------------- - Module for Digigram VX-Pocket VX2 PCMCIA card. + Module for Digigram VX-Pocket VX2 and 440 PCMCIA cards. ibl - Capture IBL size. (default = 0, minimum size) @@ -1391,29 +1410,6 @@ Prior to version 0.9.0rc4 options had a 'snd_' prefix. This was removed. Note: the driver is build only when CONFIG_ISA is set. - Module snd-vxp440 - ----------------- - - Module for Digigram VX-Pocket 440 PCMCIA card. - - ibl - Capture IBL size. (default = 0, minimum size) - - Module supports up to 8 cards. The module is compiled only when - PCMCIA is supported on kernel. - - To activate the driver via the card manager, you'll need to set - up /etc/pcmcia/vxp440.conf. See the sound/pcmcia/vx/vxp440.c. - - When the driver is compiled as a module and the hotplug firmware - is supported, the firmware data is loaded via hotplug automatically. - Install the necessary firmware files in alsa-firmware package. - When no hotplug fw loader is available, you need to load the - firmware via vxloader utility in alsa-tools package. - - About capture IBL, see the description of snd-vx222 module. - - Note: the driver is build only when CONFIG_ISA is set. - Module snd-ymfpci ----------------- diff --git a/Documentation/stable_api_nonsense.txt b/Documentation/stable_api_nonsense.txt index 3cea1387..f39c9d7 100644 --- a/Documentation/stable_api_nonsense.txt +++ b/Documentation/stable_api_nonsense.txt @@ -132,7 +132,7 @@ to extra work for the USB developers. Since all Linux USB developers do their work on their own time, asking programmers to do extra work for no gain, for free, is not a possibility. -Security issues are also a very important for Linux. When a +Security issues are also very important for Linux. When a security issue is found, it is fixed in a very short amount of time. A number of times this has caused internal kernel interfaces to be reworked to prevent the security problem from occurring. When this diff --git a/Documentation/stable_kernel_rules.txt b/Documentation/stable_kernel_rules.txt new file mode 100644 index 0000000..2c81305 --- /dev/null +++ b/Documentation/stable_kernel_rules.txt @@ -0,0 +1,58 @@ +Everything you ever wanted to know about Linux 2.6 -stable releases. + +Rules on what kind of patches are accepted, and what ones are not, into +the "-stable" tree: + + - It must be obviously correct and tested. + - It can not bigger than 100 lines, with context. + - It must fix only one thing. + - It must fix a real bug that bothers people (not a, "This could be a + problem..." type thing.) + - It must fix a problem that causes a build error (but not for things + marked CONFIG_BROKEN), an oops, a hang, data corruption, a real + security issue, or some "oh, that's not good" issue. In short, + something critical. + - No "theoretical race condition" issues, unless an explanation of how + the race can be exploited. + - It can not contain any "trivial" fixes in it (spelling changes, + whitespace cleanups, etc.) + - It must be accepted by the relevant subsystem maintainer. + - It must follow Documentation/SubmittingPatches rules. + + +Procedure for submitting patches to the -stable tree: + + - Send the patch, after verifying that it follows the above rules, to + stable@kernel.org. + - The sender will receive an ack when the patch has been accepted into + the queue, or a nak if the patch is rejected. This response might + take a few days, according to the developer's schedules. + - If accepted, the patch will be added to the -stable queue, for review + by other developers. + - Security patches should not be sent to this alias, but instead to the + documented security@kernel.org. + + +Review cycle: + + - When the -stable maintainers decide for a review cycle, the patches + will be sent to the review committee, and the maintainer of the + affected area of the patch (unless the submitter is the maintainer of + the area) and CC: to the linux-kernel mailing list. + - The review committee has 48 hours in which to ack or nak the patch. + - If the patch is rejected by a member of the committee, or linux-kernel + members object to the patch, bringing up issues that the maintainers + and members did not realize, the patch will be dropped from the + queue. + - At the end of the review cycle, the acked patches will be added to + the latest -stable release, and a new -stable release will happen. + - Security patches will be accepted into the -stable tree directly from + the security kernel team, and not go through the normal review cycle. + Contact the kernel security team for more details on this procedure. + + +Review committe: + + - This will be made up of a number of kernel developers who have + volunteered for this task, and a few that haven't. + diff --git a/Documentation/usb/usbmon.txt b/Documentation/usb/usbmon.txt index f1896ee..63cb7ed 100644 --- a/Documentation/usb/usbmon.txt +++ b/Documentation/usb/usbmon.txt @@ -102,7 +102,7 @@ Here is the list of words, from left to right: - URB Status. This field makes no sense for submissions, but is present to help scripts with parsing. In error case, it contains the error code. In case of a setup packet, it contains a Setup Tag. If scripts read a number - in this field, the proceed to read Data Length. Otherwise, they read + in this field, they proceed to read Data Length. Otherwise, they read the setup packet before reading the Data Length. - Setup packet, if present, consists of 5 words: one of each for bmRequestType, bRequest, wValue, wIndex, wLength, as specified by the USB Specification 2.0. diff --git a/Documentation/video4linux/CARDLIST.cx88 b/Documentation/video4linux/CARDLIST.cx88 index 6d44958..03deb07 100644 --- a/Documentation/video4linux/CARDLIST.cx88 +++ b/Documentation/video4linux/CARDLIST.cx88 @@ -29,3 +29,4 @@ card=27 - PixelView PlayTV Ultra Pro (Stereo) card=28 - DViCO FusionHDTV 3 Gold-T card=29 - ADS Tech Instant TV DVB-T PCI card=30 - TerraTec Cinergy 1400 DVB-T +card=31 - DViCO FusionHDTV 5 Gold diff --git a/Documentation/video4linux/CARDLIST.tuner b/Documentation/video4linux/CARDLIST.tuner index d1b9d21..f3302e1 100644 --- a/Documentation/video4linux/CARDLIST.tuner +++ b/Documentation/video4linux/CARDLIST.tuner @@ -62,3 +62,5 @@ tuner=60 - Thomson DDT 7611 (ATSC/NTSC) tuner=61 - Tena TNF9533-D/IF/TNF9533-B/DF tuner=62 - Philips TEA5767HN FM Radio tuner=63 - Philips FMD1216ME MK3 Hybrid Tuner +tuner=64 - LG TDVS-H062F/TUA6034 +tuner=65 - Ymec TVF66T5-B/DFF diff --git a/Documentation/video4linux/bttv/Insmod-options b/Documentation/video4linux/bttv/Insmod-options index 7bb5a50b..fc94ff2 100644 --- a/Documentation/video4linux/bttv/Insmod-options +++ b/Documentation/video4linux/bttv/Insmod-options @@ -44,6 +44,9 @@ bttv.o push used by bttv. bttv will disable overlay by default on this hardware to avoid crashes. With this insmod option you can override this. + no_overlay=1 Disable overlay. It should be used by broken + hardware that doesn't support PCI2PCI direct + transfers. automute=0/1 Automatically mutes the sound if there is no TV signal, on by default. You might try to disable this if you have bad input signal diff --git a/Documentation/x86_64/boot-options.txt b/Documentation/x86_64/boot-options.txt index b9e6be0..678e8f1 100644 --- a/Documentation/x86_64/boot-options.txt +++ b/Documentation/x86_64/boot-options.txt @@ -6,6 +6,11 @@ only the AMD64 specific ones are listed here. Machine check mce=off disable machine check + mce=bootlog Enable logging of machine checks left over from booting. + Disabled by default because some BIOS leave bogus ones. + If your BIOS doesn't do that it's a good idea to enable though + to make sure you log even machine check events that result + in a reboot. nomce (for compatibility with i386): same as mce=off @@ -47,7 +52,7 @@ Timing notsc Don't use the CPU time stamp counter to read the wall time. This can be used to work around timing problems on multiprocessor systems - with not properly synchronized CPUs. Only useful with a SMP kernel + with not properly synchronized CPUs. report_lost_ticks Report when timer interrupts are lost because some code turned off @@ -74,6 +79,9 @@ Idle loop event. This will make the CPUs eat a lot more power, but may be useful to get slightly better performance in multiprocessor benchmarks. It also makes some profiling using performance counters more accurate. + Please note that on systems with MONITOR/MWAIT support (like Intel EM64T + CPUs) this option has no performance advantage over the normal idle loop. + It may also interact badly with hyperthreading. Rebooting @@ -178,6 +186,5 @@ Debugging Misc noreplacement Don't replace instructions with more appropiate ones - for the CPU. This may be useful on asymmetric MP systems - where some CPU have less capabilities than the others. - + for the CPU. This may be useful on asymmetric MP systems + where some CPU have less capabilities than the others. |