diff options
Diffstat (limited to 'Documentation')
32 files changed, 2257 insertions, 2541 deletions
diff --git a/Documentation/Intel-IOMMU.txt b/Documentation/Intel-IOMMU.txt new file mode 100644 index 0000000..c232190 --- /dev/null +++ b/Documentation/Intel-IOMMU.txt @@ -0,0 +1,115 @@ +Linux IOMMU Support +=================== + +The architecture spec can be obtained from the below location. + +http://www.intel.com/technology/virtualization/ + +This guide gives a quick cheat sheet for some basic understanding. + +Some Keywords + +DMAR - DMA remapping +DRHD - DMA Engine Reporting Structure +RMRR - Reserved memory Region Reporting Structure +ZLR - Zero length reads from PCI devices +IOVA - IO Virtual address. + +Basic stuff +----------- + +ACPI enumerates and lists the different DMA engines in the platform, and +device scope relationships between PCI devices and which DMA engine controls +them. + +What is RMRR? +------------- + +There are some devices the BIOS controls, for e.g USB devices to perform +PS2 emulation. The regions of memory used for these devices are marked +reserved in the e820 map. When we turn on DMA translation, DMA to those +regions will fail. Hence BIOS uses RMRR to specify these regions along with +devices that need to access these regions. OS is expected to setup +unity mappings for these regions for these devices to access these regions. + +How is IOVA generated? +--------------------- + +Well behaved drivers call pci_map_*() calls before sending command to device +that needs to perform DMA. Once DMA is completed and mapping is no longer +required, device performs a pci_unmap_*() calls to unmap the region. + +The Intel IOMMU driver allocates a virtual address per domain. Each PCIE +device has its own domain (hence protection). Devices under p2p bridges +share the virtual address with all devices under the p2p bridge due to +transaction id aliasing for p2p bridges. + +IOVA generation is pretty generic. We used the same technique as vmalloc() +but these are not global address spaces, but separate for each domain. +Different DMA engines may support different number of domains. + +We also allocate gaurd pages with each mapping, so we can attempt to catch +any overflow that might happen. + + +Graphics Problems? +------------------ +If you encounter issues with graphics devices, you can try adding +option intel_iommu=igfx_off to turn off the integrated graphics engine. + +If it happens to be a PCI device included in the INCLUDE_ALL Engine, +then try enabling CONFIG_DMAR_GFX_WA to setup a 1-1 map. We hear +graphics drivers may be in process of using DMA api's in the near +future and at that time this option can be yanked out. + +Some exceptions to IOVA +----------------------- +Interrupt ranges are not address translated, (0xfee00000 - 0xfeefffff). +The same is true for peer to peer transactions. Hence we reserve the +address from PCI MMIO ranges so they are not allocated for IOVA addresses. + + +Fault reporting +--------------- +When errors are reported, the DMA engine signals via an interrupt. The fault +reason and device that caused it with fault reason is printed on console. + +See below for sample. + + +Boot Message Sample +------------------- + +Something like this gets printed indicating presence of DMAR tables +in ACPI. + +ACPI: DMAR (v001 A M I OEMDMAR 0x00000001 MSFT 0x00000097) @ 0x000000007f5b5ef0 + +When DMAR is being processed and initialized by ACPI, prints DMAR locations +and any RMRR's processed. + +ACPI DMAR:Host address width 36 +ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed90000 +ACPI DMAR:DRHD (flags: 0x00000000)base: 0x00000000fed91000 +ACPI DMAR:DRHD (flags: 0x00000001)base: 0x00000000fed93000 +ACPI DMAR:RMRR base: 0x00000000000ed000 end: 0x00000000000effff +ACPI DMAR:RMRR base: 0x000000007f600000 end: 0x000000007fffffff + +When DMAR is enabled for use, you will notice.. + +PCI-DMA: Using DMAR IOMMU + +Fault reporting +--------------- + +DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 +DMAR:[fault reason 05] PTE Write access is not set +DMAR:[DMA Write] Request device [00:02.0] fault addr 6df084000 +DMAR:[fault reason 05] PTE Write access is not set + +TBD +---- + +- For compatibility testing, could use unity map domain for all devices, just + provide a 1-1 for all useful memory under a single domain for all devices. +- API for paravirt ops for abstracting functionlity for VMM folks. diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches index a30dd44..681e2b3 100644 --- a/Documentation/SubmittingPatches +++ b/Documentation/SubmittingPatches @@ -464,8 +464,8 @@ section Linus Computer Science 101. Nuff said. If your code deviates too much from this, it is likely to be rejected without further review, and without comment. -Once significant exception is when moving code from one file to -another in this case you should not modify the moved code at all in +One significant exception is when moving code from one file to +another -- in this case you should not modify the moved code at all in the same patch which moves it. This clearly delineates the act of moving the code and your changes. This greatly aids review of the actual differences and allows tools to better track the history of diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index 6b0f963..6bb9be5 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt @@ -14,18 +14,6 @@ Who: Jiri Slaby <jirislaby@gmail.com> --------------------------- -What: V4L2 VIDIOC_G_MPEGCOMP and VIDIOC_S_MPEGCOMP -When: October 2007 -Why: Broken attempt to set MPEG compression parameters. These ioctls are - not able to implement the wide variety of parameters that can be set - by hardware MPEG encoders. A new MPEG control mechanism was created - in kernel 2.6.18 that replaces these ioctls. See the V4L2 specification - (section 1.9: Extended controls) for more information on this topic. -Who: Hans Verkuil <hverkuil@xs4all.nl> and - Mauro Carvalho Chehab <mchehab@infradead.org> - ---------------------------- - What: dev->power.power_state When: July 2007 Why: Broken design for runtime control over driver power states, confusing @@ -49,10 +37,10 @@ Who: David Miller <davem@davemloft.net> --------------------------- What: Video4Linux API 1 ioctls and video_decoder.h from Video devices. -When: December 2006 -Files: include/linux/video_decoder.h -Check: include/linux/video_decoder.h -Why: V4L1 AP1 was replaced by V4L2 API. during migration from 2.4 to 2.6 +When: December 2008 +Files: include/linux/video_decoder.h include/linux/videodev.h +Check: include/linux/video_decoder.h include/linux/videodev.h +Why: V4L1 AP1 was replaced by V4L2 API during migration from 2.4 to 2.6 series. The old API have lots of drawbacks and don't provide enough means to work with all video and audio standards. The newer API is already available on the main drivers and should be used instead. @@ -61,7 +49,9 @@ Why: V4L1 AP1 was replaced by V4L2 API. during migration from 2.4 to 2.6 Decoder iocts are using internally to allow video drivers to communicate with video decoders. This should also be improved to allow V4L2 calls being translated into compatible internal ioctls. -Who: Mauro Carvalho Chehab <mchehab@brturbo.com.br> + Compatibility ioctls will be provided, for a while, via + v4l1-compat module. +Who: Mauro Carvalho Chehab <mchehab@infradead.org> --------------------------- diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.txt index b90f537..bf80806 100644 --- a/Documentation/filesystems/9p.txt +++ b/Documentation/filesystems/9p.txt @@ -42,10 +42,12 @@ OPTIONS trans=name select an alternative transport. Valid options are currently: - unix - specifying a named pipe mount point - tcp - specifying a normal TCP/IP connection - fd - used passed file descriptors for connection + unix - specifying a named pipe mount point + tcp - specifying a normal TCP/IP connection + fd - used passed file descriptors for connection (see rfdno and wfdno) + virtio - connect to the next virtio channel available + (from lguest or KVM with trans_virtio module) uname=name user name to attempt mount as on the remote server. The server may override or ignore this value. Certain user diff --git a/Documentation/filesystems/Exporting b/Documentation/filesystems/Exporting index 31047e0..87019d2 100644 --- a/Documentation/filesystems/Exporting +++ b/Documentation/filesystems/Exporting @@ -2,9 +2,12 @@ Making Filesystems Exportable ============================= -Most filesystem operations require a dentry (or two) as a starting +Overview +-------- + +All filesystem operations require a dentry (or two) as a starting point. Local applications have a reference-counted hold on suitable -dentrys via open file descriptors or cwd/root. However remote +dentries via open file descriptors or cwd/root. However remote applications that access a filesystem via a remote filesystem protocol such as NFS may not be able to hold such a reference, and so need a different way to refer to a particular dentry. As the alternative @@ -13,14 +16,14 @@ server-reboot (among other things, though these tend to be the most problematic), there is no simple answer like 'filename'. The mechanism discussed here allows each filesystem implementation to -specify how to generate an opaque (out side of the filesystem) byte +specify how to generate an opaque (outside of the filesystem) byte string for any dentry, and how to find an appropriate dentry for any given opaque byte string. This byte string will be called a "filehandle fragment" as it corresponds to part of an NFS filehandle. A filesystem which supports the mapping between filehandle fragments -and dentrys will be termed "exportable". +and dentries will be termed "exportable". @@ -89,11 +92,9 @@ For a filesystem to be exportable it must: 1/ provide the filehandle fragment routines described below. 2/ make sure that d_splice_alias is used rather than d_add when ->lookup finds an inode for a given parent and name. - Typically the ->lookup routine will end: - if (inode) - return d_splice(inode, dentry); - d_add(dentry, inode); - return NULL; + Typically the ->lookup routine will end with a: + + return d_splice_alias(inode, dentry); } @@ -101,67 +102,39 @@ For a filesystem to be exportable it must: A file system implementation declares that instances of the filesystem are exportable by setting the s_export_op field in the struct super_block. This field must point to a "struct export_operations" -struct which could potentially be full of NULLs, though normally at -least get_parent will be set. - - The primary operations are decode_fh and encode_fh. -decode_fh takes a filehandle fragment and tries to find or create a -dentry for the object referred to by the filehandle. -encode_fh takes a dentry and creates a filehandle fragment which can -later be used to find/create a dentry for the same object. - -decode_fh will probably make use of "find_exported_dentry". -This function lives in the "exportfs" module which a filesystem does -not need unless it is being exported. So rather that calling -find_exported_dentry directly, each filesystem should call it through -the find_exported_dentry pointer in it's export_operations table. -This field is set correctly by the exporting agent (e.g. nfsd) when a -filesystem is exported, and before any export operations are called. - -find_exported_dentry needs three support functions from the -filesystem: - get_name. When given a parent dentry and a child dentry, this - should find a name in the directory identified by the parent - dentry, which leads to the object identified by the child dentry. - If no get_name function is supplied, a default implementation is - provided which uses vfs_readdir to find potential names, and - matches inode numbers to find the correct match. - - get_parent. When given a dentry for a directory, this should return - a dentry for the parent. Quite possibly the parent dentry will - have been allocated by d_alloc_anon. - The default get_parent function just returns an error so any - filehandle lookup that requires finding a parent will fail. - ->lookup("..") is *not* used as a default as it can leave ".." - entries in the dcache which are too messy to work with. - - get_dentry. When given an opaque datum, this should find the - implied object and create a dentry for it (possibly with - d_alloc_anon). - The opaque datum is whatever is passed down by the decode_fh - function, and is often simply a fragment of the filehandle - fragment. - decode_fh passes two datums through find_exported_dentry. One that - should be used to identify the target object, and one that can be - used to identify the object's parent, should that be necessary. - The default get_dentry function assumes that the datum contains an - inode number and a generation number, and it attempts to get the - inode using "iget" and check it's validity by matching the - generation number. A filesystem should only depend on the default - if iget can safely be used this way. - -If decode_fh and/or encode_fh are left as NULL, then default -implementations are used. These defaults are suitable for ext2 and -extremely similar filesystems (like ext3). - -The default encode_fh creates a filehandle fragment from the inode -number and generation number of the target together with the inode -number and generation number of the parent (if the parent is -required). - -The default decode_fh extract the target and parent datums from the -filehandle assuming the format used by the default encode_fh and -passed them to find_exported_dentry. +struct which has the following members: + + encode_fh (optional) + Takes a dentry and creates a filehandle fragment which can later be used + to find or create a dentry for the same object. The default + implementation creates a filehandle fragment that encodes a 32bit inode + and generation number for the inode encoded, and if necessary the + same information for the parent. + + fh_to_dentry (mandatory) + Given a filehandle fragment, this should find the implied object and + create a dentry for it (possibly with d_alloc_anon). + + fh_to_parent (optional but strongly recommended) + Given a filehandle fragment, this should find the parent of the + implied object and create a dentry for it (possibly with d_alloc_anon). + May fail if the filehandle fragment is too small. + + get_parent (optional but strongly recommended) + When given a dentry for a directory, this should return a dentry for + the parent. Quite possibly the parent dentry will have been allocated + by d_alloc_anon. The default get_parent function just returns an error + so any filehandle lookup that requires finding a parent will fail. + ->lookup("..") is *not* used as a default as it can leave ".." entries + in the dcache which are too messy to work with. + + get_name (optional) + When given a parent dentry and a child dentry, this should find a name + in the directory identified by the parent dentry, which leads to the + object identified by the child dentry. If no get_name function is + supplied, a default implementation is provided which uses vfs_readdir + to find potential names, and matches inode numbers to find the correct + match. A filehandle fragment consists of an array of 1 or more 4byte words, @@ -172,5 +145,3 @@ generated by encode_fh, in which case it will have been padded with nuls. Rather, the encode_fh routine should choose a "type" which indicates the decode_fh how much of the filehandle is valid, and how it should be interpreted. - - diff --git a/Documentation/i386/boot.txt b/Documentation/i386/boot.txt index 35985b3..fc49b79 100644 --- a/Documentation/i386/boot.txt +++ b/Documentation/i386/boot.txt @@ -168,6 +168,8 @@ Offset Proto Name Meaning 0234/1 2.05+ relocatable_kernel Whether kernel is relocatable or not 0235/3 N/A pad2 Unused 0238/4 2.06+ cmdline_size Maximum size of the kernel command line +023C/4 2.07+ hardware_subarch Hardware subarchitecture +0240/8 2.07+ hardware_subarch_data Subarchitecture-specific data (1) For backwards compatibility, if the setup_sects field contains 0, the real value is 4. @@ -204,7 +206,7 @@ boot loaders can ignore those fields. The byte order of all fields is littleendian (this is x86, after all.) -Field name: setup_secs +Field name: setup_sects Type: read Offset/size: 0x1f1/1 Protocol: ALL @@ -356,6 +358,13 @@ Protocol: 2.00+ - If 0, the protected-mode code is loaded at 0x10000. - If 1, the protected-mode code is loaded at 0x100000. + Bit 6 (write): KEEP_SEGMENTS + Protocol: 2.07+ + - if 0, reload the segment registers in the 32bit entry point. + - if 1, do not reload the segment registers in the 32bit entry point. + Assume that %cs %ds %ss %es are all set to flat segments with + a base of 0 (or the equivalent for their environment). + Bit 7 (write): CAN_USE_HEAP Set this bit to 1 to indicate that the value entered in the heap_end_ptr is valid. If this field is clear, some setup code @@ -480,6 +489,29 @@ Protocol: 2.06+ cmdline_size characters. With protocol version 2.05 and earlier, the maximum size was 255. +Field name: hardware_subarch +Type: write +Offset/size: 0x23c/4 +Protocol: 2.07+ + + In a paravirtualized environment the hardware low level architectural + pieces such as interrupt handling, page table handling, and + accessing process control registers needs to be done differently. + + This field allows the bootloader to inform the kernel we are in one + one of those environments. + + 0x00000000 The default x86/PC environment + 0x00000001 lguest + 0x00000002 Xen + +Field name: hardware_subarch_data +Type: write +Offset/size: 0x240/8 +Protocol: 2.07+ + + A pointer to data that is specific to hardware subarch + **** THE KERNEL COMMAND LINE @@ -753,3 +785,41 @@ IMPORTANT: All the hooks are required to preserve %esp, %ebp, %esi and After completing your hook, you should jump to the address that was in this field before your boot loader overwrote it (relocated, if appropriate.) + + +**** 32-bit BOOT PROTOCOL + +For machine with some new BIOS other than legacy BIOS, such as EFI, +LinuxBIOS, etc, and kexec, the 16-bit real mode setup code in kernel +based on legacy BIOS can not be used, so a 32-bit boot protocol needs +to be defined. + +In 32-bit boot protocol, the first step in loading a Linux kernel +should be to setup the boot parameters (struct boot_params, +traditionally known as "zero page"). The memory for struct boot_params +should be allocated and initialized to all zero. Then the setup header +from offset 0x01f1 of kernel image on should be loaded into struct +boot_params and examined. The end of setup header can be calculated as +follow: + + 0x0202 + byte value at offset 0x0201 + +In addition to read/modify/write the setup header of the struct +boot_params as that of 16-bit boot protocol, the boot loader should +also fill the additional fields of the struct boot_params as that +described in zero-page.txt. + +After setupping the struct boot_params, the boot loader can load the +32/64-bit kernel in the same way as that of 16-bit boot protocol. + +In 32-bit boot protocol, the kernel is started by jumping to the +32-bit kernel entry point, which is the start address of loaded +32/64-bit kernel. + +At entry, the CPU must be in 32-bit protected mode with paging +disabled; a GDT must be loaded with the descriptors for selectors +__BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat +segment; __BOOS_CS must have execute/read permission, and __BOOT_DS +must have read/write permission; CS must be __BOOT_CS and DS, ES, SS +must be __BOOT_DS; interrupt must be disabled; %esi must hold the base +address of the struct boot_params; %ebp, %edi and %ebx must be zero. diff --git a/Documentation/i386/zero-page.txt b/Documentation/i386/zero-page.txt index 6c0817c..169ad42 100644 --- a/Documentation/i386/zero-page.txt +++ b/Documentation/i386/zero-page.txt @@ -1,99 +1,31 @@ ---------------------------------------------------------------------------- -!!!!!!!!!!!!!!!WARNING!!!!!!!! -The zero page is a kernel internal data structure, not a stable ABI. It might change -without warning and the kernel has no way to detect old version of it. -If you're writing some external code like a boot loader you should only use -the stable versioned real mode boot protocol described in boot.txt. Otherwise the kernel -might break you at any time. -!!!!!!!!!!!!!WARNING!!!!!!!!!!! ----------------------------------------------------------------------------- +The additional fields in struct boot_params as a part of 32-bit boot +protocol of kernel. These should be filled by bootloader or 16-bit +real-mode setup code of the kernel. References/settings to it mainly +are in: -Summary of boot_params layout (kernel point of view) - ( collected by Hans Lermen and Martin Mares ) - -The contents of boot_params are used to pass parameters from the -16-bit realmode code of the kernel to the 32-bit part. References/settings -to it mainly are in: + include/asm-x86/bootparam.h - arch/i386/boot/setup.S - arch/i386/boot/video.S - arch/i386/kernel/head.S - arch/i386/kernel/setup.c - -Offset Type Description ------- ---- ----------- - 0 32 bytes struct screen_info, SCREEN_INFO - ATTENTION, overlaps the following !!! - 2 unsigned short EXT_MEM_K, extended memory size in Kb (from int 0x15) - 0x20 unsigned short CL_MAGIC, commandline magic number (=0xA33F) - 0x22 unsigned short CL_OFFSET, commandline offset - Address of commandline is calculated: - 0x90000 + contents of CL_OFFSET - (only taken, when CL_MAGIC = 0xA33F) - 0x40 20 bytes struct apm_bios_info, APM_BIOS_INFO - 0x60 16 bytes Intel SpeedStep (IST) BIOS support information - 0x80 16 bytes hd0-disk-parameter from intvector 0x41 - 0x90 16 bytes hd1-disk-parameter from intvector 0x46 +Offset Proto Name Meaning +/Size - 0xa0 16 bytes System description table truncated to 16 bytes. - ( struct sys_desc_table_struct ) - 0xb0 - 0x13f Free. Add more parameters here if you really need them. - 0x140- 0x1be EDID_INFO Video mode setup - -0x1c4 unsigned long EFI system table pointer -0x1c8 unsigned long EFI memory descriptor size -0x1cc unsigned long EFI memory descriptor version -0x1d0 unsigned long EFI memory descriptor map pointer -0x1d4 unsigned long EFI memory descriptor map size -0x1e0 unsigned long ALT_MEM_K, alternative mem check, in Kb -0x1e4 unsigned long Scratch field for the kernel setup code -0x1e8 char number of entries in E820MAP (below) -0x1e9 unsigned char number of entries in EDDBUF (below) -0x1ea unsigned char number of entries in EDD_MBR_SIG_BUFFER (below) -0x1f1 char size of setup.S, number of sectors -0x1f2 unsigned short MOUNT_ROOT_RDONLY (if !=0) -0x1f4 unsigned short size of compressed kernel-part in the - (b)zImage-file (in 16 byte units, rounded up) -0x1f6 unsigned short swap_dev (unused AFAIK) -0x1f8 unsigned short RAMDISK_FLAGS -0x1fa unsigned short VGA-Mode (old one) -0x1fc unsigned short ORIG_ROOT_DEV (high=Major, low=minor) -0x1ff char AUX_DEVICE_INFO - -0x200 short jump to start of setup code aka "reserved" field. -0x202 4 bytes Signature for SETUP-header, ="HdrS" -0x206 unsigned short Version number of header format - Current version is 0x0201... -0x208 8 bytes (used by setup.S for communication with boot loaders, - look there) -0x210 char LOADER_TYPE, = 0, old one - else it is set by the loader: - 0xTV: T=0 for LILO - 1 for Loadlin - 2 for bootsect-loader - 3 for SYSLINUX - 4 for ETHERBOOT - 5 for ELILO - 7 for GRuB - 8 for U-BOOT - 9 for Xen - V = version -0x211 char loadflags: - bit0 = 1: kernel is loaded high (bzImage) - bit7 = 1: Heap and pointer (see below) set by boot - loader. -0x212 unsigned short (setup.S) -0x214 unsigned long KERNEL_START, where the loader started the kernel -0x218 unsigned long INITRD_START, address of loaded ramdisk image -0x21c unsigned long INITRD_SIZE, size in bytes of ramdisk image -0x220 4 bytes (setup.S) -0x224 unsigned short setup.S heap end pointer -0x226 unsigned short zero_pad -0x228 unsigned long cmd_line_ptr -0x22c unsigned long ramdisk_max -0x230 16 bytes trampoline -0x290 - 0x2cf EDD_MBR_SIG_BUFFER (edd.S) -0x2d0 - 0xd00 E820MAP -0xd00 - 0xeff EDDBUF (edd.S) for disk signature read sector -0xd00 - 0xeeb EDDBUF (edd.S) for edd data +000/040 ALL screen_info Text mode or frame buffer information + (struct screen_info) +040/014 ALL apm_bios_info APM BIOS information (struct apm_bios_info) +060/010 ALL ist_info Intel SpeedStep (IST) BIOS support information + (struct ist_info) +080/010 ALL hd0_info hd0 disk parameter, OBSOLETE!! +090/010 ALL hd1_info hd1 disk parameter, OBSOLETE!! +0A0/010 ALL sys_desc_table System description table (struct sys_desc_table) +140/080 ALL edid_info Video mode setup (struct edid_info) +1C0/020 ALL efi_info EFI 32 information (struct efi_info) +1E0/004 ALL alk_mem_k Alternative mem check, in KB +1E4/004 ALL scratch Scratch field for the kernel setup code +1E8/001 ALL e820_entries Number of entries in e820_map (below) +1E9/001 ALL eddbuf_entries Number of entries in eddbuf (below) +1EA/001 ALL edd_mbr_sig_buf_entries Number of entries in edd_mbr_sig_buffer + (below) +290/040 ALL edd_mbr_sig_buffer EDD MBR signatures +2D0/A00 ALL e820_map E820 memory map table + (array of struct e820entry) +D00/1EC ALL eddbuf EDD data (array of struct edd_info) diff --git a/Documentation/ja_JP/SubmittingPatches b/Documentation/ja_JP/SubmittingPatches new file mode 100644 index 0000000..a9dc124 --- /dev/null +++ b/Documentation/ja_JP/SubmittingPatches @@ -0,0 +1,556 @@ +NOTE: +This is a version of Documentation/SubmittingPatches into Japanese. +This document is maintained by Keiichi KII <k-keiichi@bx.jp.nec.com> +and the JF Project team <http://www.linux.or.jp/JF/>. +If you find any difference between this document and the original file +or a problem with the translation, +please contact the maintainer of this file or JF project. + +Please also note that the purpose of this file is to be easier to read +for non English (read: Japanese) speakers and is not intended as a +fork. So if you have any comments or updates of this file, please try +to update the original English file first. + +Last Updated: 2007/10/24 +================================== +これは、 +linux-2.6.23/Documentation/SubmittingPatches の和訳 +です。 +翻訳団体: JF プロジェクト < http://www.linux.or.jp/JF/ > +翻訳日: 2007/10/17 +翻訳者: Keiichi Kii <k-keiichi at bx dot jp dot nec dot com> +校正者: Masanari Kobayashi さん <zap03216 at nifty dot ne dot jp> + Matsukura さん <nbh--mats at nifty dot com> +================================== + + Linux カーネルに変更を加えるための Howto + 又は + かの Linus Torvalds の取り扱い説明書 + +Linux カーネルに変更を加えたいと思っている個人又は会社にとって、パッ +チの投稿に関連した仕組みに慣れていなければ、その過程は時々みなさんを +おじけづかせることもあります。この文章はあなたの変更を大いに受け入れ +てもらえやすくする提案を集めたものです。 + +コードを投稿する前に、Documentation/SubmitChecklist の項目リストに目 +を通してチェックしてください。もしあなたがドライバーを投稿しようとし +ているなら、Documentation/SubmittingDrivers にも目を通してください。 + +-------------------------------------------- +セクション1 パッチの作り方と送り方 +-------------------------------------------- + +1) 「 diff -up 」 +------------ + +パッチの作成には「 diff -up 」又は「 diff -uprN 」を使ってください。 + +Linux カーネルに対する全ての変更は diff(1) コマンドによるパッチの形式で +生成してください。パッチを作成するときには、diff(1) コマンドに「 -u 」引 +数を指定して、unified 形式のパッチを作成することを確認してください。また、 +変更がどの C 関数で行われたのかを表示する「 -p 」引数を使ってください。 +この引数は生成した差分をずっと読みやすくしてくれます。パッチは Linux +カーネルソースの中のサブディレクトリではなく Linux カーネルソースのルート +ディレクトリを基準にしないといけません。 + +1個のファイルについてのパッチを作成するためには、ほとんどの場合、 +以下の作業を行えば十分です。 + + SRCTREE= linux-2.6 + MYFILE= drivers/net/mydriver.c + + cd $SRCTREE + cp $MYFILE $MYFILE.orig + vi $MYFILE # make your change + cd .. + diff -up $SRCTREE/$MYFILE{.orig,} > /tmp/patch + +複数のファイルについてのパッチを作成するためには、素の( vanilla )、す +なわち変更を加えてない Linux カーネルを展開し、自分の Linux カーネル +ソースとの差分を生成しないといけません。例えば、 + + MYSRC= /devel/linux-2.6 + + tar xvfz linux-2.6.12.tar.gz + mv linux-2.6.12 linux-2.6.12-vanilla + diff -uprN -X linux-2.6.12-vanilla/Documentation/dontdiff \ + linux-2.6.12-vanilla $MYSRC > /tmp/patch + +dontdiff ファイルには Linux カーネルのビルドプロセスの過程で生成された +ファイルの一覧がのっています。そして、それらはパッチを生成する diff(1) +コマンドで無視されるべきです。dontdiff ファイルは 2.6.12 以後のバージョ +ンの Linux カーネルソースツリーに含まれています。それより前のバージョン +の Linux カーネルソースツリーに対する dontdiff ファイルは、 +<http://www.xenotime.net/linux/doc/dontdiff>から取得することができます。 + +投稿するパッチの中に関係のない余分なファイルが含まれていないことを確 +認してください。diff(1) コマンドで生成したパッチがあなたの意図したとお +りのものであることを確認してください。 + +もしあなたのパッチが多くの差分を生み出すのであれば、あなたはパッチ +を意味のあるひとまとまりごとに分けたいと思うかもしれません。 +これは他のカーネル開発者にとってレビューしやすくなるので、あなたの +パッチを受け入れてもらうためにはとても重要なことです。これを補助でき +る多くのスクリプトがあります。 + +Quilt: +http://savannah.nongnu.org/projects/quilt + +Andrew Morton's patch scripts: +http://www.zip.com.au/~akpm/linux/patches/ +このリンクの先のスクリプトの代わりとして、quilt がパッチマネジメント +ツールとして推奨されています(上のリンクを見てください)。 + +2) パッチに対する説明 + +パッチの中の変更点に対する技術的な詳細について説明してください。 + +説明はできる限り具体的に。もっとも悪い説明は「ドライバー X を更新」、 +「ドライバー X に対するバグフィックス」あるいは「このパッチはサブシス +テム X に対する更新を含んでいます。どうか取り入れてください。」などです。 + +説明が長くなりだしたのであれば、おそらくそれはパッチを分ける必要がある +という兆候です。次の #3 を見てください。 + +3) パッチの分割 + +意味のあるひとまとまりごとに変更を個々のパッチファイルに分けてください。 + +例えば、もし1つのドライバーに対するバグフィックスとパフォーマンス強 +化の両方の変更を含んでいるのであれば、その変更を2つ以上のパッチに分 +けてください。もし変更箇所に API の更新と、その新しい API を使う新たな +ドライバーが含まれているなら、2つのパッチに分けてください。 + +一方で、もしあなたが多数のファイルに対して意味的に同じ1つの変更を加え +るのであれば、その変更を1つのパッチにまとめてください。言いかえると、 +意味的に同じ1つの変更は1つのパッチの中に含まれます。 + +あるパッチが変更を完結させるために他のパッチに依存していたとしても、 +それは問題ありません。パッチの説明の中で「このパッチはパッチ X に依存 +している」と簡単に注意書きをつけてください。 + +もしパッチをより小さなパッチの集合に凝縮することができないなら、まずは +15かそこらのパッチを送り、そのレビューと統合を待って下さい。 + +4) パッチのスタイルチェック + +あなたのパッチが基本的な( Linux カーネルの)コーディングスタイルに違反し +ていないかをチェックして下さい。その詳細を Documentation/CodingStyle で +見つけることができます。コーディングスタイルの違反はレビューする人の +時間を無駄にするだけなので、恐らくあなたのパッチは読まれることすらなく +拒否されるでしょう。 + +あなたはパッチを投稿する前に最低限パッチスタイルチェッカー +( scripts/patchcheck.pl )を利用してパッチをチェックすべきです。 +もしパッチに違反がのこっているならば、それらの全てについてあなたは正当な +理由を示せるようにしておく必要があります。 + +5) 電子メールの宛先の選び方 + +MAINTAINERS ファイルとソースコードに目を通してください。そして、その変 +更がメンテナのいる特定のサブシステムに加えられるものであることが分か +れば、その人に電子メールを送ってください。 + +もし、メンテナが載っていなかったり、メンテナからの応答がないなら、 +LKML ( linux-kernel@vger.kernel.org )へパッチを送ってください。ほとんど +のカーネル開発者はこのメーリングリストに目を通しており、変更に対して +コメントを得ることができます。 + +15個より多くのパッチを同時に vger.kernel.org のメーリングリストへ送らな +いでください!!! + +Linus Torvalds は Linux カーネルに入る全ての変更に対する最終的な意思決定者 +です。電子メールアドレスは torvalds@linux-foundation.org になります。彼は +多くの電子メールを受け取っているため、できる限り彼に電子メールを送るのは +避けるべきです。 + +バグフィックスであったり、自明な変更であったり、話し合いをほとんど +必要としないパッチは Linus へ電子メールを送るか CC しなければなりません。 +話し合いを必要としたり、明確なアドバンテージがないパッチは、通常まず +は LKML へ送られるべきです。パッチが議論された後にだけ、そのパッチを +Linus へ送るべきです。 + +6) CC (カーボンコピー)先の選び方 + +特に理由がないなら、LKML にも CC してください。 + +Linus 以外のカーネル開発者は変更に気づく必要があり、その結果、彼らはそ +の変更に対してコメントをくれたり、コードに対してレビューや提案をくれ +るかもしれません。LKML とは Linux カーネル開発者にとって一番中心的なメー +リングリストです。USB やフレームバッファデバイスや VFS や SCSI サブシステ +ムなどの特定のサブシステムに関するメーリングリストもあります。あなた +の変更に、はっきりと関連のあるメーリングリストについて知りたければ +MAINTAINERS ファイルを参照してください。 + +VGER.KERNEL.ORG でホスティングされているメーリングリストの一覧が下記の +サイトに載っています。 +<http://vger.kernel.org/vger-lists.html> + +もし、変更がユーザランドのカーネルインタフェースに影響を与え +るのであれば、MAN-PAGES のメンテナ( MAINTAINERS ファイルに一覧 +があります)に man ページのパッチを送ってください。少なくとも +情報がマニュアルページの中に入ってくるように、変更が起きたという +通知を送ってください。 + +たとえ、メンテナが #4 で反応がなかったとしても、メンテナのコードに変更を +加えたときには、いつもメンテナに CC するのを忘れないようにしてください。 + +小さなパッチであれば、Adrian Bunk が管理している Trivial Patch Monkey +(ちょっとしたパッチを集めている)<trivial@kernel.org>に CC してもいい +です。ちょっとしたパッチとは以下のルールのどれか1つを満たしていなけ +ればなりません。 + ・ドキュメントのスペルミスの修正 + ・grep(1) コマンドによる検索を困難にしているスペルの修正 + ・コンパイル時の警告の修正(無駄な警告が散乱することは好ましくないた + めです) + ・コンパイル問題の修正(それらの修正が本当に正しい場合に限る) + ・実行時の問題の修正(それらの修正が本当に問題を修正している場合に限る) + ・廃止予定の関数やマクロを使用しているコードの除去(例 check_region ) + ・問い合わせ先やドキュメントの修正 + ・移植性のないコードから移植性のあるコードへの置き換え(小さい範囲で + あればアーキテクチャ特有のことでも他の人がコピーできます) + ・作者やメンテナによる修正(すなわち patch monkey の再転送モード) +URL: <http://www.kernel.org/pub/linux/kernel/people/bunk/trivial/> + +7) MIME やリンクや圧縮ファイルや添付ファイルではなくプレインテキストのみ + +Linus や他のカーネル開発者はあなたが投稿した変更を読んで、コメントでき +る必要があります。カーネル開発者にとって、あなたが書いたコードの特定の +部分にコメントをするために、標準的な電子メールクライアントで変更が引用 +できることは重要です。 + +上記の理由で、すべてのパッチは文中に含める形式の電子メールで投稿さ +れるべきです。警告:あなたがパッチをコピー&ペーストする際には、パッ +チを改悪するエディターの折り返し機能に注意してください。 + +パッチを圧縮の有無に関わらず MIME 形式で添付しないでください。多くのポ +ピュラーな電子メールクライアントは MIME 形式の添付ファイルをプレーンテ +キストとして送信するとは限らないでしょう。そうなると、電子メールクラ +イアントがコードに対するコメントを付けることをできなくします。また、 +MIME 形式の添付ファイルは Linus に手間を取らせることになり、その変更を +受け入れてもらう可能性が低くなってしまいます。 + +例外:お使いの電子メールクライアントがパッチをめちゃくちゃにするので +あれば、誰かが MIME 形式のパッチを再送するよう求めるかもしれません。 + +警告: Mozilla のような特定の電子メールクライアントは電子メールの +ヘッダに以下のものを付加して送ります。 +---- message header ---- +Content-Type: text/plain; charset=us-ascii; format=flowed +---- message header ---- +問題は、「 format=flowed 」が付いた電子メールを特定の受信側の電子メール +クライアントがタブをスペースに置き換えるというような変更をすることです。 +したがって送られてきたパッチは壊れているように見えるでしょう。 + +これを修正するには、mozilla の defaults/pref/mailnews.js ファイルを +以下のように修正します。 +pref("mailnews.send_plaintext_flowed", false); // RFC 2646======= +pref("mailnews.display.disable_format_flowed_support", true); + +8) 電子メールのサイズ + +パッチを Linus へ送るときは常に #7 の手順に従ってください。 + +大きなパッチはメーリングリストやメンテナにとって不親切です。パッチが +未圧縮で 40KB を超えるようであるなら、インターネット上のアクセス可能な +サーバに保存し、保存場所を示す URL を伝えるほうが適切です。 + +9) カーネルバージョンの明記 + +パッチが対象とするカーネルのバージョンをパッチの概要か電子メールの +サブジェクトに付けることが重要です。 + +パッチが最新バージョンのカーネルに正しく適用できなければ、Linus は +そのパッチを採用しないでしょう。 + +10) がっかりせず再投稿 + +パッチを投稿した後は、辛抱強く待っていてください。Linus があなたのパッ +チを気に入って採用すれば、Linus がリリースする次のバージョンのカーネル +の中で姿を見せるでしょう。 + +しかし、パッチが次のバージョンのカーネルに入っていないなら、いくつもの +理由があるのでしょう。その原因を絞り込み、間違っているものを正し、更新 +したパッチを投稿するのはあなたの仕事です。 + +Linus があなたのパッチに対して何のコメントもなく不採用にすることは極め +て普通のことです。それは自然な姿です。もし、Linus があなたのパッチを受 +け取っていないのであれば、以下の理由が考えられます。 +* パッチが最新バージョンの Linux カーネルにきちんと適用できなかった +* パッチが LKML で十分に議論されていなかった +* スタイルの問題(セクション2を参照) +* 電子メールフォーマットの問題(このセクションを参照) +* パッチに対する技術的な問題 +* Linus はたくさんの電子メールを受け取っているので、どさくさに紛れて見 + 失った +* 不愉快にさせている + +判断できない場合は、LKML にコメントを頼んでください。 + +11) サブジェクトに「 PATCH 」 + +Linus や LKML への大量の電子メールのために、サブジェクトのプレフィックスに +「 [PATCH] 」を付けることが慣習となっています。これによって Linus や他の +カーネル開発者がパッチであるのか、又は、他の議論に関する電子メールであるの +かをより簡単に識別できます。 + +12) パッチへの署名 + +誰が何をしたのかを追いかけやすくするために (特に、パッチが何人かの +メンテナを経て最終的に Linux カーネルに取り込まれる場合のために)、電子 +メールでやり取りされるパッチに対して「 sign-off 」という手続きを導入し +ました。 + +「 sign-off 」とは、パッチがあなたの書いたものであるか、あるいは、 +あなたがそのパッチをオープンソースとして提供する権利を保持している、 +という証明をパッチの説明の末尾に一行記載するというものです。 +ルールはとても単純です。以下の項目を確認して下さい。 + + 原作者の証明書( DCO ) 1.1 + + このプロジェクトに寄与するものとして、以下のことを証明する。 + + (a) 本寄与は私が全体又は一部作成したものであり、私がそのファイ + ル中に明示されたオープンソースライセンスの下で公開する権利 + を持っている。もしくは、 + + (b) 本寄与は、私が知る限り、適切なオープンソースライセンスでカバ + ーされている既存の作品を元にしている。同時に、私はそのライセ + ンスの下で、私が全体又は一部作成した修正物を、ファイル中で示 + される同一のオープンソースライセンスで(異なるライセンスの下で + 投稿することが許可されている場合を除いて)投稿する権利を持って + いる。もしくは、 + + (c) 本寄与は(a)、(b)、(c)を証明する第3者から私へ直接提供された + ものであり、私はそれに変更を加えていない。 + + (d) 私はこのプロジェクトと本寄与が公のものであることに理解及び同意す + る。同時に、関与した記録(投稿の際の全ての個人情報と sign-off を + 含む)が無期限に保全されることと、当該プロジェクト又は関連する + オープンソースライセンスに沿った形で再配布されることに理解及び + 同意する。 + +もしこれに同意できるなら、以下のような1行を追加してください。 + + Signed-off-by: Random J Developer <random@developer.example.org> + +実名を使ってください。(残念ですが、偽名や匿名による寄与はできません。) + +人によっては sign-off の近くに追加のタグを付加しています。それらは今のところ +無視されますが、あなたはそのタグを社内の手続きに利用したり、sign-off に特別 +な情報を示したりすることができます。 + +13) いつ Acked-by: を使うのか + +「 Signed-off-by: 」タグはその署名者がパッチの開発に関わっていたことやパッチ +の伝播パスにいたことを示しています。 + +ある人が直接パッチの準備や作成に関わっていないけれど、その人のパッチに対す +る承認を記録し、示したいとします。その場合、その人を示すのに Acked-by: が使 +えます。Acked-by: はパッチのチェンジログにも追加されます。 + +パッチの影響を受けるコードのメンテナがパッチに関わっていなかったり、パッチ +の伝播パスにいなかった時にも、メンテナは Acked-by: をしばしば利用します。 + +Acked-by: は Signed-off-by: のように公式なタグではありません。それはメンテナが +少なくともパッチをレビューし、同意を示しているという記録です。そのような +ことからパッチの統合者がメンテナの「うん、良いと思うよ」という発言を +Acked-by: へ置き換えることがあります。 + +Acked-by: が必ずしもパッチ全体の承認を示しているわけではありません。例えば、 +あるパッチが複数のサブシステムへ影響を与えており、その中の1つのサブシステム +のメンテナからの Acked-by: を持っているとします。その場合、Acked-by: は通常 +そのメンテナのコードに影響を与える一部分だけに対する承認を示しています。 +この点は、ご自分で判断してください。(その Acked-by: が)疑わしい場合は、 +メーリングリストアーカイブの中の大元の議論を参照すべきです。 + +14) 標準的なパッチのフォーマット + +標準的なパッチのサブジェクトは以下のとおりです。 + + Subject: [PATCH 001/123] subsystem: summary phrase + +標準的なパッチの、電子メールのボディは以下の項目を含んでいます。 + + - パッチの作成者を明記する「 from 」行 + + - 空行 + + - 説明本体。これはこのパッチを説明するために無期限のチェンジログ + (変更履歴)にコピーされます。 + + - 上述した「 Signed-off-by: 」行。これも説明本体と同じくチェン + ジログ内にコピーされます。 + + - マーカー行は単純に「 --- 」です。 + + - 余計なコメントは、チェンジログには不適切です。 + + - 実際のパッチ(差分出力) + +サブジェクト行のフォーマットは、アルファベット順で電子メールをとても +ソートしやすいものになっています。(ほとんどの電子メールクライアント +はソートをサポートしています)パッチのサブジェクトの連番は0詰めであ +るため、数字でのソートとアルファベットでのソートは同じ結果になります。 + +電子メールのサブジェクト内のサブシステム表記は、パッチが適用される +分野またはサブシステムを識別できるようにすべきです。 + +電子メールのサブジェクトの「概要の言い回し」はそのパッチの概要を正確 +に表現しなければなりません。「概要の言い回し」をファイル名にしてはい +けません。一連のパッチ中でそれぞれのパッチは同じ「概要の言い回し」を +使ってはいけません(「一連のパッチ」とは順序付けられた関連のある複数の +パッチ群です)。 + +あなたの電子メールの「概要の言い回し」がそのパッチにとって世界で唯 +一の識別子になるように心がけてください。「概要の言い回し」は git の +チェンジログの中へずっと伝播していきます。「概要の言い回し」は、開 +発者が後でパッチを参照するために議論の中で利用するかもしれません。 +人々はそのパッチに関連した議論を読むために「概要の言い回し」を使って +google で検索したがるでしょう。 + +サブジェクトの例を二つ + + Subject: [patch 2/5] ext2: improve scalability of bitmap searching + Subject: [PATCHv2 001/207] x86: fix eflags tracking + +「 from 」行は電子メールのボディの一番最初の行でなければなりません。 +その形式は以下のとおりです。 + + From: Original Author <author@example.com> + +「 from 」行はチェンジログの中で、そのパッチの作成者としてクレジットされ +ている人を特定するものです。「 from 」行がかけていると、電子メールのヘッ +ダーの「 From: 」が、チェンジログの中でパッチの作成者を決定するために使わ +れるでしょう。 + +説明本体は無期限のソースのチェンジログにコミットされます。なので、説明 +本体はそのパッチに至った議論の詳細を忘れているある程度の技量を持っている人 +がその詳細を思い出すことができるものでなければなりません。 + +「 --- 」マーカー行はパッチ処理ツールに対して、チェンジログメッセージの終端 +部分を認識させるという重要な役目を果たします。 + +「 --- 」マーカー行の後の追加コメントの良い使用方法の1つに diffstat コマンド +があります。diffstat コマンドとは何のファイルが変更され、1ファイル当たり何行 +追加され何行消されたかを示すものです。diffstat コマンドは特に大きなパッチに +おいて役立ちます。その時点でだけ又はメンテナにとってのみ関係のあるコメント +は無期限に保存されるチェンジログにとって適切ではありません。そのため、この +ようなコメントもマーカー行の後に書かれるべきです。ファイル名はカーネルソー +スツリーのトップディレクトリからの表記でリストされるため、横方向のスペース +をとり過ぎないように、diffstat コマンドにオプション「 -p 1 -w 70 」を指定し +てください(インデントを含めてちょうど80列に合うでしょう)。 + +適切なパッチのフォーマットの詳細についてはセクション3の参考文献を参照して +ください。 + +------------------------------------ +セクション2 - ヒントとTIPSと小技 +------------------------------------ + +このセクションは Linux カーネルに変更を適用することに関係のある一般的な +「お約束」の多くを載せています。物事には例外というものがあります。しか +し例外を適用するには、本当に妥当な理由が不可欠です。あなたは恐らくこの +セクションを Linus のコンピュータ・サイエンス101と呼ぶでしょう。 + +1) Documentation/CodingStyleを参照 + +言うまでもなく、あなたのコードがこのコーディングスタイルからあまりに +も逸脱していると、レビューやコメントなしに受け取ってもらえないかもし +れません。 + +唯一の特筆すべき例外は、コードをあるファイルから別のファイルに移動 +するときです。この場合、コードを移動するパッチでは、移動されるコード +に関して移動以外の変更を一切加えるべきではありません。これにより、 +コードの移動とあなたが行ったコードの修正を明確に区別できるようにな +ります。これは実際に何が変更されたかをレビューする際の大きな助けに +なるとともに、ツールにコードの履歴を追跡させることも容易になります。 + +投稿するより前にパッチのスタイルチェッカー( scripts/checkpatch.pl )で +あなたのパッチをチェックしてください。このスタイルチェッカーは最終結 +論としてではなく、指標としてみるべきです。もし、あなたのコードが違反 +はしているが修正するより良く見えるのであれば、おそらくそのままにする +のがベストです。 + +スタイルチェッカーによる3段階のレポート: + - エラー: 間違っている可能性が高い + - 警告:注意してレビューする必要がある + - チェック:考慮する必要がある + +あなたはパッチに残っている全ての違反について、それがなぜ必要なのか正当な +理由を示せるようにしておく必要があります。 + +2) #ifdefは見苦しい + +ifdef が散乱したコードは、読むのもメンテナンスするのも面倒です。コードの中 +で ifdef を使わないでください。代わりに、ヘッダファイルの中に ifdef を入れて、 +条件付きで、コードの中で使われる関数を「 static inline 」関数かマクロで定義し +てください。後はコンパイラが、何もしない箇所を最適化して取り去ってくれるで +しょう。 + +まずいコードの簡単な例 + + dev = alloc_etherdev (sizeof(struct funky_private)); + if (!dev) + return -ENODEV; + #ifdef CONFIG_NET_FUNKINESS + init_funky_net(dev); + #endif + +クリーンアップしたコードの例 + +(in header) + #ifndef CONFIG_NET_FUNKINESS + static inline void init_funky_net (struct net_device *d) {} + #endif + +(in the code itself) + dev = alloc_etherdev (sizeof(struct funky_private)); + if (!dev) + return -ENODEV; + init_funky_net(dev); + +3) マクロより「 static inline 」を推奨 + +「 static inline 」関数はマクロよりもずっと推奨されています。それらは、 +型安全性があり、長さにも制限が無く、フォーマットの制限もありません。 +gcc においては、マクロと同じくらい軽いです。 + +マクロは「 static inline 」が明らかに不適切であると分かる場所(高速化パスの +いくつかの特定のケース)や「 static inline 」関数を使うことができないような +場所(マクロの引数の文字列連結のような)にだけ使われるべきです。 + +「 static inline 」は「 static __inline__ 」や「 extern inline 」や +「 extern __inline__ 」よりも適切です。 + +4) 設計に凝りすぎるな + +それが有用になるかどうか分からないような不明瞭な将来を見越した設計 +をしないでください。「できる限り簡単に、そして、それ以上簡単になら +ないような設計をしてください。」 + +---------------------- +セクション3 参考文献 +---------------------- + +Andrew Morton, "The perfect patch" (tpp). + <http://www.zip.com.au/~akpm/linux/patches/stuff/tpp.txt> + +Jeff Garzik, "Linux kernel patch submission format". + <http://linux.yyz.us/patch-format.html> + +Greg Kroah-Hartman, "How to piss off a kernel subsystem maintainer". + <http://www.kroah.com/log/2005/03/31/> + <http://www.kroah.com/log/2005/07/08/> + <http://www.kroah.com/log/2005/10/19/> + <http://www.kroah.com/log/2006/01/11/> + +NO!!!! No more huge patch bombs to linux-kernel@vger.kernel.org people! + <http://marc.theaimsgroup.com/?l=linux-kernel&m=112112749912944&w=2> + +Kernel Documentation/CodingStyle: + <http://users.sosdg.org/~qiyong/lxr/source/Documentation/CodingStyle> + +Linus Torvalds's mail on the canonical patch format: + <http://lkml.org/lkml/2005/4/7/183> +-- diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt index 6166e2d..7a77533 100644 --- a/Documentation/kbuild/makefiles.txt +++ b/Documentation/kbuild/makefiles.txt @@ -519,17 +519,17 @@ more details, with real examples. to the user why it stops. cc-cross-prefix - cc-cross-prefix is used to check if there exist a $(CC) in path with + cc-cross-prefix is used to check if there exists a $(CC) in path with one of the listed prefixes. The first prefix where there exist a prefix$(CC) in the PATH is returned - and if no prefix$(CC) is found then nothing is returned. Additional prefixes are separated by a single space in the call of cc-cross-prefix. - This functionality is usefull for architecture Makefile that try - to set CROSS_COMPILE to well know values but may have several + This functionality is useful for architecture Makefiles that try + to set CROSS_COMPILE to well-known values but may have several values to select between. - It is recommended only to try to set CROSS_COMPILE is it is a cross - build (host arch is different from target arch). And is CROSS_COMPILE + It is recommended only to try to set CROSS_COMPILE if it is a cross + build (host arch is different from target arch). And if CROSS_COMPILE is already set then leave it with the old value. Example: diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 6accd36..33121d6 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -422,7 +422,8 @@ and is between 256 and 4096 characters. It is defined in the file hpet= [X86-32,HPET] option to control HPET usage Format: { enable (default) | disable | force } disable: disable HPET and use PIT instead - force: allow force enabled of undocumented chips (ICH4, VIA) + force: allow force enabled of undocumented chips (ICH4, + VIA, nVidia) com20020= [HW,NET] ARCnet - COM20020 chipset Format: @@ -585,11 +586,6 @@ and is between 256 and 4096 characters. It is defined in the file eata= [HW,SCSI] - ec_intr= [HW,ACPI] ACPI Embedded Controller interrupt mode - Format: <int> - 0: polling mode - non-0: interrupt mode (default) - edd= [EDD] Format: {"of[f]" | "sk[ipmbr]"} See comment in arch/i386/boot/edd.S @@ -772,6 +768,23 @@ and is between 256 and 4096 characters. It is defined in the file inttest= [IA64] + intel_iommu= [DMAR] Intel IOMMU driver (DMAR) option + off + Disable intel iommu driver. + igfx_off [Default Off] + By default, gfx is mapped as normal device. If a gfx + device has a dedicated DMAR unit, the DMAR unit is + bypassed by not enabling DMAR with this option. In + this case, gfx device will use physical address for + DMA. + forcedac [x86_64] + With this option iommu will not optimize to look + for io virtual address below 32 bit forcing dual + address cycle on pci bus for cards supporting greater + than 32 bit addressing. The default is to look + for translation below 32 bit and if not available + then look in the higher range. + io7= [HW] IO7 for Marvel based alpha systems See comment before marvel_specify_io7 in arch/alpha/kernel/core_marvel.c. @@ -1426,7 +1439,8 @@ and is between 256 and 4096 characters. It is defined in the file Param: "schedule" - profile schedule points. Param: <number> - step/bucket size as a power of 2 for statistical time based profiling. - Param: "sleep" - profile D-state sleeping (millisecs) + Param: "sleep" - profile D-state sleeping (millisecs). + Requires CONFIG_SCHEDSTATS Param: "kvm" - profile VM exits. processor.max_cstate= [HW,ACPI] diff --git a/Documentation/lguest/Makefile b/Documentation/lguest/Makefile index c0b7a45..bac037e 100644 --- a/Documentation/lguest/Makefile +++ b/Documentation/lguest/Makefile @@ -1,28 +1,8 @@ # This creates the demonstration utility "lguest" which runs a Linux guest. - -# For those people that have a separate object dir, look there for .config -KBUILD_OUTPUT := ../.. -ifdef O - ifeq ("$(origin O)", "command line") - KBUILD_OUTPUT := $(O) - endif -endif -# We rely on CONFIG_PAGE_OFFSET to know where to put lguest binary. -include $(KBUILD_OUTPUT)/.config -LGUEST_GUEST_TOP := ($(CONFIG_PAGE_OFFSET) - 0x08000000) - -CFLAGS:=-Wall -Wmissing-declarations -Wmissing-prototypes -O3 -Wl,-T,lguest.lds +CFLAGS:=-Wall -Wmissing-declarations -Wmissing-prototypes -O3 -I../../include LDLIBS:=-lz -# Removing this works for some versions of ld.so (eg. Ubuntu Feisty) and -# not others (eg. FC7). -LDFLAGS+=-static -all: lguest.lds lguest -# The linker script on x86 is so complex the only way of creating one -# which will link our binary in the right place is to mangle the -# default one. -lguest.lds: - $(LD) --verbose | awk '/^==========/ { PRINT=1; next; } /SIZEOF_HEADERS/ { gsub(/0x[0-9A-F]*/, "$(LGUEST_GUEST_TOP)") } { if (PRINT) print $$0; }' > $@ +all: lguest clean: - rm -f lguest.lds lguest + rm -f lguest diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c index 103e346..f266839 100644 --- a/Documentation/lguest/lguest.c +++ b/Documentation/lguest/lguest.c @@ -1,10 +1,7 @@ /*P:100 This is the Launcher code, a simple program which lays out the * "physical" memory for the new Guest by mapping the kernel image and the * virtual devices, then reads repeatedly from /dev/lguest to run the Guest. - * - * The only trick: the Makefile links it at a high address so it will be clear - * of the guest memory region. It means that each Guest cannot have more than - * about 2.5G of memory on a normally configured Host. :*/ +:*/ #define _LARGEFILE64_SOURCE #define _GNU_SOURCE #include <stdio.h> @@ -15,6 +12,7 @@ #include <stdlib.h> #include <elf.h> #include <sys/mman.h> +#include <sys/param.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/wait.h> @@ -34,19 +32,26 @@ #include <termios.h> #include <getopt.h> #include <zlib.h> -/*L:110 We can ignore the 28 include files we need for this program, but I do +#include <assert.h> +#include <sched.h> +#include "linux/lguest_launcher.h" +#include "linux/virtio_config.h" +#include "linux/virtio_net.h" +#include "linux/virtio_blk.h" +#include "linux/virtio_console.h" +#include "linux/virtio_ring.h" +#include "asm-x86/bootparam.h" +/*L:110 We can ignore the 38 include files we need for this program, but I do * want to draw attention to the use of kernel-style types. * * As Linus said, "C is a Spartan language, and so should your naming be." I - * like these abbreviations and the header we need uses them, so we define them - * here. - */ + * like these abbreviations, so we define them here. Note that u64 is always + * unsigned long long, which works on all Linux systems: this means that we can + * use %llu in printf for any u64. */ typedef unsigned long long u64; typedef uint32_t u32; typedef uint16_t u16; typedef uint8_t u8; -#include "../../include/linux/lguest_launcher.h" -#include "../../include/asm-x86/e820_32.h" /*:*/ #define PAGE_PRESENT 0x7 /* Present, RW, Execute */ @@ -55,6 +60,10 @@ typedef uint8_t u8; #ifndef SIOCBRADDIF #define SIOCBRADDIF 0x89a2 /* add interface to bridge */ #endif +/* We can have up to 256 pages for devices. */ +#define DEVICE_PAGES 256 +/* This fits nicely in a single 4096-byte page. */ +#define VIRTQUEUE_NUM 127 /*L:120 verbose is both a global flag and a macro. The C preprocessor allows * this, and although I wouldn't recommend it, it works quite nicely here. */ @@ -65,8 +74,10 @@ static bool verbose; /* The pipe to send commands to the waker process */ static int waker_fd; -/* The top of guest physical memory. */ -static u32 top; +/* The pointer to the start of guest memory. */ +static void *guest_base; +/* The maximum guest physical address allowed, and maximum possible. */ +static unsigned long guest_limit, guest_max; /* This is our list of devices. */ struct device_list @@ -76,8 +87,17 @@ struct device_list fd_set infds; int max_infd; + /* Counter to assign interrupt numbers. */ + unsigned int next_irq; + + /* Counter to print out convenient device numbers. */ + unsigned int device_num; + /* The descriptor page for the devices. */ - struct lguest_device_desc *descs; + u8 *descpage; + + /* The tail of the last descriptor. */ + unsigned int desc_used; /* A single linked list of devices. */ struct device *dev; @@ -85,31 +105,111 @@ struct device_list struct device **lastdev; }; +/* The list of Guest devices, based on command line arguments. */ +static struct device_list devices; + /* The device structure describes a single device. */ struct device { /* The linked-list pointer. */ struct device *next; - /* The descriptor for this device, as mapped into the Guest. */ + + /* The this device's descriptor, as mapped into the Guest. */ struct lguest_device_desc *desc; - /* The memory page(s) of this device, if any. Also mapped in Guest. */ - void *mem; + + /* The name of this device, for --verbose. */ + const char *name; /* If handle_input is set, it wants to be called when this file * descriptor is ready. */ int fd; bool (*handle_input)(int fd, struct device *me); - /* If handle_output is set, it wants to be called when the Guest sends - * DMA to this key. */ - unsigned long watch_key; - u32 (*handle_output)(int fd, const struct iovec *iov, - unsigned int num, struct device *me); + /* Any queues attached to this device */ + struct virtqueue *vq; /* Device-specific data. */ void *priv; }; +/* The virtqueue structure describes a queue attached to a device. */ +struct virtqueue +{ + struct virtqueue *next; + + /* Which device owns me. */ + struct device *dev; + + /* The configuration for this queue. */ + struct lguest_vqconfig config; + + /* The actual ring of buffers. */ + struct vring vring; + + /* Last available index we saw. */ + u16 last_avail_idx; + + /* The routine to call when the Guest pings us. */ + void (*handle_output)(int fd, struct virtqueue *me); +}; + +/* Since guest is UP and we don't run at the same time, we don't need barriers. + * But I include them in the code in case others copy it. */ +#define wmb() + +/* Convert an iovec element to the given type. + * + * This is a fairly ugly trick: we need to know the size of the type and + * alignment requirement to check the pointer is kosher. It's also nice to + * have the name of the type in case we report failure. + * + * Typing those three things all the time is cumbersome and error prone, so we + * have a macro which sets them all up and passes to the real function. */ +#define convert(iov, type) \ + ((type *)_convert((iov), sizeof(type), __alignof__(type), #type)) + +static void *_convert(struct iovec *iov, size_t size, size_t align, + const char *name) +{ + if (iov->iov_len != size) + errx(1, "Bad iovec size %zu for %s", iov->iov_len, name); + if ((unsigned long)iov->iov_base % align != 0) + errx(1, "Bad alignment %p for %s", iov->iov_base, name); + return iov->iov_base; +} + +/* The virtio configuration space is defined to be little-endian. x86 is + * little-endian too, but it's nice to be explicit so we have these helpers. */ +#define cpu_to_le16(v16) (v16) +#define cpu_to_le32(v32) (v32) +#define cpu_to_le64(v64) (v64) +#define le16_to_cpu(v16) (v16) +#define le32_to_cpu(v32) (v32) +#define le64_to_cpu(v32) (v64) + +/*L:100 The Launcher code itself takes us out into userspace, that scary place + * where pointers run wild and free! Unfortunately, like most userspace + * programs, it's quite boring (which is why everyone likes to hack on the + * kernel!). Perhaps if you make up an Lguest Drinking Game at this point, it + * will get you through this section. Or, maybe not. + * + * The Launcher sets up a big chunk of memory to be the Guest's "physical" + * memory and stores it in "guest_base". In other words, Guest physical == + * Launcher virtual with an offset. + * + * This can be tough to get your head around, but usually it just means that we + * use these trivial conversion functions when the Guest gives us it's + * "physical" addresses: */ +static void *from_guest_phys(unsigned long addr) +{ + return guest_base + addr; +} + +static unsigned long to_guest_phys(const void *addr) +{ + return (addr - guest_base); +} + /*L:130 * Loading the Kernel. * @@ -123,43 +223,55 @@ static int open_or_die(const char *name, int flags) return fd; } -/* map_zeroed_pages() takes a (page-aligned) address and a number of pages. */ -static void *map_zeroed_pages(unsigned long addr, unsigned int num) +/* map_zeroed_pages() takes a number of pages. */ +static void *map_zeroed_pages(unsigned int num) { - /* We cache the /dev/zero file-descriptor so we only open it once. */ - static int fd = -1; - - if (fd == -1) - fd = open_or_die("/dev/zero", O_RDONLY); + int fd = open_or_die("/dev/zero", O_RDONLY); + void *addr; /* We use a private mapping (ie. if we write to the page, it will be - * copied), and obviously we insist that it be mapped where we ask. */ - if (mmap((void *)addr, getpagesize() * num, - PROT_READ|PROT_WRITE|PROT_EXEC, MAP_FIXED|MAP_PRIVATE, fd, 0) - != (void *)addr) - err(1, "Mmaping %u pages of /dev/zero @%p", num, (void *)addr); - - /* Returning the address is just a courtesy: can simplify callers. */ - return (void *)addr; + * copied). */ + addr = mmap(NULL, getpagesize() * num, + PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE, fd, 0); + if (addr == MAP_FAILED) + err(1, "Mmaping %u pages of /dev/zero", num); + + return addr; } -/* To find out where to start we look for the magic Guest string, which marks - * the code we see in lguest_asm.S. This is a hack which we are currently - * plotting to replace with the normal Linux entry point. */ -static unsigned long entry_point(void *start, void *end, - unsigned long page_offset) +/* Get some more pages for a device. */ +static void *get_pages(unsigned int num) { - void *p; + void *addr = from_guest_phys(guest_limit); - /* The scan gives us the physical starting address. We want the - * virtual address in this case, and fortunately, we already figured - * out the physical-virtual difference and passed it here in - * "page_offset". */ - for (p = start; p < end; p++) - if (memcmp(p, "GenuineLguest", strlen("GenuineLguest")) == 0) - return (long)p + strlen("GenuineLguest") + page_offset; + guest_limit += num * getpagesize(); + if (guest_limit > guest_max) + errx(1, "Not enough memory for devices"); + return addr; +} - err(1, "Is this image a genuine lguest?"); +/* This routine is used to load the kernel or initrd. It tries mmap, but if + * that fails (Plan 9's kernel file isn't nicely aligned on page boundaries), + * it falls back to reading the memory in. */ +static void map_at(int fd, void *addr, unsigned long offset, unsigned long len) +{ + ssize_t r; + + /* We map writable even though for some segments are marked read-only. + * The kernel really wants to be writable: it patches its own + * instructions. + * + * MAP_PRIVATE means that the page won't be copied until a write is + * done to it. This allows us to share untouched memory between + * Guests. */ + if (mmap(addr, len, PROT_READ|PROT_WRITE|PROT_EXEC, + MAP_FIXED|MAP_PRIVATE, fd, offset) != MAP_FAILED) + return; + + /* pread does a seek and a read in one shot: saves a few lines. */ + r = pread(fd, addr, len, offset); + if (r != len) + err(1, "Reading offset %lu len %lu gave %zi", offset, len, r); } /* This routine takes an open vmlinux image, which is in ELF, and maps it into @@ -167,19 +279,14 @@ static unsigned long entry_point(void *start, void *end, * by all modern binaries on Linux including the kernel. * * The ELF headers give *two* addresses: a physical address, and a virtual - * address. The Guest kernel expects to be placed in memory at the physical - * address, and the page tables set up so it will correspond to that virtual - * address. We return the difference between the virtual and physical - * addresses in the "page_offset" pointer. + * address. We use the physical address; the Guest will map itself to the + * virtual address. * * We return the starting address. */ -static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr, - unsigned long *page_offset) +static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr) { - void *addr; Elf32_Phdr phdr[ehdr->e_phnum]; unsigned int i; - unsigned long start = -1UL, end = 0; /* Sanity checks on the main ELF header: an x86 executable with a * reasonable number of correctly-sized program headers. */ @@ -199,9 +306,6 @@ static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr, if (read(elf_fd, phdr, sizeof(phdr)) != sizeof(phdr)) err(1, "Reading program headers"); - /* We don't know page_offset yet. */ - *page_offset = 0; - /* Try all the headers: there are usually only three. A read-only one, * a read-write one, and a "note" section which isn't loadable. */ for (i = 0; i < ehdr->e_phnum; i++) { @@ -212,158 +316,53 @@ static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr, verbose("Section %i: size %i addr %p\n", i, phdr[i].p_memsz, (void *)phdr[i].p_paddr); - /* We expect a simple linear address space: every segment must - * have the same difference between virtual (p_vaddr) and - * physical (p_paddr) address. */ - if (!*page_offset) - *page_offset = phdr[i].p_vaddr - phdr[i].p_paddr; - else if (*page_offset != phdr[i].p_vaddr - phdr[i].p_paddr) - errx(1, "Page offset of section %i different", i); - - /* We track the first and last address we mapped, so we can - * tell entry_point() where to scan. */ - if (phdr[i].p_paddr < start) - start = phdr[i].p_paddr; - if (phdr[i].p_paddr + phdr[i].p_filesz > end) - end = phdr[i].p_paddr + phdr[i].p_filesz; - - /* We map this section of the file at its physical address. We - * map it read & write even if the header says this segment is - * read-only. The kernel really wants to be writable: it - * patches its own instructions which would normally be - * read-only. - * - * MAP_PRIVATE means that the page won't be copied until a - * write is done to it. This allows us to share much of the - * kernel memory between Guests. */ - addr = mmap((void *)phdr[i].p_paddr, - phdr[i].p_filesz, - PROT_READ|PROT_WRITE|PROT_EXEC, - MAP_FIXED|MAP_PRIVATE, - elf_fd, phdr[i].p_offset); - if (addr != (void *)phdr[i].p_paddr) - err(1, "Mmaping vmlinux seg %i gave %p not %p", - i, addr, (void *)phdr[i].p_paddr); + /* We map this section of the file at its physical address. */ + map_at(elf_fd, from_guest_phys(phdr[i].p_paddr), + phdr[i].p_offset, phdr[i].p_filesz); } - return entry_point((void *)start, (void *)end, *page_offset); + /* The entry point is given in the ELF header. */ + return ehdr->e_entry; } -/*L:170 Prepare to be SHOCKED and AMAZED. And possibly a trifle nauseated. - * - * We know that CONFIG_PAGE_OFFSET sets what virtual address the kernel expects - * to be. We don't know what that option was, but we can figure it out - * approximately by looking at the addresses in the code. I chose the common - * case of reading a memory location into the %eax register: - * - * movl <some-address>, %eax - * - * This gets encoded as five bytes: "0xA1 <4-byte-address>". For example, - * "0xA1 0x18 0x60 0x47 0xC0" reads the address 0xC0476018 into %eax. - * - * In this example can guess that the kernel was compiled with - * CONFIG_PAGE_OFFSET set to 0xC0000000 (it's always a round number). If the - * kernel were larger than 16MB, we might see 0xC1 addresses show up, but our - * kernel isn't that bloated yet. - * - * Unfortunately, x86 has variable-length instructions, so finding this - * particular instruction properly involves writing a disassembler. Instead, - * we rely on statistics. We look for "0xA1" and tally the different bytes - * which occur 4 bytes later (the "0xC0" in our example above). When one of - * those bytes appears three times, we can be reasonably confident that it - * forms the start of CONFIG_PAGE_OFFSET. +/*L:150 A bzImage, unlike an ELF file, is not meant to be loaded. You're + * supposed to jump into it and it will unpack itself. We used to have to + * perform some hairy magic because the unpacking code scared me. * - * This is amazingly reliable. */ -static unsigned long intuit_page_offset(unsigned char *img, unsigned long len) + * Fortunately, Jeremy Fitzhardinge convinced me it wasn't that hard and wrote + * a small patch to jump over the tricky bits in the Guest, so now we just read + * the funky header so we know where in the file to load, and away we go! */ +static unsigned long load_bzimage(int fd) { - unsigned int i, possibilities[256] = { 0 }; + struct boot_params boot; + int r; + /* Modern bzImages get loaded at 1M. */ + void *p = from_guest_phys(0x100000); - for (i = 0; i + 4 < len; i++) { - /* mov 0xXXXXXXXX,%eax */ - if (img[i] == 0xA1 && ++possibilities[img[i+4]] > 3) - return (unsigned long)img[i+4] << 24; - } - errx(1, "could not determine page offset"); -} + /* Go back to the start of the file and read the header. It should be + * a Linux boot header (see Documentation/i386/boot.txt) */ + lseek(fd, 0, SEEK_SET); + read(fd, &boot, sizeof(boot)); -/*L:160 Unfortunately the entire ELF image isn't compressed: the segments - * which need loading are extracted and compressed raw. This denies us the - * information we need to make a fully-general loader. */ -static unsigned long unpack_bzimage(int fd, unsigned long *page_offset) -{ - gzFile f; - int ret, len = 0; - /* A bzImage always gets loaded at physical address 1M. This is - * actually configurable as CONFIG_PHYSICAL_START, but as the comment - * there says, "Don't change this unless you know what you are doing". - * Indeed. */ - void *img = (void *)0x100000; - - /* gzdopen takes our file descriptor (carefully placed at the start of - * the GZIP header we found) and returns a gzFile. */ - f = gzdopen(fd, "rb"); - /* We read it into memory in 64k chunks until we hit the end. */ - while ((ret = gzread(f, img + len, 65536)) > 0) - len += ret; - if (ret < 0) - err(1, "reading image from bzImage"); - - verbose("Unpacked size %i addr %p\n", len, img); - - /* Without the ELF header, we can't tell virtual-physical gap. This is - * CONFIG_PAGE_OFFSET, and people do actually change it. Fortunately, - * I have a clever way of figuring it out from the code itself. */ - *page_offset = intuit_page_offset(img, len); - - return entry_point(img, img + len, *page_offset); -} + /* Inside the setup_hdr, we expect the magic "HdrS" */ + if (memcmp(&boot.hdr.header, "HdrS", 4) != 0) + errx(1, "This doesn't look like a bzImage to me"); -/*L:150 A bzImage, unlike an ELF file, is not meant to be loaded. You're - * supposed to jump into it and it will unpack itself. We can't do that - * because the Guest can't run the unpacking code, and adding features to - * lguest kills puppies, so we don't want to. - * - * The bzImage is formed by putting the decompressing code in front of the - * compressed kernel code. So we can simple scan through it looking for the - * first "gzip" header, and start decompressing from there. */ -static unsigned long load_bzimage(int fd, unsigned long *page_offset) -{ - unsigned char c; - int state = 0; - - /* GZIP header is 0x1F 0x8B <method> <flags>... <compressed-by>. */ - while (read(fd, &c, 1) == 1) { - switch (state) { - case 0: - if (c == 0x1F) - state++; - break; - case 1: - if (c == 0x8B) - state++; - else - state = 0; - break; - case 2 ... 8: - state++; - break; - case 9: - /* Seek back to the start of the gzip header. */ - lseek(fd, -10, SEEK_CUR); - /* One final check: "compressed under UNIX". */ - if (c != 0x03) - state = -1; - else - return unpack_bzimage(fd, page_offset); - } - } - errx(1, "Could not find kernel in bzImage"); + /* Skip over the extra sectors of the header. */ + lseek(fd, (boot.hdr.setup_sects+1) * 512, SEEK_SET); + + /* Now read everything into memory. in nice big chunks. */ + while ((r = read(fd, p, 65536)) > 0) + p += r; + + /* Finally, code32_start tells us where to enter the kernel. */ + return boot.hdr.code32_start; } /*L:140 Loading the kernel is easy when it's a "vmlinux", but most kernels - * come wrapped up in the self-decompressing "bzImage" format. With some funky - * coding, we can load those, too. */ -static unsigned long load_kernel(int fd, unsigned long *page_offset) + * come wrapped up in the self-decompressing "bzImage" format. With a little + * work, we can load those, too. */ +static unsigned long load_kernel(int fd) { Elf32_Ehdr hdr; @@ -373,10 +372,10 @@ static unsigned long load_kernel(int fd, unsigned long *page_offset) /* If it's an ELF file, it starts with "\177ELF" */ if (memcmp(hdr.e_ident, ELFMAG, SELFMAG) == 0) - return map_elf(fd, &hdr, page_offset); + return map_elf(fd, &hdr); /* Otherwise we assume it's a bzImage, and try to unpack it */ - return load_bzimage(fd, page_offset); + return load_bzimage(fd); } /* This is a trivial little helper to align pages. Andi Kleen hated it because @@ -402,59 +401,45 @@ static unsigned long load_initrd(const char *name, unsigned long mem) int ifd; struct stat st; unsigned long len; - void *iaddr; ifd = open_or_die(name, O_RDONLY); /* fstat() is needed to get the file size. */ if (fstat(ifd, &st) < 0) err(1, "fstat() on initrd '%s'", name); - /* The length needs to be rounded up to a page size: mmap needs the - * address to be page aligned. */ + /* We map the initrd at the top of memory, but mmap wants it to be + * page-aligned, so we round the size up for that. */ len = page_align(st.st_size); - /* We map the initrd at the top of memory. */ - iaddr = mmap((void *)mem - len, st.st_size, - PROT_READ|PROT_EXEC|PROT_WRITE, - MAP_FIXED|MAP_PRIVATE, ifd, 0); - if (iaddr != (void *)mem - len) - err(1, "Mmaping initrd '%s' returned %p not %p", - name, iaddr, (void *)mem - len); + map_at(ifd, from_guest_phys(mem - len), 0, st.st_size); /* Once a file is mapped, you can close the file descriptor. It's a * little odd, but quite useful. */ close(ifd); - verbose("mapped initrd %s size=%lu @ %p\n", name, st.st_size, iaddr); + verbose("mapped initrd %s size=%lu @ %p\n", name, len, (void*)mem-len); /* We return the initrd size. */ return len; } -/* Once we know how much memory we have, and the address the Guest kernel - * expects, we can construct simple linear page tables which will get the Guest - * far enough into the boot to create its own. +/* Once we know how much memory we have, we can construct simple linear page + * tables which set virtual == physical which will get the Guest far enough + * into the boot to create its own. * * We lay them out of the way, just below the initrd (which is why we need to * know its size). */ static unsigned long setup_pagetables(unsigned long mem, - unsigned long initrd_size, - unsigned long page_offset) + unsigned long initrd_size) { - u32 *pgdir, *linear; + unsigned long *pgdir, *linear; unsigned int mapped_pages, i, linear_pages; - unsigned int ptes_per_page = getpagesize()/sizeof(u32); + unsigned int ptes_per_page = getpagesize()/sizeof(void *); - /* Ideally we map all physical memory starting at page_offset. - * However, if page_offset is 0xC0000000 we can only map 1G of physical - * (0xC0000000 + 1G overflows). */ - if (mem <= -page_offset) - mapped_pages = mem/getpagesize(); - else - mapped_pages = -page_offset/getpagesize(); + mapped_pages = mem/getpagesize(); /* Each PTE page can map ptes_per_page pages: how many do we need? */ linear_pages = (mapped_pages + ptes_per_page-1)/ptes_per_page; /* We put the toplevel page directory page at the top of memory. */ - pgdir = (void *)mem - initrd_size - getpagesize(); + pgdir = from_guest_phys(mem) - initrd_size - getpagesize(); /* Now we use the next linear_pages pages as pte pages */ linear = (void *)pgdir - linear_pages*getpagesize(); @@ -465,21 +450,21 @@ static unsigned long setup_pagetables(unsigned long mem, for (i = 0; i < mapped_pages; i++) linear[i] = ((i * getpagesize()) | PAGE_PRESENT); - /* The top level points to the linear page table pages above. The - * entry representing page_offset points to the first one, and they - * continue from there. */ + /* The top level points to the linear page table pages above. */ for (i = 0; i < mapped_pages; i += ptes_per_page) { - pgdir[(i + page_offset/getpagesize())/ptes_per_page] - = (((u32)linear + i*sizeof(u32)) | PAGE_PRESENT); + pgdir[i/ptes_per_page] + = ((to_guest_phys(linear) + i*sizeof(void *)) + | PAGE_PRESENT); } - verbose("Linear mapping of %u pages in %u pte pages at %p\n", - mapped_pages, linear_pages, linear); + verbose("Linear mapping of %u pages in %u pte pages at %#lx\n", + mapped_pages, linear_pages, to_guest_phys(linear)); /* We return the top level (guest-physical) address: the kernel needs * to know where it is. */ - return (unsigned long)pgdir; + return to_guest_phys(pgdir); } +/*:*/ /* Simple routine to roll all the commandline arguments together with spaces * between them. */ @@ -496,16 +481,19 @@ static void concat(char *dst, char *args[]) dst[len] = '\0'; } -/* This is where we actually tell the kernel to initialize the Guest. We saw - * the arguments it expects when we looked at initialize() in lguest_user.c: - * the top physical page to allow, the top level pagetable, the entry point and - * the page_offset constant for the Guest. */ -static int tell_kernel(u32 pgdir, u32 start, u32 page_offset) +/*L:185 This is where we actually tell the kernel to initialize the Guest. We + * saw the arguments it expects when we looked at initialize() in lguest_user.c: + * the base of Guest "physical" memory, the top physical page to allow, the + * top level pagetable and the entry point for the Guest. */ +static int tell_kernel(unsigned long pgdir, unsigned long start) { - u32 args[] = { LHREQ_INITIALIZE, - top/getpagesize(), pgdir, start, page_offset }; + unsigned long args[] = { LHREQ_INITIALIZE, + (unsigned long)guest_base, + guest_limit / getpagesize(), pgdir, start }; int fd; + verbose("Guest: %p - %p (%#lx)\n", + guest_base, guest_base + guest_limit, guest_limit); fd = open_or_die("/dev/lguest", O_RDWR); if (write(fd, args, sizeof(args)) < 0) err(1, "Writing to /dev/lguest"); @@ -515,62 +503,67 @@ static int tell_kernel(u32 pgdir, u32 start, u32 page_offset) } /*:*/ -static void set_fd(int fd, struct device_list *devices) +static void add_device_fd(int fd) { - FD_SET(fd, &devices->infds); - if (fd > devices->max_infd) - devices->max_infd = fd; + FD_SET(fd, &devices.infds); + if (fd > devices.max_infd) + devices.max_infd = fd; } /*L:200 * The Waker. * - * With a console and network devices, we can have lots of input which we need - * to process. We could try to tell the kernel what file descriptors to watch, - * but handing a file descriptor mask through to the kernel is fairly icky. + * With console, block and network devices, we can have lots of input which we + * need to process. We could try to tell the kernel what file descriptors to + * watch, but handing a file descriptor mask through to the kernel is fairly + * icky. * * Instead, we fork off a process which watches the file descriptors and writes - * the LHREQ_BREAK command to the /dev/lguest filedescriptor to tell the Host - * loop to stop running the Guest. This causes it to return from the + * the LHREQ_BREAK command to the /dev/lguest file descriptor to tell the Host + * stop running the Guest. This causes the Launcher to return from the * /dev/lguest read with -EAGAIN, where it will write to /dev/lguest to reset * the LHREQ_BREAK and wake us up again. * * This, of course, is merely a different *kind* of icky. */ -static void wake_parent(int pipefd, int lguest_fd, struct device_list *devices) +static void wake_parent(int pipefd, int lguest_fd) { /* Add the pipe from the Launcher to the fdset in the device_list, so * we watch it, too. */ - set_fd(pipefd, devices); + add_device_fd(pipefd); for (;;) { - fd_set rfds = devices->infds; - u32 args[] = { LHREQ_BREAK, 1 }; + fd_set rfds = devices.infds; + unsigned long args[] = { LHREQ_BREAK, 1 }; /* Wait until input is ready from one of the devices. */ - select(devices->max_infd+1, &rfds, NULL, NULL, NULL); + select(devices.max_infd+1, &rfds, NULL, NULL, NULL); /* Is it a message from the Launcher? */ if (FD_ISSET(pipefd, &rfds)) { - int ignorefd; + int fd; /* If read() returns 0, it means the Launcher has * exited. We silently follow. */ - if (read(pipefd, &ignorefd, sizeof(ignorefd)) == 0) + if (read(pipefd, &fd, sizeof(fd)) == 0) exit(0); - /* Otherwise it's telling us there's a problem with one - * of the devices, and we should ignore that file - * descriptor from now on. */ - FD_CLR(ignorefd, &devices->infds); + /* Otherwise it's telling us to change what file + * descriptors we're to listen to. Positive means + * listen to a new one, negative means stop + * listening. */ + if (fd >= 0) + FD_SET(fd, &devices.infds); + else + FD_CLR(-fd - 1, &devices.infds); } else /* Send LHREQ_BREAK command. */ write(lguest_fd, args, sizeof(args)); } } /* This routine just sets up a pipe to the Waker process. */ -static int setup_waker(int lguest_fd, struct device_list *device_list) +static int setup_waker(int lguest_fd) { int pipefd[2], child; - /* We create a pipe to talk to the waker, and also so it knows when the + /* We create a pipe to talk to the Waker, and also so it knows when the * Launcher dies (and closes pipe). */ pipe(pipefd); child = fork(); @@ -578,9 +571,10 @@ static int setup_waker(int lguest_fd, struct device_list *device_list) err(1, "forking"); if (child == 0) { - /* Close the "writing" end of our copy of the pipe */ + /* We are the Waker: close the "writing" end of our copy of the + * pipe and start waiting for input. */ close(pipefd[1]); - wake_parent(pipefd[0], lguest_fd, device_list); + wake_parent(pipefd[0], lguest_fd); } /* Close the reading end of our copy of the pipe. */ close(pipefd[0]); @@ -589,12 +583,12 @@ static int setup_waker(int lguest_fd, struct device_list *device_list) return pipefd[1]; } -/*L:210 +/* * Device Handling. * - * When the Guest sends DMA to us, it sends us an array of addresses and sizes. + * When the Guest gives us a buffer, it sends an array of addresses and sizes. * We need to make sure it's not trying to reach into the Launcher itself, so - * we have a convenient routine which check it and exits with an error message + * we have a convenient routine which checks it and exits with an error message * if something funny is going on: */ static void *_check_pointer(unsigned long addr, unsigned int size, @@ -602,87 +596,139 @@ static void *_check_pointer(unsigned long addr, unsigned int size, { /* We have to separately check addr and addr+size, because size could * be huge and addr + size might wrap around. */ - if (addr >= top || addr + size >= top) - errx(1, "%s:%i: Invalid address %li", __FILE__, line, addr); + if (addr >= guest_limit || addr + size >= guest_limit) + errx(1, "%s:%i: Invalid address %#lx", __FILE__, line, addr); /* We return a pointer for the caller's convenience, now we know it's * safe to use. */ - return (void *)addr; + return from_guest_phys(addr); } /* A macro which transparently hands the line number to the real function. */ #define check_pointer(addr,size) _check_pointer(addr, size, __LINE__) -/* The Guest has given us the address of a "struct lguest_dma". We check it's - * OK and convert it to an iovec (which is a simple array of ptr/size - * pairs). */ -static u32 *dma2iov(unsigned long dma, struct iovec iov[], unsigned *num) +/* Each buffer in the virtqueues is actually a chain of descriptors. This + * function returns the next descriptor in the chain, or vq->vring.num if we're + * at the end. */ +static unsigned next_desc(struct virtqueue *vq, unsigned int i) { - unsigned int i; - struct lguest_dma *udma; - - /* First we make sure that the array memory itself is valid. */ - udma = check_pointer(dma, sizeof(*udma)); - /* Now we check each element */ - for (i = 0; i < LGUEST_MAX_DMA_SECTIONS; i++) { - /* A zero length ends the array. */ - if (!udma->len[i]) - break; + unsigned int next; - iov[i].iov_base = check_pointer(udma->addr[i], udma->len[i]); - iov[i].iov_len = udma->len[i]; - } - *num = i; + /* If this descriptor says it doesn't chain, we're done. */ + if (!(vq->vring.desc[i].flags & VRING_DESC_F_NEXT)) + return vq->vring.num; - /* We return the pointer to where the caller should write the amount of - * the buffer used. */ - return &udma->used_len; + /* Check they're not leading us off end of descriptors. */ + next = vq->vring.desc[i].next; + /* Make sure compiler knows to grab that: we don't want it changing! */ + wmb(); + + if (next >= vq->vring.num) + errx(1, "Desc next is %u", next); + + return next; } -/* This routine gets a DMA buffer from the Guest for a given key, and converts - * it to an iovec array. It returns the interrupt the Guest wants when we're - * finished, and a pointer to the "used_len" field to fill in. */ -static u32 *get_dma_buffer(int fd, void *key, - struct iovec iov[], unsigned int *num, u32 *irq) +/* This looks in the virtqueue and for the first available buffer, and converts + * it to an iovec for convenient access. Since descriptors consist of some + * number of output then some number of input descriptors, it's actually two + * iovecs, but we pack them into one and note how many of each there were. + * + * This function returns the descriptor number found, or vq->vring.num (which + * is never a valid descriptor number) if none was found. */ +static unsigned get_vq_desc(struct virtqueue *vq, + struct iovec iov[], + unsigned int *out_num, unsigned int *in_num) { - u32 buf[] = { LHREQ_GETDMA, (u32)key }; - unsigned long udma; - u32 *res; - - /* Ask the kernel for a DMA buffer corresponding to this key. */ - udma = write(fd, buf, sizeof(buf)); - /* They haven't registered any, or they're all used? */ - if (udma == (unsigned long)-1) - return NULL; - - /* Convert it into our iovec array */ - res = dma2iov(udma, iov, num); - /* The kernel stashes irq in ->used_len to get it out to us. */ - *irq = *res; - /* Return a pointer to ((struct lguest_dma *)udma)->used_len. */ - return res; + unsigned int i, head; + + /* Check it isn't doing very strange things with descriptor numbers. */ + if ((u16)(vq->vring.avail->idx - vq->last_avail_idx) > vq->vring.num) + errx(1, "Guest moved used index from %u to %u", + vq->last_avail_idx, vq->vring.avail->idx); + + /* If there's nothing new since last we looked, return invalid. */ + if (vq->vring.avail->idx == vq->last_avail_idx) + return vq->vring.num; + + /* Grab the next descriptor number they're advertising, and increment + * the index we've seen. */ + head = vq->vring.avail->ring[vq->last_avail_idx++ % vq->vring.num]; + + /* If their number is silly, that's a fatal mistake. */ + if (head >= vq->vring.num) + errx(1, "Guest says index %u is available", head); + + /* When we start there are none of either input nor output. */ + *out_num = *in_num = 0; + + i = head; + do { + /* Grab the first descriptor, and check it's OK. */ + iov[*out_num + *in_num].iov_len = vq->vring.desc[i].len; + iov[*out_num + *in_num].iov_base + = check_pointer(vq->vring.desc[i].addr, + vq->vring.desc[i].len); + /* If this is an input descriptor, increment that count. */ + if (vq->vring.desc[i].flags & VRING_DESC_F_WRITE) + (*in_num)++; + else { + /* If it's an output descriptor, they're all supposed + * to come before any input descriptors. */ + if (*in_num) + errx(1, "Descriptor has out after in"); + (*out_num)++; + } + + /* If we've got too many, that implies a descriptor loop. */ + if (*out_num + *in_num > vq->vring.num) + errx(1, "Looped descriptor"); + } while ((i = next_desc(vq, i)) != vq->vring.num); + + return head; } -/* This is a convenient routine to send the Guest an interrupt. */ -static void trigger_irq(int fd, u32 irq) +/* After we've used one of their buffers, we tell them about it. We'll then + * want to send them an interrupt, using trigger_irq(). */ +static void add_used(struct virtqueue *vq, unsigned int head, int len) { - u32 buf[] = { LHREQ_IRQ, irq }; + struct vring_used_elem *used; + + /* The virtqueue contains a ring of used buffers. Get a pointer to the + * next entry in that used ring. */ + used = &vq->vring.used->ring[vq->vring.used->idx % vq->vring.num]; + used->id = head; + used->len = len; + /* Make sure buffer is written before we update index. */ + wmb(); + vq->vring.used->idx++; +} + +/* This actually sends the interrupt for this virtqueue */ +static void trigger_irq(int fd, struct virtqueue *vq) +{ + unsigned long buf[] = { LHREQ_IRQ, vq->config.irq }; + + /* If they don't want an interrupt, don't send one. */ + if (vq->vring.avail->flags & VRING_AVAIL_F_NO_INTERRUPT) + return; + + /* Send the Guest an interrupt tell them we used something up. */ if (write(fd, buf, sizeof(buf)) != 0) - err(1, "Triggering irq %i", irq); + err(1, "Triggering irq %i", vq->config.irq); } -/* This simply sets up an iovec array where we can put data to be discarded. - * This happens when the Guest doesn't want or can't handle the input: we have - * to get rid of it somewhere, and if we bury it in the ceiling space it will - * start to smell after a week. */ -static void discard_iovec(struct iovec *iov, unsigned int *num) +/* And here's the combo meal deal. Supersize me! */ +static void add_used_and_trigger(int fd, struct virtqueue *vq, + unsigned int head, int len) { - static char discard_buf[1024]; - *num = 1; - iov->iov_base = discard_buf; - iov->iov_len = sizeof(discard_buf); + add_used(vq, head, len); + trigger_irq(fd, vq); } -/* Here is the input terminal setting we save, and the routine to restore them - * on exit so the user can see what they type next. */ +/* + * The Console + * + * Here is the input terminal setting we save, and the routine to restore them + * on exit so the user gets their terminal back. */ static struct termios orig_term; static void restore_term(void) { @@ -701,38 +747,39 @@ struct console_abort /* This is the routine which handles console input (ie. stdin). */ static bool handle_console_input(int fd, struct device *dev) { - u32 irq = 0, *lenp; int len; - unsigned int num; - struct iovec iov[LGUEST_MAX_DMA_SECTIONS]; + unsigned int head, in_num, out_num; + struct iovec iov[dev->vq->vring.num]; struct console_abort *abort = dev->priv; - /* First we get the console buffer from the Guest. The key is dev->mem - * which was set to 0 in setup_console(). */ - lenp = get_dma_buffer(fd, dev->mem, iov, &num, &irq); - if (!lenp) { - /* If it's not ready for input, warn and set up to discard. */ - warn("console: no dma buffer!"); - discard_iovec(iov, &num); - } + /* First we need a console buffer from the Guests's input virtqueue. */ + head = get_vq_desc(dev->vq, iov, &out_num, &in_num); + + /* If they're not ready for input, stop listening to this file + * descriptor. We'll start again once they add an input buffer. */ + if (head == dev->vq->vring.num) + return false; + + if (out_num) + errx(1, "Output buffers in console in queue?"); /* This is why we convert to iovecs: the readv() call uses them, and so * it reads straight into the Guest's buffer. */ - len = readv(dev->fd, iov, num); + len = readv(dev->fd, iov, in_num); if (len <= 0) { /* This implies that the console is closed, is /dev/null, or - * something went terribly wrong. We still go through the rest - * of the logic, though, especially the exit handling below. */ + * something went terribly wrong. */ warnx("Failed to get console input, ignoring console."); - len = 0; + /* Put the input terminal back. */ + restore_term(); + /* Remove callback from input vq, so it doesn't restart us. */ + dev->vq->handle_output = NULL; + /* Stop listening to this fd: don't call us again. */ + return false; } - /* If we read the data into the Guest, fill in the length and send the - * interrupt. */ - if (lenp) { - *lenp = len; - trigger_irq(fd, irq); - } + /* Tell the Guest about the new input. */ + add_used_and_trigger(fd, dev->vq, head, len); /* Three ^C within one second? Exit. * @@ -746,7 +793,7 @@ static bool handle_console_input(int fd, struct device *dev) struct timeval now; gettimeofday(&now, NULL); if (now.tv_sec <= abort->start.tv_sec+1) { - u32 args[] = { LHREQ_BREAK, 0 }; + unsigned long args[] = { LHREQ_BREAK, 0 }; /* Close the fd so Waker will know it has to * exit. */ close(waker_fd); @@ -761,214 +808,168 @@ static bool handle_console_input(int fd, struct device *dev) /* Any other key resets the abort counter. */ abort->count = 0; - /* Now, if we didn't read anything, put the input terminal back and - * return failure (meaning, don't call us again). */ - if (!len) { - restore_term(); - return false; - } /* Everything went OK! */ return true; } -/* Handling console output is much simpler than input. */ -static u32 handle_console_output(int fd, const struct iovec *iov, - unsigned num, struct device*dev) +/* Handling output for console is simple: we just get all the output buffers + * and write them to stdout. */ +static void handle_console_output(int fd, struct virtqueue *vq) { - /* Whatever the Guest sends, write it to standard output. Return the - * number of bytes written. */ - return writev(STDOUT_FILENO, iov, num); -} - -/* Guest->Host network output is also pretty easy. */ -static u32 handle_tun_output(int fd, const struct iovec *iov, - unsigned num, struct device *dev) -{ - /* We put a flag in the "priv" pointer of the network device, and set - * it as soon as we see output. We'll see why in handle_tun_input() */ - *(bool *)dev->priv = true; - /* Whatever packet the Guest sent us, write it out to the tun - * device. */ - return writev(dev->fd, iov, num); + unsigned int head, out, in; + int len; + struct iovec iov[vq->vring.num]; + + /* Keep getting output buffers from the Guest until we run out. */ + while ((head = get_vq_desc(vq, iov, &out, &in)) != vq->vring.num) { + if (in) + errx(1, "Input buffers in output queue?"); + len = writev(STDOUT_FILENO, iov, out); + add_used_and_trigger(fd, vq, head, len); + } } -/* This matches the peer_key() in lguest_net.c. The key for any given slot - * is the address of the network device's page plus 4 * the slot number. */ -static unsigned long peer_offset(unsigned int peernum) +/* + * The Network + * + * Handling output for network is also simple: we get all the output buffers + * and write them (ignoring the first element) to this device's file descriptor + * (stdout). */ +static void handle_net_output(int fd, struct virtqueue *vq) { - return 4 * peernum; + unsigned int head, out, in; + int len; + struct iovec iov[vq->vring.num]; + + /* Keep getting output buffers from the Guest until we run out. */ + while ((head = get_vq_desc(vq, iov, &out, &in)) != vq->vring.num) { + if (in) + errx(1, "Input buffers in output queue?"); + /* Check header, but otherwise ignore it (we told the Guest we + * supported no features, so it shouldn't have anything + * interesting). */ + (void)convert(&iov[0], struct virtio_net_hdr); + len = writev(vq->dev->fd, iov+1, out-1); + add_used_and_trigger(fd, vq, head, len); + } } -/* This is where we handle a packet coming in from the tun device */ +/* This is where we handle a packet coming in from the tun device to our + * Guest. */ static bool handle_tun_input(int fd, struct device *dev) { - u32 irq = 0, *lenp; + unsigned int head, in_num, out_num; int len; - unsigned num; - struct iovec iov[LGUEST_MAX_DMA_SECTIONS]; + struct iovec iov[dev->vq->vring.num]; + struct virtio_net_hdr *hdr; - /* First we get a buffer the Guest has bound to its key. */ - lenp = get_dma_buffer(fd, dev->mem+peer_offset(NET_PEERNUM), iov, &num, - &irq); - if (!lenp) { + /* First we need a network buffer from the Guests's recv virtqueue. */ + head = get_vq_desc(dev->vq, iov, &out_num, &in_num); + if (head == dev->vq->vring.num) { /* Now, it's expected that if we try to send a packet too - * early, the Guest won't be ready yet. This is why we set a - * flag when the Guest sends its first packet. If it's sent a - * packet we assume it should be ready to receive them. - * - * Actually, this is what the status bits in the descriptor are - * for: we should *use* them. FIXME! */ - if (*(bool *)dev->priv) + * early, the Guest won't be ready yet. Wait until the device + * status says it's ready. */ + /* FIXME: Actually want DRIVER_ACTIVE here. */ + if (dev->desc->status & VIRTIO_CONFIG_S_DRIVER_OK) warn("network: no dma buffer!"); - discard_iovec(iov, &num); - } + /* We'll turn this back on if input buffers are registered. */ + return false; + } else if (out_num) + errx(1, "Output buffers in network recv queue?"); + + /* First element is the header: we set it to 0 (no features). */ + hdr = convert(&iov[0], struct virtio_net_hdr); + hdr->flags = 0; + hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE; /* Read the packet from the device directly into the Guest's buffer. */ - len = readv(dev->fd, iov, num); + len = readv(dev->fd, iov+1, in_num-1); if (len <= 0) err(1, "reading network"); - /* Write the used_len, and trigger the interrupt for the Guest */ - if (lenp) { - *lenp = len; - trigger_irq(fd, irq); - } + /* Tell the Guest about the new packet. */ + add_used_and_trigger(fd, dev->vq, head, sizeof(*hdr) + len); + verbose("tun input packet len %i [%02x %02x] (%s)\n", len, - ((u8 *)iov[0].iov_base)[0], ((u8 *)iov[0].iov_base)[1], - lenp ? "sent" : "discarded"); + ((u8 *)iov[1].iov_base)[0], ((u8 *)iov[1].iov_base)[1], + head != dev->vq->vring.num ? "sent" : "discarded"); + /* All good. */ return true; } -/* The last device handling routine is block output: the Guest has sent a DMA - * to the block device. It will have placed the command it wants in the - * "struct lguest_block_page". */ -static u32 handle_block_output(int fd, const struct iovec *iov, - unsigned num, struct device *dev) +/*L:215 This is the callback attached to the network and console input + * virtqueues: it ensures we try again, in case we stopped console or net + * delivery because Guest didn't have any buffers. */ +static void enable_fd(int fd, struct virtqueue *vq) { - struct lguest_block_page *p = dev->mem; - u32 irq, *lenp; - unsigned int len, reply_num; - struct iovec reply[LGUEST_MAX_DMA_SECTIONS]; - off64_t device_len, off = (off64_t)p->sector * 512; - - /* First we extract the device length from the dev->priv pointer. */ - device_len = *(off64_t *)dev->priv; - - /* We first check that the read or write is within the length of the - * block file. */ - if (off >= device_len) - err(1, "Bad offset %llu vs %llu", off, device_len); - /* Move to the right location in the block file. This shouldn't fail, - * but best to check. */ - if (lseek64(dev->fd, off, SEEK_SET) != off) - err(1, "Bad seek to sector %i", p->sector); - - verbose("Block: %s at offset %llu\n", p->type ? "WRITE" : "READ", off); - - /* They were supposed to bind a reply buffer at key equal to the start - * of the block device memory. We need this to tell them when the - * request is finished. */ - lenp = get_dma_buffer(fd, dev->mem, reply, &reply_num, &irq); - if (!lenp) - err(1, "Block request didn't give us a dma buffer"); - - if (p->type) { - /* A write request. The DMA they sent contained the data, so - * write it out. */ - len = writev(dev->fd, iov, num); - /* Grr... Now we know how long the "struct lguest_dma" they - * sent was, we make sure they didn't try to write over the end - * of the block file (possibly extending it). */ - if (off + len > device_len) { - /* Trim it back to the correct length */ - ftruncate64(dev->fd, device_len); - /* Die, bad Guest, die. */ - errx(1, "Write past end %llu+%u", off, len); - } - /* The reply length is 0: we just send back an empty DMA to - * interrupt them and tell them the write is finished. */ - *lenp = 0; - } else { - /* A read request. They sent an empty DMA to start the - * request, and we put the read contents into the reply - * buffer. */ - len = readv(dev->fd, reply, reply_num); - *lenp = len; - } - - /* The result is 1 (done), 2 if there was an error (short read or - * write). */ - p->result = 1 + (p->bytes != len); - /* Now tell them we've used their reply buffer. */ - trigger_irq(fd, irq); - - /* We're supposed to return the number of bytes of the output buffer we - * used. But the block device uses the "result" field instead, so we - * don't bother. */ - return 0; + add_device_fd(vq->dev->fd); + /* Tell waker to listen to it again */ + write(waker_fd, &vq->dev->fd, sizeof(vq->dev->fd)); } -/* This is the generic routine we call when the Guest sends some DMA out. */ -static void handle_output(int fd, unsigned long dma, unsigned long key, - struct device_list *devices) +/* This is the generic routine we call when the Guest uses LHCALL_NOTIFY. */ +static void handle_output(int fd, unsigned long addr) { struct device *i; - u32 *lenp; - struct iovec iov[LGUEST_MAX_DMA_SECTIONS]; - unsigned num = 0; - - /* Convert the "struct lguest_dma" they're sending to a "struct - * iovec". */ - lenp = dma2iov(dma, iov, &num); - - /* Check each device: if they expect output to this key, tell them to - * handle it. */ - for (i = devices->dev; i; i = i->next) { - if (i->handle_output && key == i->watch_key) { - /* We write the result straight into the used_len field - * for them. */ - *lenp = i->handle_output(fd, iov, num, i); - return; + struct virtqueue *vq; + + /* Check each virtqueue. */ + for (i = devices.dev; i; i = i->next) { + for (vq = i->vq; vq; vq = vq->next) { + if (vq->config.pfn == addr/getpagesize() + && vq->handle_output) { + verbose("Output to %s\n", vq->dev->name); + vq->handle_output(fd, vq); + return; + } } } - /* This can happen: the kernel sends any SEND_DMA which doesn't match - * another Guest to us. It could be that another Guest just left a - * network, for example. But it's unusual. */ - warnx("Pending dma %p, key %p", (void *)dma, (void *)key); + /* Early console write is done using notify on a nul-terminated string + * in Guest memory. */ + if (addr >= guest_limit) + errx(1, "Bad NOTIFY %#lx", addr); + + write(STDOUT_FILENO, from_guest_phys(addr), + strnlen(from_guest_phys(addr), guest_limit - addr)); } -/* This is called when the waker wakes us up: check for incoming file +/* This is called when the Waker wakes us up: check for incoming file * descriptors. */ -static void handle_input(int fd, struct device_list *devices) +static void handle_input(int fd) { /* select() wants a zeroed timeval to mean "don't wait". */ struct timeval poll = { .tv_sec = 0, .tv_usec = 0 }; for (;;) { struct device *i; - fd_set fds = devices->infds; + fd_set fds = devices.infds; /* If nothing is ready, we're done. */ - if (select(devices->max_infd+1, &fds, NULL, NULL, &poll) == 0) + if (select(devices.max_infd+1, &fds, NULL, NULL, &poll) == 0) break; /* Otherwise, call the device(s) which have readable * file descriptors and a method of handling them. */ - for (i = devices->dev; i; i = i->next) { + for (i = devices.dev; i; i = i->next) { if (i->handle_input && FD_ISSET(i->fd, &fds)) { + int dev_fd; + if (i->handle_input(fd, i)) + continue; + /* If handle_input() returns false, it means we - * should no longer service it. - * handle_console_input() does this. */ - if (!i->handle_input(fd, i)) { - /* Clear it from the set of input file - * descriptors kept at the head of the - * device list. */ - FD_CLR(i->fd, &devices->infds); - /* Tell waker to ignore it too... */ - write(waker_fd, &i->fd, sizeof(i->fd)); - } + * should no longer service it. Networking and + * console do this when there's no input + * buffers to deliver into. Console also uses + * it when it discovers that stdin is + * closed. */ + FD_CLR(i->fd, &devices.infds); + /* Tell waker to ignore it too, by sending a + * negative fd number (-1, since 0 is a valid + * FD number). */ + dev_fd = -i->fd - 1; + write(waker_fd, &dev_fd, sizeof(dev_fd)); } } } @@ -982,71 +983,121 @@ static void handle_input(int fd, struct device_list *devices) * routines to allocate them. * * This routine allocates a new "struct lguest_device_desc" from descriptor - * table in the devices array just above the Guest's normal memory. */ -static struct lguest_device_desc * -new_dev_desc(struct lguest_device_desc *descs, - u16 type, u16 features, u16 num_pages) + * table just above the Guest's normal memory. It returns a pointer to that + * descriptor. */ +static struct lguest_device_desc *new_dev_desc(u16 type) { - unsigned int i; + struct lguest_device_desc *d; - for (i = 0; i < LGUEST_MAX_DEVICES; i++) { - if (!descs[i].type) { - descs[i].type = type; - descs[i].features = features; - descs[i].num_pages = num_pages; - /* If they said the device needs memory, we allocate - * that now, bumping up the top of Guest memory. */ - if (num_pages) { - map_zeroed_pages(top, num_pages); - descs[i].pfn = top/getpagesize(); - top += num_pages*getpagesize(); - } - return &descs[i]; - } - } - errx(1, "too many devices"); + /* We only have one page for all the descriptors. */ + if (devices.desc_used + sizeof(*d) > getpagesize()) + errx(1, "Too many devices"); + + /* We don't need to set config_len or status: page is 0 already. */ + d = (void *)devices.descpage + devices.desc_used; + d->type = type; + devices.desc_used += sizeof(*d); + + return d; +} + +/* Each device descriptor is followed by some configuration information. + * Each configuration field looks like: u8 type, u8 len, [... len bytes...]. + * + * This routine adds a new field to an existing device's descriptor. It only + * works for the last device, but that's OK because that's how we use it. */ +static void add_desc_field(struct device *dev, u8 type, u8 len, const void *c) +{ + /* This is the last descriptor, right? */ + assert(devices.descpage + devices.desc_used + == (u8 *)(dev->desc + 1) + dev->desc->config_len); + + /* We only have one page of device descriptions. */ + if (devices.desc_used + 2 + len > getpagesize()) + errx(1, "Too many devices"); + + /* Copy in the new config header: type then length. */ + devices.descpage[devices.desc_used++] = type; + devices.descpage[devices.desc_used++] = len; + memcpy(devices.descpage + devices.desc_used, c, len); + devices.desc_used += len; + + /* Update the device descriptor length: two byte head then data. */ + dev->desc->config_len += 2 + len; } -/* This monster routine does all the creation and setup of a new device, - * including caling new_dev_desc() to allocate the descriptor and device - * memory. */ -static struct device *new_device(struct device_list *devices, - u16 type, u16 num_pages, u16 features, - int fd, - bool (*handle_input)(int, struct device *), - unsigned long watch_off, - u32 (*handle_output)(int, - const struct iovec *, - unsigned, - struct device *)) +/* This routine adds a virtqueue to a device. We specify how many descriptors + * the virtqueue is to have. */ +static void add_virtqueue(struct device *dev, unsigned int num_descs, + void (*handle_output)(int fd, struct virtqueue *me)) +{ + unsigned int pages; + struct virtqueue **i, *vq = malloc(sizeof(*vq)); + void *p; + + /* First we need some pages for this virtqueue. */ + pages = (vring_size(num_descs) + getpagesize() - 1) / getpagesize(); + p = get_pages(pages); + + /* Initialize the configuration. */ + vq->config.num = num_descs; + vq->config.irq = devices.next_irq++; + vq->config.pfn = to_guest_phys(p) / getpagesize(); + + /* Initialize the vring. */ + vring_init(&vq->vring, num_descs, p); + + /* Add the configuration information to this device's descriptor. */ + add_desc_field(dev, VIRTIO_CONFIG_F_VIRTQUEUE, + sizeof(vq->config), &vq->config); + + /* Add to tail of list, so dev->vq is first vq, dev->vq->next is + * second. */ + for (i = &dev->vq; *i; i = &(*i)->next); + *i = vq; + + /* Link virtqueue back to device. */ + vq->dev = dev; + + /* Set the routine to call when the Guest does something to this + * virtqueue. */ + vq->handle_output = handle_output; + + /* Set the "Don't Notify Me" flag if we don't have a handler */ + if (!handle_output) + vq->vring.used->flags = VRING_USED_F_NO_NOTIFY; +} + +/* This routine does all the creation and setup of a new device, including + * calling new_dev_desc() to allocate the descriptor and device memory. */ +static struct device *new_device(const char *name, u16 type, int fd, + bool (*handle_input)(int, struct device *)) { struct device *dev = malloc(sizeof(*dev)); /* Append to device list. Prepending to a single-linked list is * easier, but the user expects the devices to be arranged on the bus * in command-line order. The first network device on the command line - * is eth0, the first block device /dev/lgba, etc. */ - *devices->lastdev = dev; + * is eth0, the first block device /dev/vda, etc. */ + *devices.lastdev = dev; dev->next = NULL; - devices->lastdev = &dev->next; + devices.lastdev = &dev->next; /* Now we populate the fields one at a time. */ dev->fd = fd; /* If we have an input handler for this file descriptor, then we add it * to the device_list's fdset and maxfd. */ if (handle_input) - set_fd(dev->fd, devices); - dev->desc = new_dev_desc(devices->descs, type, features, num_pages); - dev->mem = (void *)(dev->desc->pfn * getpagesize()); + add_device_fd(dev->fd); + dev->desc = new_dev_desc(type); dev->handle_input = handle_input; - dev->watch_key = (unsigned long)dev->mem + watch_off; - dev->handle_output = handle_output; + dev->name = name; return dev; } /* Our first setup routine is the console. It's a fairly simple device, but * UNIX tty handling makes it uglier than it could be. */ -static void setup_console(struct device_list *devices) +static void setup_console(void) { struct device *dev; @@ -1062,127 +1113,38 @@ static void setup_console(struct device_list *devices) atexit(restore_term); } - /* We don't currently require any memory for the console, so we ask for - * 0 pages. */ - dev = new_device(devices, LGUEST_DEVICE_T_CONSOLE, 0, 0, - STDIN_FILENO, handle_console_input, - LGUEST_CONSOLE_DMA_KEY, handle_console_output); + dev = new_device("console", VIRTIO_ID_CONSOLE, + STDIN_FILENO, handle_console_input); /* We store the console state in dev->priv, and initialize it. */ dev->priv = malloc(sizeof(struct console_abort)); ((struct console_abort *)dev->priv)->count = 0; - verbose("device %p: console\n", - (void *)(dev->desc->pfn * getpagesize())); -} -/* Setting up a block file is also fairly straightforward. */ -static void setup_block_file(const char *filename, struct device_list *devices) -{ - int fd; - struct device *dev; - off64_t *device_len; - struct lguest_block_page *p; - - /* We open with O_LARGEFILE because otherwise we get stuck at 2G. We - * open with O_DIRECT because otherwise our benchmarks go much too - * fast. */ - fd = open_or_die(filename, O_RDWR|O_LARGEFILE|O_DIRECT); - - /* We want one page, and have no input handler (the block file never - * has anything interesting to say to us). Our timing will be quite - * random, so it should be a reasonable randomness source. */ - dev = new_device(devices, LGUEST_DEVICE_T_BLOCK, 1, - LGUEST_DEVICE_F_RANDOMNESS, - fd, NULL, 0, handle_block_output); - - /* We store the device size in the private area */ - device_len = dev->priv = malloc(sizeof(*device_len)); - /* This is the safe way of establishing the size of our device: it - * might be a normal file or an actual block device like /dev/hdb. */ - *device_len = lseek64(fd, 0, SEEK_END); - - /* The device memory is a "struct lguest_block_page". It's zeroed - * already, we just need to put in the device size. Block devices - * think in sectors (ie. 512 byte chunks), so we translate here. */ - p = dev->mem; - p->num_sectors = *device_len/512; - verbose("device %p: block %i sectors\n", - (void *)(dev->desc->pfn * getpagesize()), p->num_sectors); + /* The console needs two virtqueues: the input then the output. When + * they put something the input queue, we make sure we're listening to + * stdin. When they put something in the output queue, we write it to + * stdout. */ + add_virtqueue(dev, VIRTQUEUE_NUM, enable_fd); + add_virtqueue(dev, VIRTQUEUE_NUM, handle_console_output); + + verbose("device %u: console\n", devices.device_num++); } +/*:*/ -/* - * Network Devices. +/*M:010 Inter-guest networking is an interesting area. Simplest is to have a + * --sharenet=<name> option which opens or creates a named pipe. This can be + * used to send packets to another guest in a 1:1 manner. * - * Setting up network devices is quite a pain, because we have three types. - * First, we have the inter-Guest network. This is a file which is mapped into - * the address space of the Guests who are on the network. Because it is a - * shared mapping, the same page underlies all the devices, and they can send - * DMA to each other. + * More sopisticated is to use one of the tools developed for project like UML + * to do networking. * - * Remember from our network driver, the Guest is told what slot in the page it - * is to use. We use exclusive fnctl locks to reserve a slot. If another - * Guest is using a slot, the lock will fail and we try another. Because fnctl - * locks are cleaned up automatically when we die, this cleverly means that our - * reservation on the slot will vanish if we crash. */ -static unsigned int find_slot(int netfd, const char *filename) -{ - struct flock fl; - - fl.l_type = F_WRLCK; - fl.l_whence = SEEK_SET; - fl.l_len = 1; - /* Try a 1 byte lock in each possible position number */ - for (fl.l_start = 0; - fl.l_start < getpagesize()/sizeof(struct lguest_net); - fl.l_start++) { - /* If we succeed, return the slot number. */ - if (fcntl(netfd, F_SETLK, &fl) == 0) - return fl.l_start; - } - errx(1, "No free slots in network file %s", filename); -} - -/* This function sets up the network file */ -static void setup_net_file(const char *filename, - struct device_list *devices) -{ - int netfd; - struct device *dev; - - /* We don't use open_or_die() here: for friendliness we create the file - * if it doesn't already exist. */ - netfd = open(filename, O_RDWR, 0); - if (netfd < 0) { - if (errno == ENOENT) { - netfd = open(filename, O_RDWR|O_CREAT, 0600); - if (netfd >= 0) { - /* If we succeeded, initialize the file with a - * blank page. */ - char page[getpagesize()]; - memset(page, 0, sizeof(page)); - write(netfd, page, sizeof(page)); - } - } - if (netfd < 0) - err(1, "cannot open net file '%s'", filename); - } - - /* We need 1 page, and the features indicate the slot to use and that - * no checksum is needed. We never touch this device again; it's - * between the Guests on the network, so we don't register input or - * output handlers. */ - dev = new_device(devices, LGUEST_DEVICE_T_NET, 1, - find_slot(netfd, filename)|LGUEST_NET_F_NOCSUM, - -1, NULL, 0, NULL); - - /* Map the shared file. */ - if (mmap(dev->mem, getpagesize(), PROT_READ|PROT_WRITE, - MAP_FIXED|MAP_SHARED, netfd, 0) != dev->mem) - err(1, "could not mmap '%s'", filename); - verbose("device %p: shared net %s, peer %i\n", - (void *)(dev->desc->pfn * getpagesize()), filename, - dev->desc->features & ~LGUEST_NET_F_NOCSUM); -} -/*:*/ + * Faster is to do virtio bonding in kernel. Doing this 1:1 would be + * completely generic ("here's my vring, attach to your vring") and would work + * for any traffic. Of course, namespace and permissions issues need to be + * dealt with. A more sophisticated "multi-channel" virtio_net.c could hide + * multiple inter-guest channels behind one interface, although it would + * require some manner of hotplugging new virtio channels. + * + * Finally, we could implement a virtio network switch in the kernel. :*/ static u32 str2ip(const char *ipaddr) { @@ -1217,7 +1179,7 @@ static void add_to_bridge(int fd, const char *if_name, const char *br_name) /* This sets up the Host end of the network device with an IP address, brings * it up so packets will flow, the copies the MAC address into the hwaddr - * pointer (in practice, the Host's slot in the network device's memory). */ + * pointer. */ static void configure_device(int fd, const char *devname, u32 ipaddr, unsigned char hwaddr[6]) { @@ -1243,18 +1205,18 @@ static void configure_device(int fd, const char *devname, u32 ipaddr, memcpy(hwaddr, ifr.ifr_hwaddr.sa_data, 6); } -/*L:195 The other kind of network is a Host<->Guest network. This can either - * use briding or routing, but the principle is the same: it uses the "tun" - * device to inject packets into the Host as if they came in from a normal - * network card. We just shunt packets between the Guest and the tun - * device. */ -static void setup_tun_net(const char *arg, struct device_list *devices) +/*L:195 Our network is a Host<->Guest network. This can either use bridging or + * routing, but the principle is the same: it uses the "tun" device to inject + * packets into the Host as if they came in from a normal network card. We + * just shunt packets between the Guest and the tun device. */ +static void setup_tun_net(const char *arg) { struct device *dev; struct ifreq ifr; int netfd, ipfd; u32 ip; const char *br_name = NULL; + u8 hwaddr[6]; /* We open the /dev/net/tun device and tell it we want a tap device. A * tap device is like a tun device, only somehow different. To tell @@ -1270,21 +1232,13 @@ static void setup_tun_net(const char *arg, struct device_list *devices) * device: trust us! */ ioctl(netfd, TUNSETNOCSUM, 1); - /* We create the net device with 1 page, using the features field of - * the descriptor to tell the Guest it is in slot 1 (NET_PEERNUM), and - * that the device has fairly random timing. We do *not* specify - * LGUEST_NET_F_NOCSUM: these packets can reach the real world. - * - * We will put our MAC address is slot 0 for the Guest to see, so - * it will send packets to us using the key "peer_offset(0)": */ - dev = new_device(devices, LGUEST_DEVICE_T_NET, 1, - NET_PEERNUM|LGUEST_DEVICE_F_RANDOMNESS, netfd, - handle_tun_input, peer_offset(0), handle_tun_output); + /* First we create a new network device. */ + dev = new_device("net", VIRTIO_ID_NET, netfd, handle_tun_input); - /* We keep a flag which says whether we've seen packets come out from - * this network device. */ - dev->priv = malloc(sizeof(bool)); - *(bool *)dev->priv = false; + /* Network devices need a receive and a send queue, just like + * console. */ + add_virtqueue(dev, VIRTQUEUE_NUM, enable_fd); + add_virtqueue(dev, VIRTQUEUE_NUM, handle_net_output); /* We need a socket to perform the magic network ioctls to bring up the * tap interface, connect to the bridge etc. Any socket will do! */ @@ -1300,72 +1254,293 @@ static void setup_tun_net(const char *arg, struct device_list *devices) } else /* It is an IP address to set up the device with */ ip = str2ip(arg); - /* We are peer 0, ie. first slot, so we hand dev->mem to this routine - * to write the MAC address at the start of the device memory. */ - configure_device(ipfd, ifr.ifr_name, ip, dev->mem); + /* Set up the tun device, and get the mac address for the interface. */ + configure_device(ipfd, ifr.ifr_name, ip, hwaddr); - /* Set "promisc" bit: we want every single packet if we're going to - * bridge to other machines (and otherwise it doesn't matter). */ - *((u8 *)dev->mem) |= 0x1; + /* Tell Guest what MAC address to use. */ + add_desc_field(dev, VIRTIO_CONFIG_NET_MAC_F, sizeof(hwaddr), hwaddr); + /* We don't seed the socket any more; setup is done. */ close(ipfd); - verbose("device %p: tun net %u.%u.%u.%u\n", - (void *)(dev->desc->pfn * getpagesize()), - (u8)(ip>>24), (u8)(ip>>16), (u8)(ip>>8), (u8)ip); + verbose("device %u: tun net %u.%u.%u.%u\n", + devices.device_num++, + (u8)(ip>>24),(u8)(ip>>16),(u8)(ip>>8),(u8)ip); if (br_name) verbose("attached to bridge: %s\n", br_name); } + +/* Our block (disk) device should be really simple: the Guest asks for a block + * number and we read or write that position in the file. Unfortunately, that + * was amazingly slow: the Guest waits until the read is finished before + * running anything else, even if it could have been doing useful work. + * + * We could use async I/O, except it's reputed to suck so hard that characters + * actually go missing from your code when you try to use it. + * + * So we farm the I/O out to thread, and communicate with it via a pipe. */ + +/* This hangs off device->priv. */ +struct vblk_info +{ + /* The size of the file. */ + off64_t len; + + /* The file descriptor for the file. */ + int fd; + + /* IO thread listens on this file descriptor [0]. */ + int workpipe[2]; + + /* IO thread writes to this file descriptor to mark it done, then + * Launcher triggers interrupt to Guest. */ + int done_fd; +}; +/*:*/ + +/*L:210 + * The Disk + * + * Remember that the block device is handled by a separate I/O thread. We head + * straight into the core of that thread here: + */ +static bool service_io(struct device *dev) +{ + struct vblk_info *vblk = dev->priv; + unsigned int head, out_num, in_num, wlen; + int ret; + struct virtio_blk_inhdr *in; + struct virtio_blk_outhdr *out; + struct iovec iov[dev->vq->vring.num]; + off64_t off; + + /* See if there's a request waiting. If not, nothing to do. */ + head = get_vq_desc(dev->vq, iov, &out_num, &in_num); + if (head == dev->vq->vring.num) + return false; + + /* Every block request should contain at least one output buffer + * (detailing the location on disk and the type of request) and one + * input buffer (to hold the result). */ + if (out_num == 0 || in_num == 0) + errx(1, "Bad virtblk cmd %u out=%u in=%u", + head, out_num, in_num); + + out = convert(&iov[0], struct virtio_blk_outhdr); + in = convert(&iov[out_num+in_num-1], struct virtio_blk_inhdr); + off = out->sector * 512; + + /* The block device implements "barriers", where the Guest indicates + * that it wants all previous writes to occur before this write. We + * don't have a way of asking our kernel to do a barrier, so we just + * synchronize all the data in the file. Pretty poor, no? */ + if (out->type & VIRTIO_BLK_T_BARRIER) + fdatasync(vblk->fd); + + /* In general the virtio block driver is allowed to try SCSI commands. + * It'd be nice if we supported eject, for example, but we don't. */ + if (out->type & VIRTIO_BLK_T_SCSI_CMD) { + fprintf(stderr, "Scsi commands unsupported\n"); + in->status = VIRTIO_BLK_S_UNSUPP; + wlen = sizeof(in); + } else if (out->type & VIRTIO_BLK_T_OUT) { + /* Write */ + + /* Move to the right location in the block file. This can fail + * if they try to write past end. */ + if (lseek64(vblk->fd, off, SEEK_SET) != off) + err(1, "Bad seek to sector %llu", out->sector); + + ret = writev(vblk->fd, iov+1, out_num-1); + verbose("WRITE to sector %llu: %i\n", out->sector, ret); + + /* Grr... Now we know how long the descriptor they sent was, we + * make sure they didn't try to write over the end of the block + * file (possibly extending it). */ + if (ret > 0 && off + ret > vblk->len) { + /* Trim it back to the correct length */ + ftruncate64(vblk->fd, vblk->len); + /* Die, bad Guest, die. */ + errx(1, "Write past end %llu+%u", off, ret); + } + wlen = sizeof(in); + in->status = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR); + } else { + /* Read */ + + /* Move to the right location in the block file. This can fail + * if they try to read past end. */ + if (lseek64(vblk->fd, off, SEEK_SET) != off) + err(1, "Bad seek to sector %llu", out->sector); + + ret = readv(vblk->fd, iov+1, in_num-1); + verbose("READ from sector %llu: %i\n", out->sector, ret); + if (ret >= 0) { + wlen = sizeof(in) + ret; + in->status = VIRTIO_BLK_S_OK; + } else { + wlen = sizeof(in); + in->status = VIRTIO_BLK_S_IOERR; + } + } + + /* We can't trigger an IRQ, because we're not the Launcher. It does + * that when we tell it we're done. */ + add_used(dev->vq, head, wlen); + return true; +} + +/* This is the thread which actually services the I/O. */ +static int io_thread(void *_dev) +{ + struct device *dev = _dev; + struct vblk_info *vblk = dev->priv; + char c; + + /* Close other side of workpipe so we get 0 read when main dies. */ + close(vblk->workpipe[1]); + /* Close the other side of the done_fd pipe. */ + close(dev->fd); + + /* When this read fails, it means Launcher died, so we follow. */ + while (read(vblk->workpipe[0], &c, 1) == 1) { + /* We acknowledge each request immediately to reduce latency, + * rather than waiting until we've done them all. I haven't + * measured to see if it makes any difference. */ + while (service_io(dev)) + write(vblk->done_fd, &c, 1); + } + return 0; +} + +/* Now we've seen the I/O thread, we return to the Launcher to see what happens + * when the thread tells us it's completed some I/O. */ +static bool handle_io_finish(int fd, struct device *dev) +{ + char c; + + /* If the I/O thread died, presumably it printed the error, so we + * simply exit. */ + if (read(dev->fd, &c, 1) != 1) + exit(1); + + /* It did some work, so trigger the irq. */ + trigger_irq(fd, dev->vq); + return true; +} + +/* When the Guest submits some I/O, we just need to wake the I/O thread. */ +static void handle_virtblk_output(int fd, struct virtqueue *vq) +{ + struct vblk_info *vblk = vq->dev->priv; + char c = 0; + + /* Wake up I/O thread and tell it to go to work! */ + if (write(vblk->workpipe[1], &c, 1) != 1) + /* Presumably it indicated why it died. */ + exit(1); +} + +/*L:198 This actually sets up a virtual block device. */ +static void setup_block_file(const char *filename) +{ + int p[2]; + struct device *dev; + struct vblk_info *vblk; + void *stack; + u64 cap; + unsigned int val; + + /* This is the pipe the I/O thread will use to tell us I/O is done. */ + pipe(p); + + /* The device responds to return from I/O thread. */ + dev = new_device("block", VIRTIO_ID_BLOCK, p[0], handle_io_finish); + + /* The device has one virtqueue, where the Guest places requests. */ + add_virtqueue(dev, VIRTQUEUE_NUM, handle_virtblk_output); + + /* Allocate the room for our own bookkeeping */ + vblk = dev->priv = malloc(sizeof(*vblk)); + + /* First we open the file and store the length. */ + vblk->fd = open_or_die(filename, O_RDWR|O_LARGEFILE); + vblk->len = lseek64(vblk->fd, 0, SEEK_END); + + /* Tell Guest how many sectors this device has. */ + cap = cpu_to_le64(vblk->len / 512); + add_desc_field(dev, VIRTIO_CONFIG_BLK_F_CAPACITY, sizeof(cap), &cap); + + /* Tell Guest not to put in too many descriptors at once: two are used + * for the in and out elements. */ + val = cpu_to_le32(VIRTQUEUE_NUM - 2); + add_desc_field(dev, VIRTIO_CONFIG_BLK_F_SEG_MAX, sizeof(val), &val); + + /* The I/O thread writes to this end of the pipe when done. */ + vblk->done_fd = p[1]; + + /* This is the second pipe, which is how we tell the I/O thread about + * more work. */ + pipe(vblk->workpipe); + + /* Create stack for thread and run it */ + stack = malloc(32768); + if (clone(io_thread, stack + 32768, CLONE_VM, dev) == -1) + err(1, "Creating clone"); + + /* We don't need to keep the I/O thread's end of the pipes open. */ + close(vblk->done_fd); + close(vblk->workpipe[0]); + + verbose("device %u: virtblock %llu sectors\n", + devices.device_num, cap); +} /* That's the end of device setup. */ /*L:220 Finally we reach the core of the Launcher, which runs the Guest, serves * its input and output, and finally, lays it to rest. */ -static void __attribute__((noreturn)) -run_guest(int lguest_fd, struct device_list *device_list) +static void __attribute__((noreturn)) run_guest(int lguest_fd) { for (;;) { - u32 args[] = { LHREQ_BREAK, 0 }; - unsigned long arr[2]; + unsigned long args[] = { LHREQ_BREAK, 0 }; + unsigned long notify_addr; int readval; /* We read from the /dev/lguest device to run the Guest. */ - readval = read(lguest_fd, arr, sizeof(arr)); + readval = read(lguest_fd, ¬ify_addr, sizeof(notify_addr)); - /* The read can only really return sizeof(arr) (the Guest did a - * SEND_DMA to us), or an error. */ - - /* For a successful read, arr[0] is the address of the "struct - * lguest_dma", and arr[1] is the key the Guest sent to. */ - if (readval == sizeof(arr)) { - handle_output(lguest_fd, arr[0], arr[1], device_list); + /* One unsigned long means the Guest did HCALL_NOTIFY */ + if (readval == sizeof(notify_addr)) { + verbose("Notify on address %#lx\n", notify_addr); + handle_output(lguest_fd, notify_addr); continue; /* ENOENT means the Guest died. Reading tells us why. */ } else if (errno == ENOENT) { char reason[1024] = { 0 }; read(lguest_fd, reason, sizeof(reason)-1); errx(1, "%s", reason); - /* EAGAIN means the waker wanted us to look at some input. + /* EAGAIN means the Waker wanted us to look at some input. * Anything else means a bug or incompatible change. */ } else if (errno != EAGAIN) err(1, "Running guest failed"); - /* Service input, then unset the BREAK which releases - * the Waker. */ - handle_input(lguest_fd, device_list); + /* Service input, then unset the BREAK to release the Waker. */ + handle_input(lguest_fd); if (write(lguest_fd, args, sizeof(args)) < 0) err(1, "Resetting break"); } } /* - * This is the end of the Launcher. + * This is the end of the Launcher. The good news: we are over halfway + * through! The bad news: the most fiendish part of the code still lies ahead + * of us. * - * But wait! We've seen I/O from the Launcher, and we've seen I/O from the - * Drivers. If we were to see the Host kernel I/O code, our understanding - * would be complete... :*/ + * Are you ready? Take a deep breath and join me in the core of the Host, in + * "make Host". + :*/ static struct option opts[] = { { "verbose", 0, NULL, 'v' }, - { "sharenet", 1, NULL, 's' }, { "tunnet", 1, NULL, 't' }, { "block", 1, NULL, 'b' }, { "initrd", 1, NULL, 'i' }, @@ -1374,37 +1549,21 @@ static struct option opts[] = { static void usage(void) { errx(1, "Usage: lguest [--verbose] " - "[--sharenet=<filename>|--tunnet=(<ipaddr>|bridge:<bridgename>)\n" + "[--tunnet=(<ipaddr>|bridge:<bridgename>)\n" "|--block=<filename>|--initrd=<filename>]...\n" "<mem-in-mb> vmlinux [args...]"); } -/*L:100 The Launcher code itself takes us out into userspace, that scary place - * where pointers run wild and free! Unfortunately, like most userspace - * programs, it's quite boring (which is why everyone like to hack on the - * kernel!). Perhaps if you make up an Lguest Drinking Game at this point, it - * will get you through this section. Or, maybe not. - * - * The Launcher binary sits up high, usually starting at address 0xB8000000. - * Everything below this is the "physical" memory for the Guest. For example, - * if the Guest were to write a "1" at physical address 0, we would see a "1" - * in the Launcher at "(int *)0". Guest physical == Launcher virtual. - * - * This can be tough to get your head around, but usually it just means that we - * don't need to do any conversion when the Guest gives us it's "physical" - * addresses. - */ +/*L:105 The main routine is where the real work begins: */ int main(int argc, char *argv[]) { - /* Memory, top-level pagetable, code startpoint, PAGE_OFFSET and size - * of the (optional) initrd. */ - unsigned long mem = 0, pgdir, start, page_offset, initrd_size = 0; - /* A temporary and the /dev/lguest file descriptor. */ + /* Memory, top-level pagetable, code startpoint and size of the + * (optional) initrd. */ + unsigned long mem = 0, pgdir, start, initrd_size = 0; + /* Two temporaries and the /dev/lguest file descriptor. */ int i, c, lguest_fd; - /* The list of Guest devices, based on command line arguments. */ - struct device_list device_list; - /* The boot information for the Guest: at guest-physical address 0. */ - void *boot = (void *)0; + /* The boot information for the Guest. */ + struct boot_params *boot; /* If they specify an initrd file to load. */ const char *initrd_name = NULL; @@ -1412,11 +1571,12 @@ int main(int argc, char *argv[]) * device receive input from a file descriptor, we keep an fdset * (infds) and the maximum fd number (max_infd) with the head of the * list. We also keep a pointer to the last device, for easy appending - * to the list. */ - device_list.max_infd = -1; - device_list.dev = NULL; - device_list.lastdev = &device_list.dev; - FD_ZERO(&device_list.infds); + * to the list. Finally, we keep the next interrupt number to hand out + * (1: remember that 0 is used by the timer). */ + FD_ZERO(&devices.infds); + devices.max_infd = -1; + devices.lastdev = &devices.dev; + devices.next_irq = 1; /* We need to know how much memory so we can set up the device * descriptor and memory pages for the devices as we parse the command @@ -1424,9 +1584,16 @@ int main(int argc, char *argv[]) * of memory now. */ for (i = 1; i < argc; i++) { if (argv[i][0] != '-') { - mem = top = atoi(argv[i]) * 1024 * 1024; - device_list.descs = map_zeroed_pages(top, 1); - top += getpagesize(); + mem = atoi(argv[i]) * 1024 * 1024; + /* We start by mapping anonymous pages over all of + * guest-physical memory range. This fills it with 0, + * and ensures that the Guest won't be killed when it + * tries to access it. */ + guest_base = map_zeroed_pages(mem / getpagesize() + + DEVICE_PAGES); + guest_limit = mem; + guest_max = mem + DEVICE_PAGES*getpagesize(); + devices.descpage = get_pages(1); break; } } @@ -1437,14 +1604,11 @@ int main(int argc, char *argv[]) case 'v': verbose = true; break; - case 's': - setup_net_file(optarg, &device_list); - break; case 't': - setup_tun_net(optarg, &device_list); + setup_tun_net(optarg); break; case 'b': - setup_block_file(optarg, &device_list); + setup_block_file(optarg); break; case 'i': initrd_name = optarg; @@ -1459,56 +1623,61 @@ int main(int argc, char *argv[]) if (optind + 2 > argc) usage(); - /* We always have a console device */ - setup_console(&device_list); + verbose("Guest base is at %p\n", guest_base); - /* We start by mapping anonymous pages over all of guest-physical - * memory range. This fills it with 0, and ensures that the Guest - * won't be killed when it tries to access it. */ - map_zeroed_pages(0, mem / getpagesize()); + /* We always have a console device */ + setup_console(); /* Now we load the kernel */ - start = load_kernel(open_or_die(argv[optind+1], O_RDONLY), - &page_offset); + start = load_kernel(open_or_die(argv[optind+1], O_RDONLY)); + + /* Boot information is stashed at physical address 0 */ + boot = from_guest_phys(0); /* Map the initrd image if requested (at top of physical memory) */ if (initrd_name) { initrd_size = load_initrd(initrd_name, mem); /* These are the location in the Linux boot header where the * start and size of the initrd are expected to be found. */ - *(unsigned long *)(boot+0x218) = mem - initrd_size; - *(unsigned long *)(boot+0x21c) = initrd_size; + boot->hdr.ramdisk_image = mem - initrd_size; + boot->hdr.ramdisk_size = initrd_size; /* The bootloader type 0xFF means "unknown"; that's OK. */ - *(unsigned char *)(boot+0x210) = 0xFF; + boot->hdr.type_of_loader = 0xFF; } /* Set up the initial linear pagetables, starting below the initrd. */ - pgdir = setup_pagetables(mem, initrd_size, page_offset); + pgdir = setup_pagetables(mem, initrd_size); /* The Linux boot header contains an "E820" memory map: ours is a * simple, single region. */ - *(char*)(boot+E820NR) = 1; - *((struct e820entry *)(boot+E820MAP)) - = ((struct e820entry) { 0, mem, E820_RAM }); + boot->e820_entries = 1; + boot->e820_map[0] = ((struct e820entry) { 0, mem, E820_RAM }); /* The boot header contains a command line pointer: we put the command - * line after the boot header (at address 4096) */ - *(void **)(boot + 0x228) = boot + 4096; - concat(boot + 4096, argv+optind+2); + * line after the boot header. */ + boot->hdr.cmd_line_ptr = to_guest_phys(boot + 1); + /* We use a simple helper to copy the arguments separated by spaces. */ + concat((char *)(boot + 1), argv+optind+2); + + /* Boot protocol version: 2.07 supports the fields for lguest. */ + boot->hdr.version = 0x207; + + /* The hardware_subarch value of "1" tells the Guest it's an lguest. */ + boot->hdr.hardware_subarch = 1; - /* The guest type value of "1" tells the Guest it's under lguest. */ - *(int *)(boot + 0x23c) = 1; + /* Tell the entry path not to try to reload segment registers. */ + boot->hdr.loadflags |= KEEP_SEGMENTS; /* We tell the kernel to initialize the Guest: this returns the open * /dev/lguest file descriptor. */ - lguest_fd = tell_kernel(pgdir, start, page_offset); + lguest_fd = tell_kernel(pgdir, start); /* We fork off a child process, which wakes the Launcher whenever one * of the input file descriptors needs attention. Otherwise we would * run the Guest until it tries to output something. */ - waker_fd = setup_waker(lguest_fd, &device_list); + waker_fd = setup_waker(lguest_fd); /* Finally, run the Guest. This doesn't return. */ - run_guest(lguest_fd, &device_list); + run_guest(lguest_fd); } /*:*/ diff --git a/Documentation/lguest/lguest.txt b/Documentation/lguest/lguest.txt index 821617b..7885ab2 100644 --- a/Documentation/lguest/lguest.txt +++ b/Documentation/lguest/lguest.txt @@ -6,7 +6,7 @@ Lguest is designed to be a minimal hypervisor for the Linux kernel, for Linux developers and users to experiment with virtualization with the minimum of complexity. Nonetheless, it should have sufficient features to make it useful for specific tasks, and, of course, you are -encouraged to fork and enhance it. +encouraged to fork and enhance it (see drivers/lguest/README). Features: @@ -23,19 +23,30 @@ Developer features: Running Lguest: -- Lguest runs the same kernel as guest and host. You can configure - them differently, but usually it's easiest not to. +- The easiest way to run lguest is to use same kernel as guest and host. + You can configure them differently, but usually it's easiest not to. You will need to configure your kernel with the following options: - CONFIG_HIGHMEM64G=n ("High Memory Support" "64GB")[1] - CONFIG_TUN=y/m ("Universal TUN/TAP device driver support") - CONFIG_EXPERIMENTAL=y ("Prompt for development and/or incomplete code/drivers") - CONFIG_PARAVIRT=y ("Paravirtualization support (EXPERIMENTAL)") - CONFIG_LGUEST=y/m ("Linux hypervisor example code") - - and I recommend: - CONFIG_HZ=100 ("Timer frequency")[2] + "General setup": + "Prompt for development and/or incomplete code/drivers" = Y + (CONFIG_EXPERIMENTAL=y) + + "Processor type and features": + "Paravirtualized guest support" = Y + "Lguest guest support" = Y + "High Memory Support" = off/4GB + "Alignment value to which kernel should be aligned" = 0x100000 + (CONFIG_PARAVIRT=y, CONFIG_LGUEST_GUEST=y, CONFIG_HIGHMEM64G=n and + CONFIG_PHYSICAL_ALIGN=0x100000) + + "Device Drivers": + "Network device support" + "Universal TUN/TAP device driver support" = M/Y + (CONFIG_TUN=m) + "Virtualization" + "Linux hypervisor example code" = M/Y + (CONFIG_LGUEST=m) - A tool called "lguest" is available in this directory: type "make" to build it. If you didn't build your kernel in-tree, use "make @@ -51,14 +62,17 @@ Running Lguest: dd if=/dev/zero of=rootfile bs=1M count=2048 qemu -cdrom image.iso -hda rootfile -net user -net nic -boot d + Make sure that you install a getty on /dev/hvc0 if you want to log in on the + console! + - "modprobe lg" if you built it as a module. - Run an lguest as root: - Documentation/lguest/lguest 64m vmlinux --tunnet=192.168.19.1 --block=rootfile root=/dev/lgba + Documentation/lguest/lguest 64 vmlinux --tunnet=192.168.19.1 --block=rootfile root=/dev/vda Explanation: - 64m: the amount of memory to use. + 64: the amount of memory to use, in MB. vmlinux: the kernel image found in the top of your build directory. You can also use a standard bzImage. @@ -66,10 +80,10 @@ Running Lguest: --tunnet=192.168.19.1: configures a "tap" device for networking with this IP address. - --block=rootfile: a file or block device which becomes /dev/lgba + --block=rootfile: a file or block device which becomes /dev/vda inside the guest. - root=/dev/lgba: this (and anything else on the command line) are + root=/dev/vda: this (and anything else on the command line) are kernel boot parameters. - Configuring networking. I usually have the host masquerade, using @@ -99,31 +113,7 @@ Running Lguest: "--sharenet=<filename>": any two guests using the same file are on the same network. This file is created if it does not exist. -Lguest I/O model: - -Lguest uses a simplified DMA model plus shared memory for I/O. Guests -can communicate with each other if they share underlying memory -(usually by the lguest program mmaping the same file), but they can -use any non-shared memory to communicate with the lguest process. - -Guests can register DMA buffers at any key (must be a valid physical -address) using the LHCALL_BIND_DMA(key, dmabufs, num<<8|irq) -hypercall. "dmabufs" is the physical address of an array of "num" -"struct lguest_dma": each contains a used_len, and an array of -physical addresses and lengths. When a transfer occurs, the -"used_len" field of one of the buffers which has used_len 0 will be -set to the length transferred and the irq will fire. +There is a helpful mailing list at http://ozlabs.org/mailman/listinfo/lguest -Using an irq value of 0 unbinds the dma buffers. - -To send DMA, the LHCALL_SEND_DMA(key, dma_physaddr) hypercall is used, -and the bytes used is written to the used_len field. This can be 0 if -noone else has bound a DMA buffer to that key or some other error. -DMA buffers bound by the same guest are ignored. - -Cheers! +Good luck! Rusty Russell rusty@rustcorp.com.au. - -[1] These are on various places on the TODO list, waiting for you to - get annoyed enough at the limitation to fix it. -[2] Lguest is not yet tickless when idle. See [1]. diff --git a/Documentation/local_ops.txt b/Documentation/local_ops.txt index 4269a11..1a45f11 100644 --- a/Documentation/local_ops.txt +++ b/Documentation/local_ops.txt @@ -68,6 +68,29 @@ typedef struct { atomic_long_t a; } local_t; variable can be read when reading some _other_ cpu's variables. +* Rules to follow when using local atomic operations + +- Variables touched by local ops must be per cpu variables. +- _Only_ the CPU owner of these variables must write to them. +- This CPU can use local ops from any context (process, irq, softirq, nmi, ...) + to update its local_t variables. +- Preemption (or interrupts) must be disabled when using local ops in + process context to make sure the process won't be migrated to a + different CPU between getting the per-cpu variable and doing the + actual local op. +- When using local ops in interrupt context, no special care must be + taken on a mainline kernel, since they will run on the local CPU with + preemption already disabled. I suggest, however, to explicitly + disable preemption anyway to make sure it will still work correctly on + -rt kernels. +- Reading the local cpu variable will provide the current copy of the + variable. +- Reads of these variables can be done from any CPU, because updates to + "long", aligned, variables are always atomic. Since no memory + synchronization is done by the writer CPU, an outdated copy of the + variable can be read when reading some _other_ cpu's variables. + + * How to use local atomic operations #include <linux/percpu.h> diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt index 5fbcc22..168117b 100644 --- a/Documentation/memory-hotplug.txt +++ b/Documentation/memory-hotplug.txt @@ -2,7 +2,8 @@ Memory Hotplug ============== -Last Updated: Jul 28 2007 +Created: Jul 28 2007 +Add description of notifier of memory hotplug Oct 11 2007 This document is about memory hotplug including how-to-use and current status. Because Memory Hotplug is still under development, contents of this text will @@ -24,7 +25,8 @@ be changed often. 6.1 Memory offline and ZONE_MOVABLE 6.2. How to offline memory 7. Physical memory remove -8. Future Work List +8. Memory hotplug event notifier +9. Future Work List Note(1): x86_64's has special implementation for memory hotplug. This text does not describe it. @@ -307,8 +309,58 @@ Need more implementation yet.... - Notification completion of remove works by OS to firmware. - Guard from remove if not yet. +-------------------------------- +8. Memory hotplug event notifier +-------------------------------- +Memory hotplug has event notifer. There are 6 types of notification. + +MEMORY_GOING_ONLINE + Generated before new memory becomes available in order to be able to + prepare subsystems to handle memory. The page allocator is still unable + to allocate from the new memory. + +MEMORY_CANCEL_ONLINE + Generated if MEMORY_GOING_ONLINE fails. + +MEMORY_ONLINE + Generated when memory has succesfully brought online. The callback may + allocate pages from the new memory. + +MEMORY_GOING_OFFLINE + Generated to begin the process of offlining memory. Allocations are no + longer possible from the memory but some of the memory to be offlined + is still in use. The callback can be used to free memory known to a + subsystem from the indicated memory section. + +MEMORY_CANCEL_OFFLINE + Generated if MEMORY_GOING_OFFLINE fails. Memory is available again from + the section that we attempted to offline. + +MEMORY_OFFLINE + Generated after offlining memory is complete. + +A callback routine can be registered by + hotplug_memory_notifier(callback_func, priority) + +The second argument of callback function (action) is event types of above. +The third argument is passed by pointer of struct memory_notify. + +struct memory_notify { + unsigned long start_pfn; + unsigned long nr_pages; + int status_cahnge_nid; +} + +start_pfn is start_pfn of online/offline memory. +nr_pages is # of pages of online/offline memory. +status_change_nid is set node id when N_HIGH_MEMORY of nodemask is (will be) +set/clear. It means a new(memoryless) node gets new memory by online and a +node loses all memory. If this is -1, then nodemask status is not changed. +If status_changed_nid >= 0, callback should create/discard structures for the +node if necessary. + -------------- -8. Future Work +9. Future Work -------------- - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like sysctl or new control file. diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX index 153d84d..563e442 100644 --- a/Documentation/networking/00-INDEX +++ b/Documentation/networking/00-INDEX @@ -4,8 +4,6 @@ - information on the 3Com EtherLink Plus (3c505) driver. 6pack.txt - info on the 6pack protocol, an alternative to KISS for AX.25 -Configurable - - info on some of the configurable network parameters DLINK.txt - info on the D-Link DE-600/DE-620 parallel port pocket adapters PLIP.txt @@ -26,8 +24,6 @@ baycom.txt - info on the driver for Baycom style amateur radio modems bridge.txt - where to get user space programs for ethernet bridging with Linux. -comx.txt - - info on drivers for COMX line of synchronous serial adapters. cops.txt - info on the COPS LocalTalk Linux driver cs89x0.txt @@ -78,22 +74,14 @@ ltpc.txt - the Apple or Farallon LocalTalk PC card driver multicast.txt - Behaviour of cards under Multicast -ncsa-telnet - - notes on how NCSA telnet (DOS) breaks with MTU discovery enabled. -net-modules.txt - - info and "insmod" parameters for all network driver modules. netdevices.txt - info on network device driver functions exported to the kernel. olympic.txt - IBM PCI Pit/Pit-Phy/Olympic Token Ring driver info. policy-routing.txt - IP policy-based routing -pt.txt - - the Gracilis Packetwin AX.25 device driver ray_cs.txt - Raylink Wireless LAN card driver info. -routing.txt - - the new routing mechanism shaper.txt - info on the module that can shape/limit transmitted traffic. sk98lin.txt diff --git a/Documentation/networking/Configurable b/Documentation/networking/Configurable deleted file mode 100644 index 69c0dd4..0000000 --- a/Documentation/networking/Configurable +++ /dev/null @@ -1,34 +0,0 @@ - -There are a few network parameters that can be tuned to better match -the kernel to your system hardware and intended usage. The defaults -are usually a good choice for 99% of the people 99% of the time, but -you should be aware they do exist and can be changed. - -The current list of parameters can be found in the files: - - linux/net/TUNABLE - Documentation/networking/ip-sysctl.txt - -Some of these are accessible via the sysctl interface, and many more are -scheduled to be added in this way. For example, some parameters related -to Address Resolution Protocol (ARP) are very easily viewed and altered. - - # cat /proc/sys/net/ipv4/arp_timeout - 6000 - # echo 7000 > /proc/sys/net/ipv4/arp_timeout - # cat /proc/sys/net/ipv4/arp_timeout - 7000 - -Others are already accessible via the related user space programs. -For example, MAX_WINDOW has a default of 32 k which is a good choice for -modern hardware, but if you have a slow (8 bit) Ethernet card and/or a slow -machine, then this will be far too big for the card to keep up with fast -machines transmitting on the same net, resulting in overruns and receive errors. -A value of about 4 k would be more appropriate, which can be set via: - - # route add -net 192.168.3.0 window 4096 - -The remainder of these can only be presently changed by altering a #define -in the related header file. This means an edit and recompile cycle. - - Paul Gortmaker 06/96 diff --git a/Documentation/networking/comx.txt b/Documentation/networking/comx.txt deleted file mode 100644 index d1526eb..0000000 --- a/Documentation/networking/comx.txt +++ /dev/null @@ -1,248 +0,0 @@ - - COMX drivers for the 2.2 kernel - -Originally written by: Tivadar Szemethy, <tiv@itc.hu> -Currently maintained by: Gergely Madarasz <gorgo@itc.hu> - -Last change: 21/06/1999. - -INTRODUCTION - -This document describes the software drivers and their use for the -COMX line of synchronous serial adapters for Linux version 2.2.0 and -above. -The cards are produced and sold by ITC-Pro Ltd. Budapest, Hungary -For further info contact <info@itc.hu> -or http://www.itc.hu (mostly in Hungarian). -The firmware files and software are available from ftp://ftp.itc.hu - -Currently, the drivers support the following cards and protocols: - -COMX (2x64 kbps intelligent board) -CMX (1x256 + 1x128 kbps intelligent board) -HiCOMX (2x2Mbps intelligent board) -LoCOMX (1x512 kbps passive board) -MixCOM (1x512 or 2x512kbps passive board with a hardware watchdog an - optional BRI interface and optional flashROM (1-32M)) -SliceCOM (1x2Mbps channelized E1 board) -PciCOM (X21) - -At the moment of writing this document, the (Cisco)-HDLC, LAPB, SyncPPP and -Frame Relay (DTE, rfc1294 IP encapsulation with partially implemented Q933a -LMI) protocols are available as link-level protocol. -X.25 support is being worked on. - -USAGE - -Load the comx.o module and the hardware-specific and protocol-specific -modules you'll need into the running kernel using the insmod utility. -This creates the /proc/comx directory. -See the example scripts in the 'etc' directory. - -/proc INTERFACE INTRO - -The COMX driver set has a new type of user interface based on the /proc -filesystem which eliminates the need for external user-land software doing -IOCTL calls. -Each network interface or device (i.e. those ones you configure with 'ifconfig' -and 'route' etc.) has a corresponding directory under /proc/comx. You can -dynamically create a new interface by saying 'mkdir /proc/comx/comx0' (or you -can name it whatever you want up to 8 characters long, comx[n] is just a -convention). -Generally the files contained in these directories are text files, which can -be viewed by 'cat filename' and you can write a string to such a file by -saying 'echo _string_ >filename'. This is very similar to the sysctl interface. -Don't use a text editor to edit these files, always use 'echo' (or 'cat' -where appropriate). -When you've created the comx[n] directory, two files are created automagically -in it: 'boardtype' and 'protocol'. You have to fill in these files correctly -for your board and protocol you intend to use (see the board and protocol -descriptions in this file below or the example scripts in the 'etc' directory). -After filling in these files, other files will appear in the directory for -setting the various hardware- and protocol-related informations (for example -irq and io addresses, keepalive values etc.) These files are set to default -values upon creation, so you don't necessarily have to change all of them. - -When you're ready with filling in the files in the comx[n] directory, you can -configure the corresponding network interface with the standard network -configuration utilities. If you're unable to bring the interfaces up, look up -the various kernel log files on your system, and consult the messages for -a probable reason. - -EXAMPLE - -To create the interface 'comx0' which is the first channel of a COMX card: - -insmod comx -# insmod comx-hw-comx ; insmod comx-proto-ppp (these are usually -autoloaded if you use the kernel module loader) - -mkdir /proc/comx/comx0 -echo comx >/proc/comx/comx0/boardtype -echo 0x360 >/proc/comx/comx0/io <- jumper-selectable I/O port -echo 0x0a >/proc/comx/comx0/irq <- jumper-selectable IRQ line -echo 0xd000 >/proc/comx/comx0/memaddr <- software-configurable memory - address. COMX uses 64 KB, and this - can be: 0xa000, 0xb000, 0xc000, - 0xd000, 0xe000. Avoid conflicts - with other hardware. -cat </etc/siol1.rom >/proc/comx/comx0/firmware <- the firmware for the card -echo HDLC >/proc/comx/comx0/protocol <- the data-link protocol -echo 10 >/proc/comx/comx0/keepalive <- the keepalive for the protocol -ifconfig comx0 1.2.3.4 pointopoint 5.6.7.8 netmask 255.255.255.255 <- - finally configure it with ifconfig -Check its status: -cat /proc/comx/comx0/status - -If you want to use the second channel of this board: - -mkdir /proc/comx/comx1 -echo comx >/proc/comx/comx1/boardtype -echo 0x360 >/proc/comx/comx1/io -echo 10 >/proc/comx/comx1/irq -echo 0xd000 >/proc/comx/comx1/memaddr -echo 1 >/proc/comx/comx1/channel <- channels are numbered - as 0 (default) and 1 - -Now, check if the driver recognized that you're going to use the other -channel of the same adapter: - -cat /proc/comx/comx0/twin -comx1 -cat /proc/comx/comx1/twin -comx0 - -You don't have to load the firmware twice, if you use both channels of -an adapter, just write it into the channel 0's /proc firmware file. - -Default values: io 0x360 for COMX, 0x320 (HICOMX), irq 10, memaddr 0xd0000 - -THE LOCOMX HARDWARE DRIVER - -The LoCOMX driver doesn't require firmware, and it doesn't use memory either, -but it uses DMA channels 1 and 3. You can set the clock rate (if enabled by -jumpers on the board) by writing the kbps value into the file named 'clock'. -Set it to 'external' (it is the default) if you have external clock source. - -(Note: currently the LoCOMX driver does not support the internal clock) - -THE COMX, CMX AND HICOMX DRIVERS - -On the HICOMX, COMX and CMX, you have to load the firmware (it is different for -the three cards!). All these adapters can share the same memory -address (we usually use 0xd0000). On the CMX you can set the internal -clock rate (if enabled by jumpers on the small adapter boards) by writing -the kbps value into the 'clock' file. You have to do this before initializing -the card. If you use both HICOMX and CMX/COMX cards, initialize the HICOMX -first. The I/O address of the HICOMX board is not configurable by any -method available to the user: it is hardwired to 0x320, and if you have to -change it, consult ITC-Pro Ltd. - -THE MIXCOM DRIVER - -The MixCOM board doesn't require firmware, the driver communicates with -it through I/O ports. You can have three of these cards in one machine. - -THE SLICECOM DRIVER - -The SliceCOM board doesn't require firmware. You can have 4 of these cards -in one machine. The driver doesn't (yet) support shared interrupts, so -you will need a separate IRQ line for every board. -Read Documentation/networking/slicecom.txt for help on configuring -this adapter. - -THE HDLC/PPP LINE PROTOCOL DRIVER - -The HDLC/SyncPPP line protocol driver uses the kernel's built-in syncppp -driver (syncppp.o). You don't have to manually select syncppp.o when building -the kernel, the dependencies compile it in automatically. - - - - -EXAMPLE -(setting up hw parameters, see above) - -# using HDLC: -echo hdlc >/proc/comx/comx0/protocol -echo 10 >/proc/comx/comx0/keepalive <- not necessary, 10 is the default -ifconfig comx0 1.2.3.4 pointopoint 5.6.7.8 netmask 255.255.255.255 - -(setting up hw parameters, see above) - -# using PPP: -echo ppp >/proc/comx/comx0/protocol -ifconfig comx0 up -ifconfig comx0 1.2.3.4 pointopoint 5.6.7.8 netmask 255.255.255.255 - - -THE LAPB LINE PROTOCOL DRIVER - -For this, you'll need to configure LAPB support (See 'LAPB Data Link Driver' in -'Network options' section) into your kernel (thanks to Jonathan Naylor for his -excellent implementation). -comx-proto-lapb.o provides the following files in the appropriate directory -(the default values in parens): t1 (5), t2 (1), n2 (20), mode (DTE, STD) and -window (7). Agree with the administrator of your peer router on these -settings (most people use defaults, but you have to know if you are DTE or -DCE). - -EXAMPLE - -(setting up hw parameters, see above) -echo lapb >/proc/comx/comx0/protocol -echo dce >/proc/comx/comx0/mode <- DCE interface in this example -ifconfig comx0 1.2.3.4 pointopoint 5.6.7.8 netmask 255.255.255.255 - - -THE FRAME RELAY PROTOCOL DRIVER - -You DON'T need any other frame relay related modules from the kernel to use -COMX-Frame Relay. This protocol is a bit more complicated than the others, -because it allows to use 'subinterfaces' or DLCIs within one physical device. -First you have to create the 'master' device (the actual physical interface) -as you would do for other protocols. Specify 'frad' as protocol type. -Now you can bring this interface up by saying 'ifconfig comx0 up' (or whatever -you've named the interface). Do not assign any IP address to this interface -and do not set any routes through it. -Then, set up your DLCIs the following way: create a comx interface for each -DLCI you intend to use (with mkdir), and write 'dlci' to the 'boardtype' file, -and 'ietf-ip' to the 'protocol' file. Currently, the only supported -encapsulation type is this (also called as RFC1294/1490 IP encapsulation). -Write the DLCI number to the 'dlci' file, and write the name of the physical -COMX device to the file called 'master'. -Now you can assign an IP address to this interface and set routes using it. -See the example file for further info and example config script. -Notes: this driver implements a DTE interface with partially implemented -Q933a LMI. -You can find an extensively commented example in the 'etc' directory. - -FURTHER /proc FILES - -boardtype: -Type of the hardware. Valid values are: - 'comx', 'hicomx', 'locomx', 'cmx', 'slicecom'. - -protocol: -Data-link protocol on this channel. Can be: HDLC, LAPB, PPP, FRAD - -status: -You can read the channel's actual status from the 'status' file, for example -'cat /proc/comx/comx3/status'. - -lineup_delay: -Interpreted in seconds (default is 1). Used to avoid line jitter: the system -will consider the line status 'UP' only if it is up for at least this number -of seconds. - -debug: -You can set various debug options through this file. Valid options are: -'comx_events', 'comx_tx', 'comx_rx', 'hw_events', 'hw_tx', 'hw_rx'. -You can enable a debug options by writing its name prepended by a '+' into -the debug file, for example 'echo +comx_rx >comx0/debug'. -Disabling an option happens similarly, use the '-' prefix -(e.g. 'echo -hw_rx >debug'). -Debug results can be read from the debug file, for example: -tail -f /proc/comx/comx2/debug - - diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 747a5d1..6f7872b 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -184,14 +184,14 @@ tcp_frto - INTEGER F-RTO is an enhanced recovery algorithm for TCP retransmission timeouts. It is particularly beneficial in wireless environments where packet loss is typically due to random radio interference - rather than intermediate router congestion. FRTO is sender-side + rather than intermediate router congestion. F-RTO is sender-side only modification. Therefore it does not require any support from the peer, but in a typical case, however, where wireless link is the local access link and most of the data flows downlink, the - faraway servers should have FRTO enabled to take advantage of it. + faraway servers should have F-RTO enabled to take advantage of it. If set to 1, basic version is enabled. 2 enables SACK enhanced F-RTO if flow uses SACK. The basic version can be used also when - SACK is in use though scenario(s) with it exists where FRTO + SACK is in use though scenario(s) with it exists where F-RTO interacts badly with the packet counting of the SACK enabled TCP flow. diff --git a/Documentation/networking/ncsa-telnet b/Documentation/networking/ncsa-telnet deleted file mode 100644 index d77d28b..0000000 --- a/Documentation/networking/ncsa-telnet +++ /dev/null @@ -1,16 +0,0 @@ -NCSA telnet doesn't work with path MTU discovery enabled. This is due to a -bug in NCSA that also stops it working with other modern networking code -such as Solaris. - -The following information is courtesy of -Marek <marekm@i17linuxb.ists.pwr.wroc.pl> - -There is a fixed version somewhere on ftp.upe.ac.za (sorry, I don't -remember the exact pathname, and this site is very slow from here). -It may or may not be faster for you to get it from -ftp://ftp.ists.pwr.wroc.pl/pub/msdos/telnet/ncsa_upe/tel23074.zip -(source is in v230704s.zip). I have tested it with 1.3.79 (with -path mtu discovery enabled - ncsa 2.3.08 didn't work) and it seems -to work. I don't know if anyone is working on this code - this -version is over a year old. Too bad - it's faster and often more -stable than these windoze telnets, and runs on almost anything... diff --git a/Documentation/networking/net-modules.txt b/Documentation/networking/net-modules.txt deleted file mode 100644 index 98c4392..0000000 --- a/Documentation/networking/net-modules.txt +++ /dev/null @@ -1,315 +0,0 @@ -Wed 2-Aug-95 <matti.aarnio@utu.fi> - - Linux network driver modules - - Do not mistake this for "README.modules" at the top-level - directory! That document tells about modules in general, while - this one tells only about network device driver modules. - - This is a potpourri of INSMOD-time(*) configuration options - (if such exists) and their default values of various modules - in the Linux network drivers collection. - - Some modules have also hidden (= non-documented) tunable values. - The choice of not documenting them is based on general belief, that - the less the user needs to know, the better. (There are things that - driver developers can use, others should not confuse themselves.) - - In many cases it is highly preferred that insmod:ing is done - ONLY with defining an explicit address for the card, AND BY - NOT USING AUTO-PROBING! - - Now most cards have some explicitly defined base address that they - are compiled with (to avoid auto-probing, among other things). - If that compiled value does not match your actual configuration, - do use the "io=0xXXX" -parameter for the insmod, and give there - a value matching your environment. - - If you are adventurous, you can ask the driver to autoprobe - by using the "io=0" parameter, however it is a potentially dangerous - thing to do in a live system. (If you don't know where the - card is located, you can try autoprobing, and after possible - crash recovery, insmod with proper IO-address..) - - -------------------------- - (*) "INSMOD-time" means when you load module with - /sbin/insmod you can feed it optional parameters. - See "man insmod". - -------------------------- - - - 8390 based Network Modules (Paul Gortmaker, Nov 12, 1995) - -------------------------- - -(Includes: smc-ultra, ne, wd, 3c503, hp, hp-plus, e2100 and ac3200) - -The 8390 series of network drivers now support multiple card systems without -reloading the same module multiple times (memory efficient!) This is done by -specifying multiple comma separated values, such as: - - insmod 3c503.o io=0x280,0x300,0x330,0x350 xcvr=0,1,0,1 - -The above would have the one module controlling four 3c503 cards, with card 2 -and 4 using external transceivers. The "insmod" manual describes the usage -of comma separated value lists. - -It is *STRONGLY RECOMMENDED* that you supply "io=" instead of autoprobing. -If an "io=" argument is not supplied, then the ISA drivers will complain -about autoprobing being not recommended, and begrudgingly autoprobe for -a *SINGLE CARD ONLY* -- if you want to use multiple cards you *have* to -supply an "io=0xNNN,0xQQQ,..." argument. - -The ne module is an exception to the above. A NE2000 is essentially an -8390 chip, some bus glue and some RAM. Because of this, the ne probe is -more invasive than the rest, and so at boot we make sure the ne probe is -done last of all the 8390 cards (so that it won't trip over other 8390 based -cards) With modules we can't ensure that all other non-ne 8390 cards have -already been found. Because of this, the ne module REQUIRES an "io=0xNNN" -argument passed in via insmod. It will refuse to autoprobe. - -It is also worth noting that auto-IRQ probably isn't as reliable during -the flurry of interrupt activity on a running machine. Cards such as the -ne2000 that can't get the IRQ setting from an EEPROM or configuration -register are probably best supplied with an "irq=M" argument as well. - - ----------------------------------------------------------------------- -Card/Module List - Configurable Parameters and Default Values ----------------------------------------------------------------------- - -3c501.c: - io = 0x280 IO base address - irq = 5 IRQ - (Probes ports: 0x280, 0x300) - -3c503.c: - io = 0 (It will complain if you don't supply an "io=0xNNN") - irq = 0 (IRQ software selected by driver using autoIRQ) - xcvr = 0 (Use xcvr=1 to select external transceiver.) - (Probes ports: 0x300, 0x310, 0x330, 0x350, 0x250, 0x280, 0x2A0, 0x2E0) - -3c505.c: - io = 0 - irq = 0 - dma = 6 (not autoprobed) - (Probes ports: 0x300, 0x280, 0x310) - -3c507.c: - io = 0x300 - irq = 0 - (Probes ports: 0x300, 0x320, 0x340, 0x280) - -3c509.c: - io = 0 - irq = 0 - ( Module load-time probing Works reliably only on EISA, ISA ID-PROBE - IS NOT RELIABLE! Compile this driver statically into kernel for - now, if you need it auto-probing on an ISA-bus machine. ) - -8390.c: - (No public options, several other modules need this one) - -a2065.c: - Since this is a Zorro board, it supports full autoprobing, even for - multiple boards. (m68k/Amiga) - -ac3200.c: - io = 0 (Checks 0x1000 to 0x8fff in 0x1000 intervals) - irq = 0 (Read from config register) - (EISA probing..) - -apricot.c: - io = 0x300 (Can't be altered!) - irq = 10 - -arcnet.c: - io = 0 - irqnum = 0 - shmem = 0 - num = 0 - DO SET THESE MANUALLY AT INSMOD! - (When probing, looks at the following possible addresses: - Suggested ones: - 0x300, 0x2E0, 0x2F0, 0x2D0 - Other ones: - 0x200, 0x210, 0x220, 0x230, 0x240, 0x250, 0x260, 0x270, - 0x280, 0x290, 0x2A0, 0x2B0, 0x2C0, - 0x310, 0x320, 0x330, 0x340, 0x350, 0x360, 0x370, - 0x380, 0x390, 0x3A0, 0x3E0, 0x3F0 ) - -ariadne.c: - Since this is a Zorro board, it supports full autoprobing, even for - multiple boards. (m68k/Amiga) - -at1700.c: - io = 0x260 - irq = 0 - (Probes ports: 0x260, 0x280, 0x2A0, 0x240, 0x340, 0x320, 0x380, 0x300) - -atarilance.c: - Supports full autoprobing. (m68k/Atari) - -atp.c: *Not modularized* - (Probes ports: 0x378, 0x278, 0x3BC; - fixed IRQs: 5 and 7 ) - -cops.c: - io = 0x240 - irq = 5 - nodeid = 0 (AutoSelect = 0, NodeID 1-254 is hand selected.) - (Probes ports: 0x240, 0x340, 0x200, 0x210, 0x220, 0x230, 0x260, - 0x2A0, 0x300, 0x310, 0x320, 0x330, 0x350, 0x360) - -de4x5.c: - io = 0x000b - irq = 10 - is_not_dec = 0 -- For non-DEC card using DEC 21040/21041/21140 chip, set this to 1 - (EISA, and PCI probing) - -de600.c: - de600_debug = 0 - (On port 0x378, irq 7 -- lpt1; compile time configurable) - -de620.c: - bnc = 0, utp = 0 <-- Force media by setting either. - io = 0x378 (also compile-time configurable) - irq = 7 - -depca.c: - io = 0x200 - irq = 7 - (Probes ports: ISA: 0x300, 0x200; - EISA: 0x0c00 ) - -dummy.c: - No options - -e2100.c: - io = 0 (It will complain if you don't supply an "io=0xNNN") - irq = 0 (IRQ software selected by driver) - mem = 0 (Override default shared memory start of 0xd0000) - xcvr = 0 (Use xcvr=1 to select external transceiver.) - (Probes ports: 0x300, 0x280, 0x380, 0x220) - -eepro.c: - io = 0x200 - irq = 0 - (Probes ports: 0x200, 0x240, 0x280, 0x2C0, 0x300, 0x320, 0x340, 0x360) - -eexpress.c: - io = 0x300 - irq = 0 (IRQ value read from EEPROM) - (Probes ports: 0x300, 0x270, 0x320, 0x340) - -eql.c: - (No parameters) - -ewrk3.c: - io = 0x300 - irq = 5 - (With module no autoprobing! - On EISA-bus does EISA probing. - Static linkage probes ports on ISA bus: - 0x100, 0x120, 0x140, 0x160, 0x180, 0x1A0, 0x1C0, - 0x200, 0x220, 0x240, 0x260, 0x280, 0x2A0, 0x2C0, 0x2E0, - 0x300, 0x340, 0x360, 0x380, 0x3A0, 0x3C0) - -hp-plus.c: - io = 0 (It will complain if you don't supply an "io=0xNNN") - irq = 0 (IRQ read from configuration register) - (Probes ports: 0x200, 0x240, 0x280, 0x2C0, 0x300, 0x320, 0x340) - -hp.c: - io = 0 (It will complain if you don't supply an "io=0xNNN") - irq = 0 (IRQ software selected by driver using autoIRQ) - (Probes ports: 0x300, 0x320, 0x340, 0x280, 0x2C0, 0x200, 0x240) - -hp100.c: - hp100_port = 0 (IO-base address) - (Does EISA-probing, if on EISA-slot; - On ISA-bus probes all ports from 0x100 thru to 0x3E0 - in increments of 0x020) - -hydra.c: - Since this is a Zorro board, it supports full autoprobing, even for - multiple boards. (m68k/Amiga) - -ibmtr.c: - io = 0xa20, 0xa24 (autoprobed by default) - irq = 0 (driver cannot select irq - read from hardware) - mem = 0 (shared memory base set at 0xd0000 and not yet - able to override thru mem= parameter.) - -lance.c: *Not modularized* - (PCI, and ISA probing; "CONFIG_PCI" needed for PCI support) - (Probes ISA ports: 0x300, 0x320, 0x340, 0x360) - -loopback.c: *Static kernel component* - -ne.c: - io = 0 (Explicitly *requires* an "io=0xNNN" value) - irq = 0 (Tries to determine configured IRQ via autoIRQ) - (Probes ports: 0x300, 0x280, 0x320, 0x340, 0x360) - -net_init.c: *Static kernel component* - -ni52.c: *Not modularized* - (Probes ports: 0x300, 0x280, 0x360, 0x320, 0x340 - mems: 0xD0000, 0xD2000, 0xC8000, 0xCA000, - 0xD4000, 0xD6000, 0xD8000 ) - -ni65.c: *Not modularized* **16MB MEMORY BARRIER BUG** - (Probes ports: 0x300, 0x320, 0x340, 0x360) - -pi2.c: *Not modularized* (well, NON-STANDARD modularization!) - Only one card supported at this time. - (Probes ports: 0x380, 0x300, 0x320, 0x340, 0x360, 0x3A0) - -plip.c: - io = 0 - irq = 0 (by default, uses IRQ 5 for port at 0x3bc, IRQ 7 - for port at 0x378, and IRQ 2 for port at 0x278) - (Probes ports: 0x278, 0x378, 0x3bc) - -ppp.c: - No options (ppp-2.2+ has some, this is based on non-dynamic - version from ppp-2.1.2d) - -seeq8005.c: *Not modularized* - (Probes ports: 0x300, 0x320, 0x340, 0x360) - -skeleton.c: *Skeleton* - -slhc.c: - No configuration parameters - -slip.c: - slip_maxdev = 256 (default value from SL_NRUNIT on slip.h) - - -smc-ultra.c: - io = 0 (It will complain if you don't supply an "io=0xNNN") - irq = 0 (IRQ val. read from EEPROM) - (Probes ports: 0x200, 0x220, 0x240, 0x280, 0x300, 0x340, 0x380) - -tulip.c: *Partial modularization* - (init-time memory allocation makes problems..) - -tunnel.c: - No insmod parameters - -wavelan.c: - io = 0x390 (Settable, but change not recommended) - irq = 0 (Not honoured, if changed..) - -wd.c: - io = 0 (It will complain if you don't supply an "io=0xNNN") - irq = 0 (IRQ val. read from EEPROM, ancient cards use autoIRQ) - mem = 0 (Force shared-memory on address 0xC8000, or whatever..) - mem_end = 0 (Force non-std. mem. size via supplying mem_end val.) - (eg. for 32k WD8003EBT, use mem=0xd0000 mem_end=0xd8000) - (Probes ports: 0x300, 0x280, 0x380, 0x240) - -znet.c: *Not modularized* - (Only one device on Zenith Z-Note (notebook?) systems, - configuration information from (EE)PROM) diff --git a/Documentation/networking/pt.txt b/Documentation/networking/pt.txt deleted file mode 100644 index 72e888c..0000000 --- a/Documentation/networking/pt.txt +++ /dev/null @@ -1,58 +0,0 @@ -This is the README for the Gracilis Packetwin device driver, version 0.5 -ALPHA for Linux 1.3.43. - -These files will allow you to talk to the PackeTwin (now know as PT) and -connect through it just like a pair of TNCs. To do this you will also -require the AX.25 code in the kernel enabled. - -There are four files in this archive; this readme, a patch file, a .c file -and finally a .h file. The two program files need to be put into the -drivers/net directory in the Linux source tree, for me this is the -directory /usr/src/linux/drivers/net. The patch file needs to be patched in -at the top of the Linux source tree (/usr/src/linux in my case). - -You will most probably have to edit the pt.c file to suit your own setup, -this should just involve changing some of the defines at the top of the file. -Please note that if you run an external modem you must specify a speed of 0. - -The program is currently setup to run a 4800 baud external modem on port A -and a Kantronics DE-9600 daughter board on port B so if you have this (or -something similar) then you're right. - -To compile in the driver, put the files in the correct place and patch in -the diff. You will have to re-configure the kernel again before you -recompile it. - -The driver is not real good at the moment for finding the card. You can -'help' it by changing the order of the potential addresses in the structure -found in the pt_init() function so the address of where the card is is put -first. - -After compiling, you have to get them going, they are pretty well like any -other net device and just need ifconfig to get them going. -As an example, here is my /etc/rc.net --------------------------- - -# -# Configure the PackeTwin, port A. -/sbin/ifconfig pt0a 44.136.8.87 hw ax25 vk2xlz mtu 512 -/sbin/ifconfig pt0a 44.136.8.87 broadcast 44.136.8.255 netmask 255.255.255.0 -/sbin/route add -net 44.136.8.0 netmask 255.255.255.0 dev pt0a -/sbin/route add -net 44.0.0.0 netmask 255.0.0.0 gw 44.136.8.68 dev pt0a -/sbin/route add -net 138.25.16.0 netmask 255.255.240.0 dev pt0a -/sbin/route add -host 44.136.8.255 dev pt0a -# -# Configure the PackeTwin, port B. -/sbin/ifconfig pt0b 44.136.8.87 hw ax25 vk2xlz-1 mtu 512 -/sbin/ifconfig pt0b 44.136.8.87 broadcast 44.255.255.255 netmask 255.0.0.0 -/sbin/route add -host 44.136.8.216 dev pt0b -/sbin/route add -host 44.136.8.95 dev pt0b -/sbin/route add -host 44.255.255.255 dev pt0b - -This version of the driver comes under the GNU GPL. If you have one of my -previous (non-GPL) versions of the driver, please update to this one. - -I hope that this all works well for you. I would be pleased to hear how -many people use the driver and if it does its job. - - - Craig vk2xlz <csmall@small.dropbear.id.au> diff --git a/Documentation/networking/routing.txt b/Documentation/networking/routing.txt deleted file mode 100644 index a26838b..0000000 --- a/Documentation/networking/routing.txt +++ /dev/null @@ -1,46 +0,0 @@ -The directory ftp.inr.ac.ru:/ip-routing contains: - -- iproute.c - "professional" routing table maintenance utility. - -- rdisc.tar.gz - rdisc daemon, ported from Sun. - STRONGLY RECOMMENDED FOR ALL HOSTS. - -- routing.tgz - original Mike McLagan's route by source patch. - Currently it is obsolete. - -- gated.dif-ss<NEWEST>.gz - gated-R3_6Alpha_2 fixes. - Look at README.gated - -- mrouted-3.8.dif.gz - mrouted-3.8 fixes. - -- rtmon.c - trivial debugging utility: reads and stores netlink. - - -NEWS for user. - -- Policy based routing. Routing decisions are made on the basis - not only of destination address, but also source address, - TOS and incoming interface. -- Complete set of IP level control messages. - Now Linux is the only OS in the world complying to RFC requirements. - Great win 8) -- New interface addressing paradigm. - Assignment of address ranges to interface, - multiple prefixes etc. etc. - Do not bother, it is compatible with the old one. Moreover: -- You don't need to do "route add aaa.bbb.ccc... eth0" anymore, - it is done automatically. -- "Abstract" UNIX sockets and security enhancements. - This is necessary to use TIRPC and TLI emulation library. - -NEWS for hacker. - -- New destination cache. Flexible, robust and just beautiful. -- Network stack is reordered, simplified, optimized, a lot of bugs fixed. - (well, and new bugs were introduced, but I haven't seen them yet 8)) - It is difficult to describe all the changes, look into source. - -If you see this file, then this patch works 8) - -Alexey Kuznetsov. -kuznet@ms2.inr.ac.ru diff --git a/Documentation/networking/slicecom.hun b/Documentation/networking/slicecom.hun deleted file mode 100644 index bed2f04..0000000 --- a/Documentation/networking/slicecom.hun +++ /dev/null @@ -1,371 +0,0 @@ - -SliceCOM adapter felhasznaloi dokumentacioja - 0.51 verziohoz - -Bartók István <bartoki@itc.hu> -Utolso modositas: Wed Aug 29 17:26:58 CEST 2001 - ------------------------------------------------------------------ - -Hasznalata: - -Forditas: - -Code maturity level options - [*] Prompt for development and/or incomplete code/drivers - -Network device support - Wan interfaces - <M> MultiGate (COMX) synchronous - <M> Support for MUNICH based boards: SliceCOM, PCICOM (NEW) - <M> Support for HDLC and syncPPP... - - -A modulok betoltese: - -modprobe comx - -modprobe comx-proto-ppp # a Cisco-HDLC es a SyncPPP protokollt is - # ez a modul adja - -modprobe comx-hw-munich # a modul betoltodeskor azonnal jelent a - # syslogba a detektalt kartyakrol - - -Konfiguralas: - -# Ezen az interfeszen Cisco-HDLC vonali protokoll fog futni -# Az interfeszhez rendelt idoszeletek: 1,2 (128 kbit/sec-es vonal) -# (a G.703 keretben az elso adatot vivo idoszelet az 1-es) -# -mkdir /proc/comx/comx0.1/ -echo slicecom >/proc/comx/comx0.1/boardtype -echo hdlc >/proc/comx/comx0.1/protocol -echo 1 2 >/proc/comx/comx0.1/timeslots - - -# Ezen az interfeszen SyncPPP vonali protokoll fog futni -# Az interfeszhez rendelt idoszelet: 3 (64 kbit/sec-es vonal) -# -mkdir /proc/comx/comx0.2/ -echo slicecom >/proc/comx/comx0.2/boardtype -echo ppp >/proc/comx/comx0.2/protocol -echo 3 >/proc/comx/comx0.2/timeslots - -... - -ifconfig comx0.1 up -ifconfig comx0.2 up - ------------------------------------------------------------------ - -A COMX driverek default 20 csomagnyi transmit queue-t rendelnek a halozati -interfeszekhez. WAN halozatokban ennel hosszabbat is szokas hasznalni -(20 es 100 kozott), hogy a vonal kihasznaltsaga nagy terheles eseten jobb -legyen (bar ezzel megno a varhato kesleltetes a csomagok sorban allasa miatt): - -# ifconfig comx0 txqueuelen 50 - -Ezt a beallitasi lehetoseget csak az ujabb disztribuciok ifconfig parancsa -tamogatja (amik mar a 2.2 kernelekhez keszultek, mint a RedHat 6.1 vagy a -Debian 2.2). - -A 2.1-es Debian disztribuciohoz a http://www.debian.org/~rcw/2.2/netbase/ -cimrol toltheto le ujabb netbase csomag, ami mar ilyet tamogato ifconfig -parancsot tartalmaz. Bovebben a 2.2 kernel hasznalatarol Debian 2.1 alatt: -http://www.debian.org/releases/stable/running-kernel-2.2 - ------------------------------------------------------------------ - -A kartya LED-jeinek jelentese: - -piros - eg, ha Remote Alarm-ot kuld a tuloldal -zold - eg, ha a vett jelben megtalalja a keretszinkront - -Reszletesebben: - -piros: zold: jelentes: - -- - nincs keretszinkron (nincs jel, vagy rossz a jel) -- eg "minden rendben" -eg eg a vetel OK, de a tuloldal Remote Alarm-ot kuld -eg - ez nincs ertelmezve, egyelore funkcio nelkul - ------------------------------------------------------------------ - -Reszletesebb leiras a hardver beallitasi lehetosegeirol: - -Az altalanos,- es a protokoll-retegek beallitasi lehetosegeirol a 'comx.txt' -fajlban leirtak SliceCOM kartyanal is ervenyesek, itt csak a hardver-specifikus -beallitasi lehetosegek vannak osszefoglalva: - -Konfiguralasi interfesz a /proc/comx/ alatt: - -Minden timeslot-csoportnak kulon comx* interfeszt kell letrehozni mkdir-rel: -comx0, comx1, .. stb. Itt beallithato, hogy az adott interfesz hanyadik kartya -melyik timeslotja(i)bol alljon ossze. A Cisco-fele serial3:1 elnevezesek -(serial3:1 = a 3. kartyaban az 1-es idoszelet-csoport) Linuxon aliasing-ot -jelentenenek, ezert mi nem tudunk ilyen elnevezest hasznalni. - -Tobb kartya eseten a comx0.1, comx0.2, ... vagy slice0.1, slice0.2 nevek -hasznalhatoak. - -Tobb SliceCOM kartya is lehet egy gepben, de sajat interrupt kell mindegyiknek, -nem tud meg megosztott interruptot kezelni. - -Az egesz kartyat erinto beallitasok: - -Az ioport es irq beallitas nincs: amit a PCI BIOS kioszt a rendszernek, -azt hasznalja a driver. - - -comx0/boardnum - hanyadik SliceCOM kartya a gepben (a 'termeszetes' PCI - sorrendben ertve: ahogyan a /proc/pci-ban vagy az 'lspci' - kimeneteben megjelenik, altalaban az alaplapi PCI meghajto - aramkorokhoz kozelebb eso kartyak a kisebb sorszamuak) - - Default: 0 (0-tol kezdodik a szamolas) - - -Bar a kovetkezoket csak egy-egy interfeszen allitjuk at, megis az egesz kartya -mukodeset egyszerre allitjak. A megkotes hogy csak UP-ban levo interfeszen -hasznalhatoak, azert van, mert kulonben nem vart eredmenyekre vezetne egy ilyen -paranccsorozat: - - echo 0 >boardnum - echo internal >clock_source - echo 1 >boardnum - -- Ez a 0-s board clock_source-at allitana at. - -Ezek a beallitasok megmaradnak az osszes interfesz torlesekor, de torlodnek -a driver modul ki/betoltesekor. - - -comx0/clock_source - A Tx orajelforrasa, a Cisco-val hasonlatosra keszult. - Hasznalata: - - papaya:# echo line >/proc/comx/comx0/clock_source - papaya:# echo internal >/proc/comx/comx0/clock_source - - line - A Tx orajelet a vett adatfolyambol dekodolja, igyekszik - igazodni hozza. Ha nem lat orajelet az inputon, akkor - atall a sajat orajelgeneratorara. - internal - A Tx orajelet a sajat orajelgeneratora szolgaltatja. - - Default: line - - Normal osszeallitas eseten a tavkozlesi szolgaltato eszkoze - (pl. HDSL modem) adja az orajelet, ezert ez a default. - - -comx0/framing - A CRC4 ki/be kapcsolasa - - A CRC4: 16 PCM keretet (A PCM keret az, amibe a 32 darab 64 - kilobites csatorna van bemultiplexalva. Nem osszetevesztendo a HDLC - kerettel.) 2x8 -as csoportokra osztanak, es azokhoz 4-4 bites CRC-t - szamolnak. Elsosorban a vonal minosegenek a monitorozasara szolgal. - - papaya:~# echo crc4 >/proc/comx/comx0/framing - papaya:~# echo no-crc4 >/proc/comx/comx0/framing - - Default a 'crc4', a MATAV vonalak altalaban igy futnak. De ha nem - egyforma is a beallitas a vonal ket vegen, attol a forgalom altalaban - at tud menni. - - -comx0/linecode - A vonali kodolas beallitasa - - papaya:~# echo hdb3 >/proc/comx/comx0/linecode - papaya:~# echo ami >/proc/comx/comx0/linecode - - Default a 'hdb3', a MATAV vonalak igy futnak. - - (az AMI kodolas igen ritka E1-es vonalaknal). Ha ez a beallitas nem - egyezik a vonal ket vegen, akkor elofordulhat hogy a keretszinkron - osszejon, de CRC4-hibak es a vonalakon atvitt adatokban is hibak - keletkeznek (amit a HDLC/SyncPPP szinten CRC-hibaval jelez) - - -comx0/reg - a kartya aramkoreinek, a MUNICH (reg) es a FALC (lbireg) -comx0/lbireg regisztereinek kozvetlen elerese. Hasznalata: - - echo >reg 0x04 0x0 - a 4-es regiszterbe 0-t ir - echo >reg 0x104 - printk()-val kiirja a 4-es regiszter - tartalmat a syslogba. - - WARNING: ezek csak a fejleszteshez keszultek, sok galibat - lehet veluk okozni! - - -comx0/loopback - A kartya G.703 jelenek a visszahurkolasara is van lehetoseg: - - papaya:# echo none >/proc/comx/comx0/loopback - papaya:# echo local >/proc/comx/comx0/loopback - papaya:# echo remote >/proc/comx/comx0/loopback - - none - nincs visszahurkolas, normal mukodes - local - a kartya a sajat maga altal adott jelet kapja vissza - remote - a kartya a kivulrol vett jelet adja kifele - - Default: none - ------------------------------------------------------------------ - -Az interfeszhez (Cisco terminologiaban 'channel-group') kapcsolodo beallitasok: - -comx0/timeslots - mely timeslotok (idoszeletek) tartoznak az adott interfeszhez. - - papaya:~# cat /proc/comx/comx0/timeslots - 1 3 4 5 6 - papaya:~# - - Egy timeslot megkeresese (hanyas interfeszbe tartozik nalunk): - - papaya:~# grep ' 4' /proc/comx/comx*/timeslots - /proc/comx/comx0/timeslots:1 3 4 5 6 - papaya:~# - - Beallitasa: - papaya:~# echo '1 5 2 6 7 8' >/proc/comx/comx0/timeslots - - A timeslotok sorrendje nem szamit, '1 3 2' ugyanaz mint az '1 2 3'. - - Beallitashoz az adott interfesznek DOWN-ban kell lennie - (ifconfig comx0 down), de ugyanannak a kartyanak a tobbi interfesze - uzemelhet kozben. - - Beallitaskor leellenorzi, hogy az uj timeslotok nem utkoznek-e egy - masik interfesz timeslotjaival. Ha utkoznek, akkor nem allitja at. - - Mindig 10-es szamrendszerben tortenik a timeslotok ertelmezese, nehogy - a 08, 09 alaku felirast rosszul ertelmezze. - ------------------------------------------------------------------ - -Az interfeszek es a kartya allapotanak lekerdezese: - -- A ' '-szel kezdodo sorok az eredeti kimenetet, a //-rel kezdodo sorok a -magyarazatot jelzik. - - papaya:~$ cat /proc/comx/comx1/status - Interface administrative status is UP, modem status is UP, protocol is UP - Modem status changes: 0, Transmitter status is IDLE, tbusy: 0 - Interface load (input): 978376 / 947808 / 951024 bits/s (5s/5m/15m) - (output): 978376 / 947848 / 951024 bits/s (5s/5m/15m) - Debug flags: none - RX errors: len: 22, overrun: 1, crc: 0, aborts: 0 - buffer overrun: 0, pbuffer overrun: 0 - TX errors: underrun: 0 - Line keepalive (value: 10) status UP [0] - -// Itt kezdodik a hardver-specifikus resz: - Controller status: - No alarms - -// Alarm: hibajelzes: -// -// No alarms - minden rendben -// -// LOS - Loss Of Signal - nem erzekel jelet a bemeneten. -// AIS - Alarm Indication Signal - csak egymas utani 1-esek jonnek -// a bemeneten, a tuloldal igy is jelezheti hogy meghibasodott vagy -// nincs inicializalva. -// AUXP - Auxiliary Pattern Indication - 01010101.. sorozat jon a bemeneten. -// LFA - Loss of Frame Alignment - nincs keretszinkron -// RRA - Receive Remote Alarm - a tuloldal el, de hibat jelez. -// LMFA - Loss of CRC4 Multiframe Alignment - nincs CRC4-multikeret-szinkron -// NMF - No Multiframe alignment Found after 400 msec - ilyen alarm a no-crc4 -// es crc4 keretezesek eseten nincs, lasd lentebb -// -// Egyeb lehetseges hibajelzesek: -// -// Transmit Line Short - a kartya ugy erzi hogy az adasi kimenete rovidre -// van zarva, ezert kikapcsolta az adast. (nem feltetlenul veszi eszre -// a kulso rovidzarat) - -// A veteli oldal csomagjainak lancolt listai, debug celokra: - - Rx ring: - rafutott: 0 - lastcheck: 50845731, jiffies: 51314281 - base: 017b1858 - rx_desc_ptr: 0 - rx_desc_ptr: 017b1858 - hw_curr_ptr: 017b1858 - 06040000 017b1868 017b1898 c016ff00 - 06040000 017b1878 017b1e9c c016ff00 - 46040000 017b1888 017b24a0 c016ff00 - 06040000 017b1858 017b2aa4 c016ff00 - -// A kartyat hasznalo tobbi interfesz: a 0-s channel-group a comx1 interfesz, -// es az 1,2,...,16 timeslotok tartoznak hozza: - - Interfaces using this board: (channel-group, interface, timeslots) - 0 comx1: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 - 1 comx2: 17 - 2 comx3: 18 - 3 comx4: 19 - 4 comx5: 20 - 5 comx6: 21 - 6 comx7: 22 - 7 comx8: 23 - 8 comx9: 24 - 9 comx10: 25 - 10 comx11: 26 - 11 comx12: 27 - 12 comx13: 28 - 13 comx14: 29 - 14 comx15: 30 - 15 comx16: 31 - -// Hany esemenyt kezelt le a driver egy-egy hardver-interrupt kiszolgalasanal: - - Interrupt work histogram: - hist[ 0]: 0 hist[ 1]: 2 hist[ 2]: 18574 hist[ 3]: 79 - hist[ 4]: 14 hist[ 5]: 1 hist[ 6]: 0 hist[ 7]: 1 - hist[ 8]: 0 hist[ 9]: 7 - -// Hany kikuldendo csomag volt mar a Tx-ringben amikor ujabb lett irva bele: - - Tx ring histogram: - hist[ 0]: 2329 hist[ 1]: 0 hist[ 2]: 0 hist[ 3]: 0 - -// Az E1-interfesz hiba-szamlaloi, az rfc2495-nek megfeleloen: -// (kb. a Cisco routerek "show controllers e1" formatumaban: http://www.cisco.com/univercd/cc/td/doc/product/software/ios11/rbook/rinterfc.htm#xtocid25669126) - -Data in current interval (91 seconds elapsed): - 9516 Line Code Violations, 65 Path Code Violations, 2 E-Bit Errors - 0 Slip Secs, 2 Fr Loss Secs, 2 Line Err Secs, 0 Degraded Mins - 0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 11 Unavail Secs -Data in Interval 1 (15 minutes): - 0 Line Code Violations, 0 Path Code Violations, 0 E-Bit Errors - 0 Slip Secs, 0 Fr Loss Secs, 0 Line Err Secs, 0 Degraded Mins - 0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 0 Unavail Secs -Data in last 4 intervals (1 hour): - 0 Line Code Violations, 0 Path Code Violations, 0 E-Bit Errors - 0 Slip Secs, 0 Fr Loss Secs, 0 Line Err Secs, 0 Degraded Mins - 0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 0 Unavail Secs -Data in last 96 intervals (24 hours): - 0 Line Code Violations, 0 Path Code Violations, 0 E-Bit Errors - 0 Slip Secs, 0 Fr Loss Secs, 0 Line Err Secs, 0 Degraded Mins - 0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 0 Unavail Secs - ------------------------------------------------------------------ - -Nehany kulonlegesebb beallitasi lehetoseg (idovel beepulhetnek majd a driverbe): -Ezekkel sok galibat lehet okozni, nagyon ovatosan kell oket hasznalni! - - modified CRC-4, for improved interworking of CRC-4 and non-CRC-4 - devices: (lasd page 107 es g706 Annex B) - lbireg[ 0x1b ] |= 0x08 - lbireg[ 0x1c ] |= 0xc0 - - ilyenkor ertelmezett az NMF - 'No Multiframe alignment Found after - 400 msec' alarm. - - FALC - a vonali meghajto IC - local loop - a sajat adasomat halljam vissza - remote loop - a kivulrol jovo adast adom vissza - - Egy hibakeresesre hasznalhato dolog: - - 1-es timeslot local loop a FALC-ban: echo >lbireg 0x1d 0x21 - - local loop kikapcsolasa: echo >lbireg 0x1d 0x00 diff --git a/Documentation/networking/slicecom.txt b/Documentation/networking/slicecom.txt deleted file mode 100644 index c82c0cf..0000000 --- a/Documentation/networking/slicecom.txt +++ /dev/null @@ -1,369 +0,0 @@ - -SliceCOM adapter user's documentation - for the 0.51 driver version - -Written by Bartók István <bartoki@itc.hu> - -English translation: Lakatos György <gyuri@itc.hu> -Mon Dec 11 15:28:42 CET 2000 - -Last modified: Wed Aug 29 17:25:37 CEST 2001 - ------------------------------------------------------------------ - -Usage: - -Compiling the kernel: - -Code maturity level options - [*] Prompt for development and/or incomplete code/drivers - -Network device support - Wan interfaces - <M> MultiGate (COMX) synchronous - <M> Support for MUNICH based boards: SliceCOM, PCICOM (NEW) - <M> Support for HDLC and syncPPP... - - -Loading the modules: - -modprobe comx - -modprobe comx-proto-ppp # module for Cisco-HDLC and SyncPPP protocols - -modprobe comx-hw-munich # the module logs information by the kernel - # about the detected boards - - -Configuring the board: - -# This interface will use the Cisco-HDLC line protocol, -# the timeslices assigned are 1,2 (128 KiBit line speed) -# (the first data timeslice in the G.703 frame is no. 1) -# -mkdir /proc/comx/comx0.1/ -echo slicecom >/proc/comx/comx0.1/boardtype -echo hdlc >/proc/comx/comx0.1/protocol -echo 1 2 >/proc/comx/comx0.1/timeslots - - -# This interface uses SyncPPP line protocol, the assigned -# is no. 3 (64 KiBit line speed) -# -mkdir /proc/comx/comx0.2/ -echo slicecom >/proc/comx/comx0.2/boardtype -echo ppp >/proc/comx/comx0.2/protocol -echo 3 >/proc/comx/comx0.2/timeslots - -... - -ifconfig comx0.1 up -ifconfig comx0.2 up - ------------------------------------------------------------------ - -The COMX interfaces use a 10 packet transmit queue by default, however WAN -networks sometimes use bigger values (20 to 100), to utilize the line better -by large traffic (though the line delay increases because of more packets -join the queue). - -# ifconfig comx0 txqueuelen 50 - -This option is only supported by the ifconfig command of the later -distributions, which came with 2.2 kernels, such as RedHat 6.1 or Debian 2.2. - -You can download a newer netbase packet from -http://www.debian.org/~rcw/2.2/netbase/ for Debian 2.1, which has a new -ifconfig. You can get further information about using 2.2 kernel with -Debian 2.1 from http://www.debian.org/releases/stable/running-kernel-2.2 - ------------------------------------------------------------------ - -The SliceCom LEDs: - -red - on, if the interface is unconfigured, or it gets Remote Alarm-s -green - on, if the board finds frame-sync in the received signal - -A bit more detailed: - -red: green: meaning: - -- - no frame-sync, no signal received, or signal SNAFU. -- on "Everything is OK" -on on Reception is ok, but the remote end sends Remote Alarm -on - The interface is unconfigured - ------------------------------------------------------------------ - -A more detailed description of the hardware setting options: - -The general and the protocol layer options described in the 'comx.txt' file -apply to the SliceCom as well, I only summarize the SliceCom hardware specific -settings below. - -The '/proc/comx' configuring interface: - -An interface directory should be created for every timeslot group with -'mkdir', e,g: 'comx0', 'comx1' etc. The timeslots can be assigned here to the -specific interface. The Cisco-like naming convention (serial3:1 - first -timeslot group of the 3rd. board) can't be used here, because these mean IP -aliasing in Linux. - -You can give any meaningful name to keep the configuration clear; -e.g: 'comx0.1', 'comx0.2', 'comx1.1', comx1.2', if you have two boards -with two interfaces each. - -Settings, which apply to the board: - -Neither 'io' nor 'irq' settings required, the driver uses the resources -given by the PCI BIOS. - -comx0/boardnum - board number of the SliceCom in the PC (using the 'natural' - PCI order) as listed in '/proc/pci' or the output of the - 'lspci' command, generally the slots nearer to the motherboard - PCI driver chips have the lower numbers. - - Default: 0 (the counting starts with 0) - -Though the options below are to be set on a single interface, they apply to the -whole board. The restriction, to use them on 'UP' interfaces, is because the -command sequence below could lead to unpredictable results. - - # echo 0 >boardnum - # echo internal >clock_source - # echo 1 >boardnum - -The sequence would set the clock source of board 0. - -These settings will persist after all the interfaces are cleared, but are -cleared when the driver module is unloaded and loaded again. - -comx0/clock_source - source of the transmit clock - Usage: - - # echo line >/proc/comx/comx0/clock_source - # echo internal >/proc/comx/comx0/clock_source - - line - The Tx clock is being decoded if the input data stream, - if no clock seen on the input, then the board will use it's - own clock generator. - - internal - The Tx clock is supplied by the builtin clock generator. - - Default: line - - Normally, the telecommunication company's end device (the HDSL - modem) provides the Tx clock, that's why 'line' is the default. - -comx0/framing - Switching CRC4 off/on - - CRC4: 16 PCM frames (The 32 64Kibit channels are multiplexed into a - PCM frame, nothing to do with HDLC frames) are divided into 2x8 - groups, each group has a 4 bit CRC. - - # echo crc4 >/proc/comx/comx0/framing - # echo no-crc4 >/proc/comx/comx0/framing - - Default is 'crc4', the Hungarian MATAV lines behave like this. - The traffic generally passes if this setting on both ends don't match. - -comx0/linecode - Setting the line coding - - # echo hdb3 >/proc/comx/comx0/linecode - # echo ami >/proc/comx/comx0/linecode - - Default a 'hdb3', MATAV lines use this. - - (AMI coding is rarely used with E1 lines). Frame sync may occur, if - this setting doesn't match the other end's, but CRC4 and data errors - will come, which will result in CRC errors on HDLC/SyncPPP level. - -comx0/reg - direct access to the board's MUNICH (reg) and FALC (lbireg) -comx0/lbireg circuit's registers - - # echo >reg 0x04 0x0 - write 0 to register 4 - # echo >reg 0x104 - write the contents of register 4 with - printk() to syslog - -WARNING! These are only for development purposes, messing with this will - result much trouble! - -comx0/loopback - Places a loop to the board's G.703 signals - - # echo none >/proc/comx/comx0/loopback - # echo local >/proc/comx/comx0/loopback - # echo remote >/proc/comx/comx0/loopback - - none - normal operation, no loop - local - the board receives it's own output - remote - the board sends the received data to the remote side - - Default: none - ------------------------------------------------------------------ - -Interface (channel group in Cisco terms) settings: - -comx0/timeslots - which timeslots belong to the given interface - - Setting: - - # echo '1 5 2 6 7 8' >/proc/comx/comx0/timeslots - - # cat /proc/comx/comx0/timeslots - 1 2 5 6 7 8 - # - - Finding a timeslot: - - # grep ' 4' /proc/comx/comx*/timeslots - /proc/comx/comx0/timeslots:1 3 4 5 6 - # - - The timeslots can be in any order, '1 2 3' is the same as '1 3 2'. - - The interface has to be DOWN during the setting ('ifconfig comx0 - down'), but the other interfaces could operate normally. - - The driver checks if the assigned timeslots are vacant, if not, then - the setting won't be applied. - - The timeslot values are treated as decimal numbers, not to misunderstand - values of 08, 09 form. - ------------------------------------------------------------------ - -Checking the interface and board status: - -- Lines beginning with ' ' (space) belong to the original output, the lines -which begin with '//' are the comments. - - papaya:~$ cat /proc/comx/comx1/status - Interface administrative status is UP, modem status is UP, protocol is UP - Modem status changes: 0, Transmitter status is IDLE, tbusy: 0 - Interface load (input): 978376 / 947808 / 951024 bits/s (5s/5m/15m) - (output): 978376 / 947848 / 951024 bits/s (5s/5m/15m) - Debug flags: none - RX errors: len: 22, overrun: 1, crc: 0, aborts: 0 - buffer overrun: 0, pbuffer overrun: 0 - TX errors: underrun: 0 - Line keepalive (value: 10) status UP [0] - -// The hardware specific part starts here: - Controller status: - No alarms - -// Alarm: -// -// No alarms - Everything OK -// -// LOS - Loss Of Signal - No signal sensed on the input -// AIS - Alarm Indication Signal - The remote side sends '11111111'-s, -// it tells, that there's an error condition, or it's not -// initialised. -// AUXP - Auxiliary Pattern Indication - 01010101.. received. -// LFA - Loss of Frame Alignment - no frame sync received. -// RRA - Receive Remote Alarm - the remote end's OK, but signals error cond. -// LMFA - Loss of CRC4 Multiframe Alignment - no CRC4 multiframe sync. -// NMF - No Multiframe alignment Found after 400 msec - no such alarm using -// no-crc4 or crc4 framing, see below. -// -// Other possible error messages: -// -// Transmit Line Short - the board felt, that it's output is short-circuited, -// so it switched the transmission off. (The board can't definitely tell, -// that it's output is short-circuited.) - -// Chained list of the received packets, for debug purposes: - - Rx ring: - rafutott: 0 - lastcheck: 50845731, jiffies: 51314281 - base: 017b1858 - rx_desc_ptr: 0 - rx_desc_ptr: 017b1858 - hw_curr_ptr: 017b1858 - 06040000 017b1868 017b1898 c016ff00 - 06040000 017b1878 017b1e9c c016ff00 - 46040000 017b1888 017b24a0 c016ff00 - 06040000 017b1858 017b2aa4 c016ff00 - -// All the interfaces using the board: comx1, using the 1,2,...16 timeslots, -// comx2, using timeslot 17, etc. - - Interfaces using this board: (channel-group, interface, timeslots) - 0 comx1: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 - 1 comx2: 17 - 2 comx3: 18 - 3 comx4: 19 - 4 comx5: 20 - 5 comx6: 21 - 6 comx7: 22 - 7 comx8: 23 - 8 comx9: 24 - 9 comx10: 25 - 10 comx11: 26 - 11 comx12: 27 - 12 comx13: 28 - 13 comx14: 29 - 14 comx15: 30 - 15 comx16: 31 - -// The number of events handled by the driver during an interrupt cycle: - - Interrupt work histogram: - hist[ 0]: 0 hist[ 1]: 2 hist[ 2]: 18574 hist[ 3]: 79 - hist[ 4]: 14 hist[ 5]: 1 hist[ 6]: 0 hist[ 7]: 1 - hist[ 8]: 0 hist[ 9]: 7 - -// The number of packets to send in the Tx ring, when a new one arrived: - - Tx ring histogram: - hist[ 0]: 2329 hist[ 1]: 0 hist[ 2]: 0 hist[ 3]: 0 - -// The error counters of the E1 interface, according to the RFC2495, -// (similar to the Cisco "show controllers e1" command's output: -// http://www.cisco.com/univercd/cc/td/doc/product/software/ios11/rbook/rinterfc.htm#xtocid25669126) - -Data in current interval (91 seconds elapsed): - 9516 Line Code Violations, 65 Path Code Violations, 2 E-Bit Errors - 0 Slip Secs, 2 Fr Loss Secs, 2 Line Err Secs, 0 Degraded Mins - 0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 11 Unavail Secs -Data in Interval 1 (15 minutes): - 0 Line Code Violations, 0 Path Code Violations, 0 E-Bit Errors - 0 Slip Secs, 0 Fr Loss Secs, 0 Line Err Secs, 0 Degraded Mins - 0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 0 Unavail Secs -Data in last 4 intervals (1 hour): - 0 Line Code Violations, 0 Path Code Violations, 0 E-Bit Errors - 0 Slip Secs, 0 Fr Loss Secs, 0 Line Err Secs, 0 Degraded Mins - 0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 0 Unavail Secs -Data in last 96 intervals (24 hours): - 0 Line Code Violations, 0 Path Code Violations, 0 E-Bit Errors - 0 Slip Secs, 0 Fr Loss Secs, 0 Line Err Secs, 0 Degraded Mins - 0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 0 Unavail Secs - ------------------------------------------------------------------ - -Some unique options, (may get into the driver later): -Treat them very carefully, these can cause much trouble! - - modified CRC-4, for improved interworking of CRC-4 and non-CRC-4 - devices: (see page 107 and g706 Annex B) - lbireg[ 0x1b ] |= 0x08 - lbireg[ 0x1c ] |= 0xc0 - - - The NMF - 'No Multiframe alignment Found after 400 msec' alarm - comes into account. - - FALC - the line driver chip. - local loop - I hear my transmission back. - remote loop - I echo the remote transmission back. - - Something useful for finding errors: - - - local loop for timeslot 1 in the FALC chip: - - # echo >lbireg 0x1d 0x21 - - - Switching the loop off: - - # echo >lbireg 0x1d 0x00 diff --git a/Documentation/networking/tc-actions-env-rules.txt b/Documentation/networking/tc-actions-env-rules.txt new file mode 100644 index 0000000..01e716d --- /dev/null +++ b/Documentation/networking/tc-actions-env-rules.txt @@ -0,0 +1,29 @@ + +The "enviromental" rules for authors of any new tc actions are: + +1) If you stealeth or borroweth any packet thou shalt be branching +from the righteous path and thou shalt cloneth. + +For example if your action queues a packet to be processed later +or intentionaly branches by redirecting a packet then you need to +clone the packet. +There are certain fields in the skb tc_verd that need to be reset so we +avoid loops etc. A few are generic enough so much so that skb_act_clone() +resets them for you. So invoke skb_act_clone() rather than skb_clone() + +2) If you munge any packet thou shalt call pskb_expand_head in the case +someone else is referencing the skb. After that you "own" the skb. +You must also tell us if it is ok to munge the packet (TC_OK2MUNGE), +this way any action downstream can stomp on the packet. + +3) dropping packets you dont own is a nono. You simply return +TC_ACT_SHOT to the caller and they will drop it. + +The "enviromental" rules for callers of actions (qdiscs etc) are: + +*) thou art responsible for freeing anything returned as being +TC_ACT_SHOT/STOLEN/QUEUED. If none of TC_ACT_SHOT/STOLEN/QUEUED is +returned then all is great and you dont need to do anything. + +Post on netdev if something is unclear. + diff --git a/Documentation/powerpc/booting-without-of.txt b/Documentation/powerpc/booting-without-of.txt index a96e853..ac1be25 100644 --- a/Documentation/powerpc/booting-without-of.txt +++ b/Documentation/powerpc/booting-without-of.txt @@ -52,6 +52,7 @@ Table of Contents i) Freescale QUICC Engine module (QE) j) CFI or JEDEC memory-mapped NOR flash k) Global Utilities Block + l) Xilinx IP cores VII - Specifying interrupt information for devices 1) interrupts property @@ -851,12 +852,18 @@ address which can extend beyond that limit. /cpus/PowerPC,970FX@0 /cpus/PowerPC,970FX@1 (unit addresses do not require leading zeroes) - - d-cache-line-size : one cell, L1 data cache line size in bytes - - i-cache-line-size : one cell, L1 instruction cache line size in + - d-cache-block-size : one cell, L1 data cache block size in bytes (*) + - i-cache-block-size : one cell, L1 instruction cache block size in bytes - d-cache-size : one cell, size of L1 data cache in bytes - i-cache-size : one cell, size of L1 instruction cache in bytes +(*) The cache "block" size is the size on which the cache management +instructions operate. Historically, this document used the cache +"line" size here which is incorrect. The kernel will prefer the cache +block size and will fallback to cache line size for backward +compatibility. + Recommended properties: - timebase-frequency : a cell indicating the frequency of the @@ -870,6 +877,10 @@ address which can extend beyond that limit. for the above, the common code doesn't use that property, but you are welcome to re-use the pSeries or Maple one. A future kernel version might provide a common function for this. + - d-cache-line-size : one cell, L1 data cache line size in bytes + if different from the block size + - i-cache-line-size : one cell, L1 instruction cache line size in + bytes if different from the block size You are welcome to add any property you find relevant to your board, like some information about the mechanism used to soft-reset the @@ -2242,6 +2253,266 @@ platforms are moved over to use the flattened-device-tree model. available. For Axon: 0x0000012a + l) Xilinx IP cores + + The Xilinx EDK toolchain ships with a set of IP cores (devices) for use + in Xilinx Spartan and Virtex FPGAs. The devices cover the whole range + of standard device types (network, serial, etc.) and miscellanious + devices (gpio, LCD, spi, etc). Also, since these devices are + implemented within the fpga fabric every instance of the device can be + synthesised with different options that change the behaviour. + + Each IP-core has a set of parameters which the FPGA designer can use to + control how the core is synthesized. Historically, the EDK tool would + extract the device parameters relevant to device drivers and copy them + into an 'xparameters.h' in the form of #define symbols. This tells the + device drivers how the IP cores are configured, but it requres the kernel + to be recompiled every time the FPGA bitstream is resynthesized. + + The new approach is to export the parameters into the device tree and + generate a new device tree each time the FPGA bitstream changes. The + parameters which used to be exported as #defines will now become + properties of the device node. In general, device nodes for IP-cores + will take the following form: + + (name)@(base-address) { + compatible = "xlnx,(ip-core-name)-(HW_VER)" + [, (list of compatible devices), ...]; + reg = <(baseaddr) (size)>; + interrupt-parent = <&interrupt-controller-phandle>; + interrupts = < ... >; + xlnx,(parameter1) = "(string-value)"; + xlnx,(parameter2) = <(int-value)>; + }; + + (ip-core-name): the name of the ip block (given after the BEGIN + directive in system.mhs). Should be in lowercase + and all underscores '_' converted to dashes '-'. + (name): is derived from the "PARAMETER INSTANCE" value. + (parameter#): C_* parameters from system.mhs. The C_ prefix is + dropped from the parameter name, the name is converted + to lowercase and all underscore '_' characters are + converted to dashes '-'. + (baseaddr): the C_BASEADDR parameter. + (HW_VER): from the HW_VER parameter. + (size): equals C_HIGHADDR - C_BASEADDR + 1 + + Typically, the compatible list will include the exact IP core version + followed by an older IP core version which implements the same + interface or any other device with the same interface. + + 'reg', 'interrupt-parent' and 'interrupts' are all optional properties. + + For example, the following block from system.mhs: + + BEGIN opb_uartlite + PARAMETER INSTANCE = opb_uartlite_0 + PARAMETER HW_VER = 1.00.b + PARAMETER C_BAUDRATE = 115200 + PARAMETER C_DATA_BITS = 8 + PARAMETER C_ODD_PARITY = 0 + PARAMETER C_USE_PARITY = 0 + PARAMETER C_CLK_FREQ = 50000000 + PARAMETER C_BASEADDR = 0xEC100000 + PARAMETER C_HIGHADDR = 0xEC10FFFF + BUS_INTERFACE SOPB = opb_7 + PORT OPB_Clk = CLK_50MHz + PORT Interrupt = opb_uartlite_0_Interrupt + PORT RX = opb_uartlite_0_RX + PORT TX = opb_uartlite_0_TX + PORT OPB_Rst = sys_bus_reset_0 + END + + becomes the following device tree node: + + opb-uartlite-0@ec100000 { + device_type = "serial"; + compatible = "xlnx,opb-uartlite-1.00.b"; + reg = <ec100000 10000>; + interrupt-parent = <&opb-intc>; + interrupts = <1 0>; // got this from the opb_intc parameters + current-speed = <d#115200>; // standard serial device prop + clock-frequency = <d#50000000>; // standard serial device prop + xlnx,data-bits = <8>; + xlnx,odd-parity = <0>; + xlnx,use-parity = <0>; + }; + + Some IP cores actually implement 2 or more logical devices. In this case, + the device should still describe the whole IP core with a single node + and add a child node for each logical device. The ranges property can + be used to translate from parent IP-core to the registers of each device. + (Note: this makes the assumption that both logical devices have the same + bus binding. If this is not true, then separate nodes should be used for + each logical device). The 'cell-index' property can be used to enumerate + logical devices within an IP core. For example, the following is the + system.mhs entry for the dual ps2 controller found on the ml403 reference + design. + + BEGIN opb_ps2_dual_ref + PARAMETER INSTANCE = opb_ps2_dual_ref_0 + PARAMETER HW_VER = 1.00.a + PARAMETER C_BASEADDR = 0xA9000000 + PARAMETER C_HIGHADDR = 0xA9001FFF + BUS_INTERFACE SOPB = opb_v20_0 + PORT Sys_Intr1 = ps2_1_intr + PORT Sys_Intr2 = ps2_2_intr + PORT Clkin1 = ps2_clk_rx_1 + PORT Clkin2 = ps2_clk_rx_2 + PORT Clkpd1 = ps2_clk_tx_1 + PORT Clkpd2 = ps2_clk_tx_2 + PORT Rx1 = ps2_d_rx_1 + PORT Rx2 = ps2_d_rx_2 + PORT Txpd1 = ps2_d_tx_1 + PORT Txpd2 = ps2_d_tx_2 + END + + It would result in the following device tree nodes: + + opb_ps2_dual_ref_0@a9000000 { + ranges = <0 a9000000 2000>; + // If this device had extra parameters, then they would + // go here. + ps2@0 { + compatible = "xlnx,opb-ps2-dual-ref-1.00.a"; + reg = <0 40>; + interrupt-parent = <&opb-intc>; + interrupts = <3 0>; + cell-index = <0>; + }; + ps2@1000 { + compatible = "xlnx,opb-ps2-dual-ref-1.00.a"; + reg = <1000 40>; + interrupt-parent = <&opb-intc>; + interrupts = <3 0>; + cell-index = <0>; + }; + }; + + Also, the system.mhs file defines bus attachments from the processor + to the devices. The device tree structure should reflect the bus + attachments. Again an example; this system.mhs fragment: + + BEGIN ppc405_virtex4 + PARAMETER INSTANCE = ppc405_0 + PARAMETER HW_VER = 1.01.a + BUS_INTERFACE DPLB = plb_v34_0 + BUS_INTERFACE IPLB = plb_v34_0 + END + + BEGIN opb_intc + PARAMETER INSTANCE = opb_intc_0 + PARAMETER HW_VER = 1.00.c + PARAMETER C_BASEADDR = 0xD1000FC0 + PARAMETER C_HIGHADDR = 0xD1000FDF + BUS_INTERFACE SOPB = opb_v20_0 + END + + BEGIN opb_uart16550 + PARAMETER INSTANCE = opb_uart16550_0 + PARAMETER HW_VER = 1.00.d + PARAMETER C_BASEADDR = 0xa0000000 + PARAMETER C_HIGHADDR = 0xa0001FFF + BUS_INTERFACE SOPB = opb_v20_0 + END + + BEGIN plb_v34 + PARAMETER INSTANCE = plb_v34_0 + PARAMETER HW_VER = 1.02.a + END + + BEGIN plb_bram_if_cntlr + PARAMETER INSTANCE = plb_bram_if_cntlr_0 + PARAMETER HW_VER = 1.00.b + PARAMETER C_BASEADDR = 0xFFFF0000 + PARAMETER C_HIGHADDR = 0xFFFFFFFF + BUS_INTERFACE SPLB = plb_v34_0 + END + + BEGIN plb2opb_bridge + PARAMETER INSTANCE = plb2opb_bridge_0 + PARAMETER HW_VER = 1.01.a + PARAMETER C_RNG0_BASEADDR = 0x20000000 + PARAMETER C_RNG0_HIGHADDR = 0x3FFFFFFF + PARAMETER C_RNG1_BASEADDR = 0x60000000 + PARAMETER C_RNG1_HIGHADDR = 0x7FFFFFFF + PARAMETER C_RNG2_BASEADDR = 0x80000000 + PARAMETER C_RNG2_HIGHADDR = 0xBFFFFFFF + PARAMETER C_RNG3_BASEADDR = 0xC0000000 + PARAMETER C_RNG3_HIGHADDR = 0xDFFFFFFF + BUS_INTERFACE SPLB = plb_v34_0 + BUS_INTERFACE MOPB = opb_v20_0 + END + + Gives this device tree (some properties removed for clarity): + + plb-v34-0 { + #address-cells = <1>; + #size-cells = <1>; + device_type = "ibm,plb"; + ranges; // 1:1 translation + + plb-bram-if-cntrl-0@ffff0000 { + reg = <ffff0000 10000>; + } + + opb-v20-0 { + #address-cells = <1>; + #size-cells = <1>; + ranges = <20000000 20000000 20000000 + 60000000 60000000 20000000 + 80000000 80000000 40000000 + c0000000 c0000000 20000000>; + + opb-uart16550-0@a0000000 { + reg = <a00000000 2000>; + }; + + opb-intc-0@d1000fc0 { + reg = <d1000fc0 20>; + }; + }; + }; + + That covers the general approach to binding xilinx IP cores into the + device tree. The following are bindings for specific devices: + + i) Xilinx ML300 Framebuffer + + Simple framebuffer device from the ML300 reference design (also on the + ML403 reference design as well as others). + + Optional properties: + - resolution = <xres yres> : pixel resolution of framebuffer. Some + implementations use a different resolution. + Default is <d#640 d#480> + - virt-resolution = <xvirt yvirt> : Size of framebuffer in memory. + Default is <d#1024 d#480>. + - rotate-display (empty) : rotate display 180 degrees. + + ii) Xilinx SystemACE + + The Xilinx SystemACE device is used to program FPGAs from an FPGA + bitstream stored on a CF card. It can also be used as a generic CF + interface device. + + Optional properties: + - 8-bit (empty) : Set this property for SystemACE in 8 bit mode + + iii) Xilinx EMAC and Xilinx TEMAC + + Xilinx Ethernet devices. In addition to general xilinx properties + listed above, nodes for these devices should include a phy-handle + property, and may include other common network device properties + like local-mac-address. + + iv) Xilinx Uartlite + + Xilinx uartlite devices are simple fixed speed serial ports. + + Requred properties: + - current-speed : Baud rate of uartlite + More devices will be defined as this spec matures. VII - Specifying interrupt information for devices diff --git a/Documentation/powerpc/mpc52xx-device-tree-bindings.txt b/Documentation/powerpc/mpc52xx-device-tree-bindings.txt index 5f7d536..5e03610 100644 --- a/Documentation/powerpc/mpc52xx-device-tree-bindings.txt +++ b/Documentation/powerpc/mpc52xx-device-tree-bindings.txt @@ -185,7 +185,7 @@ bestcomm@<addr> dma-controller mpc5200-bestcomm 5200 pic also requires Recommended soc5200 child nodes; populate as needed for your board name device_type compatible Description ---- ----------- ---------- ----------- -gpt@<addr> gpt mpc5200-gpt General purpose timers +gpt@<addr> gpt fsl,mpc5200-gpt General purpose timers rtc@<addr> rtc mpc5200-rtc Real time clock mscan@<addr> mscan mpc5200-mscan CAN bus controller pci@<addr> pci mpc5200-pci PCI bridge @@ -213,7 +213,7 @@ cell-index int When multiple devices are present, is the 5) General Purpose Timer nodes (child of soc5200 node) On the mpc5200 and 5200b, GPT0 has a watchdog timer function. If the board design supports the internal wdt, then the device node for GPT0 should -include the empty property 'has-wdt'. +include the empty property 'fsl,has-wdt'. 6) PSC nodes (child of soc5200 node) PSC nodes can define the optional 'port-number' property to force assignment diff --git a/Documentation/scsi/link_power_management_policy.txt b/Documentation/scsi/link_power_management_policy.txt new file mode 100644 index 0000000..d18993d --- /dev/null +++ b/Documentation/scsi/link_power_management_policy.txt @@ -0,0 +1,19 @@ +This parameter allows the user to set the link (interface) power management. +There are 3 possible options: + +Value Effect +---------------------------------------------------------------------------- +min_power Tell the controller to try to make the link use the + least possible power when possible. This may + sacrifice some performance due to increased latency + when coming out of lower power states. + +max_performance Generally, this means no power management. Tell + the controller to have performance be a priority + over power management. + +medium_power Tell the controller to enter a lower power state + when possible, but do not enter the lowest power + state, thus improving latency over min_power setting. + + diff --git a/Documentation/scsi/sym53c8xx_2.txt b/Documentation/scsi/sym53c8xx_2.txt index 3d9f06b..49ea5c5 100644 --- a/Documentation/scsi/sym53c8xx_2.txt +++ b/Documentation/scsi/sym53c8xx_2.txt @@ -449,25 +449,14 @@ options as above. cmd_per_lun=#tags (#tags > 1) tagged command queuing enabled #tags will be truncated to the max queued commands configuration parameter. -10.2.2 Detailed control of tagged commands - This option allows you to specify a command queue depth for each device - that supports tagged command queueing. - Example: - tag_ctrl=10/t2t3q16-t5q24/t1u2q32 - will set devices queue depth as follow: - - controller #0 target #2 and target #3 -> 16 commands, - - controller #0 target #5 -> 24 commands, - - controller #1 target #1 logical unit #2 -> 32 commands, - - all other logical units (all targets, all controllers) -> 10 commands. - -10.2.3 Burst max +10.2.2 Burst max burst=0 burst disabled burst=255 get burst length from initial IO register settings. burst=#x burst enabled (1<<#x burst transfers max) #x is an integer value which is log base 2 of the burst transfers max. By default the driver uses the maximum value supported by the chip. -10.2.4 LED support +10.2.3 LED support led=1 enable LED support led=0 disable LED support Do not enable LED support if your scsi board does not use SDMS BIOS. @@ -560,9 +549,9 @@ Previously, the sym2 driver accepted arguments of the form sym53c8xx=tags:4,sync:10,debug:0x200 As a result of the new module parameters, this is no longer available. -Most of the options have remained the same, but tags has split into -cmd_per_lun and tag_ctrl for its two different purposes. The sample above -would be specified as: +Most of the options have remained the same, but tags has become +cmd_per_lun to reflect its different purposes. The sample above would +be specified as: modprobe sym53c8xx cmd_per_lun=4 sync=10 debug=0x200 or on the kernel boot line as: diff --git a/Documentation/video4linux/CARDLIST.em28xx b/Documentation/video4linux/CARDLIST.em28xx index a302668..37f0e3c 100644 --- a/Documentation/video4linux/CARDLIST.em28xx +++ b/Documentation/video4linux/CARDLIST.em28xx @@ -8,4 +8,7 @@ 7 -> Leadtek Winfast USB II (em2800) 8 -> Kworld USB2800 (em2800) 9 -> Pinnacle Dazzle DVC 90 (em2820/em2840) [2304:0207] + 10 -> Hauppauge WinTV HVR 900 (em2880) + 11 -> Terratec Hybrid XS (em2880) 12 -> Kworld PVR TV 2800 RF (em2820/em2840) + 13 -> Terratec Prodigy XS (em2880) diff --git a/Documentation/watchdog/src/watchdog-simple.c b/Documentation/watchdog/src/watchdog-simple.c index 47801bc..4cf72f3 100644 --- a/Documentation/watchdog/src/watchdog-simple.c +++ b/Documentation/watchdog/src/watchdog-simple.c @@ -3,15 +3,25 @@ #include <unistd.h> #include <fcntl.h> -int main(int argc, const char *argv[]) { +int main(void) +{ int fd = open("/dev/watchdog", O_WRONLY); + int ret = 0; if (fd == -1) { perror("watchdog"); - exit(1); + exit(EXIT_FAILURE); } while (1) { - write(fd, "\0", 1); - fsync(fd); + ret = write(fd, "\0", 1); + if (ret != 1) { + ret = -1; + break; + } + ret = fsync(fd); + if (ret) + break; sleep(10); } + close(fd); + return ret; } |