diff options
author | Timothy Pearson <tpearson@raptorengineering.com> | 2017-08-23 14:45:25 -0500 |
---|---|---|
committer | Timothy Pearson <tpearson@raptorengineering.com> | 2017-08-23 14:45:25 -0500 |
commit | fcbb27b0ec6dcbc5a5108cb8fb19eae64593d204 (patch) | |
tree | 22962a4387943edc841c72a4e636a068c66d58fd /Documentation/filesystems | |
download | ast2050-linux-kernel-fcbb27b0ec6dcbc5a5108cb8fb19eae64593d204.zip ast2050-linux-kernel-fcbb27b0ec6dcbc5a5108cb8fb19eae64593d204.tar.gz |
Initial import of modified Linux 2.6.28 tree
Original upstream URL:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git | branch linux-2.6.28.y
Diffstat (limited to 'Documentation/filesystems')
72 files changed, 19464 insertions, 0 deletions
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX new file mode 100644 index 0000000..bc6b437 --- /dev/null +++ b/Documentation/filesystems/00-INDEX @@ -0,0 +1,118 @@ +00-INDEX + - this file (info on some of the filesystems supported by linux). +Exporting + - explanation of how to make filesystems exportable. +Locking + - info on locking rules as they pertain to Linux VFS. +9p.txt + - 9p (v9fs) is an implementation of the Plan 9 remote fs protocol. +adfs.txt + - info and mount options for the Acorn Advanced Disc Filing System. +afs.txt + - info and examples for the distributed AFS (Andrew File System) fs. +affs.txt + - info and mount options for the Amiga Fast File System. +automount-support.txt + - information about filesystem automount support. +befs.txt + - information about the BeOS filesystem for Linux. +bfs.txt + - info for the SCO UnixWare Boot Filesystem (BFS). +cifs.txt + - description of the CIFS filesystem. +coda.txt + - description of the CODA filesystem. +configfs/ + - directory containing configfs documentation and example code. +cramfs.txt + - info on the cram filesystem for small storage (ROMs etc). +dentry-locking.txt + - info on the RCU-based dcache locking model. +directory-locking + - info about the locking scheme used for directory operations. +dlmfs.txt + - info on the userspace interface to the OCFS2 DLM. +dnotify.txt + - info about directory notification in Linux. +ecryptfs.txt + - docs on eCryptfs: stacked cryptographic filesystem for Linux. +ext2.txt + - info, mount options and specifications for the Ext2 filesystem. +ext3.txt + - info, mount options and specifications for the Ext3 filesystem. +ext4.txt + - info, mount options and specifications for the Ext4 filesystem. +files.txt + - info on file management in the Linux kernel. +fuse.txt + - info on the Filesystem in User SpacE including mount options. +gfs2.txt + - info on the Global File System 2. +hfs.txt + - info on the Macintosh HFS Filesystem for Linux. +hfsplus.txt + - info on the Macintosh HFSPlus Filesystem for Linux. +hpfs.txt + - info and mount options for the OS/2 HPFS. +inotify.txt + - info on the powerful yet simple file change notification system. +isofs.txt + - info and mount options for the ISO 9660 (CDROM) filesystem. +jfs.txt + - info and mount options for the JFS filesystem. +locks.txt + - info on file locking implementations, flock() vs. fcntl(), etc. +mandatory-locking.txt + - info on the Linux implementation of Sys V mandatory file locking. +ncpfs.txt + - info on Novell Netware(tm) filesystem using NCP protocol. +nfsroot.txt + - short guide on setting up a diskless box with NFS root filesystem. +ntfs.txt + - info and mount options for the NTFS filesystem (Windows NT). +ocfs2.txt + - info and mount options for the OCFS2 clustered filesystem. +porting + - various information on filesystem porting. +proc.txt + - info on Linux's /proc filesystem. +ramfs-rootfs-initramfs.txt + - info on the 'in memory' filesystems ramfs, rootfs and initramfs. +reiser4.txt + - info on the Reiser4 filesystem based on dancing tree algorithms. +relay.txt + - info on relay, for efficient streaming from kernel to user space. +romfs.txt + - description of the ROMFS filesystem. +rpc-cache.txt + - introduction to the caching mechanisms in the sunrpc layer. +seq_file.txt + - how to use the seq_file API +sharedsubtree.txt + - a description of shared subtrees for namespaces. +smbfs.txt + - info on using filesystems with the SMB protocol (Win 3.11 and NT). +spufs.txt + - info and mount options for the SPU filesystem used on Cell. +sysfs-pci.txt + - info on accessing PCI device resources through sysfs. +sysfs.txt + - info on sysfs, a ram-based filesystem for exporting kernel objects. +sysv-fs.txt + - info on the SystemV/V7/Xenix/Coherent filesystem. +tmpfs.txt + - info on tmpfs, a filesystem that holds all files in virtual memory. +udf.txt + - info and mount options for the UDF filesystem. +ufs.txt + - info on the ufs filesystem. +unionfs/ + - info on the unionfs filesystem +vfat.txt + - info on using the VFAT filesystem used in Windows NT and Windows 95 +vfs.txt + - overview of the Virtual File System +xfs.txt + - info and mount options for the XFS filesystem. +xip.txt + - info on execute-in-place for file mappings. diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.txt new file mode 100644 index 0000000..bf80806 --- /dev/null +++ b/Documentation/filesystems/9p.txt @@ -0,0 +1,144 @@ + v9fs: Plan 9 Resource Sharing for Linux + ======================================= + +ABOUT +===== + +v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol. + +This software was originally developed by Ron Minnich <rminnich@sandia.gov> +and Maya Gokhale. Additional development by Greg Watson +<gwatson@lanl.gov> and most recently Eric Van Hensbergen +<ericvh@gmail.com>, Latchesar Ionkov <lucho@ionkov.net> and Russ Cox +<rsc@swtch.com>. + +The best detailed explanation of the Linux implementation and applications of +the 9p client is available in the form of a USENIX paper: + http://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html + +Other applications are described in the following papers: + * XCPU & Clustering + http://www.xcpu.org/xcpu-talk.pdf + * KVMFS: control file system for KVM + http://www.xcpu.org/kvmfs.pdf + * CellFS: A New ProgrammingModel for the Cell BE + http://www.xcpu.org/cellfs-talk.pdf + * PROSE I/O: Using 9p to enable Application Partitions + http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf + +USAGE +===== + +For remote file server: + + mount -t 9p 10.10.1.2 /mnt/9 + +For Plan 9 From User Space applications (http://swtch.com/plan9) + + mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER + +OPTIONS +======= + + trans=name select an alternative transport. Valid options are + currently: + unix - specifying a named pipe mount point + tcp - specifying a normal TCP/IP connection + fd - used passed file descriptors for connection + (see rfdno and wfdno) + virtio - connect to the next virtio channel available + (from lguest or KVM with trans_virtio module) + + uname=name user name to attempt mount as on the remote server. The + server may override or ignore this value. Certain user + names may require authentication. + + aname=name aname specifies the file tree to access when the server is + offering several exported file systems. + + cache=mode specifies a caching policy. By default, no caches are used. + loose = no attempts are made at consistency, + intended for exclusive, read-only mounts + + debug=n specifies debug level. The debug level is a bitmask. + 0x01 = display verbose error messages + 0x02 = developer debug (DEBUG_CURRENT) + 0x04 = display 9p trace + 0x08 = display VFS trace + 0x10 = display Marshalling debug + 0x20 = display RPC debug + 0x40 = display transport debug + 0x80 = display allocation debug + + rfdno=n the file descriptor for reading with trans=fd + + wfdno=n the file descriptor for writing with trans=fd + + maxdata=n the number of bytes to use for 9p packet payload (msize) + + port=n port to connect to on the remote server + + noextend force legacy mode (no 9p2000.u semantics) + + dfltuid attempt to mount as a particular uid + + dfltgid attempt to mount with a particular gid + + afid security channel - used by Plan 9 authentication protocols + + nodevmap do not map special files - represent them as normal files. + This can be used to share devices/named pipes/sockets between + hosts. This functionality will be expanded in later versions. + + access there are three access modes. + user = if a user tries to access a file on v9fs + filesystem for the first time, v9fs sends an + attach command (Tattach) for that user. + This is the default mode. + <uid> = allows only user with uid=<uid> to access + the files on the mounted filesystem + any = v9fs does single attach and performs all + operations as one user + +RESOURCES +========= + +Our current recommendation is to use Inferno (http://www.vitanuova.com/inferno) +as the 9p server. You can start a 9p server under Inferno by issuing the +following command: + ; styxlisten -A tcp!*!564 export '#U*' + +The -A specifies an unauthenticated export. The 564 is the port # (you may +have to choose a higher port number if running as a normal user). The '#U*' +specifies exporting the root of the Linux name space. You may specify a +subset of the namespace by extending the path: '#U*'/tmp would just export +/tmp. For more information, see the Inferno manual pages covering styxlisten +and export. + +A Linux version of the 9p server is now maintained under the npfs project +on sourceforge (http://sourceforge.net/projects/npfs). The currently +maintained version is the single-threaded version of the server (named spfs) +available from the same CVS repository. + +There are user and developer mailing lists available through the v9fs project +on sourceforge (http://sourceforge.net/projects/v9fs). + +News and other information is maintained on SWiK (http://swik.net/v9fs). + +Bug reports may be issued through the kernel.org bugzilla +(http://bugzilla.kernel.org) + +For more information on the Plan 9 Operating System check out +http://plan9.bell-labs.com/plan9 + +For information on Plan 9 from User Space (Plan 9 applications and libraries +ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9 + + +STATUS +====== + +The 2.6 kernel support is working on PPC and x86. + +PLEASE USE THE KERNEL BUGZILLA TO REPORT PROBLEMS. (http://bugzilla.kernel.org) + diff --git a/Documentation/filesystems/Exporting b/Documentation/filesystems/Exporting new file mode 100644 index 0000000..87019d2 --- /dev/null +++ b/Documentation/filesystems/Exporting @@ -0,0 +1,147 @@ + +Making Filesystems Exportable +============================= + +Overview +-------- + +All filesystem operations require a dentry (or two) as a starting +point. Local applications have a reference-counted hold on suitable +dentries via open file descriptors or cwd/root. However remote +applications that access a filesystem via a remote filesystem protocol +such as NFS may not be able to hold such a reference, and so need a +different way to refer to a particular dentry. As the alternative +form of reference needs to be stable across renames, truncates, and +server-reboot (among other things, though these tend to be the most +problematic), there is no simple answer like 'filename'. + +The mechanism discussed here allows each filesystem implementation to +specify how to generate an opaque (outside of the filesystem) byte +string for any dentry, and how to find an appropriate dentry for any +given opaque byte string. +This byte string will be called a "filehandle fragment" as it +corresponds to part of an NFS filehandle. + +A filesystem which supports the mapping between filehandle fragments +and dentries will be termed "exportable". + + + +Dcache Issues +------------- + +The dcache normally contains a proper prefix of any given filesystem +tree. This means that if any filesystem object is in the dcache, then +all of the ancestors of that filesystem object are also in the dcache. +As normal access is by filename this prefix is created naturally and +maintained easily (by each object maintaining a reference count on +its parent). + +However when objects are included into the dcache by interpreting a +filehandle fragment, there is no automatic creation of a path prefix +for the object. This leads to two related but distinct features of +the dcache that are not needed for normal filesystem access. + +1/ The dcache must sometimes contain objects that are not part of the + proper prefix. i.e that are not connected to the root. +2/ The dcache must be prepared for a newly found (via ->lookup) directory + to already have a (non-connected) dentry, and must be able to move + that dentry into place (based on the parent and name in the + ->lookup). This is particularly needed for directories as + it is a dcache invariant that directories only have one dentry. + +To implement these features, the dcache has: + +a/ A dentry flag DCACHE_DISCONNECTED which is set on + any dentry that might not be part of the proper prefix. + This is set when anonymous dentries are created, and cleared when a + dentry is noticed to be a child of a dentry which is in the proper + prefix. + +b/ A per-superblock list "s_anon" of dentries which are the roots of + subtrees that are not in the proper prefix. These dentries, as + well as the proper prefix, need to be released at unmount time. As + these dentries will not be hashed, they are linked together on the + d_hash list_head. + +c/ Helper routines to allocate anonymous dentries, and to help attach + loose directory dentries at lookup time. They are: + d_alloc_anon(inode) will return a dentry for the given inode. + If the inode already has a dentry, one of those is returned. + If it doesn't, a new anonymous (IS_ROOT and + DCACHE_DISCONNECTED) dentry is allocated and attached. + In the case of a directory, care is taken that only one dentry + can ever be attached. + d_splice_alias(inode, dentry) will make sure that there is a + dentry with the same name and parent as the given dentry, and + which refers to the given inode. + If the inode is a directory and already has a dentry, then that + dentry is d_moved over the given dentry. + If the passed dentry gets attached, care is taken that this is + mutually exclusive to a d_alloc_anon operation. + If the passed dentry is used, NULL is returned, else the used + dentry is returned. This corresponds to the calling pattern of + ->lookup. + + +Filesystem Issues +----------------- + +For a filesystem to be exportable it must: + + 1/ provide the filehandle fragment routines described below. + 2/ make sure that d_splice_alias is used rather than d_add + when ->lookup finds an inode for a given parent and name. + Typically the ->lookup routine will end with a: + + return d_splice_alias(inode, dentry); + } + + + + A file system implementation declares that instances of the filesystem +are exportable by setting the s_export_op field in the struct +super_block. This field must point to a "struct export_operations" +struct which has the following members: + + encode_fh (optional) + Takes a dentry and creates a filehandle fragment which can later be used + to find or create a dentry for the same object. The default + implementation creates a filehandle fragment that encodes a 32bit inode + and generation number for the inode encoded, and if necessary the + same information for the parent. + + fh_to_dentry (mandatory) + Given a filehandle fragment, this should find the implied object and + create a dentry for it (possibly with d_alloc_anon). + + fh_to_parent (optional but strongly recommended) + Given a filehandle fragment, this should find the parent of the + implied object and create a dentry for it (possibly with d_alloc_anon). + May fail if the filehandle fragment is too small. + + get_parent (optional but strongly recommended) + When given a dentry for a directory, this should return a dentry for + the parent. Quite possibly the parent dentry will have been allocated + by d_alloc_anon. The default get_parent function just returns an error + so any filehandle lookup that requires finding a parent will fail. + ->lookup("..") is *not* used as a default as it can leave ".." entries + in the dcache which are too messy to work with. + + get_name (optional) + When given a parent dentry and a child dentry, this should find a name + in the directory identified by the parent dentry, which leads to the + object identified by the child dentry. If no get_name function is + supplied, a default implementation is provided which uses vfs_readdir + to find potential names, and matches inode numbers to find the correct + match. + + +A filehandle fragment consists of an array of 1 or more 4byte words, +together with a one byte "type". +The decode_fh routine should not depend on the stated size that is +passed to it. This size may be larger than the original filehandle +generated by encode_fh, in which case it will have been padded with +nuls. Rather, the encode_fh routine should choose a "type" which +indicates the decode_fh how much of the filehandle is valid, and how +it should be interpreted. diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking new file mode 100644 index 0000000..23d2f44 --- /dev/null +++ b/Documentation/filesystems/Locking @@ -0,0 +1,537 @@ + The text below describes the locking rules for VFS-related methods. +It is (believed to be) up-to-date. *Please*, if you change anything in +prototypes or locking protocols - update this file. And update the relevant +instances in the tree, don't leave that to maintainers of filesystems/devices/ +etc. At the very least, put the list of dubious cases in the end of this file. +Don't turn it into log - maintainers of out-of-the-tree code are supposed to +be able to use diff(1). + Thing currently missing here: socket operations. Alexey? + +--------------------------- dentry_operations -------------------------- +prototypes: + int (*d_revalidate)(struct dentry *, int); + int (*d_hash) (struct dentry *, struct qstr *); + int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); + int (*d_delete)(struct dentry *); + void (*d_release)(struct dentry *); + void (*d_iput)(struct dentry *, struct inode *); + char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen); + +locking rules: + none have BKL + dcache_lock rename_lock ->d_lock may block +d_revalidate: no no no yes +d_hash no no no yes +d_compare: no yes no no +d_delete: yes no yes no +d_release: no no no yes +d_iput: no no no yes +d_dname: no no no no + +--------------------------- inode_operations --------------------------- +prototypes: + int (*create) (struct inode *,struct dentry *,int, struct nameidata *); + struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameid +ata *); + int (*link) (struct dentry *,struct inode *,struct dentry *); + int (*unlink) (struct inode *,struct dentry *); + int (*symlink) (struct inode *,struct dentry *,const char *); + int (*mkdir) (struct inode *,struct dentry *,int); + int (*rmdir) (struct inode *,struct dentry *); + int (*mknod) (struct inode *,struct dentry *,int,dev_t); + int (*rename) (struct inode *, struct dentry *, + struct inode *, struct dentry *); + int (*readlink) (struct dentry *, char __user *,int); + int (*follow_link) (struct dentry *, struct nameidata *); + void (*truncate) (struct inode *); + int (*permission) (struct inode *, int, struct nameidata *); + int (*setattr) (struct dentry *, struct iattr *); + int (*getattr) (struct vfsmount *, struct dentry *, struct kstat *); + int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); + ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); + ssize_t (*listxattr) (struct dentry *, char *, size_t); + int (*removexattr) (struct dentry *, const char *); + +locking rules: + all may block, none have BKL + i_mutex(inode) +lookup: yes +create: yes +link: yes (both) +mknod: yes +symlink: yes +mkdir: yes +unlink: yes (both) +rmdir: yes (both) (see below) +rename: yes (all) (see below) +readlink: no +follow_link: no +truncate: yes (see below) +setattr: yes +permission: no +getattr: no +setxattr: yes +getxattr: no +listxattr: no +removexattr: yes + Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on +victim. + cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem. + ->truncate() is never called directly - it's a callback, not a +method. It's called by vmtruncate() - library function normally used by +->setattr(). Locking information above applies to that call (i.e. is +inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been +passed). + +See Documentation/filesystems/directory-locking for more detailed discussion +of the locking scheme for directory operations. + +--------------------------- super_operations --------------------------- +prototypes: + struct inode *(*alloc_inode)(struct super_block *sb); + void (*destroy_inode)(struct inode *); + void (*dirty_inode) (struct inode *); + int (*write_inode) (struct inode *, int); + void (*drop_inode) (struct inode *); + void (*delete_inode) (struct inode *); + void (*put_super) (struct super_block *); + void (*write_super) (struct super_block *); + int (*sync_fs)(struct super_block *sb, int wait); + void (*write_super_lockfs) (struct super_block *); + void (*unlockfs) (struct super_block *); + int (*statfs) (struct dentry *, struct kstatfs *); + int (*remount_fs) (struct super_block *, int *, char *); + void (*clear_inode) (struct inode *); + void (*umount_begin) (struct super_block *); + int (*show_options)(struct seq_file *, struct vfsmount *); + ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); + ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); + +locking rules: + All may block. + BKL s_lock s_umount +alloc_inode: no no no +destroy_inode: no +dirty_inode: no (must not sleep) +write_inode: no +drop_inode: no !!!inode_lock!!! +delete_inode: no +put_super: yes yes no +write_super: no yes read +sync_fs: no no read +write_super_lockfs: ? +unlockfs: ? +statfs: no no no +remount_fs: yes yes maybe (see below) +clear_inode: no +umount_begin: yes no no +show_options: no (vfsmount->sem) +quota_read: no no no (see below) +quota_write: no no no (see below) + +->remount_fs() will have the s_umount lock if it's already mounted. +When called from get_sb_single, it does NOT have the s_umount lock. +->quota_read() and ->quota_write() functions are both guaranteed to +be the only ones operating on the quota file by the quota code (via +dqio_sem) (unless an admin really wants to screw up something and +writes to quota files with quotas on). For other details about locking +see also dquot_operations section. + +--------------------------- file_system_type --------------------------- +prototypes: + int (*get_sb) (struct file_system_type *, int, + const char *, void *, struct vfsmount *); + void (*kill_sb) (struct super_block *); +locking rules: + may block BKL +get_sb yes no +kill_sb yes no + +->get_sb() returns error or 0 with locked superblock attached to the vfsmount +(exclusive on ->s_umount). +->kill_sb() takes a write-locked superblock, does all shutdown work on it, +unlocks and drops the reference. + +--------------------------- address_space_operations -------------------------- +prototypes: + int (*writepage)(struct page *page, struct writeback_control *wbc); + int (*readpage)(struct file *, struct page *); + int (*sync_page)(struct page *); + int (*writepages)(struct address_space *, struct writeback_control *); + int (*set_page_dirty)(struct page *page); + int (*readpages)(struct file *filp, struct address_space *mapping, + struct list_head *pages, unsigned nr_pages); + int (*write_begin)(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata); + int (*write_end)(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata); + sector_t (*bmap)(struct address_space *, sector_t); + int (*invalidatepage) (struct page *, unsigned long); + int (*releasepage) (struct page *, int); + int (*direct_IO)(int, struct kiocb *, const struct iovec *iov, + loff_t offset, unsigned long nr_segs); + int (*launder_page) (struct page *); + +locking rules: + All except set_page_dirty may block + + BKL PageLocked(page) i_sem +writepage: no yes, unlocks (see below) +readpage: no yes, unlocks +sync_page: no maybe +writepages: no +set_page_dirty no no +readpages: no +write_begin: no locks the page yes +write_end: no yes, unlocks yes +perform_write: no n/a yes +bmap: yes +invalidatepage: no yes +releasepage: no yes +direct_IO: no +launder_page: no yes + + ->write_begin(), ->write_end(), ->sync_page() and ->readpage() +may be called from the request handler (/dev/loop). + + ->readpage() unlocks the page, either synchronously or via I/O +completion. + + ->readpages() populates the pagecache with the passed pages and starts +I/O against them. They come unlocked upon I/O completion. + + ->writepage() is used for two purposes: for "memory cleansing" and for +"sync". These are quite different operations and the behaviour may differ +depending upon the mode. + +If writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then +it *must* start I/O against the page, even if that would involve +blocking on in-progress I/O. + +If writepage is called for memory cleansing (sync_mode == +WBC_SYNC_NONE) then its role is to get as much writeout underway as +possible. So writepage should try to avoid blocking against +currently-in-progress I/O. + +If the filesystem is not called for "sync" and it determines that it +would need to block against in-progress I/O to be able to start new I/O +against the page the filesystem should redirty the page with +redirty_page_for_writepage(), then unlock the page and return zero. +This may also be done to avoid internal deadlocks, but rarely. + +If the filesystem is called for sync then it must wait on any +in-progress I/O and then start new I/O. + +The filesystem should unlock the page synchronously, before returning to the +caller, unless ->writepage() returns special WRITEPAGE_ACTIVATE +value. WRITEPAGE_ACTIVATE means that page cannot really be written out +currently, and VM should stop calling ->writepage() on this page for some +time. VM does this by moving page to the head of the active list, hence the +name. + +Unless the filesystem is going to redirty_page_for_writepage(), unlock the page +and return zero, writepage *must* run set_page_writeback() against the page, +followed by unlocking it. Once set_page_writeback() has been run against the +page, write I/O can be submitted and the write I/O completion handler must run +end_page_writeback() once the I/O is complete. If no I/O is submitted, the +filesystem must run end_page_writeback() against the page before returning from +writepage. + +That is: after 2.5.12, pages which are under writeout are *not* locked. Note, +if the filesystem needs the page to be locked during writeout, that is ok, too, +the page is allowed to be unlocked at any point in time between the calls to +set_page_writeback() and end_page_writeback(). + +Note, failure to run either redirty_page_for_writepage() or the combination of +set_page_writeback()/end_page_writeback() on a page submitted to writepage +will leave the page itself marked clean but it will be tagged as dirty in the +radix tree. This incoherency can lead to all sorts of hard-to-debug problems +in the filesystem like having dirty inodes at umount and losing written data. + + ->sync_page() locking rules are not well-defined - usually it is called +with lock on page, but that is not guaranteed. Considering the currently +existing instances of this method ->sync_page() itself doesn't look +well-defined... + + ->writepages() is used for periodic writeback and for syscall-initiated +sync operations. The address_space should start I/O against at least +*nr_to_write pages. *nr_to_write must be decremented for each page which is +written. The address_space implementation may write more (or less) pages +than *nr_to_write asks for, but it should try to be reasonably close. If +nr_to_write is NULL, all dirty pages must be written. + +writepages should _only_ write pages which are present on +mapping->io_pages. + + ->set_page_dirty() is called from various places in the kernel +when the target page is marked as needing writeback. It may be called +under spinlock (it cannot block) and is sometimes called with the page +not locked. + + ->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some +filesystems and by the swapper. The latter will eventually go away. All +instances do not actually need the BKL. Please, keep it that way and don't +breed new callers. + + ->invalidatepage() is called when the filesystem must attempt to drop +some or all of the buffers from the page when it is being truncated. It +returns zero on success. If ->invalidatepage is zero, the kernel uses +block_invalidatepage() instead. + + ->releasepage() is called when the kernel is about to try to drop the +buffers from the page in preparation for freeing it. It returns zero to +indicate that the buffers are (or may be) freeable. If ->releasepage is zero, +the kernel assumes that the fs has no private interest in the buffers. + + ->launder_page() may be called prior to releasing a page if +it is still found to be dirty. It returns zero if the page was successfully +cleaned, or an error value if not. Note that in order to prevent the page +getting mapped back in and redirtied, it needs to be kept locked +across the entire operation. + + Note: currently almost all instances of address_space methods are +using BKL for internal serialization and that's one of the worst sources +of contention. Normally they are calling library functions (in fs/buffer.c) +and pass foo_get_block() as a callback (on local block-based filesystems, +indeed). BKL is not needed for library stuff and is usually taken by +foo_get_block(). It's an overkill, since block bitmaps can be protected by +internal fs locking and real critical areas are much smaller than the areas +filesystems protect now. + +----------------------- file_lock_operations ------------------------------ +prototypes: + void (*fl_insert)(struct file_lock *); /* lock insertion callback */ + void (*fl_remove)(struct file_lock *); /* lock removal callback */ + void (*fl_copy_lock)(struct file_lock *, struct file_lock *); + void (*fl_release_private)(struct file_lock *); + + +locking rules: + BKL may block +fl_insert: yes no +fl_remove: yes no +fl_copy_lock: yes no +fl_release_private: yes yes + +----------------------- lock_manager_operations --------------------------- +prototypes: + int (*fl_compare_owner)(struct file_lock *, struct file_lock *); + void (*fl_notify)(struct file_lock *); /* unblock callback */ + void (*fl_copy_lock)(struct file_lock *, struct file_lock *); + void (*fl_release_private)(struct file_lock *); + void (*fl_break)(struct file_lock *); /* break_lease callback */ + +locking rules: + BKL may block +fl_compare_owner: yes no +fl_notify: yes no +fl_copy_lock: yes no +fl_release_private: yes yes +fl_break: yes no + + Currently only NFSD and NLM provide instances of this class. None of the +them block. If you have out-of-tree instances - please, show up. Locking +in that area will change. +--------------------------- buffer_head ----------------------------------- +prototypes: + void (*b_end_io)(struct buffer_head *bh, int uptodate); + +locking rules: + called from interrupts. In other words, extreme care is needed here. +bh is locked, but that's all warranties we have here. Currently only RAID1, +highmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices +call this method upon the IO completion. + +--------------------------- block_device_operations ----------------------- +prototypes: + int (*open) (struct inode *, struct file *); + int (*release) (struct inode *, struct file *); + int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long); + int (*media_changed) (struct gendisk *); + int (*revalidate_disk) (struct gendisk *); + +locking rules: + BKL bd_sem +open: yes yes +release: yes yes +ioctl: yes no +media_changed: no no +revalidate_disk: no no + +The last two are called only from check_disk_change(). + +--------------------------- file_operations ------------------------------- +prototypes: + loff_t (*llseek) (struct file *, loff_t, int); + ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); + ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + int (*readdir) (struct file *, void *, filldir_t); + unsigned int (*poll) (struct file *, struct poll_table_struct *); + int (*ioctl) (struct inode *, struct file *, unsigned int, + unsigned long); + long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); + long (*compat_ioctl) (struct file *, unsigned int, unsigned long); + int (*mmap) (struct file *, struct vm_area_struct *); + int (*open) (struct inode *, struct file *); + int (*flush) (struct file *); + int (*release) (struct inode *, struct file *); + int (*fsync) (struct file *, struct dentry *, int datasync); + int (*aio_fsync) (struct kiocb *, int datasync); + int (*fasync) (int, struct file *, int); + int (*lock) (struct file *, int, struct file_lock *); + ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, + loff_t *); + ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, + loff_t *); + ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, + void __user *); + ssize_t (*sendpage) (struct file *, struct page *, int, size_t, + loff_t *, int); + unsigned long (*get_unmapped_area)(struct file *, unsigned long, + unsigned long, unsigned long, unsigned long); + int (*check_flags)(int); + int (*dir_notify)(struct file *, unsigned long); +}; + +locking rules: + All except ->poll() may block. + BKL +llseek: no (see below) +read: no +aio_read: no +write: no +aio_write: no +readdir: no +poll: no +ioctl: yes (see below) +unlocked_ioctl: no (see below) +compat_ioctl: no +mmap: no +open: no +flush: no +release: no +fsync: no (see below) +aio_fsync: no +fasync: no +lock: yes +readv: no +writev: no +sendfile: no +sendpage: no +get_unmapped_area: no +check_flags: no +dir_notify: no + +->llseek() locking has moved from llseek to the individual llseek +implementations. If your fs is not using generic_file_llseek, you +need to acquire and release the appropriate locks in your ->llseek(). +For many filesystems, it is probably safe to acquire the inode +semaphore. Note some filesystems (i.e. remote ones) provide no +protection for i_size so you will need to use the BKL. + +Note: ext2_release() was *the* source of contention on fs-intensive +loads and dropping BKL on ->release() helps to get rid of that (we still +grab BKL for cases when we close a file that had been opened r/w, but that +can and should be done using the internal locking with smaller critical areas). +Current worst offender is ext2_get_block()... + +->fasync() is a mess. This area needs a big cleanup and that will probably +affect locking. + +->readdir() and ->ioctl() on directories must be changed. Ideally we would +move ->readdir() to inode_operations and use a separate method for directory +->ioctl() or kill the latter completely. One of the problems is that for +anything that resembles union-mount we won't have a struct file for all +components. And there are other reasons why the current interface is a mess... + +->ioctl() on regular files is superceded by the ->unlocked_ioctl() that +doesn't take the BKL. + +->read on directories probably must go away - we should just enforce -EISDIR +in sys_read() and friends. + +->fsync() has i_mutex on inode. + +--------------------------- dquot_operations ------------------------------- +prototypes: + int (*initialize) (struct inode *, int); + int (*drop) (struct inode *); + int (*alloc_space) (struct inode *, qsize_t, int); + int (*alloc_inode) (const struct inode *, unsigned long); + int (*free_space) (struct inode *, qsize_t); + int (*free_inode) (const struct inode *, unsigned long); + int (*transfer) (struct inode *, struct iattr *); + int (*write_dquot) (struct dquot *); + int (*acquire_dquot) (struct dquot *); + int (*release_dquot) (struct dquot *); + int (*mark_dirty) (struct dquot *); + int (*write_info) (struct super_block *, int); + +These operations are intended to be more or less wrapping functions that ensure +a proper locking wrt the filesystem and call the generic quota operations. + +What filesystem should expect from the generic quota functions: + + FS recursion Held locks when called +initialize: yes maybe dqonoff_sem +drop: yes - +alloc_space: ->mark_dirty() - +alloc_inode: ->mark_dirty() - +free_space: ->mark_dirty() - +free_inode: ->mark_dirty() - +transfer: yes - +write_dquot: yes dqonoff_sem or dqptr_sem +acquire_dquot: yes dqonoff_sem or dqptr_sem +release_dquot: yes dqonoff_sem or dqptr_sem +mark_dirty: no - +write_info: yes dqonoff_sem + +FS recursion means calling ->quota_read() and ->quota_write() from superblock +operations. + +->alloc_space(), ->alloc_inode(), ->free_space(), ->free_inode() are called +only directly by the filesystem and do not call any fs functions only +the ->mark_dirty() operation. + +More details about quota locking can be found in fs/dquot.c. + +--------------------------- vm_operations_struct ----------------------------- +prototypes: + void (*open)(struct vm_area_struct*); + void (*close)(struct vm_area_struct*); + int (*fault)(struct vm_area_struct*, struct vm_fault *); + int (*page_mkwrite)(struct vm_area_struct *, struct page *); + int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); + +locking rules: + BKL mmap_sem PageLocked(page) +open: no yes +close: no yes +fault: no yes +page_mkwrite: no yes no +access: no yes + + ->page_mkwrite() is called when a previously read-only page is +about to become writeable. The file system is responsible for +protecting against truncate races. Once appropriate action has been +taking to lock out truncate, the page range should be verified to be +within i_size. The page mapping should also be checked that it is not +NULL. + + ->access() is called when get_user_pages() fails in +acces_process_vm(), typically used to debug a process through +/proc/pid/mem or ptrace. This function is needed only for +VM_IO | VM_PFNMAP VMAs. + +================================================================================ + Dubious stuff + +(if you break something or notice that it is broken and do not fix it yourself +- at least put it here) + +ipc/shm.c::shm_delete() - may need BKL. +->read() and ->write() in many drivers are (probably) missing BKL. diff --git a/Documentation/filesystems/adfs.txt b/Documentation/filesystems/adfs.txt new file mode 100644 index 0000000..9e8811f --- /dev/null +++ b/Documentation/filesystems/adfs.txt @@ -0,0 +1,57 @@ +Mount options for ADFS +---------------------- + + uid=nnn All files in the partition will be owned by + user id nnn. Default 0 (root). + gid=nnn All files in the partition will be in group + nnn. Default 0 (root). + ownmask=nnn The permission mask for ADFS 'owner' permissions + will be nnn. Default 0700. + othmask=nnn The permission mask for ADFS 'other' permissions + will be nnn. Default 0077. + +Mapping of ADFS permissions to Linux permissions +------------------------------------------------ + + ADFS permissions consist of the following: + + Owner read + Owner write + Other read + Other write + + (In older versions, an 'execute' permission did exist, but this + does not hold the same meaning as the Linux 'execute' permission + and is now obsolete). + + The mapping is performed as follows: + + Owner read -> -r--r--r-- + Owner write -> --w--w---w + Owner read and filetype UnixExec -> ---x--x--x + These are then masked by ownmask, eg 700 -> -rwx------ + Possible owner mode permissions -> -rwx------ + + Other read -> -r--r--r-- + Other write -> --w--w--w- + Other read and filetype UnixExec -> ---x--x--x + These are then masked by othmask, eg 077 -> ----rwxrwx + Possible other mode permissions -> ----rwxrwx + + Hence, with the default masks, if a file is owner read/write, and + not a UnixExec filetype, then the permissions will be: + + -rw------- + + However, if the masks were ownmask=0770,othmask=0007, then this would + be modified to: + -rw-rw---- + + There is no restriction on what you can do with these masks. You may + wish that either read bits give read access to the file for all, but + keep the default write protection (ownmask=0755,othmask=0577): + + -rw-r--r-- + + You can therefore tailor the permission translation to whatever you + desire the permissions should be under Linux. diff --git a/Documentation/filesystems/affs.txt b/Documentation/filesystems/affs.txt new file mode 100644 index 0000000..2d15244 --- /dev/null +++ b/Documentation/filesystems/affs.txt @@ -0,0 +1,219 @@ +Overview of Amiga Filesystems +============================= + +Not all varieties of the Amiga filesystems are supported for reading and +writing. The Amiga currently knows six different filesystems: + +DOS\0 The old or original filesystem, not really suited for + hard disks and normally not used on them, either. + Supported read/write. + +DOS\1 The original Fast File System. Supported read/write. + +DOS\2 The old "international" filesystem. International means that + a bug has been fixed so that accented ("international") letters + in file names are case-insensitive, as they ought to be. + Supported read/write. + +DOS\3 The "international" Fast File System. Supported read/write. + +DOS\4 The original filesystem with directory cache. The directory + cache speeds up directory accesses on floppies considerably, + but slows down file creation/deletion. Doesn't make much + sense on hard disks. Supported read only. + +DOS\5 The Fast File System with directory cache. Supported read only. + +All of the above filesystems allow block sizes from 512 to 32K bytes. +Supported block sizes are: 512, 1024, 2048 and 4096 bytes. Larger blocks +speed up almost everything at the expense of wasted disk space. The speed +gain above 4K seems not really worth the price, so you don't lose too +much here, either. + +The muFS (multi user File System) equivalents of the above file systems +are supported, too. + +Mount options for the AFFS +========================== + +protect If this option is set, the protection bits cannot be altered. + +setuid[=uid] This sets the owner of all files and directories in the file + system to uid or the uid of the current user, respectively. + +setgid[=gid] Same as above, but for gid. + +mode=mode Sets the mode flags to the given (octal) value, regardless + of the original permissions. Directories will get an x + permission if the corresponding r bit is set. + This is useful since most of the plain AmigaOS files + will map to 600. + +reserved=num Sets the number of reserved blocks at the start of the + partition to num. You should never need this option. + Default is 2. + +root=block Sets the block number of the root block. This should never + be necessary. + +bs=blksize Sets the blocksize to blksize. Valid block sizes are 512, + 1024, 2048 and 4096. Like the root option, this should + never be necessary, as the affs can figure it out itself. + +quiet The file system will not return an error for disallowed + mode changes. + +verbose The volume name, file system type and block size will + be written to the syslog when the filesystem is mounted. + +mufs The filesystem is really a muFS, also it doesn't + identify itself as one. This option is necessary if + the filesystem wasn't formatted as muFS, but is used + as one. + +prefix=path Path will be prefixed to every absolute path name of + symbolic links on an AFFS partition. Default = "/". + (See below.) + +volume=name When symbolic links with an absolute path are created + on an AFFS partition, name will be prepended as the + volume name. Default = "" (empty string). + (See below.) + +Handling of the Users/Groups and protection flags +================================================= + +Amiga -> Linux: + +The Amiga protection flags RWEDRWEDHSPARWED are handled as follows: + + - R maps to r for user, group and others. On directories, R implies x. + + - If both W and D are allowed, w will be set. + + - E maps to x. + + - H and P are always retained and ignored under Linux. + + - A is always reset when a file is written to. + +User id and group id will be used unless set[gu]id are given as mount +options. Since most of the Amiga file systems are single user systems +they will be owned by root. The root directory (the mount point) of the +Amiga filesystem will be owned by the user who actually mounts the +filesystem (the root directory doesn't have uid/gid fields). + +Linux -> Amiga: + +The Linux rwxrwxrwx file mode is handled as follows: + + - r permission will set R for user, group and others. + + - w permission will set W and D for user, group and others. + + - x permission of the user will set E for plain files. + + - All other flags (suid, sgid, ...) are ignored and will + not be retained. + +Newly created files and directories will get the user and group ID +of the current user and a mode according to the umask. + +Symbolic links +============== + +Although the Amiga and Linux file systems resemble each other, there +are some, not always subtle, differences. One of them becomes apparent +with symbolic links. While Linux has a file system with exactly one +root directory, the Amiga has a separate root directory for each +file system (for example, partition, floppy disk, ...). With the Amiga, +these entities are called "volumes". They have symbolic names which +can be used to access them. Thus, symbolic links can point to a +different volume. AFFS turns the volume name into a directory name +and prepends the prefix path (see prefix option) to it. + +Example: +You mount all your Amiga partitions under /amiga/<volume> (where +<volume> is the name of the volume), and you give the option +"prefix=/amiga/" when mounting all your AFFS partitions. (They +might be "User", "WB" and "Graphics", the mount points /amiga/User, +/amiga/WB and /amiga/Graphics). A symbolic link referring to +"User:sc/include/dos/dos.h" will be followed to +"/amiga/User/sc/include/dos/dos.h". + +Examples +======== + +Command line: + mount Archive/Amiga/Workbench3.1.adf /mnt -t affs -o loop,verbose + mount /dev/sda3 /Amiga -t affs + +/etc/fstab entry: + /dev/sdb5 /amiga/Workbench affs noauto,user,exec,verbose 0 0 + +IMPORTANT NOTE +============== + +If you boot Windows 95 (don't know about 3.x, 98 and NT) while you +have an Amiga harddisk connected to your PC, it will overwrite +the bytes 0x00dc..0x00df of block 0 with garbage, thus invalidating +the Rigid Disk Block. Sheer luck has it that this is an unused +area of the RDB, so only the checksum doesn't match anymore. +Linux will ignore this garbage and recognize the RDB anyway, but +before you connect that drive to your Amiga again, you must +restore or repair your RDB. So please do make a backup copy of it +before booting Windows! + +If the damage is already done, the following should fix the RDB +(where <disk> is the device name). +DO AT YOUR OWN RISK: + + dd if=/dev/<disk> of=rdb.tmp count=1 + cp rdb.tmp rdb.fixed + dd if=/dev/zero of=rdb.fixed bs=1 seek=220 count=4 + dd if=rdb.fixed of=/dev/<disk> + +Bugs, Restrictions, Caveats +=========================== + +Quite a few things may not work as advertised. Not everything is +tested, though several hundred MB have been read and written using +this fs. For a most up-to-date list of bugs please consult +fs/affs/Changes. + +Filenames are truncated to 30 characters without warning (this +can be changed by setting the compile-time option AFFS_NO_TRUNCATE +in include/linux/amigaffs.h). + +Case is ignored by the affs in filename matching, but Linux shells +do care about the case. Example (with /wb being an affs mounted fs): + rm /wb/WRONGCASE +will remove /mnt/wrongcase, but + rm /wb/WR* +will not since the names are matched by the shell. + +The block allocation is designed for hard disk partitions. If more +than 1 process writes to a (small) diskette, the blocks are allocated +in an ugly way (but the real AFFS doesn't do much better). This +is also true when space gets tight. + +You cannot execute programs on an OFS (Old File System), since the +program files cannot be memory mapped due to the 488 byte blocks. +For the same reason you cannot mount an image on such a filesystem +via the loopback device. + +The bitmap valid flag in the root block may not be accurate when the +system crashes while an affs partition is mounted. There's currently +no way to fix a garbled filesystem without an Amiga (disk validator) +or manually (who would do this?). Maybe later. + +If you mount affs partitions on system startup, you may want to tell +fsck that the fs should not be checked (place a '0' in the sixth field +of /etc/fstab). + +It's not possible to read floppy disks with a normal PC or workstation +due to an incompatibility with the Amiga floppy controller. + +If you are interested in an Amiga Emulator for Linux, look at + +http://www.freiburg.linux.de/~uae/ diff --git a/Documentation/filesystems/afs.txt b/Documentation/filesystems/afs.txt new file mode 100644 index 0000000..12ad6c7 --- /dev/null +++ b/Documentation/filesystems/afs.txt @@ -0,0 +1,249 @@ + ==================== + kAFS: AFS FILESYSTEM + ==================== + +Contents: + + - Overview. + - Usage. + - Mountpoints. + - Proc filesystem. + - The cell database. + - Security. + - Examples. + + +======== +OVERVIEW +======== + +This filesystem provides a fairly simple secure AFS filesystem driver. It is +under development and does not yet provide the full feature set. The features +it does support include: + + (*) Security (currently only AFS kaserver and KerberosIV tickets). + + (*) File reading. + + (*) Automounting. + +It does not yet support the following AFS features: + + (*) Write support. + + (*) Local caching. + + (*) pioctl() system call. + + +=========== +COMPILATION +=========== + +The filesystem should be enabled by turning on the kernel configuration +options: + + CONFIG_AF_RXRPC - The RxRPC protocol transport + CONFIG_RXKAD - The RxRPC Kerberos security handler + CONFIG_AFS - The AFS filesystem + +Additionally, the following can be turned on to aid debugging: + + CONFIG_AF_RXRPC_DEBUG - Permit AF_RXRPC debugging to be enabled + CONFIG_AFS_DEBUG - Permit AFS debugging to be enabled + +They permit the debugging messages to be turned on dynamically by manipulating +the masks in the following files: + + /sys/module/af_rxrpc/parameters/debug + /sys/module/afs/parameters/debug + + +===== +USAGE +===== + +When inserting the driver modules the root cell must be specified along with a +list of volume location server IP addresses: + + insmod af_rxrpc.o + insmod rxkad.o + insmod kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 + +The first module is the AF_RXRPC network protocol driver. This provides the +RxRPC remote operation protocol and may also be accessed from userspace. See: + + Documentation/networking/rxrpc.txt + +The second module is the kerberos RxRPC security driver, and the third module +is the actual filesystem driver for the AFS filesystem. + +Once the module has been loaded, more modules can be added by the following +procedure: + + echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells + +Where the parameters to the "add" command are the name of a cell and a list of +volume location servers within that cell, with the latter separated by colons. + +Filesystems can be mounted anywhere by commands similar to the following: + + mount -t afs "%cambridge.redhat.com:root.afs." /afs + mount -t afs "#cambridge.redhat.com:root.cell." /afs/cambridge + mount -t afs "#root.afs." /afs + mount -t afs "#root.cell." /afs/cambridge + +Where the initial character is either a hash or a percent symbol depending on +whether you definitely want a R/W volume (hash) or whether you'd prefer a R/O +volume, but are willing to use a R/W volume instead (percent). + +The name of the volume can be suffixes with ".backup" or ".readonly" to +specify connection to only volumes of those types. + +The name of the cell is optional, and if not given during a mount, then the +named volume will be looked up in the cell specified during insmod. + +Additional cells can be added through /proc (see later section). + + +=========== +MOUNTPOINTS +=========== + +AFS has a concept of mountpoints. In AFS terms, these are specially formatted +symbolic links (of the same form as the "device name" passed to mount). kAFS +presents these to the user as directories that have a follow-link capability +(ie: symbolic link semantics). If anyone attempts to access them, they will +automatically cause the target volume to be mounted (if possible) on that site. + +Automatically mounted filesystems will be automatically unmounted approximately +twenty minutes after they were last used. Alternatively they can be unmounted +directly with the umount() system call. + +Manually unmounting an AFS volume will cause any idle submounts upon it to be +culled first. If all are culled, then the requested volume will also be +unmounted, otherwise error EBUSY will be returned. + +This can be used by the administrator to attempt to unmount the whole AFS tree +mounted on /afs in one go by doing: + + umount /afs + + +=============== +PROC FILESYSTEM +=============== + +The AFS modules creates a "/proc/fs/afs/" directory and populates it: + + (*) A "cells" file that lists cells currently known to the afs module and + their usage counts: + + [root@andromeda ~]# cat /proc/fs/afs/cells + USE NAME + 3 cambridge.redhat.com + + (*) A directory per cell that contains files that list volume location + servers, volumes, and active servers known within that cell. + + [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/servers + USE ADDR STATE + 4 172.16.18.91 0 + [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/vlservers + ADDRESS + 172.16.18.91 + [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/volumes + USE STT VLID[0] VLID[1] VLID[2] NAME + 1 Val 20000000 20000001 20000002 root.afs + + +================= +THE CELL DATABASE +================= + +The filesystem maintains an internal database of all the cells it knows and the +IP addresses of the volume location servers for those cells. The cell to which +the system belongs is added to the database when insmod is performed by the +"rootcell=" argument or, if compiled in, using a "kafs.rootcell=" argument on +the kernel command line. + +Further cells can be added by commands similar to the following: + + echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells + echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells + +No other cell database operations are available at this time. + + +======== +SECURITY +======== + +Secure operations are initiated by acquiring a key using the klog program. A +very primitive klog program is available at: + + http://people.redhat.com/~dhowells/rxrpc/klog.c + +This should be compiled by: + + make klog LDLIBS="-lcrypto -lcrypt -lkrb4 -lkeyutils" + +And then run as: + + ./klog + +Assuming it's successful, this adds a key of type RxRPC, named for the service +and cell, eg: "afs@<cellname>". This can be viewed with the keyctl program or +by cat'ing /proc/keys: + + [root@andromeda ~]# keyctl show + Session Keyring + -3 --alswrv 0 0 keyring: _ses.3268 + 2 --alswrv 0 0 \_ keyring: _uid.0 + 111416553 --als--v 0 0 \_ rxrpc: afs@CAMBRIDGE.REDHAT.COM + +Currently the username, realm, password and proposed ticket lifetime are +compiled in to the program. + +It is not required to acquire a key before using AFS facilities, but if one is +not acquired then all operations will be governed by the anonymous user parts +of the ACLs. + +If a key is acquired, then all AFS operations, including mounts and automounts, +made by a possessor of that key will be secured with that key. + +If a file is opened with a particular key and then the file descriptor is +passed to a process that doesn't have that key (perhaps over an AF_UNIX +socket), then the operations on the file will be made with key that was used to +open the file. + + +======== +EXAMPLES +======== + +Here's what I use to test this. Some of the names and IP addresses are local +to my internal DNS. My "root.afs" partition has a mount point within it for +some public volumes volumes. + +insmod /tmp/rxrpc.o +insmod /tmp/rxkad.o +insmod /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.91 + +mount -t afs \%root.afs. /afs +mount -t afs \%cambridge.redhat.com:root.cell. /afs/cambridge.redhat.com/ + +echo add grand.central.org 18.7.14.88:128.2.191.224 > /proc/fs/afs/cells +mount -t afs "#grand.central.org:root.cell." /afs/grand.central.org/ +mount -t afs "#grand.central.org:root.archive." /afs/grand.central.org/archive +mount -t afs "#grand.central.org:root.contrib." /afs/grand.central.org/contrib +mount -t afs "#grand.central.org:root.doc." /afs/grand.central.org/doc +mount -t afs "#grand.central.org:root.project." /afs/grand.central.org/project +mount -t afs "#grand.central.org:root.service." /afs/grand.central.org/service +mount -t afs "#grand.central.org:root.software." /afs/grand.central.org/software +mount -t afs "#grand.central.org:root.user." /afs/grand.central.org/user + +umount /afs +rmmod kafs +rmmod rxkad +rmmod rxrpc diff --git a/Documentation/filesystems/autofs4-mount-control.txt b/Documentation/filesystems/autofs4-mount-control.txt new file mode 100644 index 0000000..c634174 --- /dev/null +++ b/Documentation/filesystems/autofs4-mount-control.txt @@ -0,0 +1,393 @@ + +Miscellaneous Device control operations for the autofs4 kernel module +==================================================================== + +The problem +=========== + +There is a problem with active restarts in autofs (that is to say +restarting autofs when there are busy mounts). + +During normal operation autofs uses a file descriptor opened on the +directory that is being managed in order to be able to issue control +operations. Using a file descriptor gives ioctl operations access to +autofs specific information stored in the super block. The operations +are things such as setting an autofs mount catatonic, setting the +expire timeout and requesting expire checks. As is explained below, +certain types of autofs triggered mounts can end up covering an autofs +mount itself which prevents us being able to use open(2) to obtain a +file descriptor for these operations if we don't already have one open. + +Currently autofs uses "umount -l" (lazy umount) to clear active mounts +at restart. While using lazy umount works for most cases, anything that +needs to walk back up the mount tree to construct a path, such as +getcwd(2) and the proc file system /proc/<pid>/cwd, no longer works +because the point from which the path is constructed has been detached +from the mount tree. + +The actual problem with autofs is that it can't reconnect to existing +mounts. Immediately one thinks of just adding the ability to remount +autofs file systems would solve it, but alas, that can't work. This is +because autofs direct mounts and the implementation of "on demand mount +and expire" of nested mount trees have the file system mounted directly +on top of the mount trigger directory dentry. + +For example, there are two types of automount maps, direct (in the kernel +module source you will see a third type called an offset, which is just +a direct mount in disguise) and indirect. + +Here is a master map with direct and indirect map entries: + +/- /etc/auto.direct +/test /etc/auto.indirect + +and the corresponding map files: + +/etc/auto.direct: + +/automount/dparse/g6 budgie:/autofs/export1 +/automount/dparse/g1 shark:/autofs/export1 +and so on. + +/etc/auto.indirect: + +g1 shark:/autofs/export1 +g6 budgie:/autofs/export1 +and so on. + +For the above indirect map an autofs file system is mounted on /test and +mounts are triggered for each sub-directory key by the inode lookup +operation. So we see a mount of shark:/autofs/export1 on /test/g1, for +example. + +The way that direct mounts are handled is by making an autofs mount on +each full path, such as /automount/dparse/g1, and using it as a mount +trigger. So when we walk on the path we mount shark:/autofs/export1 "on +top of this mount point". Since these are always directories we can +use the follow_link inode operation to trigger the mount. + +But, each entry in direct and indirect maps can have offsets (making +them multi-mount map entries). + +For example, an indirect mount map entry could also be: + +g1 \ + / shark:/autofs/export5/testing/test \ + /s1 shark:/autofs/export/testing/test/s1 \ + /s2 shark:/autofs/export5/testing/test/s2 \ + /s1/ss1 shark:/autofs/export1 \ + /s2/ss2 shark:/autofs/export2 + +and a similarly a direct mount map entry could also be: + +/automount/dparse/g1 \ + / shark:/autofs/export5/testing/test \ + /s1 shark:/autofs/export/testing/test/s1 \ + /s2 shark:/autofs/export5/testing/test/s2 \ + /s1/ss1 shark:/autofs/export2 \ + /s2/ss2 shark:/autofs/export2 + +One of the issues with version 4 of autofs was that, when mounting an +entry with a large number of offsets, possibly with nesting, we needed +to mount and umount all of the offsets as a single unit. Not really a +problem, except for people with a large number of offsets in map entries. +This mechanism is used for the well known "hosts" map and we have seen +cases (in 2.4) where the available number of mounts are exhausted or +where the number of privileged ports available is exhausted. + +In version 5 we mount only as we go down the tree of offsets and +similarly for expiring them which resolves the above problem. There is +somewhat more detail to the implementation but it isn't needed for the +sake of the problem explanation. The one important detail is that these +offsets are implemented using the same mechanism as the direct mounts +above and so the mount points can be covered by a mount. + +The current autofs implementation uses an ioctl file descriptor opened +on the mount point for control operations. The references held by the +descriptor are accounted for in checks made to determine if a mount is +in use and is also used to access autofs file system information held +in the mount super block. So the use of a file handle needs to be +retained. + + +The Solution +============ + +To be able to restart autofs leaving existing direct, indirect and +offset mounts in place we need to be able to obtain a file handle +for these potentially covered autofs mount points. Rather than just +implement an isolated operation it was decided to re-implement the +existing ioctl interface and add new operations to provide this +functionality. + +In addition, to be able to reconstruct a mount tree that has busy mounts, +the uid and gid of the last user that triggered the mount needs to be +available because these can be used as macro substitution variables in +autofs maps. They are recorded at mount request time and an operation +has been added to retrieve them. + +Since we're re-implementing the control interface, a couple of other +problems with the existing interface have been addressed. First, when +a mount or expire operation completes a status is returned to the +kernel by either a "send ready" or a "send fail" operation. The +"send fail" operation of the ioctl interface could only ever send +ENOENT so the re-implementation allows user space to send an actual +status. Another expensive operation in user space, for those using +very large maps, is discovering if a mount is present. Usually this +involves scanning /proc/mounts and since it needs to be done quite +often it can introduce significant overhead when there are many entries +in the mount table. An operation to lookup the mount status of a mount +point dentry (covered or not) has also been added. + +Current kernel development policy recommends avoiding the use of the +ioctl mechanism in favor of systems such as Netlink. An implementation +using this system was attempted to evaluate its suitability and it was +found to be inadequate, in this case. The Generic Netlink system was +used for this as raw Netlink would lead to a significant increase in +complexity. There's no question that the Generic Netlink system is an +elegant solution for common case ioctl functions but it's not a complete +replacement probably because it's primary purpose in life is to be a +message bus implementation rather than specifically an ioctl replacement. +While it would be possible to work around this there is one concern +that lead to the decision to not use it. This is that the autofs +expire in the daemon has become far to complex because umount +candidates are enumerated, almost for no other reason than to "count" +the number of times to call the expire ioctl. This involves scanning +the mount table which has proved to be a big overhead for users with +large maps. The best way to improve this is try and get back to the +way the expire was done long ago. That is, when an expire request is +issued for a mount (file handle) we should continually call back to +the daemon until we can't umount any more mounts, then return the +appropriate status to the daemon. At the moment we just expire one +mount at a time. A Generic Netlink implementation would exclude this +possibility for future development due to the requirements of the +message bus architecture. + + +autofs4 Miscellaneous Device mount control interface +==================================================== + +The control interface is opening a device node, typically /dev/autofs. + +All the ioctls use a common structure to pass the needed parameter +information and return operation results: + +struct autofs_dev_ioctl { + __u32 ver_major; + __u32 ver_minor; + __u32 size; /* total size of data passed in + * including this struct */ + __s32 ioctlfd; /* automount command fd */ + + __u32 arg1; /* Command parameters */ + __u32 arg2; + + char path[0]; +}; + +The ioctlfd field is a mount point file descriptor of an autofs mount +point. It is returned by the open call and is used by all calls except +the check for whether a given path is a mount point, where it may +optionally be used to check a specific mount corresponding to a given +mount point file descriptor, and when requesting the uid and gid of the +last successful mount on a directory within the autofs file system. + +The fields arg1 and arg2 are used to communicate parameters and results of +calls made as described below. + +The path field is used to pass a path where it is needed and the size field +is used account for the increased structure length when translating the +structure sent from user space. + +This structure can be initialized before setting specific fields by using +the void function call init_autofs_dev_ioctl(struct autofs_dev_ioctl *). + +All of the ioctls perform a copy of this structure from user space to +kernel space and return -EINVAL if the size parameter is smaller than +the structure size itself, -ENOMEM if the kernel memory allocation fails +or -EFAULT if the copy itself fails. Other checks include a version check +of the compiled in user space version against the module version and a +mismatch results in a -EINVAL return. If the size field is greater than +the structure size then a path is assumed to be present and is checked to +ensure it begins with a "/" and is NULL terminated, otherwise -EINVAL is +returned. Following these checks, for all ioctl commands except +AUTOFS_DEV_IOCTL_VERSION_CMD, AUTOFS_DEV_IOCTL_OPENMOUNT_CMD and +AUTOFS_DEV_IOCTL_CLOSEMOUNT_CMD the ioctlfd is validated and if it is +not a valid descriptor or doesn't correspond to an autofs mount point +an error of -EBADF, -ENOTTY or -EINVAL (not an autofs descriptor) is +returned. + + +The ioctls +========== + +An example of an implementation which uses this interface can be seen +in autofs version 5.0.4 and later in file lib/dev-ioctl-lib.c of the +distribution tar available for download from kernel.org in directory +/pub/linux/daemons/autofs/v5. + +The device node ioctl operations implemented by this interface are: + + +AUTOFS_DEV_IOCTL_VERSION +------------------------ + +Get the major and minor version of the autofs4 device ioctl kernel module +implementation. It requires an initialized struct autofs_dev_ioctl as an +input parameter and sets the version information in the passed in structure. +It returns 0 on success or the error -EINVAL if a version mismatch is +detected. + + +AUTOFS_DEV_IOCTL_PROTOVER_CMD and AUTOFS_DEV_IOCTL_PROTOSUBVER_CMD +------------------------------------------------------------------ + +Get the major and minor version of the autofs4 protocol version understood +by loaded module. This call requires an initialized struct autofs_dev_ioctl +with the ioctlfd field set to a valid autofs mount point descriptor +and sets the requested version number in structure field arg1. These +commands return 0 on success or one of the negative error codes if +validation fails. + + +AUTOFS_DEV_IOCTL_OPENMOUNT and AUTOFS_DEV_IOCTL_CLOSEMOUNT +---------------------------------------------------------- + +Obtain and release a file descriptor for an autofs managed mount point +path. The open call requires an initialized struct autofs_dev_ioctl with +the the path field set and the size field adjusted appropriately as well +as the arg1 field set to the device number of the autofs mount. The +device number can be obtained from the mount options shown in +/proc/mounts. The close call requires an initialized struct +autofs_dev_ioct with the ioctlfd field set to the descriptor obtained +from the open call. The release of the file descriptor can also be done +with close(2) so any open descriptors will also be closed at process exit. +The close call is included in the implemented operations largely for +completeness and to provide for a consistent user space implementation. + + +AUTOFS_DEV_IOCTL_READY_CMD and AUTOFS_DEV_IOCTL_FAIL_CMD +-------------------------------------------------------- + +Return mount and expire result status from user space to the kernel. +Both of these calls require an initialized struct autofs_dev_ioctl +with the ioctlfd field set to the descriptor obtained from the open +call and the arg1 field set to the wait queue token number, received +by user space in the foregoing mount or expire request. The arg2 field +is set to the status to be returned. For the ready call this is always +0 and for the fail call it is set to the errno of the operation. + + +AUTOFS_DEV_IOCTL_SETPIPEFD_CMD +------------------------------ + +Set the pipe file descriptor used for kernel communication to the daemon. +Normally this is set at mount time using an option but when reconnecting +to a existing mount we need to use this to tell the autofs mount about +the new kernel pipe descriptor. In order to protect mounts against +incorrectly setting the pipe descriptor we also require that the autofs +mount be catatonic (see next call). + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call and +the arg1 field set to descriptor of the pipe. On success the call +also sets the process group id used to identify the controlling process +(eg. the owning automount(8) daemon) to the process group of the caller. + + +AUTOFS_DEV_IOCTL_CATATONIC_CMD +------------------------------ + +Make the autofs mount point catatonic. The autofs mount will no longer +issue mount requests, the kernel communication pipe descriptor is released +and any remaining waits in the queue released. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call. + + +AUTOFS_DEV_IOCTL_TIMEOUT_CMD +---------------------------- + +Set the expire timeout for mounts withing an autofs mount point. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call. + + +AUTOFS_DEV_IOCTL_REQUESTER_CMD +------------------------------ + +Return the uid and gid of the last process to successfully trigger a the +mount on the given path dentry. + +The call requires an initialized struct autofs_dev_ioctl with the path +field set to the mount point in question and the size field adjusted +appropriately as well as the arg1 field set to the device number of the +containing autofs mount. Upon return the struct field arg1 contains the +uid and arg2 the gid. + +When reconstructing an autofs mount tree with active mounts we need to +re-connect to mounts that may have used the original process uid and +gid (or string variations of them) for mount lookups within the map entry. +This call provides the ability to obtain this uid and gid so they may be +used by user space for the mount map lookups. + + +AUTOFS_DEV_IOCTL_EXPIRE_CMD +--------------------------- + +Issue an expire request to the kernel for an autofs mount. Typically +this ioctl is called until no further expire candidates are found. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call. In +addition an immediate expire, independent of the mount timeout, can be +requested by setting the arg1 field to 1. If no expire candidates can +be found the ioctl returns -1 with errno set to EAGAIN. + +This call causes the kernel module to check the mount corresponding +to the given ioctlfd for mounts that can be expired, issues an expire +request back to the daemon and waits for completion. + +AUTOFS_DEV_IOCTL_ASKUMOUNT_CMD +------------------------------ + +Checks if an autofs mount point is in use. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call and +it returns the result in the arg1 field, 1 for busy and 0 otherwise. + + +AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD +--------------------------------- + +Check if the given path is a mountpoint. + +The call requires an initialized struct autofs_dev_ioctl. There are two +possible variations. Both use the path field set to the path of the mount +point to check and the size field adjusted appropriately. One uses the +ioctlfd field to identify a specific mount point to check while the other +variation uses the path and optionaly arg1 set to an autofs mount type. +The call returns 1 if this is a mount point and sets arg1 to the device +number of the mount and field arg2 to the relevant super block magic +number (described below) or 0 if it isn't a mountpoint. In both cases +the the device number (as returned by new_encode_dev()) is returned +in field arg1. + +If supplied with a file descriptor we're looking for a specific mount, +not necessarily at the top of the mounted stack. In this case the path +the descriptor corresponds to is considered a mountpoint if it is itself +a mountpoint or contains a mount, such as a multi-mount without a root +mount. In this case we return 1 if the descriptor corresponds to a mount +point and and also returns the super magic of the covering mount if there +is one or 0 if it isn't a mountpoint. + +If a path is supplied (and the ioctlfd field is set to -1) then the path +is looked up and is checked to see if it is the root of a mount. If a +type is also given we are looking for a particular autofs mount and if +a match isn't found a fail is returned. If the the located path is the +root of a mount 1 is returned along with the super magic of the mount +or 0 otherwise. + diff --git a/Documentation/filesystems/automount-support.txt b/Documentation/filesystems/automount-support.txt new file mode 100644 index 0000000..7cac200 --- /dev/null +++ b/Documentation/filesystems/automount-support.txt @@ -0,0 +1,118 @@ +Support is available for filesystems that wish to do automounting support (such +as kAFS which can be found in fs/afs/). This facility includes allowing +in-kernel mounts to be performed and mountpoint degradation to be +requested. The latter can also be requested by userspace. + + +====================== +IN-KERNEL AUTOMOUNTING +====================== + +A filesystem can now mount another filesystem on one of its directories by the +following procedure: + + (1) Give the directory a follow_link() operation. + + When the directory is accessed, the follow_link op will be called, and + it will be provided with the location of the mountpoint in the nameidata + structure (vfsmount and dentry). + + (2) Have the follow_link() op do the following steps: + + (a) Call vfs_kern_mount() to call the appropriate filesystem to set up a + superblock and gain a vfsmount structure representing it. + + (b) Copy the nameidata provided as an argument and substitute the dentry + argument into it the copy. + + (c) Call do_add_mount() to install the new vfsmount into the namespace's + mountpoint tree, thus making it accessible to userspace. Use the + nameidata set up in (b) as the destination. + + If the mountpoint will be automatically expired, then do_add_mount() + should also be given the location of an expiration list (see further + down). + + (d) Release the path in the nameidata argument and substitute in the new + vfsmount and its root dentry. The ref counts on these will need + incrementing. + +Then from userspace, you can just do something like: + + [root@andromeda root]# mount -t afs \#root.afs. /afs + [root@andromeda root]# ls /afs + asd cambridge cambridge.redhat.com grand.central.org + [root@andromeda root]# ls /afs/cambridge + afsdoc + [root@andromeda root]# ls /afs/cambridge/afsdoc/ + ChangeLog html LICENSE pdf RELNOTES-1.2.2 + +And then if you look in the mountpoint catalogue, you'll see something like: + + [root@andromeda root]# cat /proc/mounts + ... + #root.afs. /afs afs rw 0 0 + #root.cell. /afs/cambridge.redhat.com afs rw 0 0 + #afsdoc. /afs/cambridge.redhat.com/afsdoc afs rw 0 0 + + +=========================== +AUTOMATIC MOUNTPOINT EXPIRY +=========================== + +Automatic expiration of mountpoints is easy, provided you've mounted the +mountpoint to be expired in the automounting procedure outlined above. + +To do expiration, you need to follow these steps: + + (3) Create at least one list off which the vfsmounts to be expired can be + hung. Access to this list will be governed by the vfsmount_lock. + + (4) In step (2c) above, the call to do_add_mount() should be provided with a + pointer to this list. It will hang the vfsmount off of it if it succeeds. + + (5) When you want mountpoints to be expired, call mark_mounts_for_expiry() + with a pointer to this list. This will process the list, marking every + vfsmount thereon for potential expiry on the next call. + + If a vfsmount was already flagged for expiry, and if its usage count is 1 + (it's only referenced by its parent vfsmount), then it will be deleted + from the namespace and thrown away (effectively unmounted). + + It may prove simplest to simply call this at regular intervals, using + some sort of timed event to drive it. + +The expiration flag is cleared by calls to mntput. This means that expiration +will only happen on the second expiration request after the last time the +mountpoint was accessed. + +If a mountpoint is moved, it gets removed from the expiration list. If a bind +mount is made on an expirable mount, the new vfsmount will not be on the +expiration list and will not expire. + +If a namespace is copied, all mountpoints contained therein will be copied, +and the copies of those that are on an expiration list will be added to the +same expiration list. + + +======================= +USERSPACE DRIVEN EXPIRY +======================= + +As an alternative, it is possible for userspace to request expiry of any +mountpoint (though some will be rejected - the current process's idea of the +rootfs for example). It does this by passing the MNT_EXPIRE flag to +umount(). This flag is considered incompatible with MNT_FORCE and MNT_DETACH. + +If the mountpoint in question is in referenced by something other than +umount() or its parent mountpoint, an EBUSY error will be returned and the +mountpoint will not be marked for expiration or unmounted. + +If the mountpoint was not already marked for expiry at that time, an EAGAIN +error will be given and it won't be unmounted. + +Otherwise if it was already marked and it wasn't referenced, unmounting will +take place as usual. + +Again, the expiration flag is cleared every time anything other than umount() +looks at a mountpoint. diff --git a/Documentation/filesystems/befs.txt b/Documentation/filesystems/befs.txt new file mode 100644 index 0000000..67391a1 --- /dev/null +++ b/Documentation/filesystems/befs.txt @@ -0,0 +1,117 @@ +BeOS filesystem for Linux + +Document last updated: Dec 6, 2001 + +WARNING +======= +Make sure you understand that this is alpha software. This means that the +implementation is neither complete nor well-tested. + +I DISCLAIM ALL RESPONSIBILITY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE! + +LICENSE +===== +This software is covered by the GNU General Public License. +See the file COPYING for the complete text of the license. +Or the GNU website: <http://www.gnu.org/licenses/licenses.html> + +AUTHOR +===== +The largest part of the code written by Will Dyson <will_dyson@pobox.com> +He has been working on the code since Aug 13, 2001. See the changelog for +details. + +Original Author: Makoto Kato <m_kato@ga2.so-net.ne.jp> +His original code can still be found at: +<http://hp.vector.co.jp/authors/VA008030/bfs/> +Does anyone know of a more current email address for Makoto? He doesn't +respond to the address given above... + +Current maintainer: Sergey S. Kostyliov <rathamahata@php4.ru> + +WHAT IS THIS DRIVER? +================== +This module implements the native filesystem of BeOS <http://www.be.com/> +for the linux 2.4.1 and later kernels. Currently it is a read-only +implementation. + +Which is it, BFS or BEFS? +================ +Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS". +But Unixware Boot Filesystem is called bfs, too. And they are already in +the kernel. Because of this naming conflict, on Linux the BeOS +filesystem is called befs. + +HOW TO INSTALL +============== +step 1. Install the BeFS patch into the source code tree of linux. + +Apply the patchfile to your kernel source tree. +Assuming that your kernel source is in /foo/bar/linux and the patchfile +is called patch-befs-xxx, you would do the following: + + cd /foo/bar/linux + patch -p1 < /path/to/patch-befs-xxx + +if the patching step fails (i.e. there are rejected hunks), you can try to +figure it out yourself (it shouldn't be hard), or mail the maintainer +(Will Dyson <will_dyson@pobox.com>) for help. + +step 2. Configuration & make kernel + +The linux kernel has many compile-time options. Most of them are beyond the +scope of this document. I suggest the Kernel-HOWTO document as a good general +reference on this topic. <http://www.linux.com/howto/Kernel-HOWTO.html> + +However, to use the BeFS module, you must enable it at configure time. + + cd /foo/bar/linux + make menuconfig (or xconfig) + +The BeFS module is not a standard part of the linux kernel, so you must first +enable support for experimental code under the "Code maturity level" menu. + +Then, under the "Filesystems" menu will be an option called "BeFS +filesystem (experimental)", or something like that. Enable that option +(it is fine to make it a module). + +Save your kernel configuration and then build your kernel. + +step 3. Install + +See the kernel howto <http://www.linux.com/howto/Kernel-HOWTO.html> for +instructions on this critical step. + +USING BFS +========= +To use the BeOS filesystem, use filesystem type 'befs'. + +ex) + mount -t befs /dev/fd0 /beos + +MOUNT OPTIONS +============= +uid=nnn All files in the partition will be owned by user id nnn. +gid=nnn All files in the partition will be in group nnn. +iocharset=xxx Use xxx as the name of the NLS translation table. +debug The driver will output debugging information to the syslog. + +HOW TO GET LASTEST VERSION +========================== + +The latest version is currently available at: +<http://befs-driver.sourceforge.net/> + +ANY KNOWN BUGS? +=========== +As of Jan 20, 2002: + + None + +SPECIAL THANKS +============== +Dominic Giampalo ... Writing "Practical file system design with Be filesystem" +Hiroyuki Yamada ... Testing LinuxPPC. + + + diff --git a/Documentation/filesystems/bfs.txt b/Documentation/filesystems/bfs.txt new file mode 100644 index 0000000..78043d5 --- /dev/null +++ b/Documentation/filesystems/bfs.txt @@ -0,0 +1,57 @@ +BFS FILESYSTEM FOR LINUX +======================== + +The BFS filesystem is used by SCO UnixWare OS for the /stand slice, which +usually contains the kernel image and a few other files required for the +boot process. + +In order to access /stand partition under Linux you obviously need to +know the partition number and the kernel must support UnixWare disk slices +(CONFIG_UNIXWARE_DISKLABEL config option). However BFS support does not +depend on having UnixWare disklabel support because one can also mount +BFS filesystem via loopback: + +# losetup /dev/loop0 stand.img +# mount -t bfs /dev/loop0 /mnt/stand + +where stand.img is a file containing the image of BFS filesystem. +When you have finished using it and umounted you need to also deallocate +/dev/loop0 device by: + +# losetup -d /dev/loop0 + +You can simplify mounting by just typing: + +# mount -t bfs -o loop stand.img /mnt/stand + +this will allocate the first available loopback device (and load loop.o +kernel module if necessary) automatically. If the loopback driver is not +loaded automatically, make sure that you have compiled the module and +that modprobe is functioning. Beware that umount will not deallocate +/dev/loopN device if /etc/mtab file on your system is a symbolic link to +/proc/mounts. You will need to do it manually using "-d" switch of +losetup(8). Read losetup(8) manpage for more info. + +To create the BFS image under UnixWare you need to find out first which +slice contains it. The command prtvtoc(1M) is your friend: + +# prtvtoc /dev/rdsk/c0b0t0d0s0 + +(assuming your root disk is on target=0, lun=0, bus=0, controller=0). Then you +look for the slice with tag "STAND", which is usually slice 10. With this +information you can use dd(1) to create the BFS image: + +# umount /stand +# dd if=/dev/rdsk/c0b0t0d0sa of=stand.img bs=512 + +Just in case, you can verify that you have done the right thing by checking +the magic number: + +# od -Ad -tx4 stand.img | more + +The first 4 bytes should be 0x1badface. + +If you have any patches, questions or suggestions regarding this BFS +implementation please contact the author: + +Tigran Aivazian <tigran@aivazian.fsnet.co.uk> diff --git a/Documentation/filesystems/cifs.txt b/Documentation/filesystems/cifs.txt new file mode 100644 index 0000000..49cc923 --- /dev/null +++ b/Documentation/filesystems/cifs.txt @@ -0,0 +1,51 @@ + This is the client VFS module for the Common Internet File System + (CIFS) protocol which is the successor to the Server Message Block + (SMB) protocol, the native file sharing mechanism for most early + PC operating systems. CIFS is fully supported by current network + file servers such as Windows 2000, Windows 2003 (including + Windows XP) as well by Samba (which provides excellent CIFS + server support for Linux and many other operating systems), so + this network filesystem client can mount to a wide variety of + servers. The smbfs module should be used instead of this cifs module + for mounting to older SMB servers such as OS/2. The smbfs and cifs + modules can coexist and do not conflict. The CIFS VFS filesystem + module is designed to work well with servers that implement the + newer versions (dialects) of the SMB/CIFS protocol such as Samba, + the program written by Andrew Tridgell that turns any Unix host + into a SMB/CIFS file server. + + The intent of this module is to provide the most advanced network + file system function for CIFS compliant servers, including better + POSIX compliance, secure per-user session establishment, high + performance safe distributed caching (oplock), optional packet + signing, large files, Unicode support and other internationalization + improvements. Since both Samba server and this filesystem client support + the CIFS Unix extensions, the combination can provide a reasonable + alternative to NFSv4 for fileserving in some Linux to Linux environments, + not just in Linux to Windows environments. + + This filesystem has an optional mount utility (mount.cifs) that can + be obtained from the project page and installed in the path in the same + directory with the other mount helpers (such as mount.smbfs). + Mounting using the cifs filesystem without installing the mount helper + requires specifying the server's ip address. + + For Linux 2.4: + mount //anything/here /mnt_target -o + user=username,pass=password,unc=//ip_address_of_server/sharename + + For Linux 2.5: + mount //ip_address_of_server/sharename /mnt_target -o user=username, pass=password + + + For more information on the module see the project page at + + http://us1.samba.org/samba/Linux_CIFS_client.html + + For more information on CIFS see: + + http://www.snia.org/tech_activities/CIFS + + or the Samba site: + + http://www.samba.org diff --git a/Documentation/filesystems/coda.txt b/Documentation/filesystems/coda.txt new file mode 100644 index 0000000..6131135 --- /dev/null +++ b/Documentation/filesystems/coda.txt @@ -0,0 +1,1673 @@ +NOTE: +This is one of the technical documents describing a component of +Coda -- this document describes the client kernel-Venus interface. + +For more information: + http://www.coda.cs.cmu.edu +For user level software needed to run Coda: + ftp://ftp.coda.cs.cmu.edu + +To run Coda you need to get a user level cache manager for the client, +named Venus, as well as tools to manipulate ACLs, to log in, etc. The +client needs to have the Coda filesystem selected in the kernel +configuration. + +The server needs a user level server and at present does not depend on +kernel support. + + + + + + + + The Venus kernel interface + Peter J. Braam + v1.0, Nov 9, 1997 + + This document describes the communication between Venus and kernel + level filesystem code needed for the operation of the Coda file sys- + tem. This document version is meant to describe the current interface + (version 1.0) as well as improvements we envisage. + ______________________________________________________________________ + + Table of Contents + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 1. Introduction + + 2. Servicing Coda filesystem calls + + 3. The message layer + + 3.1 Implementation details + + 4. The interface at the call level + + 4.1 Data structures shared by the kernel and Venus + 4.2 The pioctl interface + 4.3 root + 4.4 lookup + 4.5 getattr + 4.6 setattr + 4.7 access + 4.8 create + 4.9 mkdir + 4.10 link + 4.11 symlink + 4.12 remove + 4.13 rmdir + 4.14 readlink + 4.15 open + 4.16 close + 4.17 ioctl + 4.18 rename + 4.19 readdir + 4.20 vget + 4.21 fsync + 4.22 inactive + 4.23 rdwr + 4.24 odymount + 4.25 ody_lookup + 4.26 ody_expand + 4.27 prefetch + 4.28 signal + + 5. The minicache and downcalls + + 5.1 INVALIDATE + 5.2 FLUSH + 5.3 PURGEUSER + 5.4 ZAPFILE + 5.5 ZAPDIR + 5.6 ZAPVNODE + 5.7 PURGEFID + 5.8 REPLACE + + 6. Initialization and cleanup + + 6.1 Requirements + + + ______________________________________________________________________ + 0wpage + + 11.. IInnttrroodduuccttiioonn + + + + A key component in the Coda Distributed File System is the cache + manager, _V_e_n_u_s. + + + When processes on a Coda enabled system access files in the Coda + filesystem, requests are directed at the filesystem layer in the + operating system. The operating system will communicate with Venus to + service the request for the process. Venus manages a persistent + client cache and makes remote procedure calls to Coda file servers and + related servers (such as authentication servers) to service these + requests it receives from the operating system. When Venus has + serviced a request it replies to the operating system with appropriate + return codes, and other data related to the request. Optionally the + kernel support for Coda may maintain a minicache of recently processed + requests to limit the number of interactions with Venus. Venus + possesses the facility to inform the kernel when elements from its + minicache are no longer valid. + + This document describes precisely this communication between the + kernel and Venus. The definitions of so called upcalls and downcalls + will be given with the format of the data they handle. We shall also + describe the semantic invariants resulting from the calls. + + Historically Coda was implemented in a BSD file system in Mach 2.6. + The interface between the kernel and Venus is very similar to the BSD + VFS interface. Similar functionality is provided, and the format of + the parameters and returned data is very similar to the BSD VFS. This + leads to an almost natural environment for implementing a kernel-level + filesystem driver for Coda in a BSD system. However, other operating + systems such as Linux and Windows 95 and NT have virtual filesystem + with different interfaces. + + To implement Coda on these systems some reverse engineering of the + Venus/Kernel protocol is necessary. Also it came to light that other + systems could profit significantly from certain small optimizations + and modifications to the protocol. To facilitate this work as well as + to make future ports easier, communication between Venus and the + kernel should be documented in great detail. This is the aim of this + document. + + 0wpage + + 22.. SSeerrvviicciinngg CCooddaa ffiilleessyysstteemm ccaallllss + + The service of a request for a Coda file system service originates in + a process PP which accessing a Coda file. It makes a system call which + traps to the OS kernel. Examples of such calls trapping to the kernel + are _r_e_a_d_, _w_r_i_t_e_, _o_p_e_n_, _c_l_o_s_e_, _c_r_e_a_t_e_, _m_k_d_i_r_, _r_m_d_i_r_, _c_h_m_o_d in a Unix + context. Similar calls exist in the Win32 environment, and are named + _C_r_e_a_t_e_F_i_l_e_, . + + Generally the operating system handles the request in a virtual + filesystem (VFS) layer, which is named I/O Manager in NT and IFS + manager in Windows 95. The VFS is responsible for partial processing + of the request and for locating the specific filesystem(s) which will + service parts of the request. Usually the information in the path + assists in locating the correct FS drivers. Sometimes after extensive + pre-processing, the VFS starts invoking exported routines in the FS + driver. This is the point where the FS specific processing of the + request starts, and here the Coda specific kernel code comes into + play. + + The FS layer for Coda must expose and implement several interfaces. + First and foremost the VFS must be able to make all necessary calls to + the Coda FS layer, so the Coda FS driver must expose the VFS interface + as applicable in the operating system. These differ very significantly + among operating systems, but share features such as facilities to + read/write and create and remove objects. The Coda FS layer services + such VFS requests by invoking one or more well defined services + offered by the cache manager Venus. When the replies from Venus have + come back to the FS driver, servicing of the VFS call continues and + finishes with a reply to the kernel's VFS. Finally the VFS layer + returns to the process. + + As a result of this design a basic interface exposed by the FS driver + must allow Venus to manage message traffic. In particular Venus must + be able to retrieve and place messages and to be notified of the + arrival of a new message. The notification must be through a mechanism + which does not block Venus since Venus must attend to other tasks even + when no messages are waiting or being processed. + + + + + + + Interfaces of the Coda FS Driver + + Furthermore the FS layer provides for a special path of communication + between a user process and Venus, called the pioctl interface. The + pioctl interface is used for Coda specific services, such as + requesting detailed information about the persistent cache managed by + Venus. Here the involvement of the kernel is minimal. It identifies + the calling process and passes the information on to Venus. When + Venus replies the response is passed back to the caller in unmodified + form. + + Finally Venus allows the kernel FS driver to cache the results from + certain services. This is done to avoid excessive context switches + and results in an efficient system. However, Venus may acquire + information, for example from the network which implies that cached + information must be flushed or replaced. Venus then makes a downcall + to the Coda FS layer to request flushes or updates in the cache. The + kernel FS driver handles such requests synchronously. + + Among these interfaces the VFS interface and the facility to place, + receive and be notified of messages are platform specific. We will + not go into the calls exported to the VFS layer but we will state the + requirements of the message exchange mechanism. + + 0wpage + + 33.. TThhee mmeessssaaggee llaayyeerr + + + + At the lowest level the communication between Venus and the FS driver + proceeds through messages. The synchronization between processes + requesting Coda file service and Venus relies on blocking and waking + up processes. The Coda FS driver processes VFS- and pioctl-requests + on behalf of a process P, creates messages for Venus, awaits replies + and finally returns to the caller. The implementation of the exchange + of messages is platform specific, but the semantics have (so far) + appeared to be generally applicable. Data buffers are created by the + FS Driver in kernel memory on behalf of P and copied to user memory in + Venus. + + The FS Driver while servicing P makes upcalls to Venus. Such an + upcall is dispatched to Venus by creating a message structure. The + structure contains the identification of P, the message sequence + number, the size of the request and a pointer to the data in kernel + memory for the request. Since the data buffer is re-used to hold the + reply from Venus, there is a field for the size of the reply. A flags + field is used in the message to precisely record the status of the + message. Additional platform dependent structures involve pointers to + determine the position of the message on queues and pointers to + synchronization objects. In the upcall routine the message structure + is filled in, flags are set to 0, and it is placed on the _p_e_n_d_i_n_g + queue. The routine calling upcall is responsible for allocating the + data buffer; its structure will be described in the next section. + + A facility must exist to notify Venus that the message has been + created, and implemented using available synchronization objects in + the OS. This notification is done in the upcall context of the process + P. When the message is on the pending queue, process P cannot proceed + in upcall. The (kernel mode) processing of P in the filesystem + request routine must be suspended until Venus has replied. Therefore + the calling thread in P is blocked in upcall. A pointer in the + message structure will locate the synchronization object on which P is + sleeping. + + Venus detects the notification that a message has arrived, and the FS + driver allow Venus to retrieve the message with a getmsg_from_kernel + call. This action finishes in the kernel by putting the message on the + queue of processing messages and setting flags to READ. Venus is + passed the contents of the data buffer. The getmsg_from_kernel call + now returns and Venus processes the request. + + At some later point the FS driver receives a message from Venus, + namely when Venus calls sendmsg_to_kernel. At this moment the Coda FS + driver looks at the contents of the message and decides if: + + + +o the message is a reply for a suspended thread P. If so it removes + the message from the processing queue and marks the message as + WRITTEN. Finally, the FS driver unblocks P (still in the kernel + mode context of Venus) and the sendmsg_to_kernel call returns to + Venus. The process P will be scheduled at some point and continues + processing its upcall with the data buffer replaced with the reply + from Venus. + + +o The message is a _d_o_w_n_c_a_l_l. A downcall is a request from Venus to + the FS Driver. The FS driver processes the request immediately + (usually a cache eviction or replacement) and when it finishes + sendmsg_to_kernel returns. + + Now P awakes and continues processing upcall. There are some + subtleties to take account of. First P will determine if it was woken + up in upcall by a signal from some other source (for example an + attempt to terminate P) or as is normally the case by Venus in its + sendmsg_to_kernel call. In the normal case, the upcall routine will + deallocate the message structure and return. The FS routine can proceed + with its processing. + + + + + + + + Sleeping and IPC arrangements + + In case P is woken up by a signal and not by Venus, it will first look + at the flags field. If the message is not yet READ, the process P can + handle its signal without notifying Venus. If Venus has READ, and + the request should not be processed, P can send Venus a signal message + to indicate that it should disregard the previous message. Such + signals are put in the queue at the head, and read first by Venus. If + the message is already marked as WRITTEN it is too late to stop the + processing. The VFS routine will now continue. (-- If a VFS request + involves more than one upcall, this can lead to complicated state, an + extra field "handle_signals" could be added in the message structure + to indicate points of no return have been passed.--) + + + + 33..11.. IImmpplleemmeennttaattiioonn ddeettaaiillss + + The Unix implementation of this mechanism has been through the + implementation of a character device associated with Coda. Venus + retrieves messages by doing a read on the device, replies are sent + with a write and notification is through the select system call on the + file descriptor for the device. The process P is kept waiting on an + interruptible wait queue object. + + In Windows NT and the DPMI Windows 95 implementation a DeviceIoControl + call is used. The DeviceIoControl call is designed to copy buffers + from user memory to kernel memory with OPCODES. The sendmsg_to_kernel + is issued as a synchronous call, while the getmsg_from_kernel call is + asynchronous. Windows EventObjects are used for notification of + message arrival. The process P is kept waiting on a KernelEvent + object in NT and a semaphore in Windows 95. + + 0wpage + + 44.. TThhee iinntteerrffaaccee aatt tthhee ccaallll lleevveell + + + This section describes the upcalls a Coda FS driver can make to Venus. + Each of these upcalls make use of two structures: inputArgs and + outputArgs. In pseudo BNF form the structures take the following + form: + + + struct inputArgs { + u_long opcode; + u_long unique; /* Keep multiple outstanding msgs distinct */ + u_short pid; /* Common to all */ + u_short pgid; /* Common to all */ + struct CodaCred cred; /* Common to all */ + + <union "in" of call dependent parts of inputArgs> + }; + + struct outputArgs { + u_long opcode; + u_long unique; /* Keep multiple outstanding msgs distinct */ + u_long result; + + <union "out" of call dependent parts of inputArgs> + }; + + + + Before going on let us elucidate the role of the various fields. The + inputArgs start with the opcode which defines the type of service + requested from Venus. There are approximately 30 upcalls at present + which we will discuss. The unique field labels the inputArg with a + unique number which will identify the message uniquely. A process and + process group id are passed. Finally the credentials of the caller + are included. + + Before delving into the specific calls we need to discuss a variety of + data structures shared by the kernel and Venus. + + + + + 44..11.. DDaattaa ssttrruuccttuurreess sshhaarreedd bbyy tthhee kkeerrnneell aanndd VVeennuuss + + + The CodaCred structure defines a variety of user and group ids as + they are set for the calling process. The vuid_t and guid_t are 32 bit + unsigned integers. It also defines group membership in an array. On + Unix the CodaCred has proven sufficient to implement good security + semantics for Coda but the structure may have to undergo modification + for the Windows environment when these mature. + + struct CodaCred { + vuid_t cr_uid, cr_euid, cr_suid, cr_fsuid; /* Real, effective, set, fs uid*/ + vgid_t cr_gid, cr_egid, cr_sgid, cr_fsgid; /* same for groups */ + vgid_t cr_groups[NGROUPS]; /* Group membership for caller */ + }; + + + + NNOOTTEE It is questionable if we need CodaCreds in Venus. Finally Venus + doesn't know about groups, although it does create files with the + default uid/gid. Perhaps the list of group membership is superfluous. + + + The next item is the fundamental identifier used to identify Coda + files, the ViceFid. A fid of a file uniquely defines a file or + directory in the Coda filesystem within a _c_e_l_l. (-- A _c_e_l_l is a + group of Coda servers acting under the aegis of a single system + control machine or SCM. See the Coda Administration manual for a + detailed description of the role of the SCM.--) + + + typedef struct ViceFid { + VolumeId Volume; + VnodeId Vnode; + Unique_t Unique; + } ViceFid; + + + + Each of the constituent fields: VolumeId, VnodeId and Unique_t are + unsigned 32 bit integers. We envisage that a further field will need + to be prefixed to identify the Coda cell; this will probably take the + form of a Ipv6 size IP address naming the Coda cell through DNS. + + The next important structure shared between Venus and the kernel is + the attributes of the file. The following structure is used to + exchange information. It has room for future extensions such as + support for device files (currently not present in Coda). + + + + + + + + + + + + + + + + + + + struct coda_vattr { + enum coda_vtype va_type; /* vnode type (for create) */ + u_short va_mode; /* files access mode and type */ + short va_nlink; /* number of references to file */ + vuid_t va_uid; /* owner user id */ + vgid_t va_gid; /* owner group id */ + long va_fsid; /* file system id (dev for now) */ + long va_fileid; /* file id */ + u_quad_t va_size; /* file size in bytes */ + long va_blocksize; /* blocksize preferred for i/o */ + struct timespec va_atime; /* time of last access */ + struct timespec va_mtime; /* time of last modification */ + struct timespec va_ctime; /* time file changed */ + u_long va_gen; /* generation number of file */ + u_long va_flags; /* flags defined for file */ + dev_t va_rdev; /* device special file represents */ + u_quad_t va_bytes; /* bytes of disk space held by file */ + u_quad_t va_filerev; /* file modification number */ + u_int va_vaflags; /* operations flags, see below */ + long va_spare; /* remain quad aligned */ + }; + + + + + 44..22.. TThhee ppiiooccttll iinntteerrffaaccee + + + Coda specific requests can be made by application through the pioctl + interface. The pioctl is implemented as an ordinary ioctl on a + fictitious file /coda/.CONTROL. The pioctl call opens this file, gets + a file handle and makes the ioctl call. Finally it closes the file. + + The kernel involvement in this is limited to providing the facility to + open and close and pass the ioctl message _a_n_d to verify that a path in + the pioctl data buffers is a file in a Coda filesystem. + + The kernel is handed a data packet of the form: + + struct { + const char *path; + struct ViceIoctl vidata; + int follow; + } data; + + + + where + + + struct ViceIoctl { + caddr_t in, out; /* Data to be transferred in, or out */ + short in_size; /* Size of input buffer <= 2K */ + short out_size; /* Maximum size of output buffer, <= 2K */ + }; + + + + The path must be a Coda file, otherwise the ioctl upcall will not be + made. + + NNOOTTEE The data structures and code are a mess. We need to clean this + up. + + We now proceed to document the individual calls: + + 0wpage + + 44..33.. rroooott + + + AArrgguummeennttss + + iinn empty + + oouutt + + struct cfs_root_out { + ViceFid VFid; + } cfs_root; + + + + DDeessccrriippttiioonn This call is made to Venus during the initialization of + the Coda filesystem. If the result is zero, the cfs_root structure + contains the ViceFid of the root of the Coda filesystem. If a non-zero + result is generated, its value is a platform dependent error code + indicating the difficulty Venus encountered in locating the root of + the Coda filesystem. + + 0wpage + + 44..44.. llooookkuupp + + + SSuummmmaarryy Find the ViceFid and type of an object in a directory if it + exists. + + AArrgguummeennttss + + iinn + + struct cfs_lookup_in { + ViceFid VFid; + char *name; /* Place holder for data. */ + } cfs_lookup; + + + + oouutt + + struct cfs_lookup_out { + ViceFid VFid; + int vtype; + } cfs_lookup; + + + + DDeessccrriippttiioonn This call is made to determine the ViceFid and filetype of + a directory entry. The directory entry requested carries name name + and Venus will search the directory identified by cfs_lookup_in.VFid. + The result may indicate that the name does not exist, or that + difficulty was encountered in finding it (e.g. due to disconnection). + If the result is zero, the field cfs_lookup_out.VFid contains the + targets ViceFid and cfs_lookup_out.vtype the coda_vtype giving the + type of object the name designates. + + The name of the object is an 8 bit character string of maximum length + CFS_MAXNAMLEN, currently set to 256 (including a 0 terminator.) + + It is extremely important to realize that Venus bitwise ors the field + cfs_lookup.vtype with CFS_NOCACHE to indicate that the object should + not be put in the kernel name cache. + + NNOOTTEE The type of the vtype is currently wrong. It should be + coda_vtype. Linux does not take note of CFS_NOCACHE. It should. + + 0wpage + + 44..55.. ggeettaattttrr + + + SSuummmmaarryy Get the attributes of a file. + + AArrgguummeennttss + + iinn + + struct cfs_getattr_in { + ViceFid VFid; + struct coda_vattr attr; /* XXXXX */ + } cfs_getattr; + + + + oouutt + + struct cfs_getattr_out { + struct coda_vattr attr; + } cfs_getattr; + + + + DDeessccrriippttiioonn This call returns the attributes of the file identified by + fid. + + EErrrroorrss Errors can occur if the object with fid does not exist, is + unaccessible or if the caller does not have permission to fetch + attributes. + + NNoottee Many kernel FS drivers (Linux, NT and Windows 95) need to acquire + the attributes as well as the Fid for the instantiation of an internal + "inode" or "FileHandle". A significant improvement in performance on + such systems could be made by combining the _l_o_o_k_u_p and _g_e_t_a_t_t_r calls + both at the Venus/kernel interaction level and at the RPC level. + + The vattr structure included in the input arguments is superfluous and + should be removed. + + 0wpage + + 44..66.. sseettaattttrr + + + SSuummmmaarryy Set the attributes of a file. + + AArrgguummeennttss + + iinn + + struct cfs_setattr_in { + ViceFid VFid; + struct coda_vattr attr; + } cfs_setattr; + + + + + oouutt + empty + + DDeessccrriippttiioonn The structure attr is filled with attributes to be changed + in BSD style. Attributes not to be changed are set to -1, apart from + vtype which is set to VNON. Other are set to the value to be assigned. + The only attributes which the FS driver may request to change are the + mode, owner, groupid, atime, mtime and ctime. The return value + indicates success or failure. + + EErrrroorrss A variety of errors can occur. The object may not exist, may + be inaccessible, or permission may not be granted by Venus. + + 0wpage + + 44..77.. aacccceessss + + + SSuummmmaarryy + + AArrgguummeennttss + + iinn + + struct cfs_access_in { + ViceFid VFid; + int flags; + } cfs_access; + + + + oouutt + empty + + DDeessccrriippttiioonn Verify if access to the object identified by VFid for + operations described by flags is permitted. The result indicates if + access will be granted. It is important to remember that Coda uses + ACLs to enforce protection and that ultimately the servers, not the + clients enforce the security of the system. The result of this call + will depend on whether a _t_o_k_e_n is held by the user. + + EErrrroorrss The object may not exist, or the ACL describing the protection + may not be accessible. + + 0wpage + + 44..88.. ccrreeaattee + + + SSuummmmaarryy Invoked to create a file + + AArrgguummeennttss + + iinn + + struct cfs_create_in { + ViceFid VFid; + struct coda_vattr attr; + int excl; + int mode; + char *name; /* Place holder for data. */ + } cfs_create; + + + + + oouutt + + struct cfs_create_out { + ViceFid VFid; + struct coda_vattr attr; + } cfs_create; + + + + DDeessccrriippttiioonn This upcall is invoked to request creation of a file. + The file will be created in the directory identified by VFid, its name + will be name, and the mode will be mode. If excl is set an error will + be returned if the file already exists. If the size field in attr is + set to zero the file will be truncated. The uid and gid of the file + are set by converting the CodaCred to a uid using a macro CRTOUID + (this macro is platform dependent). Upon success the VFid and + attributes of the file are returned. The Coda FS Driver will normally + instantiate a vnode, inode or file handle at kernel level for the new + object. + + + EErrrroorrss A variety of errors can occur. Permissions may be insufficient. + If the object exists and is not a file the error EISDIR is returned + under Unix. + + NNOOTTEE The packing of parameters is very inefficient and appears to + indicate confusion between the system call creat and the VFS operation + create. The VFS operation create is only called to create new objects. + This create call differs from the Unix one in that it is not invoked + to return a file descriptor. The truncate and exclusive options, + together with the mode, could simply be part of the mode as it is + under Unix. There should be no flags argument; this is used in open + (2) to return a file descriptor for READ or WRITE mode. + + The attributes of the directory should be returned too, since the size + and mtime changed. + + 0wpage + + 44..99.. mmkkddiirr + + + SSuummmmaarryy Create a new directory. + + AArrgguummeennttss + + iinn + + struct cfs_mkdir_in { + ViceFid VFid; + struct coda_vattr attr; + char *name; /* Place holder for data. */ + } cfs_mkdir; + + + + oouutt + + struct cfs_mkdir_out { + ViceFid VFid; + struct coda_vattr attr; + } cfs_mkdir; + + + + + DDeessccrriippttiioonn This call is similar to create but creates a directory. + Only the mode field in the input parameters is used for creation. + Upon successful creation, the attr returned contains the attributes of + the new directory. + + EErrrroorrss As for create. + + NNOOTTEE The input parameter should be changed to mode instead of + attributes. + + The attributes of the parent should be returned since the size and + mtime changes. + + 0wpage + + 44..1100.. lliinnkk + + + SSuummmmaarryy Create a link to an existing file. + + AArrgguummeennttss + + iinn + + struct cfs_link_in { + ViceFid sourceFid; /* cnode to link *to* */ + ViceFid destFid; /* Directory in which to place link */ + char *tname; /* Place holder for data. */ + } cfs_link; + + + + oouutt + empty + + DDeessccrriippttiioonn This call creates a link to the sourceFid in the directory + identified by destFid with name tname. The source must reside in the + target's parent, i.e. the source must be have parent destFid, i.e. Coda + does not support cross directory hard links. Only the return value is + relevant. It indicates success or the type of failure. + + EErrrroorrss The usual errors can occur.0wpage + + 44..1111.. ssyymmlliinnkk + + + SSuummmmaarryy create a symbolic link + + AArrgguummeennttss + + iinn + + struct cfs_symlink_in { + ViceFid VFid; /* Directory to put symlink in */ + char *srcname; + struct coda_vattr attr; + char *tname; + } cfs_symlink; + + + + oouutt + none + + DDeessccrriippttiioonn Create a symbolic link. The link is to be placed in the + directory identified by VFid and named tname. It should point to the + pathname srcname. The attributes of the newly created object are to + be set to attr. + + EErrrroorrss + + NNOOTTEE The attributes of the target directory should be returned since + its size changed. + + 0wpage + + 44..1122.. rreemmoovvee + + + SSuummmmaarryy Remove a file + + AArrgguummeennttss + + iinn + + struct cfs_remove_in { + ViceFid VFid; + char *name; /* Place holder for data. */ + } cfs_remove; + + + + oouutt + none + + DDeessccrriippttiioonn Remove file named cfs_remove_in.name in directory + identified by VFid. + + EErrrroorrss + + NNOOTTEE The attributes of the directory should be returned since its + mtime and size may change. + + 0wpage + + 44..1133.. rrmmddiirr + + + SSuummmmaarryy Remove a directory + + AArrgguummeennttss + + iinn + + struct cfs_rmdir_in { + ViceFid VFid; + char *name; /* Place holder for data. */ + } cfs_rmdir; + + + + oouutt + none + + DDeessccrriippttiioonn Remove the directory with name name from the directory + identified by VFid. + + EErrrroorrss + + NNOOTTEE The attributes of the parent directory should be returned since + its mtime and size may change. + + 0wpage + + 44..1144.. rreeaaddlliinnkk + + + SSuummmmaarryy Read the value of a symbolic link. + + AArrgguummeennttss + + iinn + + struct cfs_readlink_in { + ViceFid VFid; + } cfs_readlink; + + + + oouutt + + struct cfs_readlink_out { + int count; + caddr_t data; /* Place holder for data. */ + } cfs_readlink; + + + + DDeessccrriippttiioonn This routine reads the contents of symbolic link + identified by VFid into the buffer data. The buffer data must be able + to hold any name up to CFS_MAXNAMLEN (PATH or NAM??). + + EErrrroorrss No unusual errors. + + 0wpage + + 44..1155.. ooppeenn + + + SSuummmmaarryy Open a file. + + AArrgguummeennttss + + iinn + + struct cfs_open_in { + ViceFid VFid; + int flags; + } cfs_open; + + + + oouutt + + struct cfs_open_out { + dev_t dev; + ino_t inode; + } cfs_open; + + + + DDeessccrriippttiioonn This request asks Venus to place the file identified by + VFid in its cache and to note that the calling process wishes to open + it with flags as in open(2). The return value to the kernel differs + for Unix and Windows systems. For Unix systems the Coda FS Driver is + informed of the device and inode number of the container file in the + fields dev and inode. For Windows the path of the container file is + returned to the kernel. + EErrrroorrss + + NNOOTTEE Currently the cfs_open_out structure is not properly adapted to + deal with the Windows case. It might be best to implement two + upcalls, one to open aiming at a container file name, the other at a + container file inode. + + 0wpage + + 44..1166.. cclloossee + + + SSuummmmaarryy Close a file, update it on the servers. + + AArrgguummeennttss + + iinn + + struct cfs_close_in { + ViceFid VFid; + int flags; + } cfs_close; + + + + oouutt + none + + DDeessccrriippttiioonn Close the file identified by VFid. + + EErrrroorrss + + NNOOTTEE The flags argument is bogus and not used. However, Venus' code + has room to deal with an execp input field, probably this field should + be used to inform Venus that the file was closed but is still memory + mapped for execution. There are comments about fetching versus not + fetching the data in Venus vproc_vfscalls. This seems silly. If a + file is being closed, the data in the container file is to be the new + data. Here again the execp flag might be in play to create confusion: + currently Venus might think a file can be flushed from the cache when + it is still memory mapped. This needs to be understood. + + 0wpage + + 44..1177.. iiooccttll + + + SSuummmmaarryy Do an ioctl on a file. This includes the pioctl interface. + + AArrgguummeennttss + + iinn + + struct cfs_ioctl_in { + ViceFid VFid; + int cmd; + int len; + int rwflag; + char *data; /* Place holder for data. */ + } cfs_ioctl; + + + + oouutt + + + struct cfs_ioctl_out { + int len; + caddr_t data; /* Place holder for data. */ + } cfs_ioctl; + + + + DDeessccrriippttiioonn Do an ioctl operation on a file. The command, len and + data arguments are filled as usual. flags is not used by Venus. + + EErrrroorrss + + NNOOTTEE Another bogus parameter. flags is not used. What is the + business about PREFETCHING in the Venus code? + + + 0wpage + + 44..1188.. rreennaammee + + + SSuummmmaarryy Rename a fid. + + AArrgguummeennttss + + iinn + + struct cfs_rename_in { + ViceFid sourceFid; + char *srcname; + ViceFid destFid; + char *destname; + } cfs_rename; + + + + oouutt + none + + DDeessccrriippttiioonn Rename the object with name srcname in directory + sourceFid to destname in destFid. It is important that the names + srcname and destname are 0 terminated strings. Strings in Unix + kernels are not always null terminated. + + EErrrroorrss + + 0wpage + + 44..1199.. rreeaaddddiirr + + + SSuummmmaarryy Read directory entries. + + AArrgguummeennttss + + iinn + + struct cfs_readdir_in { + ViceFid VFid; + int count; + int offset; + } cfs_readdir; + + + + + oouutt + + struct cfs_readdir_out { + int size; + caddr_t data; /* Place holder for data. */ + } cfs_readdir; + + + + DDeessccrriippttiioonn Read directory entries from VFid starting at offset and + read at most count bytes. Returns the data in data and returns + the size in size. + + EErrrroorrss + + NNOOTTEE This call is not used. Readdir operations exploit container + files. We will re-evaluate this during the directory revamp which is + about to take place. + + 0wpage + + 44..2200.. vvggeett + + + SSuummmmaarryy instructs Venus to do an FSDB->Get. + + AArrgguummeennttss + + iinn + + struct cfs_vget_in { + ViceFid VFid; + } cfs_vget; + + + + oouutt + + struct cfs_vget_out { + ViceFid VFid; + int vtype; + } cfs_vget; + + + + DDeessccrriippttiioonn This upcall asks Venus to do a get operation on an fsobj + labelled by VFid. + + EErrrroorrss + + NNOOTTEE This operation is not used. However, it is extremely useful + since it can be used to deal with read/write memory mapped files. + These can be "pinned" in the Venus cache using vget and released with + inactive. + + 0wpage + + 44..2211.. ffssyynncc + + + SSuummmmaarryy Tell Venus to update the RVM attributes of a file. + + AArrgguummeennttss + + iinn + + struct cfs_fsync_in { + ViceFid VFid; + } cfs_fsync; + + + + oouutt + none + + DDeessccrriippttiioonn Ask Venus to update RVM attributes of object VFid. This + should be called as part of kernel level fsync type calls. The + result indicates if the syncing was successful. + + EErrrroorrss + + NNOOTTEE Linux does not implement this call. It should. + + 0wpage + + 44..2222.. iinnaaccttiivvee + + + SSuummmmaarryy Tell Venus a vnode is no longer in use. + + AArrgguummeennttss + + iinn + + struct cfs_inactive_in { + ViceFid VFid; + } cfs_inactive; + + + + oouutt + none + + DDeessccrriippttiioonn This operation returns EOPNOTSUPP. + + EErrrroorrss + + NNOOTTEE This should perhaps be removed. + + 0wpage + + 44..2233.. rrddwwrr + + + SSuummmmaarryy Read or write from a file + + AArrgguummeennttss + + iinn + + struct cfs_rdwr_in { + ViceFid VFid; + int rwflag; + int count; + int offset; + int ioflag; + caddr_t data; /* Place holder for data. */ + } cfs_rdwr; + + + + + oouutt + + struct cfs_rdwr_out { + int rwflag; + int count; + caddr_t data; /* Place holder for data. */ + } cfs_rdwr; + + + + DDeessccrriippttiioonn This upcall asks Venus to read or write from a file. + + EErrrroorrss + + NNOOTTEE It should be removed since it is against the Coda philosophy that + read/write operations never reach Venus. I have been told the + operation does not work. It is not currently used. + + + 0wpage + + 44..2244.. ooddyymmoouunntt + + + SSuummmmaarryy Allows mounting multiple Coda "filesystems" on one Unix mount + point. + + AArrgguummeennttss + + iinn + + struct ody_mount_in { + char *name; /* Place holder for data. */ + } ody_mount; + + + + oouutt + + struct ody_mount_out { + ViceFid VFid; + } ody_mount; + + + + DDeessccrriippttiioonn Asks Venus to return the rootfid of a Coda system named + name. The fid is returned in VFid. + + EErrrroorrss + + NNOOTTEE This call was used by David for dynamic sets. It should be + removed since it causes a jungle of pointers in the VFS mounting area. + It is not used by Coda proper. Call is not implemented by Venus. + + 0wpage + + 44..2255.. ooddyy__llooookkuupp + + + SSuummmmaarryy Looks up something. + + AArrgguummeennttss + + iinn irrelevant + + + oouutt + irrelevant + + DDeessccrriippttiioonn + + EErrrroorrss + + NNOOTTEE Gut it. Call is not implemented by Venus. + + 0wpage + + 44..2266.. ooddyy__eexxppaanndd + + + SSuummmmaarryy expands something in a dynamic set. + + AArrgguummeennttss + + iinn irrelevant + + oouutt + irrelevant + + DDeessccrriippttiioonn + + EErrrroorrss + + NNOOTTEE Gut it. Call is not implemented by Venus. + + 0wpage + + 44..2277.. pprreeffeettcchh + + + SSuummmmaarryy Prefetch a dynamic set. + + AArrgguummeennttss + + iinn Not documented. + + oouutt + Not documented. + + DDeessccrriippttiioonn Venus worker.cc has support for this call, although it is + noted that it doesn't work. Not surprising, since the kernel does not + have support for it. (ODY_PREFETCH is not a defined operation). + + EErrrroorrss + + NNOOTTEE Gut it. It isn't working and isn't used by Coda. + + + 0wpage + + 44..2288.. ssiiggnnaall + + + SSuummmmaarryy Send Venus a signal about an upcall. + + AArrgguummeennttss + + iinn none + + oouutt + not applicable. + + DDeessccrriippttiioonn This is an out-of-band upcall to Venus to inform Venus + that the calling process received a signal after Venus read the + message from the input queue. Venus is supposed to clean up the + operation. + + EErrrroorrss No reply is given. + + NNOOTTEE We need to better understand what Venus needs to clean up and if + it is doing this correctly. Also we need to handle multiple upcall + per system call situations correctly. It would be important to know + what state changes in Venus take place after an upcall for which the + kernel is responsible for notifying Venus to clean up (e.g. open + definitely is such a state change, but many others are maybe not). + + 0wpage + + 55.. TThhee mmiinniiccaacchhee aanndd ddoowwnnccaallllss + + + The Coda FS Driver can cache results of lookup and access upcalls, to + limit the frequency of upcalls. Upcalls carry a price since a process + context switch needs to take place. The counterpart of caching the + information is that Venus will notify the FS Driver that cached + entries must be flushed or renamed. + + The kernel code generally has to maintain a structure which links the + internal file handles (called vnodes in BSD, inodes in Linux and + FileHandles in Windows) with the ViceFid's which Venus maintains. The + reason is that frequent translations back and forth are needed in + order to make upcalls and use the results of upcalls. Such linking + objects are called ccnnooddeess. + + The current minicache implementations have cache entries which record + the following: + + 1. the name of the file + + 2. the cnode of the directory containing the object + + 3. a list of CodaCred's for which the lookup is permitted. + + 4. the cnode of the object + + The lookup call in the Coda FS Driver may request the cnode of the + desired object from the cache, by passing its name, directory and the + CodaCred's of the caller. The cache will return the cnode or indicate + that it cannot be found. The Coda FS Driver must be careful to + invalidate cache entries when it modifies or removes objects. + + When Venus obtains information that indicates that cache entries are + no longer valid, it will make a downcall to the kernel. Downcalls are + intercepted by the Coda FS Driver and lead to cache invalidations of + the kind described below. The Coda FS Driver does not return an error + unless the downcall data could not be read into kernel memory. + + + 55..11.. IINNVVAALLIIDDAATTEE + + + No information is available on this call. + + + 55..22.. FFLLUUSSHH + + + + AArrgguummeennttss None + + SSuummmmaarryy Flush the name cache entirely. + + DDeessccrriippttiioonn Venus issues this call upon startup and when it dies. This + is to prevent stale cache information being held. Some operating + systems allow the kernel name cache to be switched off dynamically. + When this is done, this downcall is made. + + + 55..33.. PPUURRGGEEUUSSEERR + + + AArrgguummeennttss + + struct cfs_purgeuser_out {/* CFS_PURGEUSER is a venus->kernel call */ + struct CodaCred cred; + } cfs_purgeuser; + + + + DDeessccrriippttiioonn Remove all entries in the cache carrying the Cred. This + call is issued when tokens for a user expire or are flushed. + + + 55..44.. ZZAAPPFFIILLEE + + + AArrgguummeennttss + + struct cfs_zapfile_out { /* CFS_ZAPFILE is a venus->kernel call */ + ViceFid CodaFid; + } cfs_zapfile; + + + + DDeessccrriippttiioonn Remove all entries which have the (dir vnode, name) pair. + This is issued as a result of an invalidation of cached attributes of + a vnode. + + NNOOTTEE Call is not named correctly in NetBSD and Mach. The minicache + zapfile routine takes different arguments. Linux does not implement + the invalidation of attributes correctly. + + + + 55..55.. ZZAAPPDDIIRR + + + AArrgguummeennttss + + struct cfs_zapdir_out { /* CFS_ZAPDIR is a venus->kernel call */ + ViceFid CodaFid; + } cfs_zapdir; + + + + DDeessccrriippttiioonn Remove all entries in the cache lying in a directory + CodaFid, and all children of this directory. This call is issued when + Venus receives a callback on the directory. + + + 55..66.. ZZAAPPVVNNOODDEE + + + + AArrgguummeennttss + + struct cfs_zapvnode_out { /* CFS_ZAPVNODE is a venus->kernel call */ + struct CodaCred cred; + ViceFid VFid; + } cfs_zapvnode; + + + + DDeessccrriippttiioonn Remove all entries in the cache carrying the cred and VFid + as in the arguments. This downcall is probably never issued. + + + 55..77.. PPUURRGGEEFFIIDD + + + SSuummmmaarryy + + AArrgguummeennttss + + struct cfs_purgefid_out { /* CFS_PURGEFID is a venus->kernel call */ + ViceFid CodaFid; + } cfs_purgefid; + + + + DDeessccrriippttiioonn Flush the attribute for the file. If it is a dir (odd + vnode), purge its children from the namecache and remove the file from the + namecache. + + + + 55..88.. RREEPPLLAACCEE + + + SSuummmmaarryy Replace the Fid's for a collection of names. + + AArrgguummeennttss + + struct cfs_replace_out { /* cfs_replace is a venus->kernel call */ + ViceFid NewFid; + ViceFid OldFid; + } cfs_replace; + + + + DDeessccrriippttiioonn This routine replaces a ViceFid in the name cache with + another. It is added to allow Venus during reintegration to replace + locally allocated temp fids while disconnected with global fids even + when the reference counts on those fids are not zero. + + 0wpage + + 66.. IInniittiiaalliizzaattiioonn aanndd cclleeaannuupp + + + This section gives brief hints as to desirable features for the Coda + FS Driver at startup and upon shutdown or Venus failures. Before + entering the discussion it is useful to repeat that the Coda FS Driver + maintains the following data: + + + 1. message queues + + 2. cnodes + + 3. name cache entries + + The name cache entries are entirely private to the driver, so they + can easily be manipulated. The message queues will generally have + clear points of initialization and destruction. The cnodes are + much more delicate. User processes hold reference counts in Coda + filesystems and it can be difficult to clean up the cnodes. + + It can expect requests through: + + 1. the message subsystem + + 2. the VFS layer + + 3. pioctl interface + + Currently the _p_i_o_c_t_l passes through the VFS for Coda so we can + treat these similarly. + + + 66..11.. RReeqquuiirreemmeennttss + + + The following requirements should be accommodated: + + 1. The message queues should have open and close routines. On Unix + the opening of the character devices are such routines. + + +o Before opening, no messages can be placed. + + +o Opening will remove any old messages still pending. + + +o Close will notify any sleeping processes that their upcall cannot + be completed. + + +o Close will free all memory allocated by the message queues. + + + 2. At open the namecache shall be initialized to empty state. + + 3. Before the message queues are open, all VFS operations will fail. + Fortunately this can be achieved by making sure than mounting the + Coda filesystem cannot succeed before opening. + + 4. After closing of the queues, no VFS operations can succeed. Here + one needs to be careful, since a few operations (lookup, + read/write, readdir) can proceed without upcalls. These must be + explicitly blocked. + + 5. Upon closing the namecache shall be flushed and disabled. + + 6. All memory held by cnodes can be freed without relying on upcalls. + + 7. Unmounting the file system can be done without relying on upcalls. + + 8. Mounting the Coda filesystem should fail gracefully if Venus cannot + get the rootfid or the attributes of the rootfid. The latter is + best implemented by Venus fetching these objects before attempting + to mount. + + NNOOTTEE NetBSD in particular but also Linux have not implemented the + above requirements fully. For smooth operation this needs to be + corrected. + + + diff --git a/Documentation/filesystems/configfs/Makefile b/Documentation/filesystems/configfs/Makefile new file mode 100644 index 0000000..be7ec5e --- /dev/null +++ b/Documentation/filesystems/configfs/Makefile @@ -0,0 +1,3 @@ +ifneq ($(CONFIG_CONFIGFS_FS),) +obj-m += configfs_example_explicit.o configfs_example_macros.o +endif diff --git a/Documentation/filesystems/configfs/configfs.txt b/Documentation/filesystems/configfs/configfs.txt new file mode 100644 index 0000000..fabcb0e --- /dev/null +++ b/Documentation/filesystems/configfs/configfs.txt @@ -0,0 +1,484 @@ + +configfs - Userspace-driven kernel object configuration. + +Joel Becker <joel.becker@oracle.com> + +Updated: 31 March 2005 + +Copyright (c) 2005 Oracle Corporation, + Joel Becker <joel.becker@oracle.com> + + +[What is configfs?] + +configfs is a ram-based filesystem that provides the converse of +sysfs's functionality. Where sysfs is a filesystem-based view of +kernel objects, configfs is a filesystem-based manager of kernel +objects, or config_items. + +With sysfs, an object is created in kernel (for example, when a device +is discovered) and it is registered with sysfs. Its attributes then +appear in sysfs, allowing userspace to read the attributes via +readdir(3)/read(2). It may allow some attributes to be modified via +write(2). The important point is that the object is created and +destroyed in kernel, the kernel controls the lifecycle of the sysfs +representation, and sysfs is merely a window on all this. + +A configfs config_item is created via an explicit userspace operation: +mkdir(2). It is destroyed via rmdir(2). The attributes appear at +mkdir(2) time, and can be read or modified via read(2) and write(2). +As with sysfs, readdir(3) queries the list of items and/or attributes. +symlink(2) can be used to group items together. Unlike sysfs, the +lifetime of the representation is completely driven by userspace. The +kernel modules backing the items must respond to this. + +Both sysfs and configfs can and should exist together on the same +system. One is not a replacement for the other. + +[Using configfs] + +configfs can be compiled as a module or into the kernel. You can access +it by doing + + mount -t configfs none /config + +The configfs tree will be empty unless client modules are also loaded. +These are modules that register their item types with configfs as +subsystems. Once a client subsystem is loaded, it will appear as a +subdirectory (or more than one) under /config. Like sysfs, the +configfs tree is always there, whether mounted on /config or not. + +An item is created via mkdir(2). The item's attributes will also +appear at this time. readdir(3) can determine what the attributes are, +read(2) can query their default values, and write(2) can store new +values. Like sysfs, attributes should be ASCII text files, preferably +with only one value per file. The same efficiency caveats from sysfs +apply. Don't mix more than one attribute in one attribute file. + +Like sysfs, configfs expects write(2) to store the entire buffer at +once. When writing to configfs attributes, userspace processes should +first read the entire file, modify the portions they wish to change, and +then write the entire buffer back. Attribute files have a maximum size +of one page (PAGE_SIZE, 4096 on i386). + +When an item needs to be destroyed, remove it with rmdir(2). An +item cannot be destroyed if any other item has a link to it (via +symlink(2)). Links can be removed via unlink(2). + +[Configuring FakeNBD: an Example] + +Imagine there's a Network Block Device (NBD) driver that allows you to +access remote block devices. Call it FakeNBD. FakeNBD uses configfs +for its configuration. Obviously, there will be a nice program that +sysadmins use to configure FakeNBD, but somehow that program has to tell +the driver about it. Here's where configfs comes in. + +When the FakeNBD driver is loaded, it registers itself with configfs. +readdir(3) sees this just fine: + + # ls /config + fakenbd + +A fakenbd connection can be created with mkdir(2). The name is +arbitrary, but likely the tool will make some use of the name. Perhaps +it is a uuid or a disk name: + + # mkdir /config/fakenbd/disk1 + # ls /config/fakenbd/disk1 + target device rw + +The target attribute contains the IP address of the server FakeNBD will +connect to. The device attribute is the device on the server. +Predictably, the rw attribute determines whether the connection is +read-only or read-write. + + # echo 10.0.0.1 > /config/fakenbd/disk1/target + # echo /dev/sda1 > /config/fakenbd/disk1/device + # echo 1 > /config/fakenbd/disk1/rw + +That's it. That's all there is. Now the device is configured, via the +shell no less. + +[Coding With configfs] + +Every object in configfs is a config_item. A config_item reflects an +object in the subsystem. It has attributes that match values on that +object. configfs handles the filesystem representation of that object +and its attributes, allowing the subsystem to ignore all but the +basic show/store interaction. + +Items are created and destroyed inside a config_group. A group is a +collection of items that share the same attributes and operations. +Items are created by mkdir(2) and removed by rmdir(2), but configfs +handles that. The group has a set of operations to perform these tasks + +A subsystem is the top level of a client module. During initialization, +the client module registers the subsystem with configfs, the subsystem +appears as a directory at the top of the configfs filesystem. A +subsystem is also a config_group, and can do everything a config_group +can. + +[struct config_item] + + struct config_item { + char *ci_name; + char ci_namebuf[UOBJ_NAME_LEN]; + struct kref ci_kref; + struct list_head ci_entry; + struct config_item *ci_parent; + struct config_group *ci_group; + struct config_item_type *ci_type; + struct dentry *ci_dentry; + }; + + void config_item_init(struct config_item *); + void config_item_init_type_name(struct config_item *, + const char *name, + struct config_item_type *type); + struct config_item *config_item_get(struct config_item *); + void config_item_put(struct config_item *); + +Generally, struct config_item is embedded in a container structure, a +structure that actually represents what the subsystem is doing. The +config_item portion of that structure is how the object interacts with +configfs. + +Whether statically defined in a source file or created by a parent +config_group, a config_item must have one of the _init() functions +called on it. This initializes the reference count and sets up the +appropriate fields. + +All users of a config_item should have a reference on it via +config_item_get(), and drop the reference when they are done via +config_item_put(). + +By itself, a config_item cannot do much more than appear in configfs. +Usually a subsystem wants the item to display and/or store attributes, +among other things. For that, it needs a type. + +[struct config_item_type] + + struct configfs_item_operations { + void (*release)(struct config_item *); + ssize_t (*show_attribute)(struct config_item *, + struct configfs_attribute *, + char *); + ssize_t (*store_attribute)(struct config_item *, + struct configfs_attribute *, + const char *, size_t); + int (*allow_link)(struct config_item *src, + struct config_item *target); + int (*drop_link)(struct config_item *src, + struct config_item *target); + }; + + struct config_item_type { + struct module *ct_owner; + struct configfs_item_operations *ct_item_ops; + struct configfs_group_operations *ct_group_ops; + struct configfs_attribute **ct_attrs; + }; + +The most basic function of a config_item_type is to define what +operations can be performed on a config_item. All items that have been +allocated dynamically will need to provide the ct_item_ops->release() +method. This method is called when the config_item's reference count +reaches zero. Items that wish to display an attribute need to provide +the ct_item_ops->show_attribute() method. Similarly, storing a new +attribute value uses the store_attribute() method. + +[struct configfs_attribute] + + struct configfs_attribute { + char *ca_name; + struct module *ca_owner; + mode_t ca_mode; + }; + +When a config_item wants an attribute to appear as a file in the item's +configfs directory, it must define a configfs_attribute describing it. +It then adds the attribute to the NULL-terminated array +config_item_type->ct_attrs. When the item appears in configfs, the +attribute file will appear with the configfs_attribute->ca_name +filename. configfs_attribute->ca_mode specifies the file permissions. + +If an attribute is readable and the config_item provides a +ct_item_ops->show_attribute() method, that method will be called +whenever userspace asks for a read(2) on the attribute. The converse +will happen for write(2). + +[struct config_group] + +A config_item cannot live in a vacuum. The only way one can be created +is via mkdir(2) on a config_group. This will trigger creation of a +child item. + + struct config_group { + struct config_item cg_item; + struct list_head cg_children; + struct configfs_subsystem *cg_subsys; + struct config_group **default_groups; + }; + + void config_group_init(struct config_group *group); + void config_group_init_type_name(struct config_group *group, + const char *name, + struct config_item_type *type); + + +The config_group structure contains a config_item. Properly configuring +that item means that a group can behave as an item in its own right. +However, it can do more: it can create child items or groups. This is +accomplished via the group operations specified on the group's +config_item_type. + + struct configfs_group_operations { + struct config_item *(*make_item)(struct config_group *group, + const char *name); + struct config_group *(*make_group)(struct config_group *group, + const char *name); + int (*commit_item)(struct config_item *item); + void (*disconnect_notify)(struct config_group *group, + struct config_item *item); + void (*drop_item)(struct config_group *group, + struct config_item *item); + }; + +A group creates child items by providing the +ct_group_ops->make_item() method. If provided, this method is called from mkdir(2) in the group's directory. The subsystem allocates a new +config_item (or more likely, its container structure), initializes it, +and returns it to configfs. Configfs will then populate the filesystem +tree to reflect the new item. + +If the subsystem wants the child to be a group itself, the subsystem +provides ct_group_ops->make_group(). Everything else behaves the same, +using the group _init() functions on the group. + +Finally, when userspace calls rmdir(2) on the item or group, +ct_group_ops->drop_item() is called. As a config_group is also a +config_item, it is not necessary for a separate drop_group() method. +The subsystem must config_item_put() the reference that was initialized +upon item allocation. If a subsystem has no work to do, it may omit +the ct_group_ops->drop_item() method, and configfs will call +config_item_put() on the item on behalf of the subsystem. + +IMPORTANT: drop_item() is void, and as such cannot fail. When rmdir(2) +is called, configfs WILL remove the item from the filesystem tree +(assuming that it has no children to keep it busy). The subsystem is +responsible for responding to this. If the subsystem has references to +the item in other threads, the memory is safe. It may take some time +for the item to actually disappear from the subsystem's usage. But it +is gone from configfs. + +When drop_item() is called, the item's linkage has already been torn +down. It no longer has a reference on its parent and has no place in +the item hierarchy. If a client needs to do some cleanup before this +teardown happens, the subsystem can implement the +ct_group_ops->disconnect_notify() method. The method is called after +configfs has removed the item from the filesystem view but before the +item is removed from its parent group. Like drop_item(), +disconnect_notify() is void and cannot fail. Client subsystems should +not drop any references here, as they still must do it in drop_item(). + +A config_group cannot be removed while it still has child items. This +is implemented in the configfs rmdir(2) code. ->drop_item() will not be +called, as the item has not been dropped. rmdir(2) will fail, as the +directory is not empty. + +[struct configfs_subsystem] + +A subsystem must register itself, usually at module_init time. This +tells configfs to make the subsystem appear in the file tree. + + struct configfs_subsystem { + struct config_group su_group; + struct mutex su_mutex; + }; + + int configfs_register_subsystem(struct configfs_subsystem *subsys); + void configfs_unregister_subsystem(struct configfs_subsystem *subsys); + + A subsystem consists of a toplevel config_group and a mutex. +The group is where child config_items are created. For a subsystem, +this group is usually defined statically. Before calling +configfs_register_subsystem(), the subsystem must have initialized the +group via the usual group _init() functions, and it must also have +initialized the mutex. + When the register call returns, the subsystem is live, and it +will be visible via configfs. At that point, mkdir(2) can be called and +the subsystem must be ready for it. + +[An Example] + +The best example of these basic concepts is the simple_children +subsystem/group and the simple_child item in configfs_example_explicit.c +and configfs_example_macros.c. It shows a trivial object displaying and +storing an attribute, and a simple group creating and destroying these +children. + +The only difference between configfs_example_explicit.c and +configfs_example_macros.c is how the attributes of the childless item +are defined. The childless item has extended attributes, each with +their own show()/store() operation. This follows a convention commonly +used in sysfs. configfs_example_explicit.c creates these attributes +by explicitly defining the structures involved. Conversely +configfs_example_macros.c uses some convenience macros from configfs.h +to define the attributes. These macros are similar to their sysfs +counterparts. + +[Hierarchy Navigation and the Subsystem Mutex] + +There is an extra bonus that configfs provides. The config_groups and +config_items are arranged in a hierarchy due to the fact that they +appear in a filesystem. A subsystem is NEVER to touch the filesystem +parts, but the subsystem might be interested in this hierarchy. For +this reason, the hierarchy is mirrored via the config_group->cg_children +and config_item->ci_parent structure members. + +A subsystem can navigate the cg_children list and the ci_parent pointer +to see the tree created by the subsystem. This can race with configfs' +management of the hierarchy, so configfs uses the subsystem mutex to +protect modifications. Whenever a subsystem wants to navigate the +hierarchy, it must do so under the protection of the subsystem +mutex. + +A subsystem will be prevented from acquiring the mutex while a newly +allocated item has not been linked into this hierarchy. Similarly, it +will not be able to acquire the mutex while a dropping item has not +yet been unlinked. This means that an item's ci_parent pointer will +never be NULL while the item is in configfs, and that an item will only +be in its parent's cg_children list for the same duration. This allows +a subsystem to trust ci_parent and cg_children while they hold the +mutex. + +[Item Aggregation Via symlink(2)] + +configfs provides a simple group via the group->item parent/child +relationship. Often, however, a larger environment requires aggregation +outside of the parent/child connection. This is implemented via +symlink(2). + +A config_item may provide the ct_item_ops->allow_link() and +ct_item_ops->drop_link() methods. If the ->allow_link() method exists, +symlink(2) may be called with the config_item as the source of the link. +These links are only allowed between configfs config_items. Any +symlink(2) attempt outside the configfs filesystem will be denied. + +When symlink(2) is called, the source config_item's ->allow_link() +method is called with itself and a target item. If the source item +allows linking to target item, it returns 0. A source item may wish to +reject a link if it only wants links to a certain type of object (say, +in its own subsystem). + +When unlink(2) is called on the symbolic link, the source item is +notified via the ->drop_link() method. Like the ->drop_item() method, +this is a void function and cannot return failure. The subsystem is +responsible for responding to the change. + +A config_item cannot be removed while it links to any other item, nor +can it be removed while an item links to it. Dangling symlinks are not +allowed in configfs. + +[Automatically Created Subgroups] + +A new config_group may want to have two types of child config_items. +While this could be codified by magic names in ->make_item(), it is much +more explicit to have a method whereby userspace sees this divergence. + +Rather than have a group where some items behave differently than +others, configfs provides a method whereby one or many subgroups are +automatically created inside the parent at its creation. Thus, +mkdir("parent") results in "parent", "parent/subgroup1", up through +"parent/subgroupN". Items of type 1 can now be created in +"parent/subgroup1", and items of type N can be created in +"parent/subgroupN". + +These automatic subgroups, or default groups, do not preclude other +children of the parent group. If ct_group_ops->make_group() exists, +other child groups can be created on the parent group directly. + +A configfs subsystem specifies default groups by filling in the +NULL-terminated array default_groups on the config_group structure. +Each group in that array is populated in the configfs tree at the same +time as the parent group. Similarly, they are removed at the same time +as the parent. No extra notification is provided. When a ->drop_item() +method call notifies the subsystem the parent group is going away, it +also means every default group child associated with that parent group. + +As a consequence of this, default_groups cannot be removed directly via +rmdir(2). They also are not considered when rmdir(2) on the parent +group is checking for children. + +[Dependant Subsystems] + +Sometimes other drivers depend on particular configfs items. For +example, ocfs2 mounts depend on a heartbeat region item. If that +region item is removed with rmdir(2), the ocfs2 mount must BUG or go +readonly. Not happy. + +configfs provides two additional API calls: configfs_depend_item() and +configfs_undepend_item(). A client driver can call +configfs_depend_item() on an existing item to tell configfs that it is +depended on. configfs will then return -EBUSY from rmdir(2) for that +item. When the item is no longer depended on, the client driver calls +configfs_undepend_item() on it. + +These API cannot be called underneath any configfs callbacks, as +they will conflict. They can block and allocate. A client driver +probably shouldn't calling them of its own gumption. Rather it should +be providing an API that external subsystems call. + +How does this work? Imagine the ocfs2 mount process. When it mounts, +it asks for a heartbeat region item. This is done via a call into the +heartbeat code. Inside the heartbeat code, the region item is looked +up. Here, the heartbeat code calls configfs_depend_item(). If it +succeeds, then heartbeat knows the region is safe to give to ocfs2. +If it fails, it was being torn down anyway, and heartbeat can gracefully +pass up an error. + +[Committable Items] + +NOTE: Committable items are currently unimplemented. + +Some config_items cannot have a valid initial state. That is, no +default values can be specified for the item's attributes such that the +item can do its work. Userspace must configure one or more attributes, +after which the subsystem can start whatever entity this item +represents. + +Consider the FakeNBD device from above. Without a target address *and* +a target device, the subsystem has no idea what block device to import. +The simple example assumes that the subsystem merely waits until all the +appropriate attributes are configured, and then connects. This will, +indeed, work, but now every attribute store must check if the attributes +are initialized. Every attribute store must fire off the connection if +that condition is met. + +Far better would be an explicit action notifying the subsystem that the +config_item is ready to go. More importantly, an explicit action allows +the subsystem to provide feedback as to whether the attributes are +initialized in a way that makes sense. configfs provides this as +committable items. + +configfs still uses only normal filesystem operations. An item is +committed via rename(2). The item is moved from a directory where it +can be modified to a directory where it cannot. + +Any group that provides the ct_group_ops->commit_item() method has +committable items. When this group appears in configfs, mkdir(2) will +not work directly in the group. Instead, the group will have two +subdirectories: "live" and "pending". The "live" directory does not +support mkdir(2) or rmdir(2) either. It only allows rename(2). The +"pending" directory does allow mkdir(2) and rmdir(2). An item is +created in the "pending" directory. Its attributes can be modified at +will. Userspace commits the item by renaming it into the "live" +directory. At this point, the subsystem receives the ->commit_item() +callback. If all required attributes are filled to satisfaction, the +method returns zero and the item is moved to the "live" directory. + +As rmdir(2) does not work in the "live" directory, an item must be +shutdown, or "uncommitted". Again, this is done via rename(2), this +time from the "live" directory back to the "pending" one. The subsystem +is notified by the ct_group_ops->uncommit_object() method. + + diff --git a/Documentation/filesystems/configfs/configfs_example_explicit.c b/Documentation/filesystems/configfs/configfs_example_explicit.c new file mode 100644 index 0000000..d428cc9 --- /dev/null +++ b/Documentation/filesystems/configfs/configfs_example_explicit.c @@ -0,0 +1,485 @@ +/* + * vim: noexpandtab ts=8 sts=0 sw=8: + * + * configfs_example_explicit.c - This file is a demonstration module + * containing a number of configfs subsystems. It explicitly defines + * each structure without using the helper macros defined in + * configfs.h. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + * + * Based on sysfs: + * sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel + * + * configfs Copyright (C) 2005 Oracle. All rights reserved. + */ + +#include <linux/init.h> +#include <linux/module.h> +#include <linux/slab.h> + +#include <linux/configfs.h> + + + +/* + * 01-childless + * + * This first example is a childless subsystem. It cannot create + * any config_items. It just has attributes. + * + * Note that we are enclosing the configfs_subsystem inside a container. + * This is not necessary if a subsystem has no attributes directly + * on the subsystem. See the next example, 02-simple-children, for + * such a subsystem. + */ + +struct childless { + struct configfs_subsystem subsys; + int showme; + int storeme; +}; + +struct childless_attribute { + struct configfs_attribute attr; + ssize_t (*show)(struct childless *, char *); + ssize_t (*store)(struct childless *, const char *, size_t); +}; + +static inline struct childless *to_childless(struct config_item *item) +{ + return item ? container_of(to_configfs_subsystem(to_config_group(item)), struct childless, subsys) : NULL; +} + +static ssize_t childless_showme_read(struct childless *childless, + char *page) +{ + ssize_t pos; + + pos = sprintf(page, "%d\n", childless->showme); + childless->showme++; + + return pos; +} + +static ssize_t childless_storeme_read(struct childless *childless, + char *page) +{ + return sprintf(page, "%d\n", childless->storeme); +} + +static ssize_t childless_storeme_write(struct childless *childless, + const char *page, + size_t count) +{ + unsigned long tmp; + char *p = (char *) page; + + tmp = simple_strtoul(p, &p, 10); + if (!p || (*p && (*p != '\n'))) + return -EINVAL; + + if (tmp > INT_MAX) + return -ERANGE; + + childless->storeme = tmp; + + return count; +} + +static ssize_t childless_description_read(struct childless *childless, + char *page) +{ + return sprintf(page, +"[01-childless]\n" +"\n" +"The childless subsystem is the simplest possible subsystem in\n" +"configfs. It does not support the creation of child config_items.\n" +"It only has a few attributes. In fact, it isn't much different\n" +"than a directory in /proc.\n"); +} + +static struct childless_attribute childless_attr_showme = { + .attr = { .ca_owner = THIS_MODULE, .ca_name = "showme", .ca_mode = S_IRUGO }, + .show = childless_showme_read, +}; +static struct childless_attribute childless_attr_storeme = { + .attr = { .ca_owner = THIS_MODULE, .ca_name = "storeme", .ca_mode = S_IRUGO | S_IWUSR }, + .show = childless_storeme_read, + .store = childless_storeme_write, +}; +static struct childless_attribute childless_attr_description = { + .attr = { .ca_owner = THIS_MODULE, .ca_name = "description", .ca_mode = S_IRUGO }, + .show = childless_description_read, +}; + +static struct configfs_attribute *childless_attrs[] = { + &childless_attr_showme.attr, + &childless_attr_storeme.attr, + &childless_attr_description.attr, + NULL, +}; + +static ssize_t childless_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + struct childless *childless = to_childless(item); + struct childless_attribute *childless_attr = + container_of(attr, struct childless_attribute, attr); + ssize_t ret = 0; + + if (childless_attr->show) + ret = childless_attr->show(childless, page); + return ret; +} + +static ssize_t childless_attr_store(struct config_item *item, + struct configfs_attribute *attr, + const char *page, size_t count) +{ + struct childless *childless = to_childless(item); + struct childless_attribute *childless_attr = + container_of(attr, struct childless_attribute, attr); + ssize_t ret = -EINVAL; + + if (childless_attr->store) + ret = childless_attr->store(childless, page, count); + return ret; +} + +static struct configfs_item_operations childless_item_ops = { + .show_attribute = childless_attr_show, + .store_attribute = childless_attr_store, +}; + +static struct config_item_type childless_type = { + .ct_item_ops = &childless_item_ops, + .ct_attrs = childless_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct childless childless_subsys = { + .subsys = { + .su_group = { + .cg_item = { + .ci_namebuf = "01-childless", + .ci_type = &childless_type, + }, + }, + }, +}; + + +/* ----------------------------------------------------------------- */ + +/* + * 02-simple-children + * + * This example merely has a simple one-attribute child. Note that + * there is no extra attribute structure, as the child's attribute is + * known from the get-go. Also, there is no container for the + * subsystem, as it has no attributes of its own. + */ + +struct simple_child { + struct config_item item; + int storeme; +}; + +static inline struct simple_child *to_simple_child(struct config_item *item) +{ + return item ? container_of(item, struct simple_child, item) : NULL; +} + +static struct configfs_attribute simple_child_attr_storeme = { + .ca_owner = THIS_MODULE, + .ca_name = "storeme", + .ca_mode = S_IRUGO | S_IWUSR, +}; + +static struct configfs_attribute *simple_child_attrs[] = { + &simple_child_attr_storeme, + NULL, +}; + +static ssize_t simple_child_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + ssize_t count; + struct simple_child *simple_child = to_simple_child(item); + + count = sprintf(page, "%d\n", simple_child->storeme); + + return count; +} + +static ssize_t simple_child_attr_store(struct config_item *item, + struct configfs_attribute *attr, + const char *page, size_t count) +{ + struct simple_child *simple_child = to_simple_child(item); + unsigned long tmp; + char *p = (char *) page; + + tmp = simple_strtoul(p, &p, 10); + if (!p || (*p && (*p != '\n'))) + return -EINVAL; + + if (tmp > INT_MAX) + return -ERANGE; + + simple_child->storeme = tmp; + + return count; +} + +static void simple_child_release(struct config_item *item) +{ + kfree(to_simple_child(item)); +} + +static struct configfs_item_operations simple_child_item_ops = { + .release = simple_child_release, + .show_attribute = simple_child_attr_show, + .store_attribute = simple_child_attr_store, +}; + +static struct config_item_type simple_child_type = { + .ct_item_ops = &simple_child_item_ops, + .ct_attrs = simple_child_attrs, + .ct_owner = THIS_MODULE, +}; + + +struct simple_children { + struct config_group group; +}; + +static inline struct simple_children *to_simple_children(struct config_item *item) +{ + return item ? container_of(to_config_group(item), struct simple_children, group) : NULL; +} + +static struct config_item *simple_children_make_item(struct config_group *group, const char *name) +{ + struct simple_child *simple_child; + + simple_child = kzalloc(sizeof(struct simple_child), GFP_KERNEL); + if (!simple_child) + return ERR_PTR(-ENOMEM); + + config_item_init_type_name(&simple_child->item, name, + &simple_child_type); + + simple_child->storeme = 0; + + return &simple_child->item; +} + +static struct configfs_attribute simple_children_attr_description = { + .ca_owner = THIS_MODULE, + .ca_name = "description", + .ca_mode = S_IRUGO, +}; + +static struct configfs_attribute *simple_children_attrs[] = { + &simple_children_attr_description, + NULL, +}; + +static ssize_t simple_children_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + return sprintf(page, +"[02-simple-children]\n" +"\n" +"This subsystem allows the creation of child config_items. These\n" +"items have only one attribute that is readable and writeable.\n"); +} + +static void simple_children_release(struct config_item *item) +{ + kfree(to_simple_children(item)); +} + +static struct configfs_item_operations simple_children_item_ops = { + .release = simple_children_release, + .show_attribute = simple_children_attr_show, +}; + +/* + * Note that, since no extra work is required on ->drop_item(), + * no ->drop_item() is provided. + */ +static struct configfs_group_operations simple_children_group_ops = { + .make_item = simple_children_make_item, +}; + +static struct config_item_type simple_children_type = { + .ct_item_ops = &simple_children_item_ops, + .ct_group_ops = &simple_children_group_ops, + .ct_attrs = simple_children_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct configfs_subsystem simple_children_subsys = { + .su_group = { + .cg_item = { + .ci_namebuf = "02-simple-children", + .ci_type = &simple_children_type, + }, + }, +}; + + +/* ----------------------------------------------------------------- */ + +/* + * 03-group-children + * + * This example reuses the simple_children group from above. However, + * the simple_children group is not the subsystem itself, it is a + * child of the subsystem. Creation of a group in the subsystem creates + * a new simple_children group. That group can then have simple_child + * children of its own. + */ + +static struct config_group *group_children_make_group(struct config_group *group, const char *name) +{ + struct simple_children *simple_children; + + simple_children = kzalloc(sizeof(struct simple_children), + GFP_KERNEL); + if (!simple_children) + return ERR_PTR(-ENOMEM); + + config_group_init_type_name(&simple_children->group, name, + &simple_children_type); + + return &simple_children->group; +} + +static struct configfs_attribute group_children_attr_description = { + .ca_owner = THIS_MODULE, + .ca_name = "description", + .ca_mode = S_IRUGO, +}; + +static struct configfs_attribute *group_children_attrs[] = { + &group_children_attr_description, + NULL, +}; + +static ssize_t group_children_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + return sprintf(page, +"[03-group-children]\n" +"\n" +"This subsystem allows the creation of child config_groups. These\n" +"groups are like the subsystem simple-children.\n"); +} + +static struct configfs_item_operations group_children_item_ops = { + .show_attribute = group_children_attr_show, +}; + +/* + * Note that, since no extra work is required on ->drop_item(), + * no ->drop_item() is provided. + */ +static struct configfs_group_operations group_children_group_ops = { + .make_group = group_children_make_group, +}; + +static struct config_item_type group_children_type = { + .ct_item_ops = &group_children_item_ops, + .ct_group_ops = &group_children_group_ops, + .ct_attrs = group_children_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct configfs_subsystem group_children_subsys = { + .su_group = { + .cg_item = { + .ci_namebuf = "03-group-children", + .ci_type = &group_children_type, + }, + }, +}; + +/* ----------------------------------------------------------------- */ + +/* + * We're now done with our subsystem definitions. + * For convenience in this module, here's a list of them all. It + * allows the init function to easily register them. Most modules + * will only have one subsystem, and will only call register_subsystem + * on it directly. + */ +static struct configfs_subsystem *example_subsys[] = { + &childless_subsys.subsys, + &simple_children_subsys, + &group_children_subsys, + NULL, +}; + +static int __init configfs_example_init(void) +{ + int ret; + int i; + struct configfs_subsystem *subsys; + + for (i = 0; example_subsys[i]; i++) { + subsys = example_subsys[i]; + + config_group_init(&subsys->su_group); + mutex_init(&subsys->su_mutex); + ret = configfs_register_subsystem(subsys); + if (ret) { + printk(KERN_ERR "Error %d while registering subsystem %s\n", + ret, + subsys->su_group.cg_item.ci_namebuf); + goto out_unregister; + } + } + + return 0; + +out_unregister: + for (; i >= 0; i--) { + configfs_unregister_subsystem(example_subsys[i]); + } + + return ret; +} + +static void __exit configfs_example_exit(void) +{ + int i; + + for (i = 0; example_subsys[i]; i++) { + configfs_unregister_subsystem(example_subsys[i]); + } +} + +module_init(configfs_example_init); +module_exit(configfs_example_exit); +MODULE_LICENSE("GPL"); diff --git a/Documentation/filesystems/configfs/configfs_example_macros.c b/Documentation/filesystems/configfs/configfs_example_macros.c new file mode 100644 index 0000000..d8e30a0 --- /dev/null +++ b/Documentation/filesystems/configfs/configfs_example_macros.c @@ -0,0 +1,448 @@ +/* + * vim: noexpandtab ts=8 sts=0 sw=8: + * + * configfs_example_macros.c - This file is a demonstration module + * containing a number of configfs subsystems. It uses the helper + * macros defined by configfs.h + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + * + * Based on sysfs: + * sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel + * + * configfs Copyright (C) 2005 Oracle. All rights reserved. + */ + +#include <linux/init.h> +#include <linux/module.h> +#include <linux/slab.h> + +#include <linux/configfs.h> + + + +/* + * 01-childless + * + * This first example is a childless subsystem. It cannot create + * any config_items. It just has attributes. + * + * Note that we are enclosing the configfs_subsystem inside a container. + * This is not necessary if a subsystem has no attributes directly + * on the subsystem. See the next example, 02-simple-children, for + * such a subsystem. + */ + +struct childless { + struct configfs_subsystem subsys; + int showme; + int storeme; +}; + +static inline struct childless *to_childless(struct config_item *item) +{ + return item ? container_of(to_configfs_subsystem(to_config_group(item)), struct childless, subsys) : NULL; +} + +CONFIGFS_ATTR_STRUCT(childless); +#define CHILDLESS_ATTR(_name, _mode, _show, _store) \ +struct childless_attribute childless_attr_##_name = __CONFIGFS_ATTR(_name, _mode, _show, _store) +#define CHILDLESS_ATTR_RO(_name, _show) \ +struct childless_attribute childless_attr_##_name = __CONFIGFS_ATTR_RO(_name, _show); + +static ssize_t childless_showme_read(struct childless *childless, + char *page) +{ + ssize_t pos; + + pos = sprintf(page, "%d\n", childless->showme); + childless->showme++; + + return pos; +} + +static ssize_t childless_storeme_read(struct childless *childless, + char *page) +{ + return sprintf(page, "%d\n", childless->storeme); +} + +static ssize_t childless_storeme_write(struct childless *childless, + const char *page, + size_t count) +{ + unsigned long tmp; + char *p = (char *) page; + + tmp = simple_strtoul(p, &p, 10); + if (!p || (*p && (*p != '\n'))) + return -EINVAL; + + if (tmp > INT_MAX) + return -ERANGE; + + childless->storeme = tmp; + + return count; +} + +static ssize_t childless_description_read(struct childless *childless, + char *page) +{ + return sprintf(page, +"[01-childless]\n" +"\n" +"The childless subsystem is the simplest possible subsystem in\n" +"configfs. It does not support the creation of child config_items.\n" +"It only has a few attributes. In fact, it isn't much different\n" +"than a directory in /proc.\n"); +} + +CHILDLESS_ATTR_RO(showme, childless_showme_read); +CHILDLESS_ATTR(storeme, S_IRUGO | S_IWUSR, childless_storeme_read, + childless_storeme_write); +CHILDLESS_ATTR_RO(description, childless_description_read); + +static struct configfs_attribute *childless_attrs[] = { + &childless_attr_showme.attr, + &childless_attr_storeme.attr, + &childless_attr_description.attr, + NULL, +}; + +CONFIGFS_ATTR_OPS(childless); +static struct configfs_item_operations childless_item_ops = { + .show_attribute = childless_attr_show, + .store_attribute = childless_attr_store, +}; + +static struct config_item_type childless_type = { + .ct_item_ops = &childless_item_ops, + .ct_attrs = childless_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct childless childless_subsys = { + .subsys = { + .su_group = { + .cg_item = { + .ci_namebuf = "01-childless", + .ci_type = &childless_type, + }, + }, + }, +}; + + +/* ----------------------------------------------------------------- */ + +/* + * 02-simple-children + * + * This example merely has a simple one-attribute child. Note that + * there is no extra attribute structure, as the child's attribute is + * known from the get-go. Also, there is no container for the + * subsystem, as it has no attributes of its own. + */ + +struct simple_child { + struct config_item item; + int storeme; +}; + +static inline struct simple_child *to_simple_child(struct config_item *item) +{ + return item ? container_of(item, struct simple_child, item) : NULL; +} + +static struct configfs_attribute simple_child_attr_storeme = { + .ca_owner = THIS_MODULE, + .ca_name = "storeme", + .ca_mode = S_IRUGO | S_IWUSR, +}; + +static struct configfs_attribute *simple_child_attrs[] = { + &simple_child_attr_storeme, + NULL, +}; + +static ssize_t simple_child_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + ssize_t count; + struct simple_child *simple_child = to_simple_child(item); + + count = sprintf(page, "%d\n", simple_child->storeme); + + return count; +} + +static ssize_t simple_child_attr_store(struct config_item *item, + struct configfs_attribute *attr, + const char *page, size_t count) +{ + struct simple_child *simple_child = to_simple_child(item); + unsigned long tmp; + char *p = (char *) page; + + tmp = simple_strtoul(p, &p, 10); + if (!p || (*p && (*p != '\n'))) + return -EINVAL; + + if (tmp > INT_MAX) + return -ERANGE; + + simple_child->storeme = tmp; + + return count; +} + +static void simple_child_release(struct config_item *item) +{ + kfree(to_simple_child(item)); +} + +static struct configfs_item_operations simple_child_item_ops = { + .release = simple_child_release, + .show_attribute = simple_child_attr_show, + .store_attribute = simple_child_attr_store, +}; + +static struct config_item_type simple_child_type = { + .ct_item_ops = &simple_child_item_ops, + .ct_attrs = simple_child_attrs, + .ct_owner = THIS_MODULE, +}; + + +struct simple_children { + struct config_group group; +}; + +static inline struct simple_children *to_simple_children(struct config_item *item) +{ + return item ? container_of(to_config_group(item), struct simple_children, group) : NULL; +} + +static struct config_item *simple_children_make_item(struct config_group *group, const char *name) +{ + struct simple_child *simple_child; + + simple_child = kzalloc(sizeof(struct simple_child), GFP_KERNEL); + if (!simple_child) + return ERR_PTR(-ENOMEM); + + config_item_init_type_name(&simple_child->item, name, + &simple_child_type); + + simple_child->storeme = 0; + + return &simple_child->item; +} + +static struct configfs_attribute simple_children_attr_description = { + .ca_owner = THIS_MODULE, + .ca_name = "description", + .ca_mode = S_IRUGO, +}; + +static struct configfs_attribute *simple_children_attrs[] = { + &simple_children_attr_description, + NULL, +}; + +static ssize_t simple_children_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + return sprintf(page, +"[02-simple-children]\n" +"\n" +"This subsystem allows the creation of child config_items. These\n" +"items have only one attribute that is readable and writeable.\n"); +} + +static void simple_children_release(struct config_item *item) +{ + kfree(to_simple_children(item)); +} + +static struct configfs_item_operations simple_children_item_ops = { + .release = simple_children_release, + .show_attribute = simple_children_attr_show, +}; + +/* + * Note that, since no extra work is required on ->drop_item(), + * no ->drop_item() is provided. + */ +static struct configfs_group_operations simple_children_group_ops = { + .make_item = simple_children_make_item, +}; + +static struct config_item_type simple_children_type = { + .ct_item_ops = &simple_children_item_ops, + .ct_group_ops = &simple_children_group_ops, + .ct_attrs = simple_children_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct configfs_subsystem simple_children_subsys = { + .su_group = { + .cg_item = { + .ci_namebuf = "02-simple-children", + .ci_type = &simple_children_type, + }, + }, +}; + + +/* ----------------------------------------------------------------- */ + +/* + * 03-group-children + * + * This example reuses the simple_children group from above. However, + * the simple_children group is not the subsystem itself, it is a + * child of the subsystem. Creation of a group in the subsystem creates + * a new simple_children group. That group can then have simple_child + * children of its own. + */ + +static struct config_group *group_children_make_group(struct config_group *group, const char *name) +{ + struct simple_children *simple_children; + + simple_children = kzalloc(sizeof(struct simple_children), + GFP_KERNEL); + if (!simple_children) + return ERR_PTR(-ENOMEM); + + config_group_init_type_name(&simple_children->group, name, + &simple_children_type); + + return &simple_children->group; +} + +static struct configfs_attribute group_children_attr_description = { + .ca_owner = THIS_MODULE, + .ca_name = "description", + .ca_mode = S_IRUGO, +}; + +static struct configfs_attribute *group_children_attrs[] = { + &group_children_attr_description, + NULL, +}; + +static ssize_t group_children_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + return sprintf(page, +"[03-group-children]\n" +"\n" +"This subsystem allows the creation of child config_groups. These\n" +"groups are like the subsystem simple-children.\n"); +} + +static struct configfs_item_operations group_children_item_ops = { + .show_attribute = group_children_attr_show, +}; + +/* + * Note that, since no extra work is required on ->drop_item(), + * no ->drop_item() is provided. + */ +static struct configfs_group_operations group_children_group_ops = { + .make_group = group_children_make_group, +}; + +static struct config_item_type group_children_type = { + .ct_item_ops = &group_children_item_ops, + .ct_group_ops = &group_children_group_ops, + .ct_attrs = group_children_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct configfs_subsystem group_children_subsys = { + .su_group = { + .cg_item = { + .ci_namebuf = "03-group-children", + .ci_type = &group_children_type, + }, + }, +}; + +/* ----------------------------------------------------------------- */ + +/* + * We're now done with our subsystem definitions. + * For convenience in this module, here's a list of them all. It + * allows the init function to easily register them. Most modules + * will only have one subsystem, and will only call register_subsystem + * on it directly. + */ +static struct configfs_subsystem *example_subsys[] = { + &childless_subsys.subsys, + &simple_children_subsys, + &group_children_subsys, + NULL, +}; + +static int __init configfs_example_init(void) +{ + int ret; + int i; + struct configfs_subsystem *subsys; + + for (i = 0; example_subsys[i]; i++) { + subsys = example_subsys[i]; + + config_group_init(&subsys->su_group); + mutex_init(&subsys->su_mutex); + ret = configfs_register_subsystem(subsys); + if (ret) { + printk(KERN_ERR "Error %d while registering subsystem %s\n", + ret, + subsys->su_group.cg_item.ci_namebuf); + goto out_unregister; + } + } + + return 0; + +out_unregister: + for (; i >= 0; i--) { + configfs_unregister_subsystem(example_subsys[i]); + } + + return ret; +} + +static void __exit configfs_example_exit(void) +{ + int i; + + for (i = 0; example_subsys[i]; i++) { + configfs_unregister_subsystem(example_subsys[i]); + } +} + +module_init(configfs_example_init); +module_exit(configfs_example_exit); +MODULE_LICENSE("GPL"); diff --git a/Documentation/filesystems/cramfs.txt b/Documentation/filesystems/cramfs.txt new file mode 100644 index 0000000..31f53f0 --- /dev/null +++ b/Documentation/filesystems/cramfs.txt @@ -0,0 +1,76 @@ + + Cramfs - cram a filesystem onto a small ROM + +cramfs is designed to be simple and small, and to compress things well. + +It uses the zlib routines to compress a file one page at a time, and +allows random page access. The meta-data is not compressed, but is +expressed in a very terse representation to make it use much less +diskspace than traditional filesystems. + +You can't write to a cramfs filesystem (making it compressible and +compact also makes it _very_ hard to update on-the-fly), so you have to +create the disk image with the "mkcramfs" utility. + + +Usage Notes +----------- + +File sizes are limited to less than 16MB. + +Maximum filesystem size is a little over 256MB. (The last file on the +filesystem is allowed to extend past 256MB.) + +Only the low 8 bits of gid are stored. The current version of +mkcramfs simply truncates to 8 bits, which is a potential security +issue. + +Hard links are supported, but hard linked files +will still have a link count of 1 in the cramfs image. + +Cramfs directories have no `.' or `..' entries. Directories (like +every other file on cramfs) always have a link count of 1. (There's +no need to use -noleaf in `find', btw.) + +No timestamps are stored in a cramfs, so these default to the epoch +(1970 GMT). Recently-accessed files may have updated timestamps, but +the update lasts only as long as the inode is cached in memory, after +which the timestamp reverts to 1970, i.e. moves backwards in time. + +Currently, cramfs must be written and read with architectures of the +same endianness, and can be read only by kernels with PAGE_CACHE_SIZE +== 4096. At least the latter of these is a bug, but it hasn't been +decided what the best fix is. For the moment if you have larger pages +you can just change the #define in mkcramfs.c, so long as you don't +mind the filesystem becoming unreadable to future kernels. + + +For /usr/share/magic +-------------------- + +0 ulelong 0x28cd3d45 Linux cramfs offset 0 +>4 ulelong x size %d +>8 ulelong x flags 0x%x +>12 ulelong x future 0x%x +>16 string >\0 signature "%.16s" +>32 ulelong x fsid.crc 0x%x +>36 ulelong x fsid.edition %d +>40 ulelong x fsid.blocks %d +>44 ulelong x fsid.files %d +>48 string >\0 name "%.16s" +512 ulelong 0x28cd3d45 Linux cramfs offset 512 +>516 ulelong x size %d +>520 ulelong x flags 0x%x +>524 ulelong x future 0x%x +>528 string >\0 signature "%.16s" +>544 ulelong x fsid.crc 0x%x +>548 ulelong x fsid.edition %d +>552 ulelong x fsid.blocks %d +>556 ulelong x fsid.files %d +>560 string >\0 name "%.16s" + + +Hacker Notes +------------ + +See fs/cramfs/README for filesystem layout and implementation notes. diff --git a/Documentation/filesystems/dentry-locking.txt b/Documentation/filesystems/dentry-locking.txt new file mode 100644 index 0000000..4c0c575 --- /dev/null +++ b/Documentation/filesystems/dentry-locking.txt @@ -0,0 +1,173 @@ +RCU-based dcache locking model +============================== + +On many workloads, the most common operation on dcache is to look up a +dentry, given a parent dentry and the name of the child. Typically, +for every open(), stat() etc., the dentry corresponding to the +pathname will be looked up by walking the tree starting with the first +component of the pathname and using that dentry along with the next +component to look up the next level and so on. Since it is a frequent +operation for workloads like multiuser environments and web servers, +it is important to optimize this path. + +Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus in +every component during path look-up. Since 2.5.10 onwards, fast-walk +algorithm changed this by holding the dcache_lock at the beginning and +walking as many cached path component dentries as possible. This +significantly decreases the number of acquisition of +dcache_lock. However it also increases the lock hold time +significantly and affects performance in large SMP machines. Since +2.5.62 kernel, dcache has been using a new locking model that uses RCU +to make dcache look-up lock-free. + +The current dcache locking model is not very different from the +existing dcache locking model. Prior to 2.5.62 kernel, dcache_lock +protected the hash chain, d_child, d_alias, d_lru lists as well as +d_inode and several other things like mount look-up. RCU-based changes +affect only the way the hash chain is protected. For everything else +the dcache_lock must be taken for both traversing as well as +updating. The hash chain updates too take the dcache_lock. The +significant change is the way d_lookup traverses the hash chain, it +doesn't acquire the dcache_lock for this and rely on RCU to ensure +that the dentry has not been *freed*. + + +Dcache locking details +====================== + +For many multi-user workloads, open() and stat() on files are very +frequently occurring operations. Both involve walking of path names to +find the dentry corresponding to the concerned file. In 2.4 kernel, +dcache_lock was held during look-up of each path component. Contention +and cache-line bouncing of this global lock caused significant +scalability problems. With the introduction of RCU in Linux kernel, +this was worked around by making the look-up of path components during +path walking lock-free. + + +Safe lock-free look-up of dcache hash table +=========================================== + +Dcache is a complex data structure with the hash table entries also +linked together in other lists. In 2.4 kernel, dcache_lock protected +all the lists. We applied RCU only on hash chain walking. The rest of +the lists are still protected by dcache_lock. Some of the important +changes are : + +1. The deletion from hash chain is done using hlist_del_rcu() macro + which doesn't initialize next pointer of the deleted dentry and + this allows us to walk safely lock-free while a deletion is + happening. + +2. Insertion of a dentry into the hash table is done using + hlist_add_head_rcu() which take care of ordering the writes - the + writes to the dentry must be visible before the dentry is + inserted. This works in conjunction with hlist_for_each_rcu() while + walking the hash chain. The only requirement is that all + initialization to the dentry must be done before + hlist_add_head_rcu() since we don't have dcache_lock protection + while traversing the hash chain. This isn't different from the + existing code. + +3. The dentry looked up without holding dcache_lock by cannot be + returned for walking if it is unhashed. It then may have a NULL + d_inode or other bogosity since RCU doesn't protect the other + fields in the dentry. We therefore use a flag DCACHE_UNHASHED to + indicate unhashed dentries and use this in conjunction with a + per-dentry lock (d_lock). Once looked up without the dcache_lock, + we acquire the per-dentry lock (d_lock) and check if the dentry is + unhashed. If so, the look-up is failed. If not, the reference count + of the dentry is increased and the dentry is returned. + +4. Once a dentry is looked up, it must be ensured during the path walk + for that component it doesn't go away. In pre-2.5.10 code, this was + done holding a reference to the dentry. dcache_rcu does the same. + In some sense, dcache_rcu path walking looks like the pre-2.5.10 + version. + +5. All dentry hash chain updates must take the dcache_lock as well as + the per-dentry lock in that order. dput() does this to ensure that + a dentry that has just been looked up in another CPU doesn't get + deleted before dget() can be done on it. + +6. There are several ways to do reference counting of RCU protected + objects. One such example is in ipv4 route cache where deferred + freeing (using call_rcu()) is done as soon as the reference count + goes to zero. This cannot be done in the case of dentries because + tearing down of dentries require blocking (dentry_iput()) which + isn't supported from RCU callbacks. Instead, tearing down of + dentries happen synchronously in dput(), but actual freeing happens + later when RCU grace period is over. This allows safe lock-free + walking of the hash chains, but a matched dentry may have been + partially torn down. The checking of DCACHE_UNHASHED flag with + d_lock held detects such dentries and prevents them from being + returned from look-up. + + +Maintaining POSIX rename semantics +================================== + +Since look-up of dentries is lock-free, it can race against a +concurrent rename operation. For example, during rename of file A to +B, look-up of either A or B must succeed. So, if look-up of B happens +after A has been removed from the hash chain but not added to the new +hash chain, it may fail. Also, a comparison while the name is being +written concurrently by a rename may result in false positive matches +violating rename semantics. Issues related to race with rename are +handled as described below : + +1. Look-up can be done in two ways - d_lookup() which is safe from + simultaneous renames and __d_lookup() which is not. If + __d_lookup() fails, it must be followed up by a d_lookup() to + correctly determine whether a dentry is in the hash table or + not. d_lookup() protects look-ups using a sequence lock + (rename_lock). + +2. The name associated with a dentry (d_name) may be changed if a + rename is allowed to happen simultaneously. To avoid memcmp() in + __d_lookup() go out of bounds due to a rename and false positive + comparison, the name comparison is done while holding the + per-dentry lock. This prevents concurrent renames during this + operation. + +3. Hash table walking during look-up may move to a different bucket as + the current dentry is moved to a different bucket due to rename. + But we use hlists in dcache hash table and they are + null-terminated. So, even if a dentry moves to a different bucket, + hash chain walk will terminate. [with a list_head list, it may not + since termination is when the list_head in the original bucket is + reached]. Since we redo the d_parent check and compare name while + holding d_lock, lock-free look-up will not race against d_move(). + +4. There can be a theoretical race when a dentry keeps coming back to + original bucket due to double moves. Due to this look-up may + consider that it has never moved and can end up in a infinite loop. + But this is not any worse that theoretical livelocks we already + have in the kernel. + + +Important guidelines for filesystem developers related to dcache_rcu +==================================================================== + +1. Existing dcache interfaces (pre-2.5.62) exported to filesystem + don't change. Only dcache internal implementation changes. However + filesystems *must not* delete from the dentry hash chains directly + using the list macros like allowed earlier. They must use dcache + APIs like d_drop() or __d_drop() depending on the situation. + +2. d_flags is now protected by a per-dentry lock (d_lock). All access + to d_flags must be protected by it. + +3. For a hashed dentry, checking of d_count needs to be protected by + d_lock. + + +Papers and other documentation on dcache locking +================================================ + +1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124). + +2. http://lse.sourceforge.net/locking/dcache/dcache.html + + + diff --git a/Documentation/filesystems/directory-locking b/Documentation/filesystems/directory-locking new file mode 100644 index 0000000..ff7b611 --- /dev/null +++ b/Documentation/filesystems/directory-locking @@ -0,0 +1,114 @@ + Locking scheme used for directory operations is based on two +kinds of locks - per-inode (->i_mutex) and per-filesystem +(->s_vfs_rename_mutex). + + For our purposes all operations fall in 5 classes: + +1) read access. Locking rules: caller locks directory we are accessing. + +2) object creation. Locking rules: same as above. + +3) object removal. Locking rules: caller locks parent, finds victim, +locks victim and calls the method. + +4) rename() that is _not_ cross-directory. Locking rules: caller locks +the parent, finds source and target, if target already exists - locks it +and then calls the method. + +5) link creation. Locking rules: + * lock parent + * check that source is not a directory + * lock source + * call the method. + +6) cross-directory rename. The trickiest in the whole bunch. Locking +rules: + * lock the filesystem + * lock parents in "ancestors first" order. + * find source and target. + * if old parent is equal to or is a descendent of target + fail with -ENOTEMPTY + * if new parent is equal to or is a descendent of source + fail with -ELOOP + * if target exists - lock it. + * call the method. + + +The rules above obviously guarantee that all directories that are going to be +read, modified or removed by method will be locked by caller. + + +If no directory is its own ancestor, the scheme above is deadlock-free. +Proof: + + First of all, at any moment we have a partial ordering of the +objects - A < B iff A is an ancestor of B. + + That ordering can change. However, the following is true: + +(1) if object removal or non-cross-directory rename holds lock on A and + attempts to acquire lock on B, A will remain the parent of B until we + acquire the lock on B. (Proof: only cross-directory rename can change + the parent of object and it would have to lock the parent). + +(2) if cross-directory rename holds the lock on filesystem, order will not + change until rename acquires all locks. (Proof: other cross-directory + renames will be blocked on filesystem lock and we don't start changing + the order until we had acquired all locks). + +(3) any operation holds at most one lock on non-directory object and + that lock is acquired after all other locks. (Proof: see descriptions + of operations). + + Now consider the minimal deadlock. Each process is blocked on +attempt to acquire some lock and already holds at least one lock. Let's +consider the set of contended locks. First of all, filesystem lock is +not contended, since any process blocked on it is not holding any locks. +Thus all processes are blocked on ->i_mutex. + + Non-directory objects are not contended due to (3). Thus link +creation can't be a part of deadlock - it can't be blocked on source +and it means that it doesn't hold any locks. + + Any contended object is either held by cross-directory rename or +has a child that is also contended. Indeed, suppose that it is held by +operation other than cross-directory rename. Then the lock this operation +is blocked on belongs to child of that object due to (1). + + It means that one of the operations is cross-directory rename. +Otherwise the set of contended objects would be infinite - each of them +would have a contended child and we had assumed that no object is its +own descendent. Moreover, there is exactly one cross-directory rename +(see above). + + Consider the object blocking the cross-directory rename. One +of its descendents is locked by cross-directory rename (otherwise we +would again have an infinite set of contended objects). But that +means that cross-directory rename is taking locks out of order. Due +to (2) the order hadn't changed since we had acquired filesystem lock. +But locking rules for cross-directory rename guarantee that we do not +try to acquire lock on descendent before the lock on ancestor. +Contradiction. I.e. deadlock is impossible. Q.E.D. + + + These operations are guaranteed to avoid loop creation. Indeed, +the only operation that could introduce loops is cross-directory rename. +Since the only new (parent, child) pair added by rename() is (new parent, +source), such loop would have to contain these objects and the rest of it +would have to exist before rename(). I.e. at the moment of loop creation +rename() responsible for that would be holding filesystem lock and new parent +would have to be equal to or a descendent of source. But that means that +new parent had been equal to or a descendent of source since the moment when +we had acquired filesystem lock and rename() would fail with -ELOOP in that +case. + + While this locking scheme works for arbitrary DAGs, it relies on +ability to check that directory is a descendent of another object. Current +implementation assumes that directory graph is a tree. This assumption is +also preserved by all operations (cross-directory rename on a tree that would +not introduce a cycle will leave it a tree and link() fails for directories). + + Notice that "directory" in the above == "anything that might have +children", so if we are going to introduce hybrid objects we will need +either to make sure that link(2) doesn't work for them or to make changes +in is_subdir() that would make it work even in presence of such beasts. diff --git a/Documentation/filesystems/dlmfs.txt b/Documentation/filesystems/dlmfs.txt new file mode 100644 index 0000000..c50bbb2 --- /dev/null +++ b/Documentation/filesystems/dlmfs.txt @@ -0,0 +1,130 @@ +dlmfs +================== +A minimal DLM userspace interface implemented via a virtual file +system. + +dlmfs is built with OCFS2 as it requires most of its infrastructure. + +Project web page: http://oss.oracle.com/projects/ocfs2 +Tools web page: http://oss.oracle.com/projects/ocfs2-tools +OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ + +All code copyright 2005 Oracle except when otherwise noted. + +CREDITS +======= + +Some code taken from ramfs which is Copyright (C) 2000 Linus Torvalds +and Transmeta Corp. + +Mark Fasheh <mark.fasheh@oracle.com> + +Caveats +======= +- Right now it only works with the OCFS2 DLM, though support for other + DLM implementations should not be a major issue. + +Mount options +============= +None + +Usage +===== + +If you're just interested in OCFS2, then please see ocfs2.txt. The +rest of this document will be geared towards those who want to use +dlmfs for easy to setup and easy to use clustered locking in +userspace. + +Setup +===== + +dlmfs requires that the OCFS2 cluster infrastructure be in +place. Please download ocfs2-tools from the above url and configure a +cluster. + +You'll want to start heartbeating on a volume which all the nodes in +your lockspace can access. The easiest way to do this is via +ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires +that an OCFS2 file system be in place so that it can automatically +find it's heartbeat area, though it will eventually support heartbeat +against raw disks. + +Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed +with ocfs2-tools. + +Once you're heartbeating, DLM lock 'domains' can be easily created / +destroyed and locks within them accessed. + +Locking +======= + +Users may access dlmfs via standard file system calls, or they can use +'libo2dlm' (distributed with ocfs2-tools) which abstracts the file +system calls and presents a more traditional locking api. + +dlmfs handles lock caching automatically for the user, so a lock +request for an already acquired lock will not generate another DLM +call. Userspace programs are assumed to handle their own local +locking. + +Two levels of locks are supported - Shared Read, and Exclusive. +Also supported is a Trylock operation. + +For information on the libo2dlm interface, please see o2dlm.h, +distributed with ocfs2-tools. + +Lock value blocks can be read and written to a resource via read(2) +and write(2) against the fd obtained via your open(2) call. The +maximum currently supported LVB length is 64 bytes (though that is an +OCFS2 DLM limitation). Through this mechanism, users of dlmfs can share +small amounts of data amongst their nodes. + +mkdir(2) signals dlmfs to join a domain (which will have the same name +as the resulting directory) + +rmdir(2) signals dlmfs to leave the domain + +Locks for a given domain are represented by regular inodes inside the +domain directory. Locking against them is done via the open(2) system +call. + +The open(2) call will not return until your lock has been granted or +an error has occurred, unless it has been instructed to do a trylock +operation. If the lock succeeds, you'll get an fd. + +open(2) with O_CREAT to ensure the resource inode is created - dlmfs does +not automatically create inodes for existing lock resources. + +Open Flag Lock Request Type +--------- ----------------- +O_RDONLY Shared Read +O_RDWR Exclusive + +Open Flag Resulting Locking Behavior +--------- -------------------------- +O_NONBLOCK Trylock operation + +You must provide exactly one of O_RDONLY or O_RDWR. + +If O_NONBLOCK is also provided and the trylock operation was valid but +could not lock the resource then open(2) will return ETXTBUSY. + +close(2) drops the lock associated with your fd. + +Modes passed to mkdir(2) or open(2) are adhered to locally. Chown is +supported locally as well. This means you can use them to restrict +access to the resources via dlmfs on your local node only. + +The resource LVB may be read from the fd in either Shared Read or +Exclusive modes via the read(2) system call. It can be written via +write(2) only when open in Exclusive mode. + +Once written, an LVB will be visible to other nodes who obtain Read +Only or higher level locks on the resource. + +See Also +======== +http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf + +For more information on the VMS distributed locking API. diff --git a/Documentation/filesystems/dnotify.txt b/Documentation/filesystems/dnotify.txt new file mode 100644 index 0000000..9f5d338 --- /dev/null +++ b/Documentation/filesystems/dnotify.txt @@ -0,0 +1,99 @@ + Linux Directory Notification + ============================ + + Stephen Rothwell <sfr@canb.auug.org.au> + +The intention of directory notification is to allow user applications +to be notified when a directory, or any of the files in it, are changed. +The basic mechanism involves the application registering for notification +on a directory using a fcntl(2) call and the notifications themselves +being delivered using signals. + +The application decides which "events" it wants to be notified about. +The currently defined events are: + + DN_ACCESS A file in the directory was accessed (read) + DN_MODIFY A file in the directory was modified (write,truncate) + DN_CREATE A file was created in the directory + DN_DELETE A file was unlinked from directory + DN_RENAME A file in the directory was renamed + DN_ATTRIB A file in the directory had its attributes + changed (chmod,chown) + +Usually, the application must reregister after each notification, but +if DN_MULTISHOT is or'ed with the event mask, then the registration will +remain until explicitly removed (by registering for no events). + +By default, SIGIO will be delivered to the process and no other useful +information. However, if the F_SETSIG fcntl(2) call is used to let the +kernel know which signal to deliver, a siginfo structure will be passed to +the signal handler and the si_fd member of that structure will contain the +file descriptor associated with the directory in which the event occurred. + +Preferably the application will choose one of the real time signals +(SIGRTMIN + <n>) so that the notifications may be queued. This is +especially important if DN_MULTISHOT is specified. Note that SIGRTMIN +is often blocked, so it is better to use (at least) SIGRTMIN + 1. + +Implementation expectations (features and bugs :-)) +--------------------------- + +The notification should work for any local access to files even if the +actual file system is on a remote server. This implies that remote +access to files served by local user mode servers should be notified. +Also, remote accesses to files served by a local kernel NFS server should +be notified. + +In order to make the impact on the file system code as small as possible, +the problem of hard links to files has been ignored. So if a file (x) +exists in two directories (a and b) then a change to the file using the +name "a/x" should be notified to a program expecting notifications on +directory "a", but will not be notified to one expecting notifications on +directory "b". + +Also, files that are unlinked, will still cause notifications in the +last directory that they were linked to. + +Configuration +------------- + +Dnotify is controlled via the CONFIG_DNOTIFY configuration option. When +disabled, fcntl(fd, F_NOTIFY, ...) will return -EINVAL. + +Example +------- + + #define _GNU_SOURCE /* needed to get the defines */ + #include <fcntl.h> /* in glibc 2.2 this has the needed + values defined */ + #include <signal.h> + #include <stdio.h> + #include <unistd.h> + + static volatile int event_fd; + + static void handler(int sig, siginfo_t *si, void *data) + { + event_fd = si->si_fd; + } + + int main(void) + { + struct sigaction act; + int fd; + + act.sa_sigaction = handler; + sigemptyset(&act.sa_mask); + act.sa_flags = SA_SIGINFO; + sigaction(SIGRTMIN + 1, &act, NULL); + + fd = open(".", O_RDONLY); + fcntl(fd, F_SETSIG, SIGRTMIN + 1); + fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_MULTISHOT); + /* we will now be notified if any of the files + in "." is modified or new files are created */ + while (1) { + pause(); + printf("Got event on fd=%d\n", event_fd); + } + } diff --git a/Documentation/filesystems/ecryptfs.txt b/Documentation/filesystems/ecryptfs.txt new file mode 100644 index 0000000..01d8a08 --- /dev/null +++ b/Documentation/filesystems/ecryptfs.txt @@ -0,0 +1,77 @@ +eCryptfs: A stacked cryptographic filesystem for Linux + +eCryptfs is free software. Please see the file COPYING for details. +For documentation, please see the files in the doc/ subdirectory. For +building and installation instructions please see the INSTALL file. + +Maintainer: Phillip Hellewell +Lead developer: Michael A. Halcrow <mhalcrow@us.ibm.com> +Developers: Michael C. Thompson + Kent Yoder +Web Site: http://ecryptfs.sf.net + +This software is currently undergoing development. Make sure to +maintain a backup copy of any data you write into eCryptfs. + +eCryptfs requires the userspace tools downloadable from the +SourceForge site: + +http://sourceforge.net/projects/ecryptfs/ + +Userspace requirements include: + - David Howells' userspace keyring headers and libraries (version + 1.0 or higher), obtainable from + http://people.redhat.com/~dhowells/keyutils/ + - Libgcrypt + + +NOTES + +In the beta/experimental releases of eCryptfs, when you upgrade +eCryptfs, you should copy the files to an unencrypted location and +then copy the files back into the new eCryptfs mount to migrate the +files. + + +MOUNT-WIDE PASSPHRASE + +Create a new directory into which eCryptfs will write its encrypted +files (i.e., /root/crypt). Then, create the mount point directory +(i.e., /mnt/crypt). Now it's time to mount eCryptfs: + +mount -t ecryptfs /root/crypt /mnt/crypt + +You should be prompted for a passphrase and a salt (the salt may be +blank). + +Try writing a new file: + +echo "Hello, World" > /mnt/crypt/hello.txt + +The operation will complete. Notice that there is a new file in +/root/crypt that is at least 12288 bytes in size (depending on your +host page size). This is the encrypted underlying file for what you +just wrote. To test reading, from start to finish, you need to clear +the user session keyring: + +keyctl clear @u + +Then umount /mnt/crypt and mount again per the instructions given +above. + +cat /mnt/crypt/hello.txt + + +NOTES + +eCryptfs version 0.1 should only be mounted on (1) empty directories +or (2) directories containing files only created by eCryptfs. If you +mount a directory that has pre-existing files not created by eCryptfs, +then behavior is undefined. Do not run eCryptfs in higher verbosity +levels unless you are doing so for the sole purpose of debugging or +development, since secret values will be written out to the system log +in that case. + + +Mike Halcrow +mhalcrow@us.ibm.com diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt new file mode 100644 index 0000000..4333e83 --- /dev/null +++ b/Documentation/filesystems/ext2.txt @@ -0,0 +1,382 @@ + +The Second Extended Filesystem +============================== + +ext2 was originally released in January 1993. Written by R\'emy Card, +Theodore Ts'o and Stephen Tweedie, it was a major rewrite of the +Extended Filesystem. It is currently still (April 2001) the predominant +filesystem in use by Linux. There are also implementations available +for NetBSD, FreeBSD, the GNU HURD, Windows 95/98/NT, OS/2 and RISC OS. + +Options +======= + +Most defaults are determined by the filesystem superblock, and can be +set using tune2fs(8). Kernel-determined defaults are indicated by (*). + +bsddf (*) Makes `df' act like BSD. +minixdf Makes `df' act like Minix. + +check=none, nocheck (*) Don't do extra checking of bitmaps on mount + (check=normal and check=strict options removed) + +debug Extra debugging information is sent to the + kernel syslog. Useful for developers. + +errors=continue Keep going on a filesystem error. +errors=remount-ro Remount the filesystem read-only on an error. +errors=panic Panic and halt the machine if an error occurs. + +grpid, bsdgroups Give objects the same group ID as their parent. +nogrpid, sysvgroups New objects have the group ID of their creator. + +nouid32 Use 16-bit UIDs and GIDs. + +oldalloc Enable the old block allocator. Orlov should + have better performance, we'd like to get some + feedback if it's the contrary for you. +orlov (*) Use the Orlov block allocator. + (See http://lwn.net/Articles/14633/ and + http://lwn.net/Articles/14446/.) + +resuid=n The user ID which may use the reserved blocks. +resgid=n The group ID which may use the reserved blocks. + +sb=n Use alternate superblock at this location. + +user_xattr Enable "user." POSIX Extended Attributes + (requires CONFIG_EXT2_FS_XATTR). + See also http://acl.bestbits.at +nouser_xattr Don't support "user." extended attributes. + +acl Enable POSIX Access Control Lists support + (requires CONFIG_EXT2_FS_POSIX_ACL). + See also http://acl.bestbits.at +noacl Don't support POSIX ACLs. + +nobh Do not attach buffer_heads to file pagecache. + +xip Use execute in place (no caching) if possible + +grpquota,noquota,quota,usrquota Quota options are silently ignored by ext2. + + +Specification +============= + +ext2 shares many properties with traditional Unix filesystems. It has +the concepts of blocks, inodes and directories. It has space in the +specification for Access Control Lists (ACLs), fragments, undeletion and +compression though these are not yet implemented (some are available as +separate patches). There is also a versioning mechanism to allow new +features (such as journalling) to be added in a maximally compatible +manner. + +Blocks +------ + +The space in the device or file is split up into blocks. These are +a fixed size, of 1024, 2048 or 4096 bytes (8192 bytes on Alpha systems), +which is decided when the filesystem is created. Smaller blocks mean +less wasted space per file, but require slightly more accounting overhead, +and also impose other limits on the size of files and the filesystem. + +Block Groups +------------ + +Blocks are clustered into block groups in order to reduce fragmentation +and minimise the amount of head seeking when reading a large amount +of consecutive data. Information about each block group is kept in a +descriptor table stored in the block(s) immediately after the superblock. +Two blocks near the start of each group are reserved for the block usage +bitmap and the inode usage bitmap which show which blocks and inodes +are in use. Since each bitmap is limited to a single block, this means +that the maximum size of a block group is 8 times the size of a block. + +The block(s) following the bitmaps in each block group are designated +as the inode table for that block group and the remainder are the data +blocks. The block allocation algorithm attempts to allocate data blocks +in the same block group as the inode which contains them. + +The Superblock +-------------- + +The superblock contains all the information about the configuration of +the filing system. The primary copy of the superblock is stored at an +offset of 1024 bytes from the start of the device, and it is essential +to mounting the filesystem. Since it is so important, backup copies of +the superblock are stored in block groups throughout the filesystem. +The first version of ext2 (revision 0) stores a copy at the start of +every block group, along with backups of the group descriptor block(s). +Because this can consume a considerable amount of space for large +filesystems, later revisions can optionally reduce the number of backup +copies by only putting backups in specific groups (this is the sparse +superblock feature). The groups chosen are 0, 1 and powers of 3, 5 and 7. + +The information in the superblock contains fields such as the total +number of inodes and blocks in the filesystem and how many are free, +how many inodes and blocks are in each block group, when the filesystem +was mounted (and if it was cleanly unmounted), when it was modified, +what version of the filesystem it is (see the Revisions section below) +and which OS created it. + +If the filesystem is revision 1 or higher, then there are extra fields, +such as a volume name, a unique identification number, the inode size, +and space for optional filesystem features to store configuration info. + +All fields in the superblock (as in all other ext2 structures) are stored +on the disc in little endian format, so a filesystem is portable between +machines without having to know what machine it was created on. + +Inodes +------ + +The inode (index node) is a fundamental concept in the ext2 filesystem. +Each object in the filesystem is represented by an inode. The inode +structure contains pointers to the filesystem blocks which contain the +data held in the object and all of the metadata about an object except +its name. The metadata about an object includes the permissions, owner, +group, flags, size, number of blocks used, access time, change time, +modification time, deletion time, number of links, fragments, version +(for NFS) and extended attributes (EAs) and/or Access Control Lists (ACLs). + +There are some reserved fields which are currently unused in the inode +structure and several which are overloaded. One field is reserved for the +directory ACL if the inode is a directory and alternately for the top 32 +bits of the file size if the inode is a regular file (allowing file sizes +larger than 2GB). The translator field is unused under Linux, but is used +by the HURD to reference the inode of a program which will be used to +interpret this object. Most of the remaining reserved fields have been +used up for both Linux and the HURD for larger owner and group fields, +The HURD also has a larger mode field so it uses another of the remaining +fields to store the extra more bits. + +There are pointers to the first 12 blocks which contain the file's data +in the inode. There is a pointer to an indirect block (which contains +pointers to the next set of blocks), a pointer to a doubly-indirect +block (which contains pointers to indirect blocks) and a pointer to a +trebly-indirect block (which contains pointers to doubly-indirect blocks). + +The flags field contains some ext2-specific flags which aren't catered +for by the standard chmod flags. These flags can be listed with lsattr +and changed with the chattr command, and allow specific filesystem +behaviour on a per-file basis. There are flags for secure deletion, +undeletable, compression, synchronous updates, immutability, append-only, +dumpable, no-atime, indexed directories, and data-journaling. Not all +of these are supported yet. + +Directories +----------- + +A directory is a filesystem object and has an inode just like a file. +It is a specially formatted file containing records which associate +each name with an inode number. Later revisions of the filesystem also +encode the type of the object (file, directory, symlink, device, fifo, +socket) to avoid the need to check the inode itself for this information +(support for taking advantage of this feature does not yet exist in +Glibc 2.2). + +The inode allocation code tries to assign inodes which are in the same +block group as the directory in which they are first created. + +The current implementation of ext2 uses a singly-linked list to store +the filenames in the directory; a pending enhancement uses hashing of the +filenames to allow lookup without the need to scan the entire directory. + +The current implementation never removes empty directory blocks once they +have been allocated to hold more files. + +Special files +------------- + +Symbolic links are also filesystem objects with inodes. They deserve +special mention because the data for them is stored within the inode +itself if the symlink is less than 60 bytes long. It uses the fields +which would normally be used to store the pointers to data blocks. +This is a worthwhile optimisation as it we avoid allocating a full +block for the symlink, and most symlinks are less than 60 characters long. + +Character and block special devices never have data blocks assigned to +them. Instead, their device number is stored in the inode, again reusing +the fields which would be used to point to the data blocks. + +Reserved Space +-------------- + +In ext2, there is a mechanism for reserving a certain number of blocks +for a particular user (normally the super-user). This is intended to +allow for the system to continue functioning even if non-privileged users +fill up all the space available to them (this is independent of filesystem +quotas). It also keeps the filesystem from filling up entirely which +helps combat fragmentation. + +Filesystem check +---------------- + +At boot time, most systems run a consistency check (e2fsck) on their +filesystems. The superblock of the ext2 filesystem contains several +fields which indicate whether fsck should actually run (since checking +the filesystem at boot can take a long time if it is large). fsck will +run if the filesystem was not cleanly unmounted, if the maximum mount +count has been exceeded or if the maximum time between checks has been +exceeded. + +Feature Compatibility +--------------------- + +The compatibility feature mechanism used in ext2 is sophisticated. +It safely allows features to be added to the filesystem, without +unnecessarily sacrificing compatibility with older versions of the +filesystem code. The feature compatibility mechanism is not supported by +the original revision 0 (EXT2_GOOD_OLD_REV) of ext2, but was introduced in +revision 1. There are three 32-bit fields, one for compatible features +(COMPAT), one for read-only compatible (RO_COMPAT) features and one for +incompatible (INCOMPAT) features. + +These feature flags have specific meanings for the kernel as follows: + +A COMPAT flag indicates that a feature is present in the filesystem, +but the on-disk format is 100% compatible with older on-disk formats, so +a kernel which didn't know anything about this feature could read/write +the filesystem without any chance of corrupting the filesystem (or even +making it inconsistent). This is essentially just a flag which says +"this filesystem has a (hidden) feature" that the kernel or e2fsck may +want to be aware of (more on e2fsck and feature flags later). The ext3 +HAS_JOURNAL feature is a COMPAT flag because the ext3 journal is simply +a regular file with data blocks in it so the kernel does not need to +take any special notice of it if it doesn't understand ext3 journaling. + +An RO_COMPAT flag indicates that the on-disk format is 100% compatible +with older on-disk formats for reading (i.e. the feature does not change +the visible on-disk format). However, an old kernel writing to such a +filesystem would/could corrupt the filesystem, so this is prevented. The +most common such feature, SPARSE_SUPER, is an RO_COMPAT feature because +sparse groups allow file data blocks where superblock/group descriptor +backups used to live, and ext2_free_blocks() refuses to free these blocks, +which would leading to inconsistent bitmaps. An old kernel would also +get an error if it tried to free a series of blocks which crossed a group +boundary, but this is a legitimate layout in a SPARSE_SUPER filesystem. + +An INCOMPAT flag indicates the on-disk format has changed in some +way that makes it unreadable by older kernels, or would otherwise +cause a problem if an old kernel tried to mount it. FILETYPE is an +INCOMPAT flag because older kernels would think a filename was longer +than 256 characters, which would lead to corrupt directory listings. +The COMPRESSION flag is an obvious INCOMPAT flag - if the kernel +doesn't understand compression, you would just get garbage back from +read() instead of it automatically decompressing your data. The ext3 +RECOVER flag is needed to prevent a kernel which does not understand the +ext3 journal from mounting the filesystem without replaying the journal. + +For e2fsck, it needs to be more strict with the handling of these +flags than the kernel. If it doesn't understand ANY of the COMPAT, +RO_COMPAT, or INCOMPAT flags it will refuse to check the filesystem, +because it has no way of verifying whether a given feature is valid +or not. Allowing e2fsck to succeed on a filesystem with an unknown +feature is a false sense of security for the user. Refusing to check +a filesystem with unknown features is a good incentive for the user to +update to the latest e2fsck. This also means that anyone adding feature +flags to ext2 also needs to update e2fsck to verify these features. + +Metadata +-------- + +It is frequently claimed that the ext2 implementation of writing +asynchronous metadata is faster than the ffs synchronous metadata +scheme but less reliable. Both methods are equally resolvable by their +respective fsck programs. + +If you're exceptionally paranoid, there are 3 ways of making metadata +writes synchronous on ext2: + +per-file if you have the program source: use the O_SYNC flag to open() +per-file if you don't have the source: use "chattr +S" on the file +per-filesystem: add the "sync" option to mount (or in /etc/fstab) + +the first and last are not ext2 specific but do force the metadata to +be written synchronously. See also Journaling below. + +Limitations +----------- + +There are various limits imposed by the on-disk layout of ext2. Other +limits are imposed by the current implementation of the kernel code. +Many of the limits are determined at the time the filesystem is first +created, and depend upon the block size chosen. The ratio of inodes to +data blocks is fixed at filesystem creation time, so the only way to +increase the number of inodes is to increase the size of the filesystem. +No tools currently exist which can change the ratio of inodes to blocks. + +Most of these limits could be overcome with slight changes in the on-disk +format and using a compatibility flag to signal the format change (at +the expense of some compatibility). + +Filesystem block size: 1kB 2kB 4kB 8kB + +File size limit: 16GB 256GB 2048GB 2048GB +Filesystem size limit: 2047GB 8192GB 16384GB 32768GB + +There is a 2.4 kernel limit of 2048GB for a single block device, so no +filesystem larger than that can be created at this time. There is also +an upper limit on the block size imposed by the page size of the kernel, +so 8kB blocks are only allowed on Alpha systems (and other architectures +which support larger pages). + +There is an upper limit of 32768 subdirectories in a single directory. + +There is a "soft" upper limit of about 10-15k files in a single directory +with the current linear linked-list directory implementation. This limit +stems from performance problems when creating and deleting (and also +finding) files in such large directories. Using a hashed directory index +(under development) allows 100k-1M+ files in a single directory without +performance problems (although RAM size becomes an issue at this point). + +The (meaningless) absolute upper limit of files in a single directory +(imposed by the file size, the realistic limit is obviously much less) +is over 130 trillion files. It would be higher except there are not +enough 4-character names to make up unique directory entries, so they +have to be 8 character filenames, even then we are fairly close to +running out of unique filenames. + +Journaling +---------- + +A journaling extension to the ext2 code has been developed by Stephen +Tweedie. It avoids the risks of metadata corruption and the need to +wait for e2fsck to complete after a crash, without requiring a change +to the on-disk ext2 layout. In a nutshell, the journal is a regular +file which stores whole metadata (and optionally data) blocks that have +been modified, prior to writing them into the filesystem. This means +it is possible to add a journal to an existing ext2 filesystem without +the need for data conversion. + +When changes to the filesystem (e.g. a file is renamed) they are stored in +a transaction in the journal and can either be complete or incomplete at +the time of a crash. If a transaction is complete at the time of a crash +(or in the normal case where the system does not crash), then any blocks +in that transaction are guaranteed to represent a valid filesystem state, +and are copied into the filesystem. If a transaction is incomplete at +the time of the crash, then there is no guarantee of consistency for +the blocks in that transaction so they are discarded (which means any +filesystem changes they represent are also lost). +Check Documentation/filesystems/ext3.txt if you want to read more about +ext3 and journaling. + +References +========== + +The kernel source file:/usr/src/linux/fs/ext2/ +e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/ +Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html +Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/ +Filesystem Resizing http://ext2resize.sourceforge.net/ +Compression (*) http://e2compr.sourceforge.net/ + +Implementations for: +Windows 95/98/NT/2000 http://uranus.it.swin.edu.au/~jn/linux/Explore2fs.htm +Windows 95 (*) http://www.yipton.demon.co.uk/content.html#FSDEXT2 +DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ +OS/2 http://perso.wanadoo.fr/matthieu.willm/ext2-os2/ +RISC OS client ftp://ftp.barnet.ac.uk/pub/acorn/armlinux/iscafs/ + +(*) no longer actively developed/supported (as of Apr 2001) diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt new file mode 100644 index 0000000..9dd2a3b --- /dev/null +++ b/Documentation/filesystems/ext3.txt @@ -0,0 +1,202 @@ + +Ext3 Filesystem +=============== + +Ext3 was originally released in September 1999. Written by Stephen Tweedie +for the 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger, +Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie. + +Ext3 is the ext2 filesystem enhanced with journalling capabilities. + +Options +======= + +When mounting an ext3 filesystem, the following option are accepted: +(*) == default + +journal=update Update the ext3 file system's journal to the current + format. + +journal=inum When a journal already exists, this option is ignored. + Otherwise, it specifies the number of the inode which + will represent the ext3 file system's journal file. + +journal_dev=devnum When the external journal device's major/minor numbers + have changed, this option allows the user to specify + the new journal location. The journal device is + identified through its new major/minor numbers encoded + in devnum. + +noload Don't load the journal on mounting. + +data=journal All data are committed into the journal prior to being + written into the main file system. + +data=ordered (*) All data are forced directly out to the main file + system prior to its metadata being committed to the + journal. + +data=writeback Data ordering is not preserved, data may be written + into the main file system after its metadata has been + committed to the journal. + +commit=nrsec (*) Ext3 can be told to sync all its data and metadata + every 'nrsec' seconds. The default value is 5 seconds. + This means that if you lose your power, you will lose + as much as the latest 5 seconds of work (your + filesystem will not be damaged though, thanks to the + journaling). This default value (or any low value) + will hurt performance, but it's good for data-safety. + Setting it to 0 will have the same effect as leaving + it at the default (5 seconds). + Setting it to very large values will improve + performance. + +barrier=1 This enables/disables barriers. barrier=0 disables + it, barrier=1 enables it. + +orlov (*) This enables the new Orlov block allocator. It is + enabled by default. + +oldalloc This disables the Orlov block allocator and enables + the old block allocator. Orlov should have better + performance - we'd like to get some feedback if it's + the contrary for you. + +user_xattr Enables Extended User Attributes. Additionally, you + need to have extended attribute support enabled in the + kernel configuration (CONFIG_EXT3_FS_XATTR). See the + attr(5) manual page and http://acl.bestbits.at/ to + learn more about extended attributes. + +nouser_xattr Disables Extended User Attributes. + +acl Enables POSIX Access Control Lists support. + Additionally, you need to have ACL support enabled in + the kernel configuration (CONFIG_EXT3_FS_POSIX_ACL). + See the acl(5) manual page and http://acl.bestbits.at/ + for more information. + +noacl This option disables POSIX Access Control List + support. + +reservation + +noreservation + +bsddf (*) Make 'df' act like BSD. +minixdf Make 'df' act like Minix. + +check=none Don't do extra checking of bitmaps on mount. +nocheck + +debug Extra debugging information is sent to syslog. + +errors=remount-ro(*) Remount the filesystem read-only on an error. +errors=continue Keep going on a filesystem error. +errors=panic Panic and halt the machine if an error occurs. + +data_err=ignore(*) Just print an error message if an error occurs + in a file data buffer in ordered mode. +data_err=abort Abort the journal if an error occurs in a file + data buffer in ordered mode. + +grpid Give objects the same group ID as their creator. +bsdgroups + +nogrpid (*) New objects have the group ID of their creator. +sysvgroups + +resgid=n The group ID which may use the reserved blocks. + +resuid=n The user ID which may use the reserved blocks. + +sb=n Use alternate superblock at this location. + +quota +noquota +grpquota +usrquota + +bh (*) ext3 associates buffer heads to data pages to +nobh (a) cache disk block mapping information + (b) link pages into transaction to provide + ordering guarantees. + "bh" option forces use of buffer heads. + "nobh" option tries to avoid associating buffer + heads (supported only for "writeback" mode). + + +Specification +============= +Ext3 shares all disk implementation with the ext2 filesystem, and adds +transactions capabilities to ext2. Journaling is done by the Journaling Block +Device layer. + +Journaling Block Device layer +----------------------------- +The Journaling Block Device layer (JBD) isn't ext3 specific. It was designed +to add journaling capabilities to a block device. The ext3 filesystem code +will inform the JBD of modifications it is performing (called a transaction). +The journal supports the transactions start and stop, and in case of a crash, +the journal can replay the transactions to quickly put the partition back into +a consistent state. + +Handles represent a single atomic update to a filesystem. JBD can handle an +external journal on a block device. + +Data Mode +--------- +There are 3 different data modes: + +* writeback mode +In data=writeback mode, ext3 does not journal data at all. This mode provides +a similar level of journaling as that of XFS, JFS, and ReiserFS in its default +mode - metadata journaling. A crash+recovery can cause incorrect data to +appear in files which were written shortly before the crash. This mode will +typically provide the best ext3 performance. + +* ordered mode +In data=ordered mode, ext3 only officially journals metadata, but it logically +groups metadata and data blocks into a single unit called a transaction. When +it's time to write the new metadata out to disk, the associated data blocks +are written first. In general, this mode performs slightly slower than +writeback but significantly faster than journal mode. + +* journal mode +data=journal mode provides full data and metadata journaling. All new data is +written to the journal first, and then to its final location. +In the event of a crash, the journal can be replayed, bringing both data and +metadata into a consistent state. This mode is the slowest except when data +needs to be read from and written to disk at the same time where it +outperforms all other modes. + +Compatibility +------------- + +Ext2 partitions can be easily convert to ext3, with `tune2fs -j <dev>`. +Ext3 is fully compatible with Ext2. Ext3 partitions can easily be mounted as +Ext2. + + +External Tools +============== +See manual pages to learn more. + +tune2fs: create a ext3 journal on a ext2 partition with the -j flag. +mke2fs: create a ext3 partition with the -j flag. +debugfs: ext2 and ext3 file system debugger. +ext2online: online (mounted) ext2 and ext3 filesystem resizer + + +References +========== + +kernel source: <file:fs/ext3/> + <file:fs/jbd/> + +programs: http://e2fsprogs.sourceforge.net/ + http://ext2resize.sourceforge.net + +useful links: http://www-106.ibm.com/developerworks/linux/library/l-fs7/ + http://www-106.ibm.com/developerworks/linux/library/l-fs8/ diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt new file mode 100644 index 0000000..174eaff --- /dev/null +++ b/Documentation/filesystems/ext4.txt @@ -0,0 +1,302 @@ + +Ext4 Filesystem +=============== + +Ext4 is an an advanced level of the ext3 filesystem which incorporates +scalability and reliability enhancements for supporting large filesystems +(64 bit) in keeping with increasing disk capacities and state-of-the-art +feature requirements. + +Mailing list: linux-ext4@vger.kernel.org +Web site: http://ext4.wiki.kernel.org + + +1. Quick usage instructions: +=========================== + +Note: More extensive information for getting started with ext4 can be + found at the ext4 wiki site at the URL: + http://ext4.wiki.kernel.org/index.php/Ext4_Howto + + - Compile and install the latest version of e2fsprogs (as of this + writing version 1.41.3) from: + + http://sourceforge.net/project/showfiles.php?group_id=2406 + + or + + ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/ + + or grab the latest git repository from: + + git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git + + - Note that it is highly important to install the mke2fs.conf file + that comes with the e2fsprogs 1.41.x sources in /etc/mke2fs.conf. If + you have edited the /etc/mke2fs.conf file installed on your system, + you will need to merge your changes with the version from e2fsprogs + 1.41.x. + + - Create a new filesystem using the ext4 filesystem type: + + # mke2fs -t ext4 /dev/hda1 + + Or to configure an existing ext3 filesystem to support extents: + + # tune2fs -O extents /dev/hda1 + + If the filesystem was created with 128 byte inodes, it can be + converted to use 256 byte for greater efficiency via: + + # tune2fs -I 256 /dev/hda1 + + (Note: we currently do not have tools to convert an ext4 + filesystem back to ext3; so please do not do try this on production + filesystems.) + + - Mounting: + + # mount -t ext4 /dev/hda1 /wherever + + - When comparing performance with other filesystems, remember that + ext3/4 by default offers higher data integrity guarantees than most. + So when comparing with a metadata-only journalling filesystem, such + as ext3, use `mount -o data=writeback'. And you might as well use + `mount -o nobh' too along with it. Making the journal larger than + the mke2fs default often helps performance with metadata-intensive + workloads. + +2. Features +=========== + +2.1 Currently available + +* ability to use filesystems > 16TB (e2fsprogs support not available yet) +* extent format reduces metadata overhead (RAM, IO for access, transactions) +* extent format more robust in face of on-disk corruption due to magics, +* internal redunancy in tree +* improved file allocation (multi-block alloc) +* fix 32000 subdirectory limit +* nsec timestamps for mtime, atime, ctime, create time +* inode version field on disk (NFSv4, Lustre) +* reduced e2fsck time via uninit_bg feature +* journal checksumming for robustness, performance +* persistent file preallocation (e.g for streaming media, databases) +* ability to pack bitmaps and inode tables into larger virtual groups via the + flex_bg feature +* large file support +* Inode allocation using large virtual block groups via flex_bg +* delayed allocation +* large block (up to pagesize) support +* efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force + the ordering) + +2.2 Candidate features for future inclusion + +* Online defrag (patches available but not well tested) +* reduced mke2fs time via lazy itable initialization in conjuction with + the uninit_bg feature (capability to do this is available in e2fsprogs + but a kernel thread to do lazy zeroing of unused inode table blocks + after filesystem is first mounted is required for safety) + +There are several others under discussion, whether they all make it in is +partly a function of how much time everyone has to work on them. Features like +metadata checksumming have been discussed and planned for a bit but no patches +exist yet so I'm not sure they're in the near-term roadmap. + +The big performance win will come with mballoc, delalloc and flex_bg +grouping of bitmaps and inode tables. Some test results available here: + + - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-write-2.6.27-rc1.html + - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-readwrite-2.6.27-rc1.html + +3. Options +========== + +When mounting an ext4 filesystem, the following option are accepted: +(*) == default + +extents (*) ext4 will use extents to address file data. The + file system will no longer be mountable by ext3. + +noextents ext4 will not use extents for newly created files + +journal_checksum Enable checksumming of the journal transactions. + This will allow the recovery code in e2fsck and the + kernel to detect corruption in the kernel. It is a + compatible change and will be ignored by older kernels. + +journal_async_commit Commit block can be written to disk without waiting + for descriptor blocks. If enabled older kernels cannot + mount the device. This will enable 'journal_checksum' + internally. + +journal=update Update the ext4 file system's journal to the current + format. + +journal=inum When a journal already exists, this option is ignored. + Otherwise, it specifies the number of the inode which + will represent the ext4 file system's journal file. + +journal_dev=devnum When the external journal device's major/minor numbers + have changed, this option allows the user to specify + the new journal location. The journal device is + identified through its new major/minor numbers encoded + in devnum. + +noload Don't load the journal on mounting. + +data=journal All data are committed into the journal prior to being + written into the main file system. + +data=ordered (*) All data are forced directly out to the main file + system prior to its metadata being committed to the + journal. + +data=writeback Data ordering is not preserved, data may be written + into the main file system after its metadata has been + committed to the journal. + +commit=nrsec (*) Ext4 can be told to sync all its data and metadata + every 'nrsec' seconds. The default value is 5 seconds. + This means that if you lose your power, you will lose + as much as the latest 5 seconds of work (your + filesystem will not be damaged though, thanks to the + journaling). This default value (or any low value) + will hurt performance, but it's good for data-safety. + Setting it to 0 will have the same effect as leaving + it at the default (5 seconds). + Setting it to very large values will improve + performance. + +barrier=<0|1(*)> This enables/disables the use of write barriers in + the jbd code. barrier=0 disables, barrier=1 enables. + This also requires an IO stack which can support + barriers, and if jbd gets an error on a barrier + write, it will disable again with a warning. + Write barriers enforce proper on-disk ordering + of journal commits, making volatile disk write caches + safe to use, at some performance penalty. If + your disks are battery-backed in one way or another, + disabling barriers may safely improve performance. + +inode_readahead=n This tuning parameter controls the maximum + number of inode table blocks that ext4's inode + table readahead algorithm will pre-read into + the buffer cache. The default value is 32 blocks. + +orlov (*) This enables the new Orlov block allocator. It is + enabled by default. + +oldalloc This disables the Orlov block allocator and enables + the old block allocator. Orlov should have better + performance - we'd like to get some feedback if it's + the contrary for you. + +user_xattr Enables Extended User Attributes. Additionally, you + need to have extended attribute support enabled in the + kernel configuration (CONFIG_EXT4_FS_XATTR). See the + attr(5) manual page and http://acl.bestbits.at/ to + learn more about extended attributes. + +nouser_xattr Disables Extended User Attributes. + +acl Enables POSIX Access Control Lists support. + Additionally, you need to have ACL support enabled in + the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL). + See the acl(5) manual page and http://acl.bestbits.at/ + for more information. + +noacl This option disables POSIX Access Control List + support. + +reservation + +noreservation + +bsddf (*) Make 'df' act like BSD. +minixdf Make 'df' act like Minix. + +debug Extra debugging information is sent to syslog. + +errors=remount-ro(*) Remount the filesystem read-only on an error. +errors=continue Keep going on a filesystem error. +errors=panic Panic and halt the machine if an error occurs. + +data_err=ignore(*) Just print an error message if an error occurs + in a file data buffer in ordered mode. +data_err=abort Abort the journal if an error occurs in a file + data buffer in ordered mode. + +grpid Give objects the same group ID as their creator. +bsdgroups + +nogrpid (*) New objects have the group ID of their creator. +sysvgroups + +resgid=n The group ID which may use the reserved blocks. + +resuid=n The user ID which may use the reserved blocks. + +sb=n Use alternate superblock at this location. + +quota +noquota +grpquota +usrquota + +bh (*) ext4 associates buffer heads to data pages to +nobh (a) cache disk block mapping information + (b) link pages into transaction to provide + ordering guarantees. + "bh" option forces use of buffer heads. + "nobh" option tries to avoid associating buffer + heads (supported only for "writeback" mode). + +stripe=n Number of filesystem blocks that mballoc will try + to use for allocation size and alignment. For RAID5/6 + systems this should be the number of data + disks * RAID chunk size in file system blocks. +delalloc (*) Deferring block allocation until write-out time. +nodelalloc Disable delayed allocation. Blocks are allocation + when data is copied from user to page cache. + +Data Mode +========= +There are 3 different data modes: + +* writeback mode +In data=writeback mode, ext4 does not journal data at all. This mode provides +a similar level of journaling as that of XFS, JFS, and ReiserFS in its default +mode - metadata journaling. A crash+recovery can cause incorrect data to +appear in files which were written shortly before the crash. This mode will +typically provide the best ext4 performance. + +* ordered mode +In data=ordered mode, ext4 only officially journals metadata, but it logically +groups metadata information related to data changes with the data blocks into a +single unit called a transaction. When it's time to write the new metadata +out to disk, the associated data blocks are written first. In general, +this mode performs slightly slower than writeback but significantly faster than journal mode. + +* journal mode +data=journal mode provides full data and metadata journaling. All new data is +written to the journal first, and then to its final location. +In the event of a crash, the journal can be replayed, bringing both data and +metadata into a consistent state. This mode is the slowest except when data +needs to be read from and written to disk at the same time where it +outperforms all others modes. Curently ext4 does not have delayed +allocation support if this data journalling mode is selected. + +References +========== + +kernel source: <file:fs/ext4/> + <file:fs/jbd2/> + +programs: http://e2fsprogs.sourceforge.net/ + +useful links: http://fedoraproject.org/wiki/ext3-devel + http://www.bullopensource.org/ext4/ + http://ext4.wiki.kernel.org/index.php/Main_Page + http://fedoraproject.org/wiki/Features/Ext4 diff --git a/Documentation/filesystems/fiemap.txt b/Documentation/filesystems/fiemap.txt new file mode 100644 index 0000000..1e3defc --- /dev/null +++ b/Documentation/filesystems/fiemap.txt @@ -0,0 +1,228 @@ +============ +Fiemap Ioctl +============ + +The fiemap ioctl is an efficient method for userspace to get file +extent mappings. Instead of block-by-block mapping (such as bmap), fiemap +returns a list of extents. + + +Request Basics +-------------- + +A fiemap request is encoded within struct fiemap: + +struct fiemap { + __u64 fm_start; /* logical offset (inclusive) at + * which to start mapping (in) */ + __u64 fm_length; /* logical length of mapping which + * userspace cares about (in) */ + __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ + __u32 fm_mapped_extents; /* number of extents that were + * mapped (out) */ + __u32 fm_extent_count; /* size of fm_extents array (in) */ + __u32 fm_reserved; + struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */ +}; + + +fm_start, and fm_length specify the logical range within the file +which the process would like mappings for. Extents returned mirror +those on disk - that is, the logical offset of the 1st returned extent +may start before fm_start, and the range covered by the last returned +extent may end after fm_length. All offsets and lengths are in bytes. + +Certain flags to modify the way in which mappings are looked up can be +set in fm_flags. If the kernel doesn't understand some particular +flags, it will return EBADR and the contents of fm_flags will contain +the set of flags which caused the error. If the kernel is compatible +with all flags passed, the contents of fm_flags will be unmodified. +It is up to userspace to determine whether rejection of a particular +flag is fatal to it's operation. This scheme is intended to allow the +fiemap interface to grow in the future but without losing +compatibility with old software. + +fm_extent_count specifies the number of elements in the fm_extents[] array +that can be used to return extents. If fm_extent_count is zero, then the +fm_extents[] array is ignored (no extents will be returned), and the +fm_mapped_extents count will hold the number of extents needed in +fm_extents[] to hold the file's current mapping. Note that there is +nothing to prevent the file from changing between calls to FIEMAP. + +The following flags can be set in fm_flags: + +* FIEMAP_FLAG_SYNC +If this flag is set, the kernel will sync the file before mapping extents. + +* FIEMAP_FLAG_XATTR +If this flag is set, the extents returned will describe the inodes +extended attribute lookup tree, instead of it's data tree. + + +Extent Mapping +-------------- + +Extent information is returned within the embedded fm_extents array +which userspace must allocate along with the fiemap structure. The +number of elements in the fiemap_extents[] array should be passed via +fm_extent_count. The number of extents mapped by kernel will be +returned via fm_mapped_extents. If the number of fiemap_extents +allocated is less than would be required to map the requested range, +the maximum number of extents that can be mapped in the fm_extent[] +array will be returned and fm_mapped_extents will be equal to +fm_extent_count. In that case, the last extent in the array will not +complete the requested range and will not have the FIEMAP_EXTENT_LAST +flag set (see the next section on extent flags). + +Each extent is described by a single fiemap_extent structure as +returned in fm_extents. + +struct fiemap_extent { + __u64 fe_logical; /* logical offset in bytes for the start of + * the extent */ + __u64 fe_physical; /* physical offset in bytes for the start + * of the extent */ + __u64 fe_length; /* length in bytes for the extent */ + __u64 fe_reserved64[2]; + __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ + __u32 fe_reserved[3]; +}; + +All offsets and lengths are in bytes and mirror those on disk. It is valid +for an extents logical offset to start before the request or it's logical +length to extend past the request. Unless FIEMAP_EXTENT_NOT_ALIGNED is +returned, fe_logical, fe_physical, and fe_length will be aligned to the +block size of the file system. With the exception of extents flagged as +FIEMAP_EXTENT_MERGED, adjacent extents will not be merged. + +The fe_flags field contains flags which describe the extent returned. +A special flag, FIEMAP_EXTENT_LAST is always set on the last extent in +the file so that the process making fiemap calls can determine when no +more extents are available, without having to call the ioctl again. + +Some flags are intentionally vague and will always be set in the +presence of other more specific flags. This way a program looking for +a general property does not have to know all existing and future flags +which imply that property. + +For example, if FIEMAP_EXTENT_DATA_INLINE or FIEMAP_EXTENT_DATA_TAIL +are set, FIEMAP_EXTENT_NOT_ALIGNED will also be set. A program looking +for inline or tail-packed data can key on the specific flag. Software +which simply cares not to try operating on non-aligned extents +however, can just key on FIEMAP_EXTENT_NOT_ALIGNED, and not have to +worry about all present and future flags which might imply unaligned +data. Note that the opposite is not true - it would be valid for +FIEMAP_EXTENT_NOT_ALIGNED to appear alone. + +* FIEMAP_EXTENT_LAST +This is the last extent in the file. A mapping attempt past this +extent will return nothing. + +* FIEMAP_EXTENT_UNKNOWN +The location of this extent is currently unknown. This may indicate +the data is stored on an inaccessible volume or that no storage has +been allocated for the file yet. + +* FIEMAP_EXTENT_DELALLOC + - This will also set FIEMAP_EXTENT_UNKNOWN. +Delayed allocation - while there is data for this extent, it's +physical location has not been allocated yet. + +* FIEMAP_EXTENT_ENCODED +This extent does not consist of plain filesystem blocks but is +encoded (e.g. encrypted or compressed). Reading the data in this +extent via I/O to the block device will have undefined results. + +Note that it is *always* undefined to try to update the data +in-place by writing to the indicated location without the +assistance of the filesystem, or to access the data using the +information returned by the FIEMAP interface while the filesystem +is mounted. In other words, user applications may only read the +extent data via I/O to the block device while the filesystem is +unmounted, and then only if the FIEMAP_EXTENT_ENCODED flag is +clear; user applications must not try reading or writing to the +filesystem via the block device under any other circumstances. + +* FIEMAP_EXTENT_DATA_ENCRYPTED + - This will also set FIEMAP_EXTENT_ENCODED +The data in this extent has been encrypted by the file system. + +* FIEMAP_EXTENT_NOT_ALIGNED +Extent offsets and length are not guaranteed to be block aligned. + +* FIEMAP_EXTENT_DATA_INLINE + This will also set FIEMAP_EXTENT_NOT_ALIGNED +Data is located within a meta data block. + +* FIEMAP_EXTENT_DATA_TAIL + This will also set FIEMAP_EXTENT_NOT_ALIGNED +Data is packed into a block with data from other files. + +* FIEMAP_EXTENT_UNWRITTEN +Unwritten extent - the extent is allocated but it's data has not been +initialized. This indicates the extent's data will be all zero if read +through the filesystem but the contents are undefined if read directly from +the device. + +* FIEMAP_EXTENT_MERGED +This will be set when a file does not support extents, i.e., it uses a block +based addressing scheme. Since returning an extent for each block back to +userspace would be highly inefficient, the kernel will try to merge most +adjacent blocks into 'extents'. + + +VFS -> File System Implementation +--------------------------------- + +File systems wishing to support fiemap must implement a ->fiemap callback on +their inode_operations structure. The fs ->fiemap call is responsible for +defining it's set of supported fiemap flags, and calling a helper function on +each discovered extent: + +struct inode_operations { + ... + + int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, + u64 len); + +->fiemap is passed struct fiemap_extent_info which describes the +fiemap request: + +struct fiemap_extent_info { + unsigned int fi_flags; /* Flags as passed from user */ + unsigned int fi_extents_mapped; /* Number of mapped extents */ + unsigned int fi_extents_max; /* Size of fiemap_extent array */ + struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */ +}; + +It is intended that the file system should not need to access any of this +structure directly. + + +Flag checking should be done at the beginning of the ->fiemap callback via the +fiemap_check_flags() helper: + +int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); + +The struct fieinfo should be passed in as recieved from ioctl_fiemap(). The +set of fiemap flags which the fs understands should be passed via fs_flags. If +fiemap_check_flags finds invalid user flags, it will place the bad values in +fieinfo->fi_flags and return -EBADR. If the file system gets -EBADR, from +fiemap_check_flags(), it should immediately exit, returning that error back to +ioctl_fiemap(). + + +For each extent in the request range, the file system should call +the helper function, fiemap_fill_next_extent(): + +int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, + u64 phys, u64 len, u32 flags, u32 dev); + +fiemap_fill_next_extent() will use the passed values to populate the +next free extent in the fm_extents array. 'General' extent flags will +automatically be set from specific flags on behalf of the calling file +system so that the userspace API is not broken. + +fiemap_fill_next_extent() returns 0 on success, and 1 when the +user-supplied fm_extents array is full. If an error is encountered +while copying the extent to user memory, -EFAULT will be returned. diff --git a/Documentation/filesystems/files.txt b/Documentation/filesystems/files.txt new file mode 100644 index 0000000..bb0142f --- /dev/null +++ b/Documentation/filesystems/files.txt @@ -0,0 +1,123 @@ +File management in the Linux kernel +----------------------------------- + +This document describes how locking for files (struct file) +and file descriptor table (struct files) works. + +Up until 2.6.12, the file descriptor table has been protected +with a lock (files->file_lock) and reference count (files->count). +->file_lock protected accesses to all the file related fields +of the table. ->count was used for sharing the file descriptor +table between tasks cloned with CLONE_FILES flag. Typically +this would be the case for posix threads. As with the common +refcounting model in the kernel, the last task doing +a put_files_struct() frees the file descriptor (fd) table. +The files (struct file) themselves are protected using +reference count (->f_count). + +In the new lock-free model of file descriptor management, +the reference counting is similar, but the locking is +based on RCU. The file descriptor table contains multiple +elements - the fd sets (open_fds and close_on_exec, the +array of file pointers, the sizes of the sets and the array +etc.). In order for the updates to appear atomic to +a lock-free reader, all the elements of the file descriptor +table are in a separate structure - struct fdtable. +files_struct contains a pointer to struct fdtable through +which the actual fd table is accessed. Initially the +fdtable is embedded in files_struct itself. On a subsequent +expansion of fdtable, a new fdtable structure is allocated +and files->fdtab points to the new structure. The fdtable +structure is freed with RCU and lock-free readers either +see the old fdtable or the new fdtable making the update +appear atomic. Here are the locking rules for +the fdtable structure - + +1. All references to the fdtable must be done through + the files_fdtable() macro : + + struct fdtable *fdt; + + rcu_read_lock(); + + fdt = files_fdtable(files); + .... + if (n <= fdt->max_fds) + .... + ... + rcu_read_unlock(); + + files_fdtable() uses rcu_dereference() macro which takes care of + the memory barrier requirements for lock-free dereference. + The fdtable pointer must be read within the read-side + critical section. + +2. Reading of the fdtable as described above must be protected + by rcu_read_lock()/rcu_read_unlock(). + +3. For any update to the fd table, files->file_lock must + be held. + +4. To look up the file structure given an fd, a reader + must use either fcheck() or fcheck_files() APIs. These + take care of barrier requirements due to lock-free lookup. + An example : + + struct file *file; + + rcu_read_lock(); + file = fcheck(fd); + if (file) { + ... + } + .... + rcu_read_unlock(); + +5. Handling of the file structures is special. Since the look-up + of the fd (fget()/fget_light()) are lock-free, it is possible + that look-up may race with the last put() operation on the + file structure. This is avoided using atomic_inc_not_zero() + on ->f_count : + + rcu_read_lock(); + file = fcheck_files(files, fd); + if (file) { + if (atomic_inc_not_zero(&file->f_count)) + *fput_needed = 1; + else + /* Didn't get the reference, someone's freed */ + file = NULL; + } + rcu_read_unlock(); + .... + return file; + + atomic_inc_not_zero() detects if refcounts is already zero or + goes to zero during increment. If it does, we fail + fget()/fget_light(). + +6. Since both fdtable and file structures can be looked up + lock-free, they must be installed using rcu_assign_pointer() + API. If they are looked up lock-free, rcu_dereference() + must be used. However it is advisable to use files_fdtable() + and fcheck()/fcheck_files() which take care of these issues. + +7. While updating, the fdtable pointer must be looked up while + holding files->file_lock. If ->file_lock is dropped, then + another thread expand the files thereby creating a new + fdtable and making the earlier fdtable pointer stale. + For example : + + spin_lock(&files->file_lock); + fd = locate_fd(files, file, start); + if (fd >= 0) { + /* locate_fd() may have expanded fdtable, load the ptr */ + fdt = files_fdtable(files); + FD_SET(fd, fdt->open_fds); + FD_CLR(fd, fdt->close_on_exec); + spin_unlock(&files->file_lock); + ..... + + Since locate_fd() can drop ->file_lock (and reacquire ->file_lock), + the fdtable pointer (fdt) must be loaded after locate_fd(). + diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt new file mode 100644 index 0000000..397a41a --- /dev/null +++ b/Documentation/filesystems/fuse.txt @@ -0,0 +1,423 @@ +Definitions +~~~~~~~~~~~ + +Userspace filesystem: + + A filesystem in which data and metadata are provided by an ordinary + userspace process. The filesystem can be accessed normally through + the kernel interface. + +Filesystem daemon: + + The process(es) providing the data and metadata of the filesystem. + +Non-privileged mount (or user mount): + + A userspace filesystem mounted by a non-privileged (non-root) user. + The filesystem daemon is running with the privileges of the mounting + user. NOTE: this is not the same as mounts allowed with the "user" + option in /etc/fstab, which is not discussed here. + +Filesystem connection: + + A connection between the filesystem daemon and the kernel. The + connection exists until either the daemon dies, or the filesystem is + umounted. Note that detaching (or lazy umounting) the filesystem + does _not_ break the connection, in this case it will exist until + the last reference to the filesystem is released. + +Mount owner: + + The user who does the mounting. + +User: + + The user who is performing filesystem operations. + +What is FUSE? +~~~~~~~~~~~~~ + +FUSE is a userspace filesystem framework. It consists of a kernel +module (fuse.ko), a userspace library (libfuse.*) and a mount utility +(fusermount). + +One of the most important features of FUSE is allowing secure, +non-privileged mounts. This opens up new possibilities for the use of +filesystems. A good example is sshfs: a secure network filesystem +using the sftp protocol. + +The userspace library and utilities are available from the FUSE +homepage: + + http://fuse.sourceforge.net/ + +Filesystem type +~~~~~~~~~~~~~~~ + +The filesystem type given to mount(2) can be one of the following: + +'fuse' + + This is the usual way to mount a FUSE filesystem. The first + argument of the mount system call may contain an arbitrary string, + which is not interpreted by the kernel. + +'fuseblk' + + The filesystem is block device based. The first argument of the + mount system call is interpreted as the name of the device. + +Mount options +~~~~~~~~~~~~~ + +'fd=N' + + The file descriptor to use for communication between the userspace + filesystem and the kernel. The file descriptor must have been + obtained by opening the FUSE device ('/dev/fuse'). + +'rootmode=M' + + The file mode of the filesystem's root in octal representation. + +'user_id=N' + + The numeric user id of the mount owner. + +'group_id=N' + + The numeric group id of the mount owner. + +'default_permissions' + + By default FUSE doesn't check file access permissions, the + filesystem is free to implement it's access policy or leave it to + the underlying file access mechanism (e.g. in case of network + filesystems). This option enables permission checking, restricting + access based on file mode. It is usually useful together with the + 'allow_other' mount option. + +'allow_other' + + This option overrides the security measure restricting file access + to the user mounting the filesystem. This option is by default only + allowed to root, but this restriction can be removed with a + (userspace) configuration option. + +'max_read=N' + + With this option the maximum size of read operations can be set. + The default is infinite. Note that the size of read requests is + limited anyway to 32 pages (which is 128kbyte on i386). + +'blksize=N' + + Set the block size for the filesystem. The default is 512. This + option is only valid for 'fuseblk' type mounts. + +Control filesystem +~~~~~~~~~~~~~~~~~~ + +There's a control filesystem for FUSE, which can be mounted by: + + mount -t fusectl none /sys/fs/fuse/connections + +Mounting it under the '/sys/fs/fuse/connections' directory makes it +backwards compatible with earlier versions. + +Under the fuse control filesystem each connection has a directory +named by a unique number. + +For each connection the following files exist within this directory: + + 'waiting' + + The number of requests which are waiting to be transferred to + userspace or being processed by the filesystem daemon. If there is + no filesystem activity and 'waiting' is non-zero, then the + filesystem is hung or deadlocked. + + 'abort' + + Writing anything into this file will abort the filesystem + connection. This means that all waiting requests will be aborted an + error returned for all aborted and new requests. + +Only the owner of the mount may read or write these files. + +Interrupting filesystem operations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If a process issuing a FUSE filesystem request is interrupted, the +following will happen: + + 1) If the request is not yet sent to userspace AND the signal is + fatal (SIGKILL or unhandled fatal signal), then the request is + dequeued and returns immediately. + + 2) If the request is not yet sent to userspace AND the signal is not + fatal, then an 'interrupted' flag is set for the request. When + the request has been successfully transferred to userspace and + this flag is set, an INTERRUPT request is queued. + + 3) If the request is already sent to userspace, then an INTERRUPT + request is queued. + +INTERRUPT requests take precedence over other requests, so the +userspace filesystem will receive queued INTERRUPTs before any others. + +The userspace filesystem may ignore the INTERRUPT requests entirely, +or may honor them by sending a reply to the _original_ request, with +the error set to EINTR. + +It is also possible that there's a race between processing the +original request and it's INTERRUPT request. There are two possibilities: + + 1) The INTERRUPT request is processed before the original request is + processed + + 2) The INTERRUPT request is processed after the original request has + been answered + +If the filesystem cannot find the original request, it should wait for +some timeout and/or a number of new requests to arrive, after which it +should reply to the INTERRUPT request with an EAGAIN error. In case +1) the INTERRUPT request will be requeued. In case 2) the INTERRUPT +reply will be ignored. + +Aborting a filesystem connection +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is possible to get into certain situations where the filesystem is +not responding. Reasons for this may be: + + a) Broken userspace filesystem implementation + + b) Network connection down + + c) Accidental deadlock + + d) Malicious deadlock + +(For more on c) and d) see later sections) + +In either of these cases it may be useful to abort the connection to +the filesystem. There are several ways to do this: + + - Kill the filesystem daemon. Works in case of a) and b) + + - Kill the filesystem daemon and all users of the filesystem. Works + in all cases except some malicious deadlocks + + - Use forced umount (umount -f). Works in all cases but only if + filesystem is still attached (it hasn't been lazy unmounted) + + - Abort filesystem through the FUSE control filesystem. Most + powerful method, always works. + +How do non-privileged mounts work? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Since the mount() system call is a privileged operation, a helper +program (fusermount) is needed, which is installed setuid root. + +The implication of providing non-privileged mounts is that the mount +owner must not be able to use this capability to compromise the +system. Obvious requirements arising from this are: + + A) mount owner should not be able to get elevated privileges with the + help of the mounted filesystem + + B) mount owner should not get illegitimate access to information from + other users' and the super user's processes + + C) mount owner should not be able to induce undesired behavior in + other users' or the super user's processes + +How are requirements fulfilled? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + + A) The mount owner could gain elevated privileges by either: + + 1) creating a filesystem containing a device file, then opening + this device + + 2) creating a filesystem containing a suid or sgid application, + then executing this application + + The solution is not to allow opening device files and ignore + setuid and setgid bits when executing programs. To ensure this + fusermount always adds "nosuid" and "nodev" to the mount options + for non-privileged mounts. + + B) If another user is accessing files or directories in the + filesystem, the filesystem daemon serving requests can record the + exact sequence and timing of operations performed. This + information is otherwise inaccessible to the mount owner, so this + counts as an information leak. + + The solution to this problem will be presented in point 2) of C). + + C) There are several ways in which the mount owner can induce + undesired behavior in other users' processes, such as: + + 1) mounting a filesystem over a file or directory which the mount + owner could otherwise not be able to modify (or could only + make limited modifications). + + This is solved in fusermount, by checking the access + permissions on the mountpoint and only allowing the mount if + the mount owner can do unlimited modification (has write + access to the mountpoint, and mountpoint is not a "sticky" + directory) + + 2) Even if 1) is solved the mount owner can change the behavior + of other users' processes. + + i) It can slow down or indefinitely delay the execution of a + filesystem operation creating a DoS against the user or the + whole system. For example a suid application locking a + system file, and then accessing a file on the mount owner's + filesystem could be stopped, and thus causing the system + file to be locked forever. + + ii) It can present files or directories of unlimited length, or + directory structures of unlimited depth, possibly causing a + system process to eat up diskspace, memory or other + resources, again causing DoS. + + The solution to this as well as B) is not to allow processes + to access the filesystem, which could otherwise not be + monitored or manipulated by the mount owner. Since if the + mount owner can ptrace a process, it can do all of the above + without using a FUSE mount, the same criteria as used in + ptrace can be used to check if a process is allowed to access + the filesystem or not. + + Note that the ptrace check is not strictly necessary to + prevent B/2/i, it is enough to check if mount owner has enough + privilege to send signal to the process accessing the + filesystem, since SIGSTOP can be used to get a similar effect. + +I think these limitations are unacceptable? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If a sysadmin trusts the users enough, or can ensure through other +measures, that system processes will never enter non-privileged +mounts, it can relax the last limitation with a "user_allow_other" +config option. If this config option is set, the mounting user can +add the "allow_other" mount option which disables the check for other +users' processes. + +Kernel - userspace interface +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following diagram shows how a filesystem operation (in this +example unlink) is performed in FUSE. + +NOTE: everything in this description is greatly simplified + + | "rm /mnt/fuse/file" | FUSE filesystem daemon + | | + | | >sys_read() + | | >fuse_dev_read() + | | >request_wait() + | | [sleep on fc->waitq] + | | + | >sys_unlink() | + | >fuse_unlink() | + | [get request from | + | fc->unused_list] | + | >request_send() | + | [queue req on fc->pending] | + | [wake up fc->waitq] | [woken up] + | >request_wait_answer() | + | [sleep on req->waitq] | + | | <request_wait() + | | [remove req from fc->pending] + | | [copy req to read buffer] + | | [add req to fc->processing] + | | <fuse_dev_read() + | | <sys_read() + | | + | | [perform unlink] + | | + | | >sys_write() + | | >fuse_dev_write() + | | [look up req in fc->processing] + | | [remove from fc->processing] + | | [copy write buffer to req] + | [woken up] | [wake up req->waitq] + | | <fuse_dev_write() + | | <sys_write() + | <request_wait_answer() | + | <request_send() | + | [add request to | + | fc->unused_list] | + | <fuse_unlink() | + | <sys_unlink() | + +There are a couple of ways in which to deadlock a FUSE filesystem. +Since we are talking about unprivileged userspace programs, +something must be done about these. + +Scenario 1 - Simple deadlock +----------------------------- + + | "rm /mnt/fuse/file" | FUSE filesystem daemon + | | + | >sys_unlink("/mnt/fuse/file") | + | [acquire inode semaphore | + | for "file"] | + | >fuse_unlink() | + | [sleep on req->waitq] | + | | <sys_read() + | | >sys_unlink("/mnt/fuse/file") + | | [acquire inode semaphore + | | for "file"] + | | *DEADLOCK* + +The solution for this is to allow the filesystem to be aborted. + +Scenario 2 - Tricky deadlock +---------------------------- + +This one needs a carefully crafted filesystem. It's a variation on +the above, only the call back to the filesystem is not explicit, +but is caused by a pagefault. + + | Kamikaze filesystem thread 1 | Kamikaze filesystem thread 2 + | | + | [fd = open("/mnt/fuse/file")] | [request served normally] + | [mmap fd to 'addr'] | + | [close fd] | [FLUSH triggers 'magic' flag] + | [read a byte from addr] | + | >do_page_fault() | + | [find or create page] | + | [lock page] | + | >fuse_readpage() | + | [queue READ request] | + | [sleep on req->waitq] | + | | [read request to buffer] + | | [create reply header before addr] + | | >sys_write(addr - headerlength) + | | >fuse_dev_write() + | | [look up req in fc->processing] + | | [remove from fc->processing] + | | [copy write buffer to req] + | | >do_page_fault() + | | [find or create page] + | | [lock page] + | | * DEADLOCK * + +Solution is basically the same as above. + +An additional problem is that while the write buffer is being copied +to the request, the request must not be interrupted/aborted. This is +because the destination address of the copy may not be valid after the +request has returned. + +This is solved with doing the copy atomically, and allowing abort +while the page(s) belonging to the write buffer are faulted with +get_user_pages(). The 'req->locked' flag indicates when the copy is +taking place, and abort is delayed until this flag is unset. diff --git a/Documentation/filesystems/gfs2-glocks.txt b/Documentation/filesystems/gfs2-glocks.txt new file mode 100644 index 0000000..4dae9a3 --- /dev/null +++ b/Documentation/filesystems/gfs2-glocks.txt @@ -0,0 +1,114 @@ + Glock internal locking rules + ------------------------------ + +This documents the basic principles of the glock state machine +internals. Each glock (struct gfs2_glock in fs/gfs2/incore.h) +has two main (internal) locks: + + 1. A spinlock (gl_spin) which protects the internal state such + as gl_state, gl_target and the list of holders (gl_holders) + 2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other + threads from making calls to the DLM, etc. at the same time. If a + thread takes this lock, it must then call run_queue (usually via the + workqueue) when it releases it in order to ensure any pending tasks + are completed. + +The gl_holders list contains all the queued lock requests (not +just the holders) associated with the glock. If there are any +held locks, then they will be contiguous entries at the head +of the list. Locks are granted in strictly the order that they +are queued, except for those marked LM_FLAG_PRIORITY which are +used only during recovery, and even then only for journal locks. + +There are three lock states that users of the glock layer can request, +namely shared (SH), deferred (DF) and exclusive (EX). Those translate +to the following DLM lock modes: + +Glock mode | DLM lock mode +------------------------------ + UN | IV/NL Unlocked (no DLM lock associated with glock) or NL + SH | PR (Protected read) + DF | CW (Concurrent write) + EX | EX (Exclusive) + +Thus DF is basically a shared mode which is incompatible with the "normal" +shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O +operations. The glocks are basically a lock plus some routines which deal +with cache management. The following rules apply for the cache: + +Glock mode | Cache data | Cache Metadata | Dirty Data | Dirty Metadata +-------------------------------------------------------------------------- + UN | No | No | No | No + SH | Yes | Yes | No | No + DF | No | Yes | No | No + EX | Yes | Yes | Yes | Yes + +These rules are implemented using the various glock operations which +are defined for each type of glock. Not all types of glocks use +all the modes. Only inode glocks use the DF mode for example. + +Table of glock operations and per type constants: + +Field | Purpose +---------------------------------------------------------------------------- +go_xmote_th | Called before remote state change (e.g. to sync dirty data) +go_xmote_bh | Called after remote state change (e.g. to refill cache) +go_inval | Called if remote state change requires invalidating the cache +go_demote_ok | Returns boolean value of whether its ok to demote a glock + | (e.g. checks timeout, and that there is no cached data) +go_lock | Called for the first local holder of a lock +go_unlock | Called on the final local unlock of a lock +go_dump | Called to print content of object for debugfs file, or on + | error to dump glock to the log. +go_type; | The type of the glock, LM_TYPE_..... +go_min_hold_time | The minimum hold time + +The minimum hold time for each lock is the time after a remote lock +grant for which we ignore remote demote requests. This is in order to +prevent a situation where locks are being bounced around the cluster +from node to node with none of the nodes making any progress. This +tends to show up most with shared mmaped files which are being written +to by multiple nodes. By delaying the demotion in response to a +remote callback, that gives the userspace program time to make +some progress before the pages are unmapped. + +There is a plan to try and remove the go_lock and go_unlock callbacks +if possible, in order to try and speed up the fast path though the locking. +Also, eventually we hope to make the glock "EX" mode locally shared +such that any local locking will be done with the i_mutex as required +rather than via the glock. + +Locking rules for glock operations: + +Operation | GLF_LOCK bit lock held | gl_spin spinlock held +----------------------------------------------------------------- +go_xmote_th | Yes | No +go_xmote_bh | Yes | No +go_inval | Yes | No +go_demote_ok | Sometimes | Yes +go_lock | Yes | No +go_unlock | Yes | No +go_dump | Sometimes | Yes + +N.B. Operations must not drop either the bit lock or the spinlock +if its held on entry. go_dump and do_demote_ok must never block. +Note that go_dump will only be called if the glock's state +indicates that it is caching uptodate data. + +Glock locking order within GFS2: + + 1. i_mutex (if required) + 2. Rename glock (for rename only) + 3. Inode glock(s) + (Parents before children, inodes at "same level" with same parent in + lock number order) + 4. Rgrp glock(s) (for (de)allocation operations) + 5. Transaction glock (via gfs2_trans_begin) for non-read operations + 6. Page lock (always last, very important!) + +There are two glocks per inode. One deals with access to the inode +itself (locking order as above), and the other, known as the iopen +glock is used in conjunction with the i_nlink field in the inode to +determine the lifetime of the inode in question. Locking of inodes +is on a per-inode basis. Locking of rgrps is on a per rgrp basis. + diff --git a/Documentation/filesystems/gfs2.txt b/Documentation/filesystems/gfs2.txt new file mode 100644 index 0000000..593004b --- /dev/null +++ b/Documentation/filesystems/gfs2.txt @@ -0,0 +1,43 @@ +Global File System +------------------ + +http://sources.redhat.com/cluster/ + +GFS is a cluster file system. It allows a cluster of computers to +simultaneously use a block device that is shared between them (with FC, +iSCSI, NBD, etc). GFS reads and writes to the block device like a local +file system, but also uses a lock module to allow the computers coordinate +their I/O so file system consistency is maintained. One of the nifty +features of GFS is perfect consistency -- changes made to the file system +on one machine show up immediately on all other machines in the cluster. + +GFS uses interchangable inter-node locking mechanisms. Different lock +modules can plug into GFS and each file system selects the appropriate +lock module at mount time. Lock modules include: + + lock_nolock -- allows gfs to be used as a local file system + + lock_dlm -- uses a distributed lock manager (dlm) for inter-node locking + The dlm is found at linux/fs/dlm/ + +In addition to interfacing with an external locking manager, a gfs lock +module is responsible for interacting with external cluster management +systems. Lock_dlm depends on user space cluster management systems found +at the URL above. + +To use gfs as a local file system, no external clustering systems are +needed, simply: + + $ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device + $ mount -t gfs2 /dev/block_device /dir + +GFS2 is not on-disk compatible with previous versions of GFS. + +The following man pages can be found at the URL above: + gfs2_fsck to repair a filesystem + gfs2_grow to expand a filesystem online + gfs2_jadd to add journals to a filesystem online + gfs2_tool to manipulate, examine and tune a filesystem + gfs2_quota to examine and change quota values in a filesystem + mount.gfs2 to help mount(8) mount a filesystem + mkfs.gfs2 to make a filesystem diff --git a/Documentation/filesystems/hfs.txt b/Documentation/filesystems/hfs.txt new file mode 100644 index 0000000..bd0fa77 --- /dev/null +++ b/Documentation/filesystems/hfs.txt @@ -0,0 +1,83 @@ + +Macintosh HFS Filesystem for Linux +================================== + +HFS stands for ``Hierarchical File System'' and is the filesystem used +by the Mac Plus and all later Macintosh models. Earlier Macintosh +models used MFS (``Macintosh File System''), which is not supported, +MacOS 8.1 and newer support a filesystem called HFS+ that's similar to +HFS but is extended in various areas. Use the hfsplus filesystem driver +to access such filesystems from Linux. + + +Mount options +============= + +When mounting an HFS filesystem, the following options are accepted: + + creator=cccc, type=cccc + Specifies the creator/type values as shown by the MacOS finder + used for creating new files. Default values: '????'. + + uid=n, gid=n + Specifies the user/group that owns all files on the filesystems. + Default: user/group id of the mounting process. + + dir_umask=n, file_umask=n, umask=n + Specifies the umask used for all files , all directories or all + files and directories. Defaults to the umask of the mounting process. + + session=n + Select the CDROM session to mount as HFS filesystem. Defaults to + leaving that decision to the CDROM driver. This option will fail + with anything but a CDROM as underlying devices. + + part=n + Select partition number n from the devices. Does only makes + sense for CDROMS because they can't be partitioned under Linux. + For disk devices the generic partition parsing code does this + for us. Defaults to not parsing the partition table at all. + + quiet + Ignore invalid mount options instead of complaining. + + +Writing to HFS Filesystems +========================== + +HFS is not a UNIX filesystem, thus it does not have the usual features you'd +expect: + + o You can't modify the set-uid, set-gid, sticky or executable bits or the uid + and gid of files. + o You can't create hard- or symlinks, device files, sockets or FIFOs. + +HFS does on the other have the concepts of multiple forks per file. These +non-standard forks are represented as hidden additional files in the normal +filesystems namespace which is kind of a cludge and makes the semantics for +the a little strange: + + o You can't create, delete or rename resource forks of files or the + Finder's metadata. + o They are however created (with default values), deleted and renamed + along with the corresponding data fork or directory. + o Copying files to a different filesystem will loose those attributes + that are essential for MacOS to work. + + +Creating HFS filesystems +=================================== + +The hfsutils package from Robert Leslie contains a program called +hformat that can be used to create HFS filesystem. See +<http://www.mars.org/home/rob/proj/hfs/> for details. + + +Credits +======= + +The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU) +and is now maintained by Roman Zippel (roman@ardistech.com) at Ardis +Technologies. +Roman rewrote large parts of the code and brought in btree routines derived +from Brad Boyer's hfsplus driver (also maintained by Roman now). diff --git a/Documentation/filesystems/hfsplus.txt b/Documentation/filesystems/hfsplus.txt new file mode 100644 index 0000000..af1628a --- /dev/null +++ b/Documentation/filesystems/hfsplus.txt @@ -0,0 +1,59 @@ + +Macintosh HFSPlus Filesystem for Linux +====================================== + +HFSPlus is a filesystem first introduced in MacOS 8.1. +HFSPlus has several extensions to HFS, including 32-bit allocation +blocks, 255-character unicode filenames, and file sizes of 2^63 bytes. + + +Mount options +============= + +When mounting an HFSPlus filesystem, the following options are accepted: + + creator=cccc, type=cccc + Specifies the creator/type values as shown by the MacOS finder + used for creating new files. Default values: '????'. + + uid=n, gid=n + Specifies the user/group that owns all files on the filesystem + that have uninitialized permissions structures. + Default: user/group id of the mounting process. + + umask=n + Specifies the umask (in octal) used for files and directories + that have uninitialized permissions structures. + Default: umask of the mounting process. + + session=n + Select the CDROM session to mount as HFSPlus filesystem. Defaults to + leaving that decision to the CDROM driver. This option will fail + with anything but a CDROM as underlying devices. + + part=n + Select partition number n from the devices. This option only makes + sense for CDROMs because they can't be partitioned under Linux. + For disk devices the generic partition parsing code does this + for us. Defaults to not parsing the partition table at all. + + decompose + Decompose file name characters. + + nodecompose + Do not decompose file name characters. + + force + Used to force write access to volumes that are marked as journalled + or locked. Use at your own risk. + + nls=cccc + Encoding to use when presenting file names. + + +References +========== + +kernel source: <file:fs/hfsplus> + +Apple Technote 1150 http://developer.apple.com/technotes/tn/tn1150.html diff --git a/Documentation/filesystems/hpfs.txt b/Documentation/filesystems/hpfs.txt new file mode 100644 index 0000000..fa45c3b --- /dev/null +++ b/Documentation/filesystems/hpfs.txt @@ -0,0 +1,296 @@ +Read/Write HPFS 2.09 +1998-2004, Mikulas Patocka + +email: mikulas@artax.karlin.mff.cuni.cz +homepage: http://artax.karlin.mff.cuni.cz/~mikulas/vyplody/hpfs/index-e.cgi + +CREDITS: +Chris Smith, 1993, original read-only HPFS, some code and hpfs structures file + is taken from it +Jacques Gelinas, MSDos mmap, Inspired by fs/nfs/mmap.c (Jon Tombs 15 Aug 1993) +Werner Almesberger, 1992, 1993, MSDos option parser & CR/LF conversion + +Mount options + +uid=xxx,gid=xxx,umask=xxx (default uid=gid=0 umask=default_system_umask) + Set owner/group/mode for files that do not have it specified in extended + attributes. Mode is inverted umask - for example umask 027 gives owner + all permission, group read permission and anybody else no access. Note + that for files mode is anded with 0666. If you want files to have 'x' + rights, you must use extended attributes. +case=lower,asis (default asis) + File name lowercasing in readdir. +conv=binary,text,auto (default binary) + CR/LF -> LF conversion, if auto, decision is made according to extension + - there is a list of text extensions (I thing it's better to not convert + text file than to damage binary file). If you want to change that list, + change it in the source. Original readonly HPFS contained some strange + heuristic algorithm that I removed. I thing it's danger to let the + computer decide whether file is text or binary. For example, DJGPP + binaries contain small text message at the beginning and they could be + misidentified and damaged under some circumstances. +check=none,normal,strict (default normal) + Check level. Selecting none will cause only little speedup and big + danger. I tried to write it so that it won't crash if check=normal on + corrupted filesystems. check=strict means many superfluous checks - + used for debugging (for example it checks if file is allocated in + bitmaps when accessing it). +errors=continue,remount-ro,panic (default remount-ro) + Behaviour when filesystem errors found. +chkdsk=no,errors,always (default errors) + When to mark filesystem dirty so that OS/2 checks it. +eas=no,ro,rw (default rw) + What to do with extended attributes. 'no' - ignore them and use always + values specified in uid/gid/mode options. 'ro' - read extended + attributes but do not create them. 'rw' - create extended attributes + when you use chmod/chown/chgrp/mknod/ln -s on the filesystem. +timeshift=(-)nnn (default 0) + Shifts the time by nnn seconds. For example, if you see under linux + one hour more, than under os/2, use timeshift=-3600. + + +File names + +As in OS/2, filenames are case insensitive. However, shell thinks that names +are case sensitive, so for example when you create a file FOO, you can use +'cat FOO', 'cat Foo', 'cat foo' or 'cat F*' but not 'cat f*'. Note, that you +also won't be able to compile linux kernel (and maybe other things) on HPFS +because kernel creates different files with names like bootsect.S and +bootsect.s. When searching for file thats name has characters >= 128, codepages +are used - see below. +OS/2 ignores dots and spaces at the end of file name, so this driver does as +well. If you create 'a. ...', the file 'a' will be created, but you can still +access it under names 'a.', 'a..', 'a . . . ' etc. + + +Extended attributes + +On HPFS partitions, OS/2 can associate to each file a special information called +extended attributes. Extended attributes are pairs of (key,value) where key is +an ascii string identifying that attribute and value is any string of bytes of +variable length. OS/2 stores window and icon positions and file types there. So +why not use it for unix-specific info like file owner or access rights? This +driver can do it. If you chown/chgrp/chmod on a hpfs partition, extended +attributes with keys "UID", "GID" or "MODE" and 2-byte values are created. Only +that extended attributes those value differs from defaults specified in mount +options are created. Once created, the extended attributes are never deleted, +they're just changed. It means that when your default uid=0 and you type +something like 'chown luser file; chown root file' the file will contain +extended attribute UID=0. And when you umount the fs and mount it again with +uid=luser_uid, the file will be still owned by root! If you chmod file to 444, +extended attribute "MODE" will not be set, this special case is done by setting +read-only flag. When you mknod a block or char device, besides "MODE", the +special 4-byte extended attribute "DEV" will be created containing the device +number. Currently this driver cannot resize extended attributes - it means +that if somebody (I don't know who?) has set "UID", "GID", "MODE" or "DEV" +attributes with different sizes, they won't be rewritten and changing these +values doesn't work. + + +Symlinks + +You can do symlinks on HPFS partition, symlinks are achieved by setting extended +attribute named "SYMLINK" with symlink value. Like on ext2, you can chown and +chgrp symlinks but I don't know what is it good for. chmoding symlink results +in chmoding file where symlink points. These symlinks are just for Linux use and +incompatible with OS/2. OS/2 PmShell symlinks are not supported because they are +stored in very crazy way. They tried to do it so that link changes when file is +moved ... sometimes it works. But the link is partly stored in directory +extended attributes and partly in OS2SYS.INI. I don't want (and don't know how) +to analyze or change OS2SYS.INI. + + +Codepages + +HPFS can contain several uppercasing tables for several codepages and each +file has a pointer to codepage it's name is in. However OS/2 was created in +America where people don't care much about codepages and so multiple codepages +support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk. +Once I booted English OS/2 working in cp 850 and I created a file on my 852 +partition. It marked file name codepage as 850 - good. But when I again booted +Czech OS/2, the file was completely inaccessible under any name. It seems that +OS/2 uppercases the search pattern with its system code page (852) and file +name it's comparing to with its code page (850). These could never match. Is it +really what IBM developers wanted? But problems continued. When I created in +Czech OS/2 another file in that directory, that file was inaccessible too. OS/2 +probably uses different uppercasing method when searching where to place a file +(note, that files in HPFS directory must be sorted) and when searching for +a file. Finally when I opened this directory in PmShell, PmShell crashed (the +funny thing was that, when rebooted, PmShell tried to reopen this directory +again :-). chkdsk happily ignores these errors and only low-level disk +modification saved me. Never mix different language versions of OS/2 on one +system although HPFS was designed to allow that. +OK, I could implement complex codepage support to this driver but I think it +would cause more problems than benefit with such buggy implementation in OS/2. +So this driver simply uses first codepage it finds for uppercasing and +lowercasing no matter what's file codepage index. Usually all file names are in +this codepage - if you don't try to do what I described above :-) + + +Known bugs + +HPFS386 on OS/2 server is not supported. HPFS386 installed on normal OS/2 client +should work. If you have OS/2 server, use only read-only mode. I don't know how +to handle some HPFS386 structures like access control list or extended perm +list, I don't know how to delete them when file is deleted and how to not +overwrite them with extended attributes. Send me some info on these structures +and I'll make it. However, this driver should detect presence of HPFS386 +structures, remount read-only and not destroy them (I hope). + +When there's not enough space for extended attributes, they will be truncated +and no error is returned. + +OS/2 can't access files if the path is longer than about 256 chars but this +driver allows you to do it. chkdsk ignores such errors. + +Sometimes you won't be able to delete some files on a very full filesystem +(returning error ENOSPC). That's because file in non-leaf node in directory tree +(one directory, if it's large, has dirents in tree on HPFS) must be replaced +with another node when deleted. And that new file might have larger name than +the old one so the new name doesn't fit in directory node (dnode). And that +would result in directory tree splitting, that takes disk space. Workaround is +to delete other files that are leaf (probability that the file is non-leaf is +about 1/50) or to truncate file first to make some space. +You encounter this problem only if you have many directories so that +preallocated directory band is full i.e. + number_of_directories / size_of_filesystem_in_mb > 4. + +You can't delete open directories. + +You can't rename over directories (what is it good for?). + +Renaming files so that only case changes doesn't work. This driver supports it +but vfs doesn't. Something like 'mv file FILE' won't work. + +All atimes and directory mtimes are not updated. That's because of performance +reasons. If you extremely wish to update them, let me know, I'll write it (but +it will be slow). + +When the system is out of memory and swap, it may slightly corrupt filesystem +(lost files, unbalanced directories). (I guess all filesystem may do it). + +When compiled, you get warning: function declaration isn't a prototype. Does +anybody know what does it mean? + + +What does "unbalanced tree" message mean? + +Old versions of this driver created sometimes unbalanced dnode trees. OS/2 +chkdsk doesn't scream if the tree is unbalanced (and sometimes creates +unbalanced trees too :-) but both HPFS and HPFS386 contain bug that it rarely +crashes when the tree is not balanced. This driver handles unbalanced trees +correctly and writes warning if it finds them. If you see this message, this is +probably because of directories created with old version of this driver. +Workaround is to move all files from that directory to another and then back +again. Do it in Linux, not OS/2! If you see this message in directory that is +whole created by this driver, it is BUG - let me know about it. + + +Bugs in OS/2 + +When you have two (or more) lost directories pointing each to other, chkdsk +locks up when repairing filesystem. + +Sometimes (I think it's random) when you create a file with one-char name under +OS/2, OS/2 marks it as 'long'. chkdsk then removes this flag saying "Minor fs +error corrected". + +File names like "a .b" are marked as 'long' by OS/2 but chkdsk "corrects" it and +marks them as short (and writes "minor fs error corrected"). This bug is not in +HPFS386. + +Codepage bugs described above. + +If you don't install fixpacks, there are many, many more... + + +History + +0.90 First public release +0.91 Fixed bug that caused shooting to memory when write_inode was called on + open inode (rarely happened) +0.92 Fixed a little memory leak in freeing directory inodes +0.93 Fixed bug that locked up the machine when there were too many filenames + with first 15 characters same + Fixed write_file to zero file when writing behind file end +0.94 Fixed a little memory leak when trying to delete busy file or directory +0.95 Fixed a bug that i_hpfs_parent_dir was not updated when moving files +1.90 First version for 2.1.1xx kernels +1.91 Fixed a bug that chk_sectors failed when sectors were at the end of disk + Fixed a race-condition when write_inode is called while deleting file + Fixed a bug that could possibly happen (with very low probability) when + using 0xff in filenames + Rewritten locking to avoid race-conditions + Mount option 'eas' now works + Fsync no longer returns error + Files beginning with '.' are marked hidden + Remount support added + Alloc is not so slow when filesystem becomes full + Atimes are no more updated because it slows down operation + Code cleanup (removed all commented debug prints) +1.92 Corrected a bug when sync was called just before closing file +1.93 Modified, so that it works with kernels >= 2.1.131, I don't know if it + works with previous versions + Fixed a possible problem with disks > 64G (but I don't have one, so I can't + test it) + Fixed a file overflow at 2G + Added new option 'timeshift' + Changed behaviour on HPFS386: It is now possible to operate on HPFS386 in + read-only mode + Fixed a bug that slowed down alloc and prevented allocating 100% space + (this bug was not destructive) +1.94 Added workaround for one bug in Linux + Fixed one buffer leak + Fixed some incompatibilities with large extended attributes (but it's still + not 100% ok, I have no info on it and OS/2 doesn't want to create them) + Rewritten allocation + Fixed a bug with i_blocks (du sometimes didn't display correct values) + Directories have no longer archive attribute set (some programs don't like + it) + Fixed a bug that it set badly one flag in large anode tree (it was not + destructive) +1.95 Fixed one buffer leak, that could happen on corrupted filesystem + Fixed one bug in allocation in 1.94 +1.96 Added workaround for one bug in OS/2 (HPFS locked up, HPFS386 reported + error sometimes when opening directories in PMSHELL) + Fixed a possible bitmap race + Fixed possible problem on large disks + You can now delete open files + Fixed a nondestructive race in rename +1.97 Support for HPFS v3 (on large partitions) + Fixed a bug that it didn't allow creation of files > 128M (it should be 2G) +1.97.1 Changed names of global symbols + Fixed a bug when chmoding or chowning root directory +1.98 Fixed a deadlock when using old_readdir + Better directory handling; workaround for "unbalanced tree" bug in OS/2 +1.99 Corrected a possible problem when there's not enough space while deleting + file + Now it tries to truncate the file if there's not enough space when deleting + Removed a lot of redundant code +2.00 Fixed a bug in rename (it was there since 1.96) + Better anti-fragmentation strategy +2.01 Fixed problem with directory listing over NFS + Directory lseek now checks for proper parameters + Fixed race-condition in buffer code - it is in all filesystems in Linux; + when reading device (cat /dev/hda) while creating files on it, files + could be damaged +2.02 Workaround for bug in breada in Linux. breada could cause accesses beyond + end of partition +2.03 Char, block devices and pipes are correctly created + Fixed non-crashing race in unlink (Alexander Viro) + Now it works with Japanese version of OS/2 +2.04 Fixed error when ftruncate used to extend file +2.05 Fixed crash when got mount parameters without = + Fixed crash when allocation of anode failed due to full disk + Fixed some crashes when block io or inode allocation failed +2.06 Fixed some crash on corrupted disk structures + Better allocation strategy + Reschedule points added so that it doesn't lock CPU long time + It should work in read-only mode on Warp Server +2.07 More fixes for Warp Server. Now it really works +2.08 Creating new files is not so slow on large disks + An attempt to sync deleted file does not generate filesystem error +2.09 Fixed error on extremely fragmented files + + + vim: set textwidth=80: diff --git a/Documentation/filesystems/inotify.txt b/Documentation/filesystems/inotify.txt new file mode 100644 index 0000000..59a919f --- /dev/null +++ b/Documentation/filesystems/inotify.txt @@ -0,0 +1,269 @@ + inotify + a powerful yet simple file change notification system + + + +Document started 15 Mar 2005 by Robert Love <rml@novell.com> + + +(i) User Interface + +Inotify is controlled by a set of three system calls and normal file I/O on a +returned file descriptor. + +First step in using inotify is to initialise an inotify instance: + + int fd = inotify_init (); + +Each instance is associated with a unique, ordered queue. + +Change events are managed by "watches". A watch is an (object,mask) pair where +the object is a file or directory and the mask is a bit mask of one or more +inotify events that the application wishes to receive. See <linux/inotify.h> +for valid events. A watch is referenced by a watch descriptor, or wd. + +Watches are added via a path to the file. + +Watches on a directory will return events on any files inside of the directory. + +Adding a watch is simple: + + int wd = inotify_add_watch (fd, path, mask); + +Where "fd" is the return value from inotify_init(), path is the path to the +object to watch, and mask is the watch mask (see <linux/inotify.h>). + +You can update an existing watch in the same manner, by passing in a new mask. + +An existing watch is removed via + + int ret = inotify_rm_watch (fd, wd); + +Events are provided in the form of an inotify_event structure that is read(2) +from a given inotify instance. The filename is of dynamic length and follows +the struct. It is of size len. The filename is padded with null bytes to +ensure proper alignment. This padding is reflected in len. + +You can slurp multiple events by passing a large buffer, for example + + size_t len = read (fd, buf, BUF_LEN); + +Where "buf" is a pointer to an array of "inotify_event" structures at least +BUF_LEN bytes in size. The above example will return as many events as are +available and fit in BUF_LEN. + +Each inotify instance fd is also select()- and poll()-able. + +You can find the size of the current event queue via the standard FIONREAD +ioctl on the fd returned by inotify_init(). + +All watches are destroyed and cleaned up on close. + + +(ii) + +Prototypes: + + int inotify_init (void); + int inotify_add_watch (int fd, const char *path, __u32 mask); + int inotify_rm_watch (int fd, __u32 mask); + + +(iii) Kernel Interface + +Inotify's kernel API consists a set of functions for managing watches and an +event callback. + +To use the kernel API, you must first initialize an inotify instance with a set +of inotify_operations. You are given an opaque inotify_handle, which you use +for any further calls to inotify. + + struct inotify_handle *ih = inotify_init(my_event_handler); + +You must provide a function for processing events and a function for destroying +the inotify watch. + + void handle_event(struct inotify_watch *watch, u32 wd, u32 mask, + u32 cookie, const char *name, struct inode *inode) + + watch - the pointer to the inotify_watch that triggered this call + wd - the watch descriptor + mask - describes the event that occurred + cookie - an identifier for synchronizing events + name - the dentry name for affected files in a directory-based event + inode - the affected inode in a directory-based event + + void destroy_watch(struct inotify_watch *watch) + +You may add watches by providing a pre-allocated and initialized inotify_watch +structure and specifying the inode to watch along with an inotify event mask. +You must pin the inode during the call. You will likely wish to embed the +inotify_watch structure in a structure of your own which contains other +information about the watch. Once you add an inotify watch, it is immediately +subject to removal depending on filesystem events. You must grab a reference if +you depend on the watch hanging around after the call. + + inotify_init_watch(&my_watch->iwatch); + inotify_get_watch(&my_watch->iwatch); // optional + s32 wd = inotify_add_watch(ih, &my_watch->iwatch, inode, mask); + inotify_put_watch(&my_watch->iwatch); // optional + +You may use the watch descriptor (wd) or the address of the inotify_watch for +other inotify operations. You must not directly read or manipulate data in the +inotify_watch. Additionally, you must not call inotify_add_watch() more than +once for a given inotify_watch structure, unless you have first called either +inotify_rm_watch() or inotify_rm_wd(). + +To determine if you have already registered a watch for a given inode, you may +call inotify_find_watch(), which gives you both the wd and the watch pointer for +the inotify_watch, or an error if the watch does not exist. + + wd = inotify_find_watch(ih, inode, &watchp); + +You may use container_of() on the watch pointer to access your own data +associated with a given watch. When an existing watch is found, +inotify_find_watch() bumps the refcount before releasing its locks. You must +put that reference with: + + put_inotify_watch(watchp); + +Call inotify_find_update_watch() to update the event mask for an existing watch. +inotify_find_update_watch() returns the wd of the updated watch, or an error if +the watch does not exist. + + wd = inotify_find_update_watch(ih, inode, mask); + +An existing watch may be removed by calling either inotify_rm_watch() or +inotify_rm_wd(). + + int ret = inotify_rm_watch(ih, &my_watch->iwatch); + int ret = inotify_rm_wd(ih, wd); + +A watch may be removed while executing your event handler with the following: + + inotify_remove_watch_locked(ih, iwatch); + +Call inotify_destroy() to remove all watches from your inotify instance and +release it. If there are no outstanding references, inotify_destroy() will call +your destroy_watch op for each watch. + + inotify_destroy(ih); + +When inotify removes a watch, it sends an IN_IGNORED event to your callback. +You may use this event as an indication to free the watch memory. Note that +inotify may remove a watch due to filesystem events, as well as by your request. +If you use IN_ONESHOT, inotify will remove the watch after the first event, at +which point you may call the final inotify_put_watch. + +(iv) Kernel Interface Prototypes + + struct inotify_handle *inotify_init(struct inotify_operations *ops); + + inotify_init_watch(struct inotify_watch *watch); + + s32 inotify_add_watch(struct inotify_handle *ih, + struct inotify_watch *watch, + struct inode *inode, u32 mask); + + s32 inotify_find_watch(struct inotify_handle *ih, struct inode *inode, + struct inotify_watch **watchp); + + s32 inotify_find_update_watch(struct inotify_handle *ih, + struct inode *inode, u32 mask); + + int inotify_rm_wd(struct inotify_handle *ih, u32 wd); + + int inotify_rm_watch(struct inotify_handle *ih, + struct inotify_watch *watch); + + void inotify_remove_watch_locked(struct inotify_handle *ih, + struct inotify_watch *watch); + + void inotify_destroy(struct inotify_handle *ih); + + void get_inotify_watch(struct inotify_watch *watch); + void put_inotify_watch(struct inotify_watch *watch); + + +(v) Internal Kernel Implementation + +Each inotify instance is represented by an inotify_handle structure. +Inotify's userspace consumers also have an inotify_device which is +associated with the inotify_handle, and on which events are queued. + +Each watch is associated with an inotify_watch structure. Watches are chained +off of each associated inotify_handle and each associated inode. + +See fs/inotify.c and fs/inotify_user.c for the locking and lifetime rules. + + +(vi) Rationale + +Q: What is the design decision behind not tying the watch to the open fd of + the watched object? + +A: Watches are associated with an open inotify device, not an open file. + This solves the primary problem with dnotify: keeping the file open pins + the file and thus, worse, pins the mount. Dnotify is therefore infeasible + for use on a desktop system with removable media as the media cannot be + unmounted. Watching a file should not require that it be open. + +Q: What is the design decision behind using an-fd-per-instance as opposed to + an fd-per-watch? + +A: An fd-per-watch quickly consumes more file descriptors than are allowed, + more fd's than are feasible to manage, and more fd's than are optimally + select()-able. Yes, root can bump the per-process fd limit and yes, users + can use epoll, but requiring both is a silly and extraneous requirement. + A watch consumes less memory than an open file, separating the number + spaces is thus sensible. The current design is what user-space developers + want: Users initialize inotify, once, and add n watches, requiring but one + fd and no twiddling with fd limits. Initializing an inotify instance two + thousand times is silly. If we can implement user-space's preferences + cleanly--and we can, the idr layer makes stuff like this trivial--then we + should. + + There are other good arguments. With a single fd, there is a single + item to block on, which is mapped to a single queue of events. The single + fd returns all watch events and also any potential out-of-band data. If + every fd was a separate watch, + + - There would be no way to get event ordering. Events on file foo and + file bar would pop poll() on both fd's, but there would be no way to tell + which happened first. A single queue trivially gives you ordering. Such + ordering is crucial to existing applications such as Beagle. Imagine + "mv a b ; mv b a" events without ordering. + + - We'd have to maintain n fd's and n internal queues with state, + versus just one. It is a lot messier in the kernel. A single, linear + queue is the data structure that makes sense. + + - User-space developers prefer the current API. The Beagle guys, for + example, love it. Trust me, I asked. It is not a surprise: Who'd want + to manage and block on 1000 fd's via select? + + - No way to get out of band data. + + - 1024 is still too low. ;-) + + When you talk about designing a file change notification system that + scales to 1000s of directories, juggling 1000s of fd's just does not seem + the right interface. It is too heavy. + + Additionally, it _is_ possible to more than one instance and + juggle more than one queue and thus more than one associated fd. There + need not be a one-fd-per-process mapping; it is one-fd-per-queue and a + process can easily want more than one queue. + +Q: Why the system call approach? + +A: The poor user-space interface is the second biggest problem with dnotify. + Signals are a terrible, terrible interface for file notification. Or for + anything, for that matter. The ideal solution, from all perspectives, is a + file descriptor-based one that allows basic file I/O and poll/select. + Obtaining the fd and managing the watches could have been done either via a + device file or a family of new system calls. We decided to implement a + family of system calls because that is the preferred approach for new kernel + interfaces. The only real difference was whether we wanted to use open(2) + and ioctl(2) or a couple of new system calls. System calls beat ioctls. + diff --git a/Documentation/filesystems/isofs.txt b/Documentation/filesystems/isofs.txt new file mode 100644 index 0000000..6973b98 --- /dev/null +++ b/Documentation/filesystems/isofs.txt @@ -0,0 +1,43 @@ +Mount options that are the same as for msdos and vfat partitions. + + gid=nnn All files in the partition will be in group nnn. + uid=nnn All files in the partition will be owned by user id nnn. + umask=nnn The permission mask (see umask(1)) for the partition. + +Mount options that are the same as vfat partitions. These are only useful +when using discs encoded using Microsoft's Joliet extensions. + iocharset=name Character set to use for converting from Unicode to + ASCII. Joliet filenames are stored in Unicode format, but + Unix for the most part doesn't know how to deal with Unicode. + There is also an option of doing UTF-8 translations with the + utf8 option. + utf8 Encode Unicode names in UTF-8 format. Default is no. + +Mount options unique to the isofs filesystem. + block=512 Set the block size for the disk to 512 bytes + block=1024 Set the block size for the disk to 1024 bytes + block=2048 Set the block size for the disk to 2048 bytes + check=relaxed Matches filenames with different cases + check=strict Matches only filenames with the exact same case + cruft Try to handle badly formatted CDs. + map=off Do not map non-Rock Ridge filenames to lower case + map=normal Map non-Rock Ridge filenames to lower case + map=acorn As map=normal but also apply Acorn extensions if present + mode=xxx Sets the permissions on files to xxx + dmode=xxx Sets the permissions on directories to xxx + nojoliet Ignore Joliet extensions if they are present. + norock Ignore Rock Ridge extensions if they are present. + hide Completely strip hidden files from the file system. + showassoc Show files marked with the 'associated' bit + unhide Deprecated; showing hidden files is now default; + If given, it is a synonym for 'showassoc' which will + recreate previous unhide behavior + session=x Select number of session on multisession CD + sbsector=xxx Session begins from sector xxx + +Recommended documents about ISO 9660 standard are located at: +http://www.y-adagio.com/public/standards/iso_cdromr/tocont.htm +ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf +Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically +identical with ISO 9660.", so it is a valid and gratis substitute of the +official ISO specification. diff --git a/Documentation/filesystems/jfs.txt b/Documentation/filesystems/jfs.txt new file mode 100644 index 0000000..26ebde7 --- /dev/null +++ b/Documentation/filesystems/jfs.txt @@ -0,0 +1,41 @@ +IBM's Journaled File System (JFS) for Linux + +JFS Homepage: http://jfs.sourceforge.net/ + +The following mount options are supported: + +iocharset=name Character set to use for converting from Unicode to + ASCII. The default is to do no conversion. Use + iocharset=utf8 for UTF-8 translations. This requires + CONFIG_NLS_UTF8 to be set in the kernel .config file. + iocharset=none specifies the default behavior explicitly. + +resize=value Resize the volume to <value> blocks. JFS only supports + growing a volume, not shrinking it. This option is only + valid during a remount, when the volume is mounted + read-write. The resize keyword with no value will grow + the volume to the full size of the partition. + +nointegrity Do not write to the journal. The primary use of this option + is to allow for higher performance when restoring a volume + from backup media. The integrity of the volume is not + guaranteed if the system abnormally abends. + +integrity Default. Commit metadata changes to the journal. Use this + option to remount a volume where the nointegrity option was + previously specified in order to restore normal behavior. + +errors=continue Keep going on a filesystem error. +errors=remount-ro Default. Remount the filesystem read-only on an error. +errors=panic Panic and halt the machine if an error occurs. + +uid=value Override on-disk uid with specified value +gid=value Override on-disk gid with specified value +umask=value Override on-disk umask with specified octal value. For + directories, the execute bit will be set if the corresponding + read bit is set. + +Please send bugs, comments, cards and letters to shaggy@linux.vnet.ibm.com. + +The JFS mailing list can be subscribed to by using the link labeled +"Mail list Subscribe" at our web page http://jfs.sourceforge.net/ diff --git a/Documentation/filesystems/locks.txt b/Documentation/filesystems/locks.txt new file mode 100644 index 0000000..fab857a --- /dev/null +++ b/Documentation/filesystems/locks.txt @@ -0,0 +1,67 @@ + File Locking Release Notes + + Andy Walker <andy@lysaker.kvaerner.no> + + 12 May 1997 + + +1. What's New? +-------------- + +1.1 Broken Flock Emulation +-------------------------- + +The old flock(2) emulation in the kernel was swapped for proper BSD +compatible flock(2) support in the 1.3.x series of kernels. With the +release of the 2.1.x kernel series, support for the old emulation has +been totally removed, so that we don't need to carry this baggage +forever. + +This should not cause problems for anybody, since everybody using a +2.1.x kernel should have updated their C library to a suitable version +anyway (see the file "Documentation/Changes".) + +1.2 Allow Mixed Locks Again +--------------------------- + +1.2.1 Typical Problems - Sendmail +--------------------------------- +Because sendmail was unable to use the old flock() emulation, many sendmail +installations use fcntl() instead of flock(). This is true of Slackware 3.0 +for example. This gave rise to some other subtle problems if sendmail was +configured to rebuild the alias file. Sendmail tried to lock the aliases.dir +file with fcntl() at the same time as the GDBM routines tried to lock this +file with flock(). With pre 1.3.96 kernels this could result in deadlocks that, +over time, or under a very heavy mail load, would eventually cause the kernel +to lock solid with deadlocked processes. + + +1.2.2 The Solution +------------------ +The solution I have chosen, after much experimentation and discussion, +is to make flock() and fcntl() locks oblivious to each other. Both can +exists, and neither will have any effect on the other. + +I wanted the two lock styles to be cooperative, but there were so many +race and deadlock conditions that the current solution was the only +practical one. It puts us in the same position as, for example, SunOS +4.1.x and several other commercial Unices. The only OS's that support +cooperative flock()/fcntl() are those that emulate flock() using +fcntl(), with all the problems that implies. + + +1.3 Mandatory Locking As A Mount Option +--------------------------------------- + +Mandatory locking, as described in 'Documentation/filesystems/mandatory.txt' +was prior to this release a general configuration option that was valid for +all mounted filesystems. This had a number of inherent dangers, not the +least of which was the ability to freeze an NFS server by asking it to read +a file for which a mandatory lock existed. + +From this release of the kernel, mandatory locking can be turned on and off +on a per-filesystem basis, using the mount options 'mand' and 'nomand'. +The default is to disallow mandatory locking. The intention is that +mandatory locking only be enabled on a local filesystem as the specific need +arises. + diff --git a/Documentation/filesystems/mandatory-locking.txt b/Documentation/filesystems/mandatory-locking.txt new file mode 100644 index 0000000..0979d1d --- /dev/null +++ b/Documentation/filesystems/mandatory-locking.txt @@ -0,0 +1,171 @@ + Mandatory File Locking For The Linux Operating System + + Andy Walker <andy@lysaker.kvaerner.no> + + 15 April 1996 + (Updated September 2007) + +0. Why you should avoid mandatory locking +----------------------------------------- + +The Linux implementation is prey to a number of difficult-to-fix race +conditions which in practice make it not dependable: + + - The write system call checks for a mandatory lock only once + at its start. It is therefore possible for a lock request to + be granted after this check but before the data is modified. + A process may then see file data change even while a mandatory + lock was held. + - Similarly, an exclusive lock may be granted on a file after + the kernel has decided to proceed with a read, but before the + read has actually completed, and the reading process may see + the file data in a state which should not have been visible + to it. + - Similar races make the claimed mutual exclusion between lock + and mmap similarly unreliable. + +1. What is mandatory locking? +------------------------------ + +Mandatory locking is kernel enforced file locking, as opposed to the more usual +cooperative file locking used to guarantee sequential access to files among +processes. File locks are applied using the flock() and fcntl() system calls +(and the lockf() library routine which is a wrapper around fcntl().) It is +normally a process' responsibility to check for locks on a file it wishes to +update, before applying its own lock, updating the file and unlocking it again. +The most commonly used example of this (and in the case of sendmail, the most +troublesome) is access to a user's mailbox. The mail user agent and the mail +transfer agent must guard against updating the mailbox at the same time, and +prevent reading the mailbox while it is being updated. + +In a perfect world all processes would use and honour a cooperative, or +"advisory" locking scheme. However, the world isn't perfect, and there's +a lot of poorly written code out there. + +In trying to address this problem, the designers of System V UNIX came up +with a "mandatory" locking scheme, whereby the operating system kernel would +block attempts by a process to write to a file that another process holds a +"read" -or- "shared" lock on, and block attempts to both read and write to a +file that a process holds a "write " -or- "exclusive" lock on. + +The System V mandatory locking scheme was intended to have as little impact as +possible on existing user code. The scheme is based on marking individual files +as candidates for mandatory locking, and using the existing fcntl()/lockf() +interface for applying locks just as if they were normal, advisory locks. + +Note 1: In saying "file" in the paragraphs above I am actually not telling +the whole truth. System V locking is based on fcntl(). The granularity of +fcntl() is such that it allows the locking of byte ranges in files, in addition +to entire files, so the mandatory locking rules also have byte level +granularity. + +Note 2: POSIX.1 does not specify any scheme for mandatory locking, despite +borrowing the fcntl() locking scheme from System V. The mandatory locking +scheme is defined by the System V Interface Definition (SVID) Version 3. + +2. Marking a file for mandatory locking +--------------------------------------- + +A file is marked as a candidate for mandatory locking by setting the group-id +bit in its file mode but removing the group-execute bit. This is an otherwise +meaningless combination, and was chosen by the System V implementors so as not +to break existing user programs. + +Note that the group-id bit is usually automatically cleared by the kernel when +a setgid file is written to. This is a security measure. The kernel has been +modified to recognize the special case of a mandatory lock candidate and to +refrain from clearing this bit. Similarly the kernel has been modified not +to run mandatory lock candidates with setgid privileges. + +3. Available implementations +---------------------------- + +I have considered the implementations of mandatory locking available with +SunOS 4.1.x, Solaris 2.x and HP-UX 9.x. + +Generally I have tried to make the most sense out of the behaviour exhibited +by these three reference systems. There are many anomalies. + +All the reference systems reject all calls to open() for a file on which +another process has outstanding mandatory locks. This is in direct +contravention of SVID 3, which states that only calls to open() with the +O_TRUNC flag set should be rejected. The Linux implementation follows the SVID +definition, which is the "Right Thing", since only calls with O_TRUNC can +modify the contents of the file. + +HP-UX even disallows open() with O_TRUNC for a file with advisory locks, not +just mandatory locks. That would appear to contravene POSIX.1. + +mmap() is another interesting case. All the operating systems mentioned +prevent mandatory locks from being applied to an mmap()'ed file, but HP-UX +also disallows advisory locks for such a file. SVID actually specifies the +paranoid HP-UX behaviour. + +In my opinion only MAP_SHARED mappings should be immune from locking, and then +only from mandatory locks - that is what is currently implemented. + +SunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for +mandatory locks, so reads and writes to locked files always block when they +should return EAGAIN. + +I'm afraid that this is such an esoteric area that the semantics described +below are just as valid as any others, so long as the main points seem to +agree. + +4. Semantics +------------ + +1. Mandatory locks can only be applied via the fcntl()/lockf() locking + interface - in other words the System V/POSIX interface. BSD style + locks using flock() never result in a mandatory lock. + +2. If a process has locked a region of a file with a mandatory read lock, then + other processes are permitted to read from that region. If any of these + processes attempts to write to the region it will block until the lock is + released, unless the process has opened the file with the O_NONBLOCK + flag in which case the system call will return immediately with the error + status EAGAIN. + +3. If a process has locked a region of a file with a mandatory write lock, all + attempts to read or write to that region block until the lock is released, + unless a process has opened the file with the O_NONBLOCK flag in which case + the system call will return immediately with the error status EAGAIN. + +4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has + any mandatory locks owned by other processes will be rejected with the + error status EAGAIN. + +5. Attempts to apply a mandatory lock to a file that is memory mapped and + shared (via mmap() with MAP_SHARED) will be rejected with the error status + EAGAIN. + +6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED) + that has any mandatory locks in effect will be rejected with the error status + EAGAIN. + +5. Which system calls are affected? +----------------------------------- + +Those which modify a file's contents, not just the inode. That gives read(), +write(), readv(), writev(), open(), creat(), mmap(), truncate() and +ftruncate(). truncate() and ftruncate() are considered to be "write" actions +for the purposes of mandatory locking. + +The affected region is usually defined as stretching from the current position +for the total number of bytes read or written. For the truncate calls it is +defined as the bytes of a file removed or added (we must also consider bytes +added, as a lock can specify just "the whole file", rather than a specific +range of bytes.) + +Note 3: I may have overlooked some system calls that need mandatory lock +checking in my eagerness to get this code out the door. Please let me know, or +better still fix the system calls yourself and submit a patch to me or Linus. + +6. Warning! +----------- + +Not even root can override a mandatory lock, so runaway processes can wreak +havoc if they lock crucial files. The way around it is to change the file +permissions (remove the setgid bit) before trying to read or write to it. +Of course, that might be a bit tricky if the system is hung :-( + diff --git a/Documentation/filesystems/ncpfs.txt b/Documentation/filesystems/ncpfs.txt new file mode 100644 index 0000000..f12c30c --- /dev/null +++ b/Documentation/filesystems/ncpfs.txt @@ -0,0 +1,12 @@ +The ncpfs filesystem understands the NCP protocol, designed by the +Novell Corporation for their NetWare(tm) product. NCP is functionally +similar to the NFS used in the TCP/IP community. +To mount a NetWare filesystem, you need a special mount program, which +can be found in the ncpfs package. The home site for ncpfs is +ftp.gwdg.de/pub/linux/misc/ncpfs, but sunsite and its many mirrors +will have it as well. + +Related products are linware and mars_nwe, which will give Linux partial +NetWare server functionality. Linware's home site is +klokan.sh.cvut.cz/pub/linux/linware; mars_nwe can be found on +ftp.gwdg.de/pub/linux/misc/ncpfs. diff --git a/Documentation/filesystems/nfs-rdma.txt b/Documentation/filesystems/nfs-rdma.txt new file mode 100644 index 0000000..44bd766 --- /dev/null +++ b/Documentation/filesystems/nfs-rdma.txt @@ -0,0 +1,271 @@ +################################################################################ +# # +# NFS/RDMA README # +# # +################################################################################ + + Author: NetApp and Open Grid Computing + Date: May 29, 2008 + +Table of Contents +~~~~~~~~~~~~~~~~~ + - Overview + - Getting Help + - Installation + - Check RDMA and NFS Setup + - NFS/RDMA Setup + +Overview +~~~~~~~~ + + This document describes how to install and setup the Linux NFS/RDMA client + and server software. + + The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server + was first included in the following release, Linux 2.6.25. + + In our testing, we have obtained excellent performance results (full 10Gbit + wire bandwidth at minimal client CPU) under many workloads. The code passes + the full Connectathon test suite and operates over both Infiniband and iWARP + RDMA adapters. + +Getting Help +~~~~~~~~~~~~ + + If you get stuck, you can ask questions on the + + nfs-rdma-devel@lists.sourceforge.net + + mailing list. + +Installation +~~~~~~~~~~~~ + + These instructions are a step by step guide to building a machine for + use with NFS/RDMA. + + - Install an RDMA device + + Any device supported by the drivers in drivers/infiniband/hw is acceptable. + + Testing has been performed using several Mellanox-based IB cards, the + Ammasso AMS1100 iWARP adapter, and the Chelsio cxgb3 iWARP adapter. + + - Install a Linux distribution and tools + + The first kernel release to contain both the NFS/RDMA client and server was + Linux 2.6.25 Therefore, a distribution compatible with this and subsequent + Linux kernel release should be installed. + + The procedures described in this document have been tested with + distributions from Red Hat's Fedora Project (http://fedora.redhat.com/). + + - Install nfs-utils-1.1.2 or greater on the client + + An NFS/RDMA mount point can be obtained by using the mount.nfs command in + nfs-utils-1.1.2 or greater (nfs-utils-1.1.1 was the first nfs-utils + version with support for NFS/RDMA mounts, but for various reasons we + recommend using nfs-utils-1.1.2 or greater). To see which version of + mount.nfs you are using, type: + + $ /sbin/mount.nfs -V + + If the version is less than 1.1.2 or the command does not exist, + you should install the latest version of nfs-utils. + + Download the latest package from: + + http://www.kernel.org/pub/linux/utils/nfs + + Uncompress the package and follow the installation instructions. + + If you will not need the idmapper and gssd executables (you do not need + these to create an NFS/RDMA enabled mount command), the installation + process can be simplified by disabling these features when running + configure: + + $ ./configure --disable-gss --disable-nfsv4 + + To build nfs-utils you will need the tcp_wrappers package installed. For + more information on this see the package's README and INSTALL files. + + After building the nfs-utils package, there will be a mount.nfs binary in + the utils/mount directory. This binary can be used to initiate NFS v2, v3, + or v4 mounts. To initiate a v4 mount, the binary must be called + mount.nfs4. The standard technique is to create a symlink called + mount.nfs4 to mount.nfs. + + This mount.nfs binary should be installed at /sbin/mount.nfs as follows: + + $ sudo cp utils/mount/mount.nfs /sbin/mount.nfs + + In this location, mount.nfs will be invoked automatically for NFS mounts + by the system mount commmand. + + NOTE: mount.nfs and therefore nfs-utils-1.1.2 or greater is only needed + on the NFS client machine. You do not need this specific version of + nfs-utils on the server. Furthermore, only the mount.nfs command from + nfs-utils-1.1.2 is needed on the client. + + - Install a Linux kernel with NFS/RDMA + + The NFS/RDMA client and server are both included in the mainline Linux + kernel version 2.6.25 and later. This and other versions of the 2.6 Linux + kernel can be found at: + + ftp://ftp.kernel.org/pub/linux/kernel/v2.6/ + + Download the sources and place them in an appropriate location. + + - Configure the RDMA stack + + Make sure your kernel configuration has RDMA support enabled. Under + Device Drivers -> InfiniBand support, update the kernel configuration + to enable InfiniBand support [NOTE: the option name is misleading. Enabling + InfiniBand support is required for all RDMA devices (IB, iWARP, etc.)]. + + Enable the appropriate IB HCA support (mlx4, mthca, ehca, ipath, etc.) or + iWARP adapter support (amso, cxgb3, etc.). + + If you are using InfiniBand, be sure to enable IP-over-InfiniBand support. + + - Configure the NFS client and server + + Your kernel configuration must also have NFS file system support and/or + NFS server support enabled. These and other NFS related configuration + options can be found under File Systems -> Network File Systems. + + - Build, install, reboot + + The NFS/RDMA code will be enabled automatically if NFS and RDMA + are turned on. The NFS/RDMA client and server are configured via the hidden + SUNRPC_XPRT_RDMA config option that depends on SUNRPC and INFINIBAND. The + value of SUNRPC_XPRT_RDMA will be: + + - N if either SUNRPC or INFINIBAND are N, in this case the NFS/RDMA client + and server will not be built + - M if both SUNRPC and INFINIBAND are on (M or Y) and at least one is M, + in this case the NFS/RDMA client and server will be built as modules + - Y if both SUNRPC and INFINIBAND are Y, in this case the NFS/RDMA client + and server will be built into the kernel + + Therefore, if you have followed the steps above and turned no NFS and RDMA, + the NFS/RDMA client and server will be built. + + Build a new kernel, install it, boot it. + +Check RDMA and NFS Setup +~~~~~~~~~~~~~~~~~~~~~~~~ + + Before configuring the NFS/RDMA software, it is a good idea to test + your new kernel to ensure that the kernel is working correctly. + In particular, it is a good idea to verify that the RDMA stack + is functioning as expected and standard NFS over TCP/IP and/or UDP/IP + is working properly. + + - Check RDMA Setup + + If you built the RDMA components as modules, load them at + this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel + card: + + $ modprobe ib_mthca + $ modprobe ib_ipoib + + If you are using InfiniBand, make sure there is a Subnet Manager (SM) + running on the network. If your IB switch has an embedded SM, you can + use it. Otherwise, you will need to run an SM, such as OpenSM, on one + of your end nodes. + + If an SM is running on your network, you should see the following: + + $ cat /sys/class/infiniband/driverX/ports/1/state + 4: ACTIVE + + where driverX is mthca0, ipath5, ehca3, etc. + + To further test the InfiniBand software stack, use IPoIB (this + assumes you have two IB hosts named host1 and host2): + + host1$ ifconfig ib0 a.b.c.x + host2$ ifconfig ib0 a.b.c.y + host1$ ping a.b.c.y + host2$ ping a.b.c.x + + For other device types, follow the appropriate procedures. + + - Check NFS Setup + + For the NFS components enabled above (client and/or server), + test their functionality over standard Ethernet using TCP/IP or UDP/IP. + +NFS/RDMA Setup +~~~~~~~~~~~~~~ + + We recommend that you use two machines, one to act as the client and + one to act as the server. + + One time configuration: + + - On the server system, configure the /etc/exports file and + start the NFS/RDMA server. + + Exports entries with the following formats have been tested: + + /vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash) + /vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash) + + The IP address(es) is(are) the client's IPoIB address for an InfiniBand + HCA or the cleint's iWARP address(es) for an RNIC. + + NOTE: The "insecure" option must be used because the NFS/RDMA client does + not use a reserved port. + + Each time a machine boots: + + - Load and configure the RDMA drivers + + For InfiniBand using a Mellanox adapter: + + $ modprobe ib_mthca + $ modprobe ib_ipoib + $ ifconfig ib0 a.b.c.d + + NOTE: use unique addresses for the client and server + + - Start the NFS server + + If the NFS/RDMA server was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in + kernel config), load the RDMA transport module: + + $ modprobe svcrdma + + Regardless of how the server was built (module or built-in), start the + server: + + $ /etc/init.d/nfs start + + or + + $ service nfs start + + Instruct the server to listen on the RDMA transport: + + $ echo rdma 2050 > /proc/fs/nfsd/portlist + + - On the client system + + If the NFS/RDMA client was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in + kernel config), load the RDMA client module: + + $ modprobe xprtrdma.ko + + Regardless of how the client was built (module or built-in), use this + command to mount the NFS/RDMA server: + + $ mount -o rdma,port=2050 <IPoIB-server-name-or-address>:/<export> /mnt + + To verify that the mount is using RDMA, run "cat /proc/mounts" and check + the "proto" field for the given mount. + + Congratulations! You're using NFS/RDMA! diff --git a/Documentation/filesystems/nfsroot.txt b/Documentation/filesystems/nfsroot.txt new file mode 100644 index 0000000..68baddf --- /dev/null +++ b/Documentation/filesystems/nfsroot.txt @@ -0,0 +1,270 @@ +Mounting the root filesystem via NFS (nfsroot) +=============================================== + +Written 1996 by Gero Kuhlmann <gero@gkminix.han.de> +Updated 1997 by Martin Mares <mj@atrey.karlin.mff.cuni.cz> +Updated 2006 by Nico Schottelius <nico-kernel-nfsroot@schottelius.org> +Updated 2006 by Horms <horms@verge.net.au> + + + +In order to use a diskless system, such as an X-terminal or printer server +for example, it is necessary for the root filesystem to be present on a +non-disk device. This may be an initramfs (see Documentation/filesystems/ +ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/initrd.txt) or a +filesystem mounted via NFS. The following text describes on how to use NFS +for the root filesystem. For the rest of this text 'client' means the +diskless system, and 'server' means the NFS server. + + + + +1.) Enabling nfsroot capabilities + ----------------------------- + +In order to use nfsroot, NFS client support needs to be selected as +built-in during configuration. Once this has been selected, the nfsroot +option will become available, which should also be selected. + +In the networking options, kernel level autoconfiguration can be selected, +along with the types of autoconfiguration to support. Selecting all of +DHCP, BOOTP and RARP is safe. + + + + +2.) Kernel command line + ------------------- + +When the kernel has been loaded by a boot loader (see below) it needs to be +told what root fs device to use. And in the case of nfsroot, where to find +both the server and the name of the directory on the server to mount as root. +This can be established using the following kernel command line parameters: + + +root=/dev/nfs + + This is necessary to enable the pseudo-NFS-device. Note that it's not a + real device but just a synonym to tell the kernel to use NFS instead of + a real device. + + +nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>] + + If the `nfsroot' parameter is NOT given on the command line, + the default "/tftpboot/%s" will be used. + + <server-ip> Specifies the IP address of the NFS server. + The default address is determined by the `ip' parameter + (see below). This parameter allows the use of different + servers for IP autoconfiguration and NFS. + + <root-dir> Name of the directory on the server to mount as root. + If there is a "%s" token in the string, it will be + replaced by the ASCII-representation of the client's + IP address. + + <nfs-options> Standard NFS options. All options are separated by commas. + The following defaults are used: + port = as given by server portmap daemon + rsize = 4096 + wsize = 4096 + timeo = 7 + retrans = 3 + acregmin = 3 + acregmax = 60 + acdirmin = 30 + acdirmax = 60 + flags = hard, nointr, noposix, cto, ac + + +ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf> + + This parameter tells the kernel how to configure IP addresses of devices + and also how to set up the IP routing table. It was originally called + `nfsaddrs', but now the boot-time IP configuration works independently of + NFS, so it was renamed to `ip' and the old name remained as an alias for + compatibility reasons. + + If this parameter is missing from the kernel command line, all fields are + assumed to be empty, and the defaults mentioned below apply. In general + this means that the kernel tries to configure everything using + autoconfiguration. + + The <autoconf> parameter can appear alone as the value to the `ip' + parameter (without all the ':' characters before). If the value is + "ip=off" or "ip=none", no autoconfiguration will take place, otherwise + autoconfiguration will take place. The most common way to use this + is "ip=dhcp". + + <client-ip> IP address of the client. + + Default: Determined using autoconfiguration. + + <server-ip> IP address of the NFS server. If RARP is used to determine + the client address and this parameter is NOT empty only + replies from the specified server are accepted. + + Only required for for NFS root. That is autoconfiguration + will not be triggered if it is missing and NFS root is not + in operation. + + Default: Determined using autoconfiguration. + The address of the autoconfiguration server is used. + + <gw-ip> IP address of a gateway if the server is on a different subnet. + + Default: Determined using autoconfiguration. + + <netmask> Netmask for local network interface. If unspecified + the netmask is derived from the client IP address assuming + classful addressing. + + Default: Determined using autoconfiguration. + + <hostname> Name of the client. May be supplied by autoconfiguration, + but its absence will not trigger autoconfiguration. + + Default: Client IP address is used in ASCII notation. + + <device> Name of network device to use. + + Default: If the host only has one device, it is used. + Otherwise the device is determined using + autoconfiguration. This is done by sending + autoconfiguration requests out of all devices, + and using the device that received the first reply. + + <autoconf> Method to use for autoconfiguration. In the case of options + which specify multiple autoconfiguration protocols, + requests are sent using all protocols, and the first one + to reply is used. + + Only autoconfiguration protocols that have been compiled + into the kernel will be used, regardless of the value of + this option. + + off or none: don't use autoconfiguration + (do static IP assignment instead) + on or any: use any protocol available in the kernel + (default) + dhcp: use DHCP + bootp: use BOOTP + rarp: use RARP + both: use both BOOTP and RARP but not DHCP + (old option kept for backwards compatibility) + + Default: any + + + + +3.) Boot Loader + ---------- + +To get the kernel into memory different approaches can be used. +They depend on various facilities being available: + + +3.1) Booting from a floppy using syslinux + + When building kernels, an easy way to create a boot floppy that uses + syslinux is to use the zdisk or bzdisk make targets which use zimage + and bzimage images respectively. Both targets accept the + FDARGS parameter which can be used to set the kernel command line. + + e.g. + make bzdisk FDARGS="root=/dev/nfs" + + Note that the user running this command will need to have + access to the floppy drive device, /dev/fd0 + + For more information on syslinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ + + N.B: Previously it was possible to write a kernel directly to + a floppy using dd, configure the boot device using rdev, and + boot using the resulting floppy. Linux no longer supports this + method of booting. + +3.2) Booting from a cdrom using isolinux + + When building kernels, an easy way to create a bootable cdrom that + uses isolinux is to use the isoimage target which uses a bzimage + image. Like zdisk and bzdisk, this target accepts the FDARGS + parameter which can be used to set the kernel command line. + + e.g. + make isoimage FDARGS="root=/dev/nfs" + + The resulting iso image will be arch/<ARCH>/boot/image.iso + This can be written to a cdrom using a variety of tools including + cdrecord. + + e.g. + cdrecord dev=ATAPI:1,0,0 arch/i386/boot/image.iso + + For more information on isolinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ + +3.2) Using LILO + When using LILO all the necessary command line parameters may be + specified using the 'append=' directive in the LILO configuration + file. + + However, to use the 'root=' directive you also need to create + a dummy root device, which may be removed after LILO is run. + + mknod /dev/boot255 c 0 255 + + For information on configuring LILO, please refer to its documentation. + +3.3) Using GRUB + When using GRUB, kernel parameter are simply appended after the kernel + specification: kernel <kernel> <parameters> + +3.4) Using loadlin + loadlin may be used to boot Linux from a DOS command prompt without + requiring a local hard disk to mount as root. This has not been + thoroughly tested by the authors of this document, but in general + it should be possible configure the kernel command line similarly + to the configuration of LILO. + + Please refer to the loadlin documentation for further information. + +3.5) Using a boot ROM + This is probably the most elegant way of booting a diskless client. + With a boot ROM the kernel is loaded using the TFTP protocol. The + authors of this document are not aware of any no commercial boot + ROMs that support booting Linux over the network. However, there + are two free implementations of a boot ROM, netboot-nfs and + etherboot, both of which are available on sunsite.unc.edu, and both + of which contain everything you need to boot a diskless Linux client. + +3.6) Using pxelinux + Pxelinux may be used to boot linux using the PXE boot loader + which is present on many modern network cards. + + When using pxelinux, the kernel image is specified using + "kernel <relative-path-below /tftpboot>". The nfsroot parameters + are passed to the kernel by adding them to the "append" line. + It is common to use serial console in conjunction with pxeliunx, + see Documentation/serial-console.txt for more information. + + For more information on isolinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ + + + + +4.) Credits + ------- + + The nfsroot code in the kernel and the RARP support have been written + by Gero Kuhlmann <gero@gkminix.han.de>. + + The rest of the IP layer autoconfiguration code has been written + by Martin Mares <mj@atrey.karlin.mff.cuni.cz>. + + In order to write the initial version of nfsroot I would like to thank + Jens-Uwe Mager <jum@anubis.han.de> for his help. diff --git a/Documentation/filesystems/ntfs.txt b/Documentation/filesystems/ntfs.txt new file mode 100644 index 0000000..ac2a261 --- /dev/null +++ b/Documentation/filesystems/ntfs.txt @@ -0,0 +1,716 @@ +The Linux NTFS filesystem driver +================================ + + +Table of contents +================= + +- Overview +- Web site +- Features +- Supported mount options +- Known bugs and (mis-)features +- Using NTFS volume and stripe sets + - The Device-Mapper driver + - The Software RAID / MD driver + - Limitations when using the MD driver +- ChangeLog + + +Overview +======== + +Linux-NTFS comes with a number of user-space programs known as ntfsprogs. +These include mkntfs, a full-featured ntfs filesystem format utility, +ntfsundelete used for recovering files that were unintentionally deleted +from an NTFS volume and ntfsresize which is used to resize an NTFS partition. +See the web site for more information. + +To mount an NTFS 1.2/3.x (Windows NT4/2000/XP/2003) volume, use the file +system type 'ntfs'. The driver currently supports read-only mode (with no +fault-tolerance, encryption or journalling) and very limited, but safe, write +support. + +For fault tolerance and raid support (i.e. volume and stripe sets), you can +use the kernel's Software RAID / MD driver. See section "Using Software RAID +with NTFS" for details. + + +Web site +======== + +There is plenty of additional information on the linux-ntfs web site +at http://www.linux-ntfs.org/ + +The web site has a lot of additional information, such as a comprehensive +FAQ, documentation on the NTFS on-disk format, information on the Linux-NTFS +userspace utilities, etc. + + +Features +======== + +- This is a complete rewrite of the NTFS driver that used to be in the 2.4 and + earlier kernels. This new driver implements NTFS read support and is + functionally equivalent to the old ntfs driver and it also implements limited + write support. The biggest limitation at present is that files/directories + cannot be created or deleted. See below for the list of write features that + are so far supported. Another limitation is that writing to compressed files + is not implemented at all. Also, neither read nor write access to encrypted + files is so far implemented. +- The new driver has full support for sparse files on NTFS 3.x volumes which + the old driver isn't happy with. +- The new driver supports execution of binaries due to mmap() now being + supported. +- The new driver supports loopback mounting of files on NTFS which is used by + some Linux distributions to enable the user to run Linux from an NTFS + partition by creating a large file while in Windows and then loopback + mounting the file while in Linux and creating a Linux filesystem on it that + is used to install Linux on it. +- A comparison of the two drivers using: + time find . -type f -exec md5sum "{}" \; + run three times in sequence with each driver (after a reboot) on a 1.4GiB + NTFS partition, showed the new driver to be 20% faster in total time elapsed + (from 9:43 minutes on average down to 7:53). The time spent in user space + was unchanged but the time spent in the kernel was decreased by a factor of + 2.5 (from 85 CPU seconds down to 33). +- The driver does not support short file names in general. For backwards + compatibility, we implement access to files using their short file names if + they exist. The driver will not create short file names however, and a + rename will discard any existing short file name. +- The new driver supports exporting of mounted NTFS volumes via NFS. +- The new driver supports async io (aio). +- The new driver supports fsync(2), fdatasync(2), and msync(2). +- The new driver supports readv(2) and writev(2). +- The new driver supports access time updates (including mtime and ctime). +- The new driver supports truncate(2) and open(2) with O_TRUNC. But at present + only very limited support for highly fragmented files, i.e. ones which have + their data attribute split across multiple extents, is included. Another + limitation is that at present truncate(2) will never create sparse files, + since to mark a file sparse we need to modify the directory entry for the + file and we do not implement directory modifications yet. +- The new driver supports write(2) which can both overwrite existing data and + extend the file size so that you can write beyond the existing data. Also, + writing into sparse regions is supported and the holes are filled in with + clusters. But at present only limited support for highly fragmented files, + i.e. ones which have their data attribute split across multiple extents, is + included. Another limitation is that write(2) will never create sparse + files, since to mark a file sparse we need to modify the directory entry for + the file and we do not implement directory modifications yet. + +Supported mount options +======================= + +In addition to the generic mount options described by the manual page for the +mount command (man 8 mount, also see man 5 fstab), the NTFS driver supports the +following mount options: + +iocharset=name Deprecated option. Still supported but please use + nls=name in the future. See description for nls=name. + +nls=name Character set to use when returning file names. + Unlike VFAT, NTFS suppresses names that contain + unconvertible characters. Note that most character + sets contain insufficient characters to represent all + possible Unicode characters that can exist on NTFS. + To be sure you are not missing any files, you are + advised to use nls=utf8 which is capable of + representing all Unicode characters. + +utf8=<bool> Option no longer supported. Currently mapped to + nls=utf8 but please use nls=utf8 in the future and + make sure utf8 is compiled either as module or into + the kernel. See description for nls=name. + +uid= +gid= +umask= Provide default owner, group, and access mode mask. + These options work as documented in mount(8). By + default, the files/directories are owned by root and + he/she has read and write permissions, as well as + browse permission for directories. No one else has any + access permissions. I.e. the mode on all files is by + default rw------- and for directories rwx------, a + consequence of the default fmask=0177 and dmask=0077. + Using a umask of zero will grant all permissions to + everyone, i.e. all files and directories will have mode + rwxrwxrwx. + +fmask= +dmask= Instead of specifying umask which applies both to + files and directories, fmask applies only to files and + dmask only to directories. + +sloppy=<BOOL> If sloppy is specified, ignore unknown mount options. + Otherwise the default behaviour is to abort mount if + any unknown options are found. + +show_sys_files=<BOOL> If show_sys_files is specified, show the system files + in directory listings. Otherwise the default behaviour + is to hide the system files. + Note that even when show_sys_files is specified, "$MFT" + will not be visible due to bugs/mis-features in glibc. + Further, note that irrespective of show_sys_files, all + files are accessible by name, i.e. you can always do + "ls -l \$UpCase" for example to specifically show the + system file containing the Unicode upcase table. + +case_sensitive=<BOOL> If case_sensitive is specified, treat all file names as + case sensitive and create file names in the POSIX + namespace. Otherwise the default behaviour is to treat + file names as case insensitive and to create file names + in the WIN32/LONG name space. Note, the Linux NTFS + driver will never create short file names and will + remove them on rename/delete of the corresponding long + file name. + Note that files remain accessible via their short file + name, if it exists. If case_sensitive, you will need + to provide the correct case of the short file name. + +disable_sparse=<BOOL> If disable_sparse is specified, creation of sparse + regions, i.e. holes, inside files is disabled for the + volume (for the duration of this mount only). By + default, creation of sparse regions is enabled, which + is consistent with the behaviour of traditional Unix + filesystems. + +errors=opt What to do when critical filesystem errors are found. + Following values can be used for "opt": + continue: DEFAULT, try to clean-up as much as + possible, e.g. marking a corrupt inode as + bad so it is no longer accessed, and then + continue. + recover: At present only supported is recovery of + the boot sector from the backup copy. + If read-only mount, the recovery is done + in memory only and not written to disk. + Note that the options are additive, i.e. specifying: + errors=continue,errors=recover + means the driver will attempt to recover and if that + fails it will clean-up as much as possible and + continue. + +mft_zone_multiplier= Set the MFT zone multiplier for the volume (this + setting is not persistent across mounts and can be + changed from mount to mount but cannot be changed on + remount). Values of 1 to 4 are allowed, 1 being the + default. The MFT zone multiplier determines how much + space is reserved for the MFT on the volume. If all + other space is used up, then the MFT zone will be + shrunk dynamically, so this has no impact on the + amount of free space. However, it can have an impact + on performance by affecting fragmentation of the MFT. + In general use the default. If you have a lot of small + files then use a higher value. The values have the + following meaning: + Value MFT zone size (% of volume size) + 1 12.5% + 2 25% + 3 37.5% + 4 50% + Note this option is irrelevant for read-only mounts. + + +Known bugs and (mis-)features +============================= + +- The link count on each directory inode entry is set to 1, due to Linux not + supporting directory hard links. This may well confuse some user space + applications, since the directory names will have the same inode numbers. + This also speeds up ntfs_read_inode() immensely. And we haven't found any + problems with this approach so far. If you find a problem with this, please + let us know. + + +Please send bug reports/comments/feedback/abuse to the Linux-NTFS development +list at sourceforge: linux-ntfs-dev@lists.sourceforge.net + + +Using NTFS volume and stripe sets +================================= + +For support of volume and stripe sets, you can either use the kernel's +Device-Mapper driver or the kernel's Software RAID / MD driver. The former is +the recommended one to use for linear raid. But the latter is required for +raid level 5. For striping and mirroring, either driver should work fine. + + +The Device-Mapper driver +------------------------ + +You will need to create a table of the components of the volume/stripe set and +how they fit together and load this into the kernel using the dmsetup utility +(see man 8 dmsetup). + +Linear volume sets, i.e. linear raid, has been tested and works fine. Even +though untested, there is no reason why stripe sets, i.e. raid level 0, and +mirrors, i.e. raid level 1 should not work, too. Stripes with parity, i.e. +raid level 5, unfortunately cannot work yet because the current version of the +Device-Mapper driver does not support raid level 5. You may be able to use the +Software RAID / MD driver for raid level 5, see the next section for details. + +To create the table describing your volume you will need to know each of its +components and their sizes in sectors, i.e. multiples of 512-byte blocks. + +For NT4 fault tolerant volumes you can obtain the sizes using fdisk. So for +example if one of your partitions is /dev/hda2 you would do: + +$ fdisk -ul /dev/hda + +Disk /dev/hda: 81.9 GB, 81964302336 bytes +255 heads, 63 sectors/track, 9964 cylinders, total 160086528 sectors +Units = sectors of 1 * 512 = 512 bytes + + Device Boot Start End Blocks Id System + /dev/hda1 * 63 4209029 2104483+ 83 Linux + /dev/hda2 4209030 37768814 16779892+ 86 NTFS + /dev/hda3 37768815 46170809 4200997+ 83 Linux + +And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 = +33559785 sectors. + +For Win2k and later dynamic disks, you can for example use the ldminfo utility +which is part of the Linux LDM tools (the latest version at the time of +writing is linux-ldm-0.0.8.tar.bz2). You can download it from: + http://www.linux-ntfs.org/ +Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go +into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You +will find the precompiled (i386) ldminfo utility there. NOTE: You will not be +able to compile this yourself easily so use the binary version! + +Then you would use ldminfo in dump mode to obtain the necessary information: + +$ ./ldminfo --dump /dev/hda + +This would dump the LDM database found on /dev/hda which describes all of your +dynamic disks and all the volumes on them. At the bottom you will see the +VOLUME DEFINITIONS section which is all you really need. You may need to look +further above to determine which of the disks in the volume definitions is +which device in Linux. Hint: Run ldminfo on each of your dynamic disks and +look at the Disk Id close to the top of the output for each (the PRIVATE HEADER +section). You can then find these Disk Ids in the VBLK DATABASE section in the +<Disk> components where you will get the LDM Name for the disk that is found in +the VOLUME DEFINITIONS section. + +Note you will also need to enable the LDM driver in the Linux kernel. If your +distribution did not enable it, you will need to recompile the kernel with it +enabled. This will create the LDM partitions on each device at boot time. You +would then use those devices (for /dev/hda they would be /dev/hda1, 2, 3, etc) +in the Device-Mapper table. + +You can also bypass using the LDM driver by using the main device (e.g. +/dev/hda) and then using the offsets of the LDM partitions into this device as +the "Start sector of device" when creating the table. Once again ldminfo would +give you the correct information to do this. + +Assuming you know all your devices and their sizes things are easy. + +For a linear raid the table would look like this (note all values are in +512-byte sectors): + +--- cut here --- +# Offset into Size of this Raid type Device Start sector +# volume device of device +0 1028161 linear /dev/hda1 0 +1028161 3903762 linear /dev/hdb2 0 +4931923 2103211 linear /dev/hdc1 0 +--- cut here --- + +For a striped volume, i.e. raid level 0, you will need to know the chunk size +you used when creating the volume. Windows uses 64kiB as the default, so it +will probably be this unless you changes the defaults when creating the array. + +For a raid level 0 the table would look like this (note all values are in +512-byte sectors): + +--- cut here --- +# Offset Size Raid Number Chunk 1st Start 2nd Start +# into of the type of size Device in Device in +# volume volume stripes device device +0 2056320 striped 2 128 /dev/hda1 0 /dev/hdb1 0 +--- cut here --- + +If there are more than two devices, just add each of them to the end of the +line. + +Finally, for a mirrored volume, i.e. raid level 1, the table would look like +this (note all values are in 512-byte sectors): + +--- cut here --- +# Ofs Size Raid Log Number Region Should Number Source Start Target Start +# in of the type type of log size sync? of Device in Device in +# vol volume params mirrors Device Device +0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0 +--- cut here --- + +If you are mirroring to multiple devices you can specify further targets at the +end of the line. + +Note the "Should sync?" parameter "nosync" means that the two mirrors are +already in sync which will be the case on a clean shutdown of Windows. If the +mirrors are not clean, you can specify the "sync" option instead of "nosync" +and the Device-Mapper driver will then copy the entirety of the "Source Device" +to the "Target Device" or if you specified multipled target devices to all of +them. + +Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1), +and hand it over to dmsetup to work with, like so: + +$ dmsetup create myvolume1 /etc/ntfsvolume1 + +You can obviously replace "myvolume1" with whatever name you like. + +If it all worked, you will now have the device /dev/device-mapper/myvolume1 +which you can then just use as an argument to the mount command as usual to +mount the ntfs volume. For example: + +$ mount -t ntfs -o ro /dev/device-mapper/myvolume1 /mnt/myvol1 + +(You need to create the directory /mnt/myvol1 first and of course you can use +anything you like instead of /mnt/myvol1 as long as it is an existing +directory.) + +It is advisable to do the mount read-only to see if the volume has been setup +correctly to avoid the possibility of causing damage to the data on the ntfs +volume. + + +The Software RAID / MD driver +----------------------------- + +An alternative to using the Device-Mapper driver is to use the kernel's +Software RAID / MD driver. For which you need to set up your /etc/raidtab +appropriately (see man 5 raidtab). + +Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level +0, have been tested and work fine (though see section "Limitations when using +the MD driver with NTFS volumes" especially if you want to use linear raid). +Even though untested, there is no reason why mirrors, i.e. raid level 1, and +stripes with parity, i.e. raid level 5, should not work, too. + +You have to use the "persistent-superblock 0" option for each raid-disk in the +NTFS volume/stripe you are configuring in /etc/raidtab as the persistent +superblock used by the MD driver would damage the NTFS volume. + +Windows by default uses a stripe chunk size of 64k, so you probably want the +"chunk-size 64k" option for each raid-disk, too. + +For example, if you have a stripe set consisting of two partitions /dev/hda5 +and /dev/hdb1 your /etc/raidtab would look like this: + +raiddev /dev/md0 + raid-level 0 + nr-raid-disks 2 + nr-spare-disks 0 + persistent-superblock 0 + chunk-size 64k + device /dev/hda5 + raid-disk 0 + device /dev/hdb1 + raid-disk 1 + +For linear raid, just change the raid-level above to "raid-level linear", for +mirrors, change it to "raid-level 1", and for stripe sets with parity, change +it to "raid-level 5". + +Note for stripe sets with parity you will also need to tell the MD driver +which parity algorithm to use by specifying the option "parity-algorithm +which", where you need to replace "which" with the name of the algorithm to +use (see man 5 raidtab for available algorithms) and you will have to try the +different available algorithms until you find one that works. Make sure you +are working read-only when playing with this as you may damage your data +otherwise. If you find which algorithm works please let us know (email the +linux-ntfs developers list linux-ntfs-dev@lists.sourceforge.net or drop in on +IRC in channel #ntfs on the irc.freenode.net network) so we can update this +documentation. + +Once the raidtab is setup, run for example raid0run -a to start all devices or +raid0run /dev/md0 to start a particular md device, in this case /dev/md0. + +Then just use the mount command as usual to mount the ntfs volume using for +example: mount -t ntfs -o ro /dev/md0 /mnt/myntfsvolume + +It is advisable to do the mount read-only to see if the md volume has been +setup correctly to avoid the possibility of causing damage to the data on the +ntfs volume. + + +Limitations when using the Software RAID / MD driver +----------------------------------------------------- + +Using the md driver will not work properly if any of your NTFS partitions have +an odd number of sectors. This is especially important for linear raid as all +data after the first partition with an odd number of sectors will be offset by +one or more sectors so if you mount such a partition with write support you +will cause massive damage to the data on the volume which will only become +apparent when you try to use the volume again under Windows. + +So when using linear raid, make sure that all your partitions have an even +number of sectors BEFORE attempting to use it. You have been warned! + +Even better is to simply use the Device-Mapper for linear raid and then you do +not have this problem with odd numbers of sectors. + + +ChangeLog +========= + +Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog. + +2.1.29: + - Fix a deadlock when mounting read-write. +2.1.28: + - Fix a deadlock. +2.1.27: + - Implement page migration support so the kernel can move memory used + by NTFS files and directories around for management purposes. + - Add support for writing to sparse files created with Windows XP SP2. + - Many minor improvements and bug fixes. +2.1.26: + - Implement support for sector sizes above 512 bytes (up to the maximum + supported by NTFS which is 4096 bytes). + - Enhance support for NTFS volumes which were supported by Windows but + not by Linux due to invalid attribute list attribute flags. + - A few minor updates and bug fixes. +2.1.25: + - Write support is now extended with write(2) being able to both + overwrite existing file data and to extend files. Also, if a write + to a sparse region occurs, write(2) will fill in the hole. Note, + mmap(2) based writes still do not support writing into holes or + writing beyond the initialized size. + - Write support has a new feature and that is that truncate(2) and + open(2) with O_TRUNC are now implemented thus files can be both made + smaller and larger. + - Note: Both write(2) and truncate(2)/open(2) with O_TRUNC still have + limitations in that they + - only provide limited support for highly fragmented files. + - only work on regular, i.e. uncompressed and unencrypted files. + - never create sparse files although this will change once directory + operations are implemented. + - Lots of bug fixes and enhancements across the board. +2.1.24: + - Support journals ($LogFile) which have been modified by chkdsk. This + means users can boot into Windows after we marked the volume dirty. + The Windows boot will run chkdsk and then reboot. The user can then + immediately boot into Linux rather than having to do a full Windows + boot first before rebooting into Linux and we will recognize such a + journal and empty it as it is clean by definition. + - Support journals ($LogFile) with only one restart page as well as + journals with two different restart pages. We sanity check both and + either use the only sane one or the more recent one of the two in the + case that both are valid. + - Lots of bug fixes and enhancements across the board. +2.1.23: + - Stamp the user space journal, aka transaction log, aka $UsnJrnl, if + it is present and active thus telling Windows and applications using + the transaction log that changes can have happened on the volume + which are not recorded in $UsnJrnl. + - Detect the case when Windows has been hibernated (suspended to disk) + and if this is the case do not allow (re)mounting read-write to + prevent data corruption when you boot back into the suspended + Windows session. + - Implement extension of resident files using the normal file write + code paths, i.e. most very small files can be extended to be a little + bit bigger but not by much. + - Add new mount option "disable_sparse". (See list of mount options + above for details.) + - Improve handling of ntfs volumes with errors and strange boot sectors + in particular. + - Fix various bugs including a nasty deadlock that appeared in recent + kernels (around 2.6.11-2.6.12 timeframe). +2.1.22: + - Improve handling of ntfs volumes with errors. + - Fix various bugs and race conditions. +2.1.21: + - Fix several race conditions and various other bugs. + - Many internal cleanups, code reorganization, optimizations, and mft + and index record writing code rewritten to fit in with the changes. + - Update Documentation/filesystems/ntfs.txt with instructions on how to + use the Device-Mapper driver with NTFS ftdisk/LDM raid. +2.1.20: + - Fix two stupid bugs introduced in 2.1.18 release. +2.1.19: + - Minor bugfix in handling of the default upcase table. + - Many internal cleanups and improvements. Many thanks to Linus + Torvalds and Al Viro for the help and advice with the sparse + annotations and cleanups. +2.1.18: + - Fix scheduling latencies at mount time. (Ingo Molnar) + - Fix endianness bug in a little traversed portion of the attribute + lookup code. +2.1.17: + - Fix bugs in mount time error code paths. +2.1.16: + - Implement access time updates (including mtime and ctime). + - Implement fsync(2), fdatasync(2), and msync(2) system calls. + - Enable the readv(2) and writev(2) system calls. + - Enable access via the asynchronous io (aio) API by adding support for + the aio_read(3) and aio_write(3) functions. +2.1.15: + - Invalidate quotas when (re)mounting read-write. + NOTE: This now only leave user space journalling on the side. (See + note for version 2.1.13, below.) +2.1.14: + - Fix an NFSd caused deadlock reported by several users. +2.1.13: + - Implement writing of inodes (access time updates are not implemented + yet so mounting with -o noatime,nodiratime is enforced). + - Enable writing out of resident files so you can now overwrite any + uncompressed, unencrypted, nonsparse file as long as you do not + change the file size. + - Add housekeeping of ntfs system files so that ntfsfix no longer needs + to be run after writing to an NTFS volume. + NOTE: This still leaves quota tracking and user space journalling on + the side but they should not cause data corruption. In the worst + case the charged quotas will be out of date ($Quota) and some + userspace applications might get confused due to the out of date + userspace journal ($UsnJrnl). +2.1.12: + - Fix the second fix to the decompression engine from the 2.1.9 release + and some further internals cleanups. +2.1.11: + - Driver internal cleanups. +2.1.10: + - Force read-only (re)mounting of volumes with unsupported volume + flags and various cleanups. +2.1.9: + - Fix two bugs in handling of corner cases in the decompression engine. +2.1.8: + - Read the $MFT mirror and compare it to the $MFT and if the two do not + match, force a read-only mount and do not allow read-write remounts. + - Read and parse the $LogFile journal and if it indicates that the + volume was not shutdown cleanly, force a read-only mount and do not + allow read-write remounts. If the $LogFile indicates a clean + shutdown and a read-write (re)mount is requested, empty $LogFile to + ensure that Windows cannot cause data corruption by replaying a stale + journal after Linux has written to the volume. + - Improve time handling so that the NTFS time is fully preserved when + converted to kernel time and only up to 99 nano-seconds are lost when + kernel time is converted to NTFS time. +2.1.7: + - Enable NFS exporting of mounted NTFS volumes. +2.1.6: + - Fix minor bug in handling of compressed directories that fixes the + erroneous "du" and "stat" output people reported. +2.1.5: + - Minor bug fix in attribute list attribute handling that fixes the + I/O errors on "ls" of certain fragmented files found by at least two + people running Windows XP. +2.1.4: + - Minor update allowing compilation with all gcc versions (well, the + ones the kernel can be compiled with anyway). +2.1.3: + - Major bug fixes for reading files and volumes in corner cases which + were being hit by Windows 2k/XP users. +2.1.2: + - Major bug fixes alleviating the hangs in statfs experienced by some + users. +2.1.1: + - Update handling of compressed files so people no longer get the + frequently reported warning messages about initialized_size != + data_size. +2.1.0: + - Add configuration option for developmental write support. + - Initial implementation of file overwriting. (Writes to resident files + are not written out to disk yet, so avoid writing to files smaller + than about 1kiB.) + - Intercept/abort changes in file size as they are not implemented yet. +2.0.25: + - Minor bugfixes in error code paths and small cleanups. +2.0.24: + - Small internal cleanups. + - Support for sendfile system call. (Christoph Hellwig) +2.0.23: + - Massive internal locking changes to mft record locking. Fixes + various race conditions and deadlocks. + - Fix ntfs over loopback for compressed files by adding an + optimization barrier. (gcc was screwing up otherwise ?) + Thanks go to Christoph Hellwig for pointing these two out: + - Remove now unused function fs/ntfs/malloc.h::vmalloc_nofs(). + - Fix ntfs_free() for ia64 and parisc. +2.0.22: + - Small internal cleanups. +2.0.21: + These only affect 32-bit architectures: + - Check for, and refuse to mount too large volumes (maximum is 2TiB). + - Check for, and refuse to open too large files and directories + (maximum is 16TiB). +2.0.20: + - Support non-resident directory index bitmaps. This means we now cope + with huge directories without problems. + - Fix a page leak that manifested itself in some cases when reading + directory contents. + - Internal cleanups. +2.0.19: + - Fix race condition and improvements in block i/o interface. + - Optimization when reading compressed files. +2.0.18: + - Fix race condition in reading of compressed files. +2.0.17: + - Cleanups and optimizations. +2.0.16: + - Fix stupid bug introduced in 2.0.15 in new attribute inode API. + - Big internal cleanup replacing the mftbmp access hacks by using the + new attribute inode API instead. +2.0.15: + - Bug fix in parsing of remount options. + - Internal changes implementing attribute (fake) inodes allowing all + attribute i/o to go via the page cache and to use all the normal + vfs/mm functionality. +2.0.14: + - Internal changes improving run list merging code and minor locking + change to not rely on BKL in ntfs_statfs(). +2.0.13: + - Internal changes towards using iget5_locked() in preparation for + fake inodes and small cleanups to ntfs_volume structure. +2.0.12: + - Internal cleanups in address space operations made possible by the + changes introduced in the previous release. +2.0.11: + - Internal updates and cleanups introducing the first step towards + fake inode based attribute i/o. +2.0.10: + - Microsoft says that the maximum number of inodes is 2^32 - 1. Update + the driver accordingly to only use 32-bits to store inode numbers on + 32-bit architectures. This improves the speed of the driver a little. +2.0.9: + - Change decompression engine to use a single buffer. This should not + affect performance except perhaps on the most heavy i/o on SMP + systems when accessing multiple compressed files from multiple + devices simultaneously. + - Minor updates and cleanups. +2.0.8: + - Remove now obsolete show_inodes and posix mount option(s). + - Restore show_sys_files mount option. + - Add new mount option case_sensitive, to determine if the driver + treats file names as case sensitive or not. + - Mostly drop support for short file names (for backwards compatibility + we only support accessing files via their short file name if one + exists). + - Fix dcache aliasing issues wrt short/long file names. + - Cleanups and minor fixes. +2.0.7: + - Just cleanups. +2.0.6: + - Major bugfix to make compatible with other kernel changes. This fixes + the hangs/oopses on umount. + - Locking cleanup in directory operations (remove BKL usage). +2.0.5: + - Major buffer overflow bug fix. + - Minor cleanups and updates for kernel 2.5.12. +2.0.4: + - Cleanups and updates for kernel 2.5.11. +2.0.3: + - Small bug fixes, cleanups, and performance improvements. +2.0.2: + - Use default fmask of 0177 so that files are no executable by default. + If you want owner executable files, just use fmask=0077. + - Update for kernel 2.5.9 but preserve backwards compatibility with + kernel 2.5.7. + - Minor bug fixes, cleanups, and updates. +2.0.1: + - Minor updates, primarily set the executable bit by default on files + so they can be executed. +2.0.0: + - Started ChangeLog. + diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.txt new file mode 100644 index 0000000..67310fb --- /dev/null +++ b/Documentation/filesystems/ocfs2.txt @@ -0,0 +1,81 @@ +OCFS2 filesystem +================== +OCFS2 is a general purpose extent based shared disk cluster file +system with many similarities to ext3. It supports 64 bit inode +numbers, and has automatically extending metadata groups which may +also make it attractive for non-clustered use. + +You'll want to install the ocfs2-tools package in order to at least +get "mount.ocfs2" and "ocfs2_hb_ctl". + +Project web page: http://oss.oracle.com/projects/ocfs2 +Tools web page: http://oss.oracle.com/projects/ocfs2-tools +OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ + +All code copyright 2005 Oracle except when otherwise noted. + +CREDITS: +Lots of code taken from ext3 and other projects. + +Authors in alphabetical order: +Joel Becker <joel.becker@oracle.com> +Zach Brown <zach.brown@oracle.com> +Mark Fasheh <mark.fasheh@oracle.com> +Kurt Hackel <kurt.hackel@oracle.com> +Sunil Mushran <sunil.mushran@oracle.com> +Manish Singh <manish.singh@oracle.com> + +Caveats +======= +Features which OCFS2 does not support yet: + - quotas + - Directory change notification (F_NOTIFY) + - Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease) + - POSIX ACLs + +Mount options +============= + +OCFS2 supports the following mount options: +(*) == default + +barrier=1 This enables/disables barriers. barrier=0 disables it, + barrier=1 enables it. +errors=remount-ro(*) Remount the filesystem read-only on an error. +errors=panic Panic and halt the machine if an error occurs. +intr (*) Allow signals to interrupt cluster operations. +nointr Do not allow signals to interrupt cluster + operations. +atime_quantum=60(*) OCFS2 will not update atime unless this number + of seconds has passed since the last update. + Set to zero to always update atime. +data=ordered (*) All data are forced directly out to the main file + system prior to its metadata being committed to the + journal. +data=writeback Data ordering is not preserved, data may be written + into the main file system after its metadata has been + committed to the journal. +preferred_slot=0(*) During mount, try to use this filesystem slot first. If + it is in use by another node, the first empty one found + will be chosen. Invalid values will be ignored. +commit=nrsec (*) Ocfs2 can be told to sync all its data and metadata + every 'nrsec' seconds. The default value is 5 seconds. + This means that if you lose your power, you will lose + as much as the latest 5 seconds of work (your + filesystem will not be damaged though, thanks to the + journaling). This default value (or any low value) + will hurt performance, but it's good for data-safety. + Setting it to 0 will have the same effect as leaving + it at the default (5 seconds). + Setting it to very large values will improve + performance. +localalloc=8(*) Allows custom localalloc size in MB. If the value is too + large, the fs will silently revert it to the default. + Localalloc is not enabled for local mounts. +localflocks This disables cluster aware flock. +inode64 Indicates that Ocfs2 is allowed to create inodes at + any location in the filesystem, including those which + will result in inode numbers occupying more than 32 + bits of significance. +user_xattr (*) Enables Extended User Attributes. +nouser_xattr Disables Extended User Attributes. diff --git a/Documentation/filesystems/omfs.txt b/Documentation/filesystems/omfs.txt new file mode 100644 index 0000000..1d0d41f --- /dev/null +++ b/Documentation/filesystems/omfs.txt @@ -0,0 +1,106 @@ +Optimized MPEG Filesystem (OMFS) + +Overview +======== + +OMFS is a filesystem created by SonicBlue for use in the ReplayTV DVR +and Rio Karma MP3 player. The filesystem is extent-based, utilizing +block sizes from 2k to 8k, with hash-based directories. This +filesystem driver may be used to read and write disks from these +devices. + +Note, it is not recommended that this FS be used in place of a general +filesystem for your own streaming media device. Native Linux filesystems +will likely perform better. + +More information is available at: + + http://linux-karma.sf.net/ + +Various utilities, including mkomfs and omfsck, are included with +omfsprogs, available at: + + http://bobcopeland.com/karma/ + +Instructions are included in its README. + +Options +======= + +OMFS supports the following mount-time options: + + uid=n - make all files owned by specified user + gid=n - make all files owned by specified group + umask=xxx - set permission umask to xxx + fmask=xxx - set umask to xxx for files + dmask=xxx - set umask to xxx for directories + +Disk format +=========== + +OMFS discriminates between "sysblocks" and normal data blocks. The sysblock +group consists of super block information, file metadata, directory structures, +and extents. Each sysblock has a header containing CRCs of the entire +sysblock, and may be mirrored in successive blocks on the disk. A sysblock may +have a smaller size than a data block, but since they are both addressed by the +same 64-bit block number, any remaining space in the smaller sysblock is +unused. + +Sysblock header information: + +struct omfs_header { + __be64 h_self; /* FS block where this is located */ + __be32 h_body_size; /* size of useful data after header */ + __be16 h_crc; /* crc-ccitt of body_size bytes */ + char h_fill1[2]; + u8 h_version; /* version, always 1 */ + char h_type; /* OMFS_INODE_X */ + u8 h_magic; /* OMFS_IMAGIC */ + u8 h_check_xor; /* XOR of header bytes before this */ + __be32 h_fill2; +}; + +Files and directories are both represented by omfs_inode: + +struct omfs_inode { + struct omfs_header i_head; /* header */ + __be64 i_parent; /* parent containing this inode */ + __be64 i_sibling; /* next inode in hash bucket */ + __be64 i_ctime; /* ctime, in milliseconds */ + char i_fill1[35]; + char i_type; /* OMFS_[DIR,FILE] */ + __be32 i_fill2; + char i_fill3[64]; + char i_name[OMFS_NAMELEN]; /* filename */ + __be64 i_size; /* size of file, in bytes */ +}; + +Directories in OMFS are implemented as a large hash table. Filenames are +hashed then prepended into the bucket list beginning at OMFS_DIR_START. +Lookup requires hashing the filename, then seeking across i_sibling pointers +until a match is found on i_name. Empty buckets are represented by block +pointers with all-1s (~0). + +A file is an omfs_inode structure followed by an extent table beginning at +OMFS_EXTENT_START: + +struct omfs_extent_entry { + __be64 e_cluster; /* start location of a set of blocks */ + __be64 e_blocks; /* number of blocks after e_cluster */ +}; + +struct omfs_extent { + __be64 e_next; /* next extent table location */ + __be32 e_extent_count; /* total # extents in this table */ + __be32 e_fill; + struct omfs_extent_entry e_entry; /* start of extent entries */ +}; + +Each extent holds the block offset followed by number of blocks allocated to +the extent. The final extent in each table is a terminator with e_cluster +being ~0 and e_blocks being ones'-complement of the total number of blocks +in the table. + +If this table overflows, a continuation inode is written and pointed to by +e_next. These have a header but lack the rest of the inode structure. + diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting new file mode 100644 index 0000000..92b888d --- /dev/null +++ b/Documentation/filesystems/porting @@ -0,0 +1,275 @@ +Changes since 2.5.0: + +--- +[recommended] + +New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(), + sb_set_blocksize() and sb_min_blocksize(). + +Use them. + +(sb_find_get_block() replaces 2.4's get_hash_table()) + +--- +[recommended] + +New methods: ->alloc_inode() and ->destroy_inode(). + +Remove inode->u.foo_inode_i +Declare + struct foo_inode_info { + /* fs-private stuff */ + struct inode vfs_inode; + }; + static inline struct foo_inode_info *FOO_I(struct inode *inode) + { + return list_entry(inode, struct foo_inode_info, vfs_inode); + } + +Use FOO_I(inode) instead of &inode->u.foo_inode_i; + +Add foo_alloc_inode() and foo_destroy_inode() - the former should allocate +foo_inode_info and return the address of ->vfs_inode, the latter should free +FOO_I(inode) (see in-tree filesystems for examples). + +Make them ->alloc_inode and ->destroy_inode in your super_operations. + +Keep in mind that now you need explicit initialization of private data +typically between calling iget_locked() and unlocking the inode. + +At some point that will become mandatory. + +--- +[mandatory] + +Change of file_system_type method (->read_super to ->get_sb) + +->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV. + +Turn your foo_read_super() into a function that would return 0 in case of +success and negative number in case of error (-EINVAL unless you have more +informative error value to report). Call it foo_fill_super(). Now declare + +int foo_get_sb(struct file_system_type *fs_type, + int flags, const char *dev_name, void *data, struct vfsmount *mnt) +{ + return get_sb_bdev(fs_type, flags, dev_name, data, foo_fill_super, + mnt); +} + +(or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of +filesystem). + +Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as +foo_get_sb. + +--- +[mandatory] + +Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames. +Most likely there is no need to change anything, but if you relied on +global exclusion between renames for some internal purpose - you need to +change your internal locking. Otherwise exclusion warranties remain the +same (i.e. parents and victim are locked, etc.). + +--- +[informational] + +Now we have the exclusion between ->lookup() and directory removal (by +->rmdir() and ->rename()). If you used to need that exclusion and do +it by internal locking (most of filesystems couldn't care less) - you +can relax your locking. + +--- +[mandatory] + +->lookup(), ->truncate(), ->create(), ->unlink(), ->mknod(), ->mkdir(), +->rmdir(), ->link(), ->lseek(), ->symlink(), ->rename() +and ->readdir() are called without BKL now. Grab it on entry, drop upon return +- that will guarantee the same locking you used to have. If your method or its +parts do not need BKL - better yet, now you can shift lock_kernel() and +unlock_kernel() so that they would protect exactly what needs to be +protected. + +--- +[mandatory] + +BKL is also moved from around sb operations. ->write_super() Is now called +without BKL held. BKL should have been shifted into individual fs sb_op +functions. If you don't need it, remove it. + +--- +[informational] + +check for ->link() target not being a directory is done by callers. Feel +free to drop it... + +--- +[informational] + +->link() callers hold ->i_mutex on the object we are linking to. Some of your +problems might be over... + +--- +[mandatory] + +new file_system_type method - kill_sb(superblock). If you are converting +an existing filesystem, set it according to ->fs_flags: + FS_REQUIRES_DEV - kill_block_super + FS_LITTER - kill_litter_super + neither - kill_anon_super +FS_LITTER is gone - just remove it from fs_flags. + +--- +[mandatory] + + FS_SINGLE is gone (actually, that had happened back when ->get_sb() +went in - and hadn't been documented ;-/). Just remove it from fs_flags +(and see ->get_sb() entry for other actions). + +--- +[mandatory] + +->setattr() is called without BKL now. Caller _always_ holds ->i_mutex, so +watch for ->i_mutex-grabbing code that might be used by your ->setattr(). +Callers of notify_change() need ->i_mutex now. + +--- +[recommended] + +New super_block field "struct export_operations *s_export_op" for +explicit support for exporting, e.g. via NFS. The structure is fully +documented at its declaration in include/linux/fs.h, and in +Documentation/filesystems/Exporting. + +Briefly it allows for the definition of decode_fh and encode_fh operations +to encode and decode filehandles, and allows the filesystem to use +a standard helper function for decode_fh, and provide file-system specific +support for this helper, particularly get_parent. + +It is planned that this will be required for exporting once the code +settles down a bit. + +[mandatory] + +s_export_op is now required for exporting a filesystem. +isofs, ext2, ext3, resierfs, fat +can be used as examples of very different filesystems. + +--- +[mandatory] + +iget4() and the read_inode2 callback have been superseded by iget5_locked() +which has the following prototype, + + struct inode *iget5_locked(struct super_block *sb, unsigned long ino, + int (*test)(struct inode *, void *), + int (*set)(struct inode *, void *), + void *data); + +'test' is an additional function that can be used when the inode +number is not sufficient to identify the actual file object. 'set' +should be a non-blocking function that initializes those parts of a +newly created inode to allow the test function to succeed. 'data' is +passed as an opaque value to both test and set functions. + +When the inode has been created by iget5_locked(), it will be returned with the +I_NEW flag set and will still be locked. The filesystem then needs to finalize +the initialization. Once the inode is initialized it must be unlocked by +calling unlock_new_inode(). + +The filesystem is responsible for setting (and possibly testing) i_ino +when appropriate. There is also a simpler iget_locked function that +just takes the superblock and inode number as arguments and does the +test and set for you. + +e.g. + inode = iget_locked(sb, ino); + if (inode->i_state & I_NEW) { + err = read_inode_from_disk(inode); + if (err < 0) { + iget_failed(inode); + return err; + } + unlock_new_inode(inode); + } + +Note that if the process of setting up a new inode fails, then iget_failed() +should be called on the inode to render it dead, and an appropriate error +should be passed back to the caller. + +--- +[recommended] + +->getattr() finally getting used. See instances in nfs, minix, etc. + +--- +[mandatory] + +->revalidate() is gone. If your filesystem had it - provide ->getattr() +and let it call whatever you had as ->revlidate() + (for symlinks that +had ->revalidate()) add calls in ->follow_link()/->readlink(). + +--- +[mandatory] + +->d_parent changes are not protected by BKL anymore. Read access is safe +if at least one of the following is true: + * filesystem has no cross-directory rename() + * dcache_lock is held + * we know that parent had been locked (e.g. we are looking at +->d_parent of ->lookup() argument). + * we are called from ->rename(). + * the child's ->d_lock is held +Audit your code and add locking if needed. Notice that any place that is +not protected by the conditions above is risky even in the old tree - you +had been relying on BKL and that's prone to screwups. Old tree had quite +a few holes of that kind - unprotected access to ->d_parent leading to +anything from oops to silent memory corruption. + +--- +[mandatory] + + FS_NOMOUNT is gone. If you use it - just set MS_NOUSER in flags +(see rootfs for one kind of solution and bdev/socket/pipe for another). + +--- +[recommended] + + Use bdev_read_only(bdev) instead of is_read_only(kdev). The latter +is still alive, but only because of the mess in drivers/s390/block/dasd.c. +As soon as it gets fixed is_read_only() will die. + +--- +[mandatory] + +->permission() is called without BKL now. Grab it on entry, drop upon +return - that will guarantee the same locking you used to have. If +your method or its parts do not need BKL - better yet, now you can +shift lock_kernel() and unlock_kernel() so that they would protect +exactly what needs to be protected. + +--- +[mandatory] + +->statfs() is now called without BKL held. BKL should have been +shifted into individual fs sb_op functions where it's not clear that +it's safe to remove it. If you don't need it, remove it. + +--- +[mandatory] + + is_read_only() is gone; use bdev_read_only() instead. + +--- +[mandatory] + + destroy_buffers() is gone; use invalidate_bdev(). + +--- +[mandatory] + + fsync_dev() is gone; use fsync_bdev(). NOTE: lvm breakage is +deliberate; as soon as struct block_device * is propagated in a reasonable +way by that code fixing will become trivial; until then nothing can be +done. diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt new file mode 100644 index 0000000..bb1b0dd --- /dev/null +++ b/Documentation/filesystems/proc.txt @@ -0,0 +1,2513 @@ +------------------------------------------------------------------------------ + T H E /proc F I L E S Y S T E M +------------------------------------------------------------------------------ +/proc/sys Terrehon Bowden <terrehon@pacbell.net> October 7 1999 + Bodo Bauer <bb@ricochet.net> + +2.4.x update Jorge Nerin <comandante@zaralinux.com> November 14 2000 +------------------------------------------------------------------------------ +Version 1.3 Kernel version 2.2.12 + Kernel version 2.4.0-test11-pre4 +------------------------------------------------------------------------------ + +Table of Contents +----------------- + + 0 Preface + 0.1 Introduction/Credits + 0.2 Legal Stuff + + 1 Collecting System Information + 1.1 Process-Specific Subdirectories + 1.2 Kernel data + 1.3 IDE devices in /proc/ide + 1.4 Networking info in /proc/net + 1.5 SCSI info + 1.6 Parallel port info in /proc/parport + 1.7 TTY info in /proc/tty + 1.8 Miscellaneous kernel statistics in /proc/stat + + 2 Modifying System Parameters + 2.1 /proc/sys/fs - File system data + 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats + 2.3 /proc/sys/kernel - general kernel parameters + 2.4 /proc/sys/vm - The virtual memory subsystem + 2.5 /proc/sys/dev - Device specific parameters + 2.6 /proc/sys/sunrpc - Remote procedure calls + 2.7 /proc/sys/net - Networking stuff + 2.8 /proc/sys/net/ipv4 - IPV4 settings + 2.9 Appletalk + 2.10 IPX + 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem + 2.12 /proc/<pid>/oom_adj - Adjust the oom-killer score + 2.13 /proc/<pid>/oom_score - Display current oom-killer score + 2.14 /proc/<pid>/io - Display the IO accounting fields + 2.15 /proc/<pid>/coredump_filter - Core dump filtering settings + 2.16 /proc/<pid>/mountinfo - Information about mounts + 2.17 /proc/sys/fs/epoll - Configuration options for the epoll interface + +------------------------------------------------------------------------------ +Preface +------------------------------------------------------------------------------ + +0.1 Introduction/Credits +------------------------ + +This documentation is part of a soon (or so we hope) to be released book on +the SuSE Linux distribution. As there is no complete documentation for the +/proc file system and we've used many freely available sources to write these +chapters, it seems only fair to give the work back to the Linux community. +This work is based on the 2.2.* kernel version and the upcoming 2.4.*. I'm +afraid it's still far from complete, but we hope it will be useful. As far as +we know, it is the first 'all-in-one' document about the /proc file system. It +is focused on the Intel x86 hardware, so if you are looking for PPC, ARM, +SPARC, AXP, etc., features, you probably won't find what you are looking for. +It also only covers IPv4 networking, not IPv6 nor other protocols - sorry. But +additions and patches are welcome and will be added to this document if you +mail them to Bodo. + +We'd like to thank Alan Cox, Rik van Riel, and Alexey Kuznetsov and a lot of +other people for help compiling this documentation. We'd also like to extend a +special thank you to Andi Kleen for documentation, which we relied on heavily +to create this document, as well as the additional information he provided. +Thanks to everybody else who contributed source or docs to the Linux kernel +and helped create a great piece of software... :) + +If you have any comments, corrections or additions, please don't hesitate to +contact Bodo Bauer at bb@ricochet.net. We'll be happy to add them to this +document. + +The latest version of this document is available online at +http://skaro.nightcrawler.com/~bb/Docs/Proc as HTML version. + +If the above direction does not works for you, ypu could try the kernel +mailing list at linux-kernel@vger.kernel.org and/or try to reach me at +comandante@zaralinux.com. + +0.2 Legal Stuff +--------------- + +We don't guarantee the correctness of this document, and if you come to us +complaining about how you screwed up your system because of incorrect +documentation, we won't feel responsible... + +------------------------------------------------------------------------------ +CHAPTER 1: COLLECTING SYSTEM INFORMATION +------------------------------------------------------------------------------ + +------------------------------------------------------------------------------ +In This Chapter +------------------------------------------------------------------------------ +* Investigating the properties of the pseudo file system /proc and its + ability to provide information on the running Linux system +* Examining /proc's structure +* Uncovering various information about the kernel and the processes running + on the system +------------------------------------------------------------------------------ + + +The proc file system acts as an interface to internal data structures in the +kernel. It can be used to obtain information about the system and to change +certain kernel parameters at runtime (sysctl). + +First, we'll take a look at the read-only parts of /proc. In Chapter 2, we +show you how you can use /proc/sys to change settings. + +1.1 Process-Specific Subdirectories +----------------------------------- + +The directory /proc contains (among other things) one subdirectory for each +process running on the system, which is named after the process ID (PID). + +The link self points to the process reading the file system. Each process +subdirectory has the entries listed in Table 1-1. + + +Table 1-1: Process specific entries in /proc +.............................................................................. + File Content + clear_refs Clears page referenced bits shown in smaps output + cmdline Command line arguments + cpu Current and last cpu in which it was executed (2.4)(smp) + cwd Link to the current working directory + environ Values of environment variables + exe Link to the executable of this process + fd Directory, which contains all file descriptors + maps Memory maps to executables and library files (2.4) + mem Memory held by this process + root Link to the root directory of this process + stat Process status + statm Process memory status information + status Process status in human readable form + wchan If CONFIG_KALLSYMS is set, a pre-decoded wchan + smaps Extension based on maps, the rss size for each mapped file +.............................................................................. + +For example, to get the status information of a process, all you have to do is +read the file /proc/PID/status: + + >cat /proc/self/status + Name: cat + State: R (running) + Pid: 5452 + PPid: 743 + TracerPid: 0 (2.4) + Uid: 501 501 501 501 + Gid: 100 100 100 100 + Groups: 100 14 16 + VmSize: 1112 kB + VmLck: 0 kB + VmRSS: 348 kB + VmData: 24 kB + VmStk: 12 kB + VmExe: 8 kB + VmLib: 1044 kB + SigPnd: 0000000000000000 + SigBlk: 0000000000000000 + SigIgn: 0000000000000000 + SigCgt: 0000000000000000 + CapInh: 00000000fffffeff + CapPrm: 0000000000000000 + CapEff: 0000000000000000 + + +This shows you nearly the same information you would get if you viewed it with +the ps command. In fact, ps uses the proc file system to obtain its +information. The statm file contains more detailed information about the +process memory usage. Its seven fields are explained in Table 1-2. The stat +file contains details information about the process itself. Its fields are +explained in Table 1-3. + + +Table 1-2: Contents of the statm files (as of 2.6.8-rc3) +.............................................................................. + Field Content + size total program size (pages) (same as VmSize in status) + resident size of memory portions (pages) (same as VmRSS in status) + shared number of pages that are shared (i.e. backed by a file) + trs number of pages that are 'code' (not including libs; broken, + includes data segment) + lrs number of pages of library (always 0 on 2.6) + drs number of pages of data/stack (including libs; broken, + includes library text) + dt number of dirty pages (always 0 on 2.6) +.............................................................................. + + +Table 1-3: Contents of the stat files (as of 2.6.22-rc3) +.............................................................................. + Field Content + pid process id + tcomm filename of the executable + state state (R is running, S is sleeping, D is sleeping in an + uninterruptible wait, Z is zombie, T is traced or stopped) + ppid process id of the parent process + pgrp pgrp of the process + sid session id + tty_nr tty the process uses + tty_pgrp pgrp of the tty + flags task flags + min_flt number of minor faults + cmin_flt number of minor faults with child's + maj_flt number of major faults + cmaj_flt number of major faults with child's + utime user mode jiffies + stime kernel mode jiffies + cutime user mode jiffies with child's + cstime kernel mode jiffies with child's + priority priority level + nice nice level + num_threads number of threads + it_real_value (obsolete, always 0) + start_time time the process started after system boot + vsize virtual memory size + rss resident set memory size + rsslim current limit in bytes on the rss + start_code address above which program text can run + end_code address below which program text can run + start_stack address of the start of the stack + esp current value of ESP + eip current value of EIP + pending bitmap of pending signals (obsolete) + blocked bitmap of blocked signals (obsolete) + sigign bitmap of ignored signals (obsolete) + sigcatch bitmap of catched signals (obsolete) + wchan address where process went to sleep + 0 (place holder) + 0 (place holder) + exit_signal signal to send to parent thread on exit + task_cpu which CPU the task is scheduled on + rt_priority realtime priority + policy scheduling policy (man sched_setscheduler) + blkio_ticks time spent waiting for block IO +.............................................................................. + + +1.2 Kernel data +--------------- + +Similar to the process entries, the kernel data files give information about +the running kernel. The files used to obtain this information are contained in +/proc and are listed in Table 1-4. Not all of these will be present in your +system. It depends on the kernel configuration and the loaded modules, which +files are there, and which are missing. + +Table 1-4: Kernel info in /proc +.............................................................................. + File Content + apm Advanced power management info + buddyinfo Kernel memory allocator information (see text) (2.5) + bus Directory containing bus specific information + cmdline Kernel command line + cpuinfo Info about the CPU + devices Available devices (block and character) + dma Used DMS channels + filesystems Supported filesystems + driver Various drivers grouped here, currently rtc (2.4) + execdomains Execdomains, related to security (2.4) + fb Frame Buffer devices (2.4) + fs File system parameters, currently nfs/exports (2.4) + ide Directory containing info about the IDE subsystem + interrupts Interrupt usage + iomem Memory map (2.4) + ioports I/O port usage + irq Masks for irq to cpu affinity (2.4)(smp?) + isapnp ISA PnP (Plug&Play) Info (2.4) + kcore Kernel core image (can be ELF or A.OUT(deprecated in 2.4)) + kmsg Kernel messages + ksyms Kernel symbol table + loadavg Load average of last 1, 5 & 15 minutes + locks Kernel locks + meminfo Memory info + misc Miscellaneous + modules List of loaded modules + mounts Mounted filesystems + net Networking info (see text) + partitions Table of partitions known to the system + pci Deprecated info of PCI bus (new way -> /proc/bus/pci/, + decoupled by lspci (2.4) + rtc Real time clock + scsi SCSI info (see text) + slabinfo Slab pool info + stat Overall statistics + swaps Swap space utilization + sys See chapter 2 + sysvipc Info of SysVIPC Resources (msg, sem, shm) (2.4) + tty Info of tty drivers + uptime System uptime + version Kernel version + video bttv info of video resources (2.4) + vmallocinfo Show vmalloced areas +.............................................................................. + +You can, for example, check which interrupts are currently in use and what +they are used for by looking in the file /proc/interrupts: + + > cat /proc/interrupts + CPU0 + 0: 8728810 XT-PIC timer + 1: 895 XT-PIC keyboard + 2: 0 XT-PIC cascade + 3: 531695 XT-PIC aha152x + 4: 2014133 XT-PIC serial + 5: 44401 XT-PIC pcnet_cs + 8: 2 XT-PIC rtc + 11: 8 XT-PIC i82365 + 12: 182918 XT-PIC PS/2 Mouse + 13: 1 XT-PIC fpu + 14: 1232265 XT-PIC ide0 + 15: 7 XT-PIC ide1 + NMI: 0 + +In 2.4.* a couple of lines where added to this file LOC & ERR (this time is the +output of a SMP machine): + + > cat /proc/interrupts + + CPU0 CPU1 + 0: 1243498 1214548 IO-APIC-edge timer + 1: 8949 8958 IO-APIC-edge keyboard + 2: 0 0 XT-PIC cascade + 5: 11286 10161 IO-APIC-edge soundblaster + 8: 1 0 IO-APIC-edge rtc + 9: 27422 27407 IO-APIC-edge 3c503 + 12: 113645 113873 IO-APIC-edge PS/2 Mouse + 13: 0 0 XT-PIC fpu + 14: 22491 24012 IO-APIC-edge ide0 + 15: 2183 2415 IO-APIC-edge ide1 + 17: 30564 30414 IO-APIC-level eth0 + 18: 177 164 IO-APIC-level bttv + NMI: 2457961 2457959 + LOC: 2457882 2457881 + ERR: 2155 + +NMI is incremented in this case because every timer interrupt generates a NMI +(Non Maskable Interrupt) which is used by the NMI Watchdog to detect lockups. + +LOC is the local interrupt counter of the internal APIC of every CPU. + +ERR is incremented in the case of errors in the IO-APIC bus (the bus that +connects the CPUs in a SMP system. This means that an error has been detected, +the IO-APIC automatically retry the transmission, so it should not be a big +problem, but you should read the SMP-FAQ. + +In 2.6.2* /proc/interrupts was expanded again. This time the goal was for +/proc/interrupts to display every IRQ vector in use by the system, not +just those considered 'most important'. The new vectors are: + + THR -- interrupt raised when a machine check threshold counter + (typically counting ECC corrected errors of memory or cache) exceeds + a configurable threshold. Only available on some systems. + + TRM -- a thermal event interrupt occurs when a temperature threshold + has been exceeded for the CPU. This interrupt may also be generated + when the temperature drops back to normal. + + SPU -- a spurious interrupt is some interrupt that was raised then lowered + by some IO device before it could be fully processed by the APIC. Hence + the APIC sees the interrupt but does not know what device it came from. + For this case the APIC will generate the interrupt with a IRQ vector + of 0xff. This might also be generated by chipset bugs. + + RES, CAL, TLB -- rescheduling, call and TLB flush interrupts are + sent from one CPU to another per the needs of the OS. Typically, + their statistics are used by kernel developers and interested users to + determine the occurance of interrupt of the given type. + +The above IRQ vectors are displayed only when relevent. For example, +the threshold vector does not exist on x86_64 platforms. Others are +suppressed when the system is a uniprocessor. As of this writing, only +i386 and x86_64 platforms support the new IRQ vector displays. + +Of some interest is the introduction of the /proc/irq directory to 2.4. +It could be used to set IRQ to CPU affinity, this means that you can "hook" an +IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the +irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and +prof_cpu_mask. + +For example + > ls /proc/irq/ + 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask + 1 11 13 15 17 19 3 5 7 9 default_smp_affinity + > ls /proc/irq/0/ + smp_affinity + +smp_affinity is a bitmask, in which you can specify which CPUs can handle the +IRQ, you can set it by doing: + + > echo 1 > /proc/irq/10/smp_affinity + +This means that only the first CPU will handle the IRQ, but you can also echo +5 which means that only the first and fourth CPU can handle the IRQ. + +The contents of each smp_affinity file is the same by default: + + > cat /proc/irq/0/smp_affinity + ffffffff + +The default_smp_affinity mask applies to all non-active IRQs, which are the +IRQs which have not yet been allocated/activated, and hence which lack a +/proc/irq/[0-9]* directory. + +prof_cpu_mask specifies which CPUs are to be profiled by the system wide +profiler. Default value is ffffffff (all cpus). + +The way IRQs are routed is handled by the IO-APIC, and it's Round Robin +between all the CPUs which are allowed to handle it. As usual the kernel has +more info than you and does a better job than you, so the defaults are the +best choice for almost everyone. + +There are three more important subdirectories in /proc: net, scsi, and sys. +The general rule is that the contents, or even the existence of these +directories, depend on your kernel configuration. If SCSI is not enabled, the +directory scsi may not exist. The same is true with the net, which is there +only when networking support is present in the running kernel. + +The slabinfo file gives information about memory usage at the slab level. +Linux uses slab pools for memory management above page level in version 2.2. +Commonly used objects have their own slab pool (such as network buffers, +directory cache, and so on). + +.............................................................................. + +> cat /proc/buddyinfo + +Node 0, zone DMA 0 4 5 4 4 3 ... +Node 0, zone Normal 1 0 0 1 101 8 ... +Node 0, zone HighMem 2 0 0 1 1 0 ... + +Memory fragmentation is a problem under some workloads, and buddyinfo is a +useful tool for helping diagnose these problems. Buddyinfo will give you a +clue as to how big an area you can safely allocate, or why a previous +allocation failed. + +Each column represents the number of pages of a certain order which are +available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in +ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE +available in ZONE_NORMAL, etc... + +.............................................................................. + +meminfo: + +Provides information about distribution and utilization of memory. This +varies by architecture and compile options. The following is from a +16GB PIII, which has highmem enabled. You may not have all of these fields. + +> cat /proc/meminfo + + +MemTotal: 16344972 kB +MemFree: 13634064 kB +Buffers: 3656 kB +Cached: 1195708 kB +SwapCached: 0 kB +Active: 891636 kB +Inactive: 1077224 kB +HighTotal: 15597528 kB +HighFree: 13629632 kB +LowTotal: 747444 kB +LowFree: 4432 kB +SwapTotal: 0 kB +SwapFree: 0 kB +Dirty: 968 kB +Writeback: 0 kB +AnonPages: 861800 kB +Mapped: 280372 kB +Slab: 284364 kB +SReclaimable: 159856 kB +SUnreclaim: 124508 kB +PageTables: 24448 kB +NFS_Unstable: 0 kB +Bounce: 0 kB +WritebackTmp: 0 kB +CommitLimit: 7669796 kB +Committed_AS: 100056 kB +VmallocTotal: 112216 kB +VmallocUsed: 428 kB +VmallocChunk: 111088 kB + + MemTotal: Total usable ram (i.e. physical ram minus a few reserved + bits and the kernel binary code) + MemFree: The sum of LowFree+HighFree + Buffers: Relatively temporary storage for raw disk blocks + shouldn't get tremendously large (20MB or so) + Cached: in-memory cache for files read from the disk (the + pagecache). Doesn't include SwapCached + SwapCached: Memory that once was swapped out, is swapped back in but + still also is in the swapfile (if memory is needed it + doesn't need to be swapped out AGAIN because it is already + in the swapfile. This saves I/O) + Active: Memory that has been used more recently and usually not + reclaimed unless absolutely necessary. + Inactive: Memory which has been less recently used. It is more + eligible to be reclaimed for other purposes + HighTotal: + HighFree: Highmem is all memory above ~860MB of physical memory + Highmem areas are for use by userspace programs, or + for the pagecache. The kernel must use tricks to access + this memory, making it slower to access than lowmem. + LowTotal: + LowFree: Lowmem is memory which can be used for everything that + highmem can be used for, but it is also available for the + kernel's use for its own data structures. Among many + other things, it is where everything from the Slab is + allocated. Bad things happen when you're out of lowmem. + SwapTotal: total amount of swap space available + SwapFree: Memory which has been evicted from RAM, and is temporarily + on the disk + Dirty: Memory which is waiting to get written back to the disk + Writeback: Memory which is actively being written back to the disk + AnonPages: Non-file backed pages mapped into userspace page tables + Mapped: files which have been mmaped, such as libraries + Slab: in-kernel data structures cache +SReclaimable: Part of Slab, that might be reclaimed, such as caches + SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure + PageTables: amount of memory dedicated to the lowest level of page + tables. +NFS_Unstable: NFS pages sent to the server, but not yet committed to stable + storage + Bounce: Memory used for block device "bounce buffers" +WritebackTmp: Memory used by FUSE for temporary writeback buffers + CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'), + this is the total amount of memory currently available to + be allocated on the system. This limit is only adhered to + if strict overcommit accounting is enabled (mode 2 in + 'vm.overcommit_memory'). + The CommitLimit is calculated with the following formula: + CommitLimit = ('vm.overcommit_ratio' * Physical RAM) + Swap + For example, on a system with 1G of physical RAM and 7G + of swap with a `vm.overcommit_ratio` of 30 it would + yield a CommitLimit of 7.3G. + For more details, see the memory overcommit documentation + in vm/overcommit-accounting. +Committed_AS: The amount of memory presently allocated on the system. + The committed memory is a sum of all of the memory which + has been allocated by processes, even if it has not been + "used" by them as of yet. A process which malloc()'s 1G + of memory, but only touches 300M of it will only show up + as using 300M of memory even if it has the address space + allocated for the entire 1G. This 1G is memory which has + been "committed" to by the VM and can be used at any time + by the allocating application. With strict overcommit + enabled on the system (mode 2 in 'vm.overcommit_memory'), + allocations which would exceed the CommitLimit (detailed + above) will not be permitted. This is useful if one needs + to guarantee that processes will not fail due to lack of + memory once that memory has been successfully allocated. +VmallocTotal: total size of vmalloc memory area + VmallocUsed: amount of vmalloc area which is used +VmallocChunk: largest contigious block of vmalloc area which is free + +.............................................................................. + +vmallocinfo: + +Provides information about vmalloced/vmaped areas. One line per area, +containing the virtual address range of the area, size in bytes, +caller information of the creator, and optional information depending +on the kind of area : + + pages=nr number of pages + phys=addr if a physical address was specified + ioremap I/O mapping (ioremap() and friends) + vmalloc vmalloc() area + vmap vmap()ed pages + user VM_USERMAP area + vpages buffer for pages pointers was vmalloced (huge area) + N<node>=nr (Only on NUMA kernels) + Number of pages allocated on memory node <node> + +> cat /proc/vmallocinfo +0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204 ... + /0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128 +0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204 ... + /0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64 +0xffffc20000302000-0xffffc20000304000 8192 acpi_tb_verify_table+0x21/0x4f... + phys=7fee8000 ioremap +0xffffc20000304000-0xffffc20000307000 12288 acpi_tb_verify_table+0x21/0x4f... + phys=7fee7000 ioremap +0xffffc2000031d000-0xffffc2000031f000 8192 init_vdso_vars+0x112/0x210 +0xffffc2000031f000-0xffffc2000032b000 49152 cramfs_uncompress_init+0x2e ... + /0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3 +0xffffc2000033a000-0xffffc2000033d000 12288 sys_swapon+0x640/0xac0 ... + pages=2 vmalloc N1=2 +0xffffc20000347000-0xffffc2000034c000 20480 xt_alloc_table_info+0xfe ... + /0x130 [x_tables] pages=4 vmalloc N0=4 +0xffffffffa0000000-0xffffffffa000f000 61440 sys_init_module+0xc27/0x1d00 ... + pages=14 vmalloc N2=14 +0xffffffffa000f000-0xffffffffa0014000 20480 sys_init_module+0xc27/0x1d00 ... + pages=4 vmalloc N1=4 +0xffffffffa0014000-0xffffffffa0017000 12288 sys_init_module+0xc27/0x1d00 ... + pages=2 vmalloc N1=2 +0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 ... + pages=10 vmalloc N0=10 + +1.3 IDE devices in /proc/ide +---------------------------- + +The subdirectory /proc/ide contains information about all IDE devices of which +the kernel is aware. There is one subdirectory for each IDE controller, the +file drivers and a link for each IDE device, pointing to the device directory +in the controller specific subtree. + +The file drivers contains general information about the drivers used for the +IDE devices: + + > cat /proc/ide/drivers + ide-cdrom version 4.53 + ide-disk version 1.08 + +More detailed information can be found in the controller specific +subdirectories. These are named ide0, ide1 and so on. Each of these +directories contains the files shown in table 1-5. + + +Table 1-5: IDE controller info in /proc/ide/ide? +.............................................................................. + File Content + channel IDE channel (0 or 1) + config Configuration (only for PCI/IDE bridge) + mate Mate name + model Type/Chipset of IDE controller +.............................................................................. + +Each device connected to a controller has a separate subdirectory in the +controllers directory. The files listed in table 1-6 are contained in these +directories. + + +Table 1-6: IDE device information +.............................................................................. + File Content + cache The cache + capacity Capacity of the medium (in 512Byte blocks) + driver driver and version + geometry physical and logical geometry + identify device identify block + media media type + model device identifier + settings device setup + smart_thresholds IDE disk management thresholds + smart_values IDE disk management values +.............................................................................. + +The most interesting file is settings. This file contains a nice overview of +the drive parameters: + + # cat /proc/ide/ide0/hda/settings + name value min max mode + ---- ----- --- --- ---- + bios_cyl 526 0 65535 rw + bios_head 255 0 255 rw + bios_sect 63 0 63 rw + breada_readahead 4 0 127 rw + bswap 0 0 1 r + file_readahead 72 0 2097151 rw + io_32bit 0 0 3 rw + keepsettings 0 0 1 rw + max_kb_per_request 122 1 127 rw + multcount 0 0 8 rw + nice1 1 0 1 rw + nowerr 0 0 1 rw + pio_mode write-only 0 255 w + slow 0 0 1 rw + unmaskirq 0 0 1 rw + using_dma 0 0 1 rw + + +1.4 Networking info in /proc/net +-------------------------------- + +The subdirectory /proc/net follows the usual pattern. Table 1-6 shows the +additional values you get for IP version 6 if you configure the kernel to +support this. Table 1-7 lists the files and their meaning. + + +Table 1-6: IPv6 info in /proc/net +.............................................................................. + File Content + udp6 UDP sockets (IPv6) + tcp6 TCP sockets (IPv6) + raw6 Raw device statistics (IPv6) + igmp6 IP multicast addresses, which this host joined (IPv6) + if_inet6 List of IPv6 interface addresses + ipv6_route Kernel routing table for IPv6 + rt6_stats Global IPv6 routing tables statistics + sockstat6 Socket statistics (IPv6) + snmp6 Snmp data (IPv6) +.............................................................................. + + +Table 1-7: Network info in /proc/net +.............................................................................. + File Content + arp Kernel ARP table + dev network devices with statistics + dev_mcast the Layer2 multicast groups a device is listening too + (interface index, label, number of references, number of bound + addresses). + dev_stat network device status + ip_fwchains Firewall chain linkage + ip_fwnames Firewall chain names + ip_masq Directory containing the masquerading tables + ip_masquerade Major masquerading table + netstat Network statistics + raw raw device statistics + route Kernel routing table + rpc Directory containing rpc info + rt_cache Routing cache + snmp SNMP data + sockstat Socket statistics + tcp TCP sockets + tr_rif Token ring RIF routing table + udp UDP sockets + unix UNIX domain sockets + wireless Wireless interface data (Wavelan etc) + igmp IP multicast addresses, which this host joined + psched Global packet scheduler parameters. + netlink List of PF_NETLINK sockets + ip_mr_vifs List of multicast virtual interfaces + ip_mr_cache List of multicast routing cache +.............................................................................. + +You can use this information to see which network devices are available in +your system and how much traffic was routed over those devices: + + > cat /proc/net/dev + Inter-|Receive |[... + face |bytes packets errs drop fifo frame compressed multicast|[... + lo: 908188 5596 0 0 0 0 0 0 [... + ppp0:15475140 20721 410 0 0 410 0 0 [... + eth0: 614530 7085 0 0 0 0 0 1 [... + + ...] Transmit + ...] bytes packets errs drop fifo colls carrier compressed + ...] 908188 5596 0 0 0 0 0 0 + ...] 1375103 17405 0 0 0 0 0 0 + ...] 1703981 5535 0 0 0 3 0 0 + +In addition, each Channel Bond interface has it's own directory. For +example, the bond0 device will have a directory called /proc/net/bond0/. +It will contain information that is specific to that bond, such as the +current slaves of the bond, the link status of the slaves, and how +many times the slaves link has failed. + +1.5 SCSI info +------------- + +If you have a SCSI host adapter in your system, you'll find a subdirectory +named after the driver for this adapter in /proc/scsi. You'll also see a list +of all recognized SCSI devices in /proc/scsi: + + >cat /proc/scsi/scsi + Attached devices: + Host: scsi0 Channel: 00 Id: 00 Lun: 00 + Vendor: IBM Model: DGHS09U Rev: 03E0 + Type: Direct-Access ANSI SCSI revision: 03 + Host: scsi0 Channel: 00 Id: 06 Lun: 00 + Vendor: PIONEER Model: CD-ROM DR-U06S Rev: 1.04 + Type: CD-ROM ANSI SCSI revision: 02 + + +The directory named after the driver has one file for each adapter found in +the system. These files contain information about the controller, including +the used IRQ and the IO address range. The amount of information shown is +dependent on the adapter you use. The example shows the output for an Adaptec +AHA-2940 SCSI adapter: + + > cat /proc/scsi/aic7xxx/0 + + Adaptec AIC7xxx driver version: 5.1.19/3.2.4 + Compile Options: + TCQ Enabled By Default : Disabled + AIC7XXX_PROC_STATS : Disabled + AIC7XXX_RESET_DELAY : 5 + Adapter Configuration: + SCSI Adapter: Adaptec AHA-294X Ultra SCSI host adapter + Ultra Wide Controller + PCI MMAPed I/O Base: 0xeb001000 + Adapter SEEPROM Config: SEEPROM found and used. + Adaptec SCSI BIOS: Enabled + IRQ: 10 + SCBs: Active 0, Max Active 2, + Allocated 15, HW 16, Page 255 + Interrupts: 160328 + BIOS Control Word: 0x18b6 + Adapter Control Word: 0x005b + Extended Translation: Enabled + Disconnect Enable Flags: 0xffff + Ultra Enable Flags: 0x0001 + Tag Queue Enable Flags: 0x0000 + Ordered Queue Tag Flags: 0x0000 + Default Tag Queue Depth: 8 + Tagged Queue By Device array for aic7xxx host instance 0: + {255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255} + Actual queue depth per device for aic7xxx host instance 0: + {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1} + Statistics: + (scsi0:0:0:0) + Device using Wide/Sync transfers at 40.0 MByte/sec, offset 8 + Transinfo settings: current(12/8/1/0), goal(12/8/1/0), user(12/15/1/0) + Total transfers 160151 (74577 reads and 85574 writes) + (scsi0:0:6:0) + Device using Narrow/Sync transfers at 5.0 MByte/sec, offset 15 + Transinfo settings: current(50/15/0/0), goal(50/15/0/0), user(50/15/0/0) + Total transfers 0 (0 reads and 0 writes) + + +1.6 Parallel port info in /proc/parport +--------------------------------------- + +The directory /proc/parport contains information about the parallel ports of +your system. It has one subdirectory for each port, named after the port +number (0,1,2,...). + +These directories contain the four files shown in Table 1-8. + + +Table 1-8: Files in /proc/parport +.............................................................................. + File Content + autoprobe Any IEEE-1284 device ID information that has been acquired. + devices list of the device drivers using that port. A + will appear by the + name of the device currently using the port (it might not appear + against any). + hardware Parallel port's base address, IRQ line and DMA channel. + irq IRQ that parport is using for that port. This is in a separate + file to allow you to alter it by writing a new value in (IRQ + number or none). +.............................................................................. + +1.7 TTY info in /proc/tty +------------------------- + +Information about the available and actually used tty's can be found in the +directory /proc/tty.You'll find entries for drivers and line disciplines in +this directory, as shown in Table 1-9. + + +Table 1-9: Files in /proc/tty +.............................................................................. + File Content + drivers list of drivers and their usage + ldiscs registered line disciplines + driver/serial usage statistic and status of single tty lines +.............................................................................. + +To see which tty's are currently in use, you can simply look into the file +/proc/tty/drivers: + + > cat /proc/tty/drivers + pty_slave /dev/pts 136 0-255 pty:slave + pty_master /dev/ptm 128 0-255 pty:master + pty_slave /dev/ttyp 3 0-255 pty:slave + pty_master /dev/pty 2 0-255 pty:master + serial /dev/cua 5 64-67 serial:callout + serial /dev/ttyS 4 64-67 serial + /dev/tty0 /dev/tty0 4 0 system:vtmaster + /dev/ptmx /dev/ptmx 5 2 system + /dev/console /dev/console 5 1 system:console + /dev/tty /dev/tty 5 0 system:/dev/tty + unknown /dev/tty 4 1-63 console + + +1.8 Miscellaneous kernel statistics in /proc/stat +------------------------------------------------- + +Various pieces of information about kernel activity are available in the +/proc/stat file. All of the numbers reported in this file are aggregates +since the system first booted. For a quick look, simply cat the file: + + > cat /proc/stat + cpu 2255 34 2290 22625563 6290 127 456 0 + cpu0 1132 34 1441 11311718 3675 127 438 0 + cpu1 1123 0 849 11313845 2614 0 18 0 + intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...] + ctxt 1990473 + btime 1062191376 + processes 2915 + procs_running 1 + procs_blocked 0 + +The very first "cpu" line aggregates the numbers in all of the other "cpuN" +lines. These numbers identify the amount of time the CPU has spent performing +different kinds of work. Time units are in USER_HZ (typically hundredths of a +second). The meanings of the columns are as follows, from left to right: + +- user: normal processes executing in user mode +- nice: niced processes executing in user mode +- system: processes executing in kernel mode +- idle: twiddling thumbs +- iowait: waiting for I/O to complete +- irq: servicing interrupts +- softirq: servicing softirqs +- steal: involuntary wait + +The "intr" line gives counts of interrupts serviced since boot time, for each +of the possible system interrupts. The first column is the total of all +interrupts serviced; each subsequent column is the total for that particular +interrupt. + +The "ctxt" line gives the total number of context switches across all CPUs. + +The "btime" line gives the time at which the system booted, in seconds since +the Unix epoch. + +The "processes" line gives the number of processes and threads created, which +includes (but is not limited to) those created by calls to the fork() and +clone() system calls. + +The "procs_running" line gives the number of processes currently running on +CPUs. + +The "procs_blocked" line gives the number of processes currently blocked, +waiting for I/O to complete. + + +1.9 Ext4 file system parameters +------------------------------ + +Information about mounted ext4 file systems can be found in +/proc/fs/ext4. Each mounted filesystem will have a directory in +/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or +/proc/fs/ext4/dm-0). The files in each per-device directory are shown +in Table 1-10, below. + +Table 1-10: Files in /proc/fs/ext4/<devname> +.............................................................................. + File Content + mb_groups details of multiblock allocator buddy cache of free blocks + mb_history multiblock allocation history + stats controls whether the multiblock allocator should start + collecting statistics, which are shown during the unmount + group_prealloc the multiblock allocator will round up allocation + requests to a multiple of this tuning parameter if the + stripe size is not set in the ext4 superblock + max_to_scan The maximum number of extents the multiblock allocator + will search to find the best extent + min_to_scan The minimum number of extents the multiblock allocator + will search to find the best extent + order2_req Tuning parameter which controls the minimum size for + requests (as a power of 2) where the buddy cache is + used + stream_req Files which have fewer blocks than this tunable + parameter will have their blocks allocated out of a + block group specific preallocation pool, so that small + files are packed closely together. Each large file + will have its blocks allocated out of its own unique + preallocation pool. +inode_readahead Tuning parameter which controls the maximum number of + inode table blocks that ext4's inode table readahead + algorithm will pre-read into the buffer cache +.............................................................................. + + +------------------------------------------------------------------------------ +Summary +------------------------------------------------------------------------------ +The /proc file system serves information about the running system. It not only +allows access to process data but also allows you to request the kernel status +by reading files in the hierarchy. + +The directory structure of /proc reflects the types of information and makes +it easy, if not obvious, where to look for specific data. +------------------------------------------------------------------------------ + +------------------------------------------------------------------------------ +CHAPTER 2: MODIFYING SYSTEM PARAMETERS +------------------------------------------------------------------------------ + +------------------------------------------------------------------------------ +In This Chapter +------------------------------------------------------------------------------ +* Modifying kernel parameters by writing into files found in /proc/sys +* Exploring the files which modify certain parameters +* Review of the /proc/sys file tree +------------------------------------------------------------------------------ + + +A very interesting part of /proc is the directory /proc/sys. This is not only +a source of information, it also allows you to change parameters within the +kernel. Be very careful when attempting this. You can optimize your system, +but you can also cause it to crash. Never alter kernel parameters on a +production system. Set up a development machine and test to make sure that +everything works the way you want it to. You may have no alternative but to +reboot the machine once an error has been made. + +To change a value, simply echo the new value into the file. An example is +given below in the section on the file system data. You need to be root to do +this. You can create your own boot script to perform this every time your +system boots. + +The files in /proc/sys can be used to fine tune and monitor miscellaneous and +general things in the operation of the Linux kernel. Since some of the files +can inadvertently disrupt your system, it is advisable to read both +documentation and source before actually making adjustments. In any case, be +very careful when writing to any of these files. The entries in /proc may +change slightly between the 2.1.* and the 2.2 kernel, so if there is any doubt +review the kernel documentation in the directory /usr/src/linux/Documentation. +This chapter is heavily based on the documentation included in the pre 2.2 +kernels, and became part of it in version 2.2.1 of the Linux kernel. + +2.1 /proc/sys/fs - File system data +----------------------------------- + +This subdirectory contains specific file system, file handle, inode, dentry +and quota information. + +Currently, these files are in /proc/sys/fs: + +dentry-state +------------ + +Status of the directory cache. Since directory entries are dynamically +allocated and deallocated, this file indicates the current status. It holds +six values, in which the last two are not used and are always zero. The others +are listed in table 2-1. + + +Table 2-1: Status files of the directory cache +.............................................................................. + File Content + nr_dentry Almost always zero + nr_unused Number of unused cache entries + age_limit + in seconds after the entry may be reclaimed, when memory is short + want_pages internally +.............................................................................. + +dquot-nr and dquot-max +---------------------- + +The file dquot-max shows the maximum number of cached disk quota entries. + +The file dquot-nr shows the number of allocated disk quota entries and the +number of free disk quota entries. + +If the number of available cached disk quotas is very low and you have a large +number of simultaneous system users, you might want to raise the limit. + +file-nr and file-max +-------------------- + +The kernel allocates file handles dynamically, but doesn't free them again at +this time. + +The value in file-max denotes the maximum number of file handles that the +Linux kernel will allocate. When you get a lot of error messages about running +out of file handles, you might want to raise this limit. The default value is +10% of RAM in kilobytes. To change it, just write the new number into the +file: + + # cat /proc/sys/fs/file-max + 4096 + # echo 8192 > /proc/sys/fs/file-max + # cat /proc/sys/fs/file-max + 8192 + + +This method of revision is useful for all customizable parameters of the +kernel - simply echo the new value to the corresponding file. + +Historically, the three values in file-nr denoted the number of allocated file +handles, the number of allocated but unused file handles, and the maximum +number of file handles. Linux 2.6 always reports 0 as the number of free file +handles -- this is not an error, it just means that the number of allocated +file handles exactly matches the number of used file handles. + +Attempts to allocate more file descriptors than file-max are reported with +printk, look for "VFS: file-max limit <number> reached". + +inode-state and inode-nr +------------------------ + +The file inode-nr contains the first two items from inode-state, so we'll skip +to that file... + +inode-state contains two actual numbers and five dummy values. The numbers +are nr_inodes and nr_free_inodes (in order of appearance). + +nr_inodes +~~~~~~~~~ + +Denotes the number of inodes the system has allocated. This number will +grow and shrink dynamically. + +nr_open +------- + +Denotes the maximum number of file-handles a process can +allocate. Default value is 1024*1024 (1048576) which should be +enough for most machines. Actual limit depends on RLIMIT_NOFILE +resource limit. + +nr_free_inodes +-------------- + +Represents the number of free inodes. Ie. The number of inuse inodes is +(nr_inodes - nr_free_inodes). + +aio-nr and aio-max-nr +--------------------- + +aio-nr is the running total of the number of events specified on the +io_setup system call for all currently active aio contexts. If aio-nr +reaches aio-max-nr then io_setup will fail with EAGAIN. Note that +raising aio-max-nr does not result in the pre-allocation or re-sizing +of any kernel data structures. + +2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats +----------------------------------------------------------- + +Besides these files, there is the subdirectory /proc/sys/fs/binfmt_misc. This +handles the kernel support for miscellaneous binary formats. + +Binfmt_misc provides the ability to register additional binary formats to the +Kernel without compiling an additional module/kernel. Therefore, binfmt_misc +needs to know magic numbers at the beginning or the filename extension of the +binary. + +It works by maintaining a linked list of structs that contain a description of +a binary format, including a magic with size (or the filename extension), +offset and mask, and the interpreter name. On request it invokes the given +interpreter with the original program as argument, as binfmt_java and +binfmt_em86 and binfmt_mz do. Since binfmt_misc does not define any default +binary-formats, you have to register an additional binary-format. + +There are two general files in binfmt_misc and one file per registered format. +The two general files are register and status. + +Registering a new binary format +------------------------------- + +To register a new binary format you have to issue the command + + echo :name:type:offset:magic:mask:interpreter: > /proc/sys/fs/binfmt_misc/register + + + +with appropriate name (the name for the /proc-dir entry), offset (defaults to +0, if omitted), magic, mask (which can be omitted, defaults to all 0xff) and +last but not least, the interpreter that is to be invoked (for example and +testing /bin/echo). Type can be M for usual magic matching or E for filename +extension matching (give extension in place of magic). + +Check or reset the status of the binary format handler +------------------------------------------------------ + +If you do a cat on the file /proc/sys/fs/binfmt_misc/status, you will get the +current status (enabled/disabled) of binfmt_misc. Change the status by echoing +0 (disables) or 1 (enables) or -1 (caution: this clears all previously +registered binary formats) to status. For example echo 0 > status to disable +binfmt_misc (temporarily). + +Status of a single handler +-------------------------- + +Each registered handler has an entry in /proc/sys/fs/binfmt_misc. These files +perform the same function as status, but their scope is limited to the actual +binary format. By cating this file, you also receive all related information +about the interpreter/magic of the binfmt. + +Example usage of binfmt_misc (emulate binfmt_java) +-------------------------------------------------- + + cd /proc/sys/fs/binfmt_misc + echo ':Java:M::\xca\xfe\xba\xbe::/usr/local/java/bin/javawrapper:' > register + echo ':HTML:E::html::/usr/local/java/bin/appletviewer:' > register + echo ':Applet:M::<!--applet::/usr/local/java/bin/appletviewer:' > register + echo ':DEXE:M::\x0eDEX::/usr/bin/dosexec:' > register + + +These four lines add support for Java executables and Java applets (like +binfmt_java, additionally recognizing the .html extension with no need to put +<!--applet> to every applet file). You have to install the JDK and the +shell-script /usr/local/java/bin/javawrapper too. It works around the +brokenness of the Java filename handling. To add a Java binary, just create a +link to the class-file somewhere in the path. + +2.3 /proc/sys/kernel - general kernel parameters +------------------------------------------------ + +This directory reflects general kernel behaviors. As I've said before, the +contents depend on your configuration. Here you'll find the most important +files, along with descriptions of what they mean and how to use them. + +acct +---- + +The file contains three values; highwater, lowwater, and frequency. + +It exists only when BSD-style process accounting is enabled. These values +control its behavior. If the free space on the file system where the log lives +goes below lowwater percentage, accounting suspends. If it goes above +highwater percentage, accounting resumes. Frequency determines how often you +check the amount of free space (value is in seconds). Default settings are: 4, +2, and 30. That is, suspend accounting if there is less than 2 percent free; +resume it if we have a value of 3 or more percent; consider information about +the amount of free space valid for 30 seconds + +ctrl-alt-del +------------ + +When the value in this file is 0, ctrl-alt-del is trapped and sent to the init +program to handle a graceful restart. However, when the value is greater that +zero, Linux's reaction to this key combination will be an immediate reboot, +without syncing its dirty buffers. + +[NOTE] + When a program (like dosemu) has the keyboard in raw mode, the + ctrl-alt-del is intercepted by the program before it ever reaches the + kernel tty layer, and it is up to the program to decide what to do with + it. + +domainname and hostname +----------------------- + +These files can be controlled to set the NIS domainname and hostname of your +box. For the classic darkstar.frop.org a simple: + + # echo "darkstar" > /proc/sys/kernel/hostname + # echo "frop.org" > /proc/sys/kernel/domainname + + +would suffice to set your hostname and NIS domainname. + +osrelease, ostype and version +----------------------------- + +The names make it pretty obvious what these fields contain: + + > cat /proc/sys/kernel/osrelease + 2.2.12 + + > cat /proc/sys/kernel/ostype + Linux + + > cat /proc/sys/kernel/version + #4 Fri Oct 1 12:41:14 PDT 1999 + + +The files osrelease and ostype should be clear enough. Version needs a little +more clarification. The #4 means that this is the 4th kernel built from this +source base and the date after it indicates the time the kernel was built. The +only way to tune these values is to rebuild the kernel. + +panic +----- + +The value in this file represents the number of seconds the kernel waits +before rebooting on a panic. When you use the software watchdog, the +recommended setting is 60. If set to 0, the auto reboot after a kernel panic +is disabled, which is the default setting. + +printk +------ + +The four values in printk denote +* console_loglevel, +* default_message_loglevel, +* minimum_console_loglevel and +* default_console_loglevel +respectively. + +These values influence printk() behavior when printing or logging error +messages, which come from inside the kernel. See syslog(2) for more +information on the different log levels. + +console_loglevel +---------------- + +Messages with a higher priority than this will be printed to the console. + +default_message_level +--------------------- + +Messages without an explicit priority will be printed with this priority. + +minimum_console_loglevel +------------------------ + +Minimum (highest) value to which the console_loglevel can be set. + +default_console_loglevel +------------------------ + +Default value for console_loglevel. + +sg-big-buff +----------- + +This file shows the size of the generic SCSI (sg) buffer. At this point, you +can't tune it yet, but you can change it at compile time by editing +include/scsi/sg.h and changing the value of SG_BIG_BUFF. + +If you use a scanner with SANE (Scanner Access Now Easy) you might want to set +this to a higher value. Refer to the SANE documentation on this issue. + +modprobe +-------- + +The location where the modprobe binary is located. The kernel uses this +program to load modules on demand. + +unknown_nmi_panic +----------------- + +The value in this file affects behavior of handling NMI. When the value is +non-zero, unknown NMI is trapped and then panic occurs. At that time, kernel +debugging information is displayed on console. + +NMI switch that most IA32 servers have fires unknown NMI up, for example. +If a system hangs up, try pressing the NMI switch. + +panic_on_unrecovered_nmi +------------------------ + +The default Linux behaviour on an NMI of either memory or unknown is to continue +operation. For many environments such as scientific computing it is preferable +that the box is taken out and the error dealt with than an uncorrected +parity/ECC error get propogated. + +A small number of systems do generate NMI's for bizarre random reasons such as +power management so the default is off. That sysctl works like the existing +panic controls already in that directory. + +nmi_watchdog +------------ + +Enables/Disables the NMI watchdog on x86 systems. When the value is non-zero +the NMI watchdog is enabled and will continuously test all online cpus to +determine whether or not they are still functioning properly. + +Because the NMI watchdog shares registers with oprofile, by disabling the NMI +watchdog, oprofile may have more registers to utilize. + +msgmni +------ + +Maximum number of message queue ids on the system. +This value scales to the amount of lowmem. It is automatically recomputed +upon memory add/remove or ipc namespace creation/removal. +When a value is written into this file, msgmni's value becomes fixed, i.e. it +is not recomputed anymore when one of the above events occurs. +Use auto_msgmni to change this behavior. + +auto_msgmni +----------- + +Enables/Disables automatic recomputing of msgmni upon memory add/remove or +upon ipc namespace creation/removal (see the msgmni description above). +Echoing "1" into this file enables msgmni automatic recomputing. +Echoing "0" turns it off. +auto_msgmni default value is 1. + + +2.4 /proc/sys/vm - The virtual memory subsystem +----------------------------------------------- + +The files in this directory can be used to tune the operation of the virtual +memory (VM) subsystem of the Linux kernel. + +vfs_cache_pressure +------------------ + +Controls the tendency of the kernel to reclaim the memory which is used for +caching of directory and inode objects. + +At the default value of vfs_cache_pressure=100 the kernel will attempt to +reclaim dentries and inodes at a "fair" rate with respect to pagecache and +swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer +to retain dentry and inode caches. Increasing vfs_cache_pressure beyond 100 +causes the kernel to prefer to reclaim dentries and inodes. + +dirty_background_ratio +---------------------- + +Contains, as a percentage of the dirtyable system memory (free pages + mapped +pages + file cache, not including locked pages and HugePages), the number of +pages at which the pdflush background writeback daemon will start writing out +dirty data. + +dirty_ratio +----------------- + +Contains, as a percentage of the dirtyable system memory (free pages + mapped +pages + file cache, not including locked pages and HugePages), the number of +pages at which a process which is generating disk writes will itself start +writing out dirty data. + +dirty_writeback_centisecs +------------------------- + +The pdflush writeback daemons will periodically wake up and write `old' data +out to disk. This tunable expresses the interval between those wakeups, in +100'ths of a second. + +Setting this to zero disables periodic writeback altogether. + +dirty_expire_centisecs +---------------------- + +This tunable is used to define when dirty data is old enough to be eligible +for writeout by the pdflush daemons. It is expressed in 100'ths of a second. +Data which has been dirty in-memory for longer than this interval will be +written out next time a pdflush daemon wakes up. + +highmem_is_dirtyable +-------------------- + +Only present if CONFIG_HIGHMEM is set. + +This defaults to 0 (false), meaning that the ratios set above are calculated +as a percentage of lowmem only. This protects against excessive scanning +in page reclaim, swapping and general VM distress. + +Setting this to 1 can be useful on 32 bit machines where you want to make +random changes within an MMAPed file that is larger than your available +lowmem without causing large quantities of random IO. Is is safe if the +behavior of all programs running on the machine is known and memory will +not be otherwise stressed. + +legacy_va_layout +---------------- + +If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel +will use the legacy (2.4) layout for all processes. + +lowmem_reserve_ratio +--------------------- + +For some specialised workloads on highmem machines it is dangerous for +the kernel to allow process memory to be allocated from the "lowmem" +zone. This is because that memory could then be pinned via the mlock() +system call, or by unavailability of swapspace. + +And on large highmem machines this lack of reclaimable lowmem memory +can be fatal. + +So the Linux page allocator has a mechanism which prevents allocations +which _could_ use highmem from using too much lowmem. This means that +a certain amount of lowmem is defended from the possibility of being +captured into pinned user memory. + +(The same argument applies to the old 16 megabyte ISA DMA region. This +mechanism will also defend that region from allocations which could use +highmem or lowmem). + +The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is +in defending these lower zones. + +If you have a machine which uses highmem or ISA DMA and your +applications are using mlock(), or if you are running with no swap then +you probably should change the lowmem_reserve_ratio setting. + +The lowmem_reserve_ratio is an array. You can see them by reading this file. +- +% cat /proc/sys/vm/lowmem_reserve_ratio +256 256 32 +- +Note: # of this elements is one fewer than number of zones. Because the highest + zone's value is not necessary for following calculation. + +But, these values are not used directly. The kernel calculates # of protection +pages for each zones from them. These are shown as array of protection pages +in /proc/zoneinfo like followings. (This is an example of x86-64 box). +Each zone has an array of protection pages like this. + +- +Node 0, zone DMA + pages free 1355 + min 3 + low 3 + high 4 + : + : + numa_other 0 + protection: (0, 2004, 2004, 2004) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + pagesets + cpu: 0 pcp: 0 + : +- +These protections are added to score to judge whether this zone should be used +for page allocation or should be reclaimed. + +In this example, if normal pages (index=2) are required to this DMA zone and +pages_high is used for watermark, the kernel judges this zone should not be +used because pages_free(1355) is smaller than watermark + protection[2] +(4 + 2004 = 2008). If this protection value is 0, this zone would be used for +normal page requirement. If requirement is DMA zone(index=0), protection[0] +(=0) is used. + +zone[i]'s protection[j] is calculated by following expression. + +(i < j): + zone[i]->protection[j] + = (total sums of present_pages from zone[i+1] to zone[j] on the node) + / lowmem_reserve_ratio[i]; +(i = j): + (should not be protected. = 0; +(i > j): + (not necessary, but looks 0) + +The default values of lowmem_reserve_ratio[i] are + 256 (if zone[i] means DMA or DMA32 zone) + 32 (others). +As above expression, they are reciprocal number of ratio. +256 means 1/256. # of protection pages becomes about "0.39%" of total present +pages of higher zones on the node. + +If you would like to protect more pages, smaller values are effective. +The minimum value is 1 (1/1 -> 100%). + +page-cluster +------------ + +page-cluster controls the number of pages which are written to swap in +a single attempt. The swap I/O size. + +It is a logarithmic value - setting it to zero means "1 page", setting +it to 1 means "2 pages", setting it to 2 means "4 pages", etc. + +The default value is three (eight pages at a time). There may be some +small benefits in tuning this to a different value if your workload is +swap-intensive. + +overcommit_memory +----------------- + +Controls overcommit of system memory, possibly allowing processes +to allocate (but not use) more memory than is actually available. + + +0 - Heuristic overcommit handling. Obvious overcommits of + address space are refused. Used for a typical system. It + ensures a seriously wild allocation fails while allowing + overcommit to reduce swap usage. root is allowed to + allocate slightly more memory in this mode. This is the + default. + +1 - Always overcommit. Appropriate for some scientific + applications. + +2 - Don't overcommit. The total address space commit + for the system is not permitted to exceed swap plus a + configurable percentage (default is 50) of physical RAM. + Depending on the percentage you use, in most situations + this means a process will not be killed while attempting + to use already-allocated memory but will receive errors + on memory allocation as appropriate. + +overcommit_ratio +---------------- + +Percentage of physical memory size to include in overcommit calculations +(see above.) + +Memory allocation limit = swapspace + physmem * (overcommit_ratio / 100) + + swapspace = total size of all swap areas + physmem = size of physical memory in system + +nr_hugepages and hugetlb_shm_group +---------------------------------- + +nr_hugepages configures number of hugetlb page reserved for the system. + +hugetlb_shm_group contains group id that is allowed to create SysV shared +memory segment using hugetlb page. + +hugepages_treat_as_movable +-------------------------- + +This parameter is only useful when kernelcore= is specified at boot time to +create ZONE_MOVABLE for pages that may be reclaimed or migrated. Huge pages +are not movable so are not normally allocated from ZONE_MOVABLE. A non-zero +value written to hugepages_treat_as_movable allows huge pages to be allocated +from ZONE_MOVABLE. + +Once enabled, the ZONE_MOVABLE is treated as an area of memory the huge +pages pool can easily grow or shrink within. Assuming that applications are +not running that mlock() a lot of memory, it is likely the huge pages pool +can grow to the size of ZONE_MOVABLE by repeatedly entering the desired value +into nr_hugepages and triggering page reclaim. + +laptop_mode +----------- + +laptop_mode is a knob that controls "laptop mode". All the things that are +controlled by this knob are discussed in Documentation/laptops/laptop-mode.txt. + +block_dump +---------- + +block_dump enables block I/O debugging when set to a nonzero value. More +information on block I/O debugging is in Documentation/laptops/laptop-mode.txt. + +swap_token_timeout +------------------ + +This file contains valid hold time of swap out protection token. The Linux +VM has token based thrashing control mechanism and uses the token to prevent +unnecessary page faults in thrashing situation. The unit of the value is +second. The value would be useful to tune thrashing behavior. + +drop_caches +----------- + +Writing to this will cause the kernel to drop clean caches, dentries and +inodes from memory, causing that memory to become free. + +To free pagecache: + echo 1 > /proc/sys/vm/drop_caches +To free dentries and inodes: + echo 2 > /proc/sys/vm/drop_caches +To free pagecache, dentries and inodes: + echo 3 > /proc/sys/vm/drop_caches + +As this is a non-destructive operation and dirty objects are not freeable, the +user should run `sync' first. + + +2.5 /proc/sys/dev - Device specific parameters +---------------------------------------------- + +Currently there is only support for CDROM drives, and for those, there is only +one read-only file containing information about the CD-ROM drives attached to +the system: + + >cat /proc/sys/dev/cdrom/info + CD-ROM information, Id: cdrom.c 2.55 1999/04/25 + + drive name: sr0 hdb + drive speed: 32 40 + drive # of slots: 1 0 + Can close tray: 1 1 + Can open tray: 1 1 + Can lock tray: 1 1 + Can change speed: 1 1 + Can select disk: 0 1 + Can read multisession: 1 1 + Can read MCN: 1 1 + Reports media changed: 1 1 + Can play audio: 1 1 + + +You see two drives, sr0 and hdb, along with a list of their features. + +2.6 /proc/sys/sunrpc - Remote procedure calls +--------------------------------------------- + +This directory contains four files, which enable or disable debugging for the +RPC functions NFS, NFS-daemon, RPC and NLM. The default values are 0. They can +be set to one to turn debugging on. (The default value is 0 for each) + +2.7 /proc/sys/net - Networking stuff +------------------------------------ + +The interface to the networking parts of the kernel is located in +/proc/sys/net. Table 2-3 shows all possible subdirectories. You may see only +some of them, depending on your kernel's configuration. + + +Table 2-3: Subdirectories in /proc/sys/net +.............................................................................. + Directory Content Directory Content + core General parameter appletalk Appletalk protocol + unix Unix domain sockets netrom NET/ROM + 802 E802 protocol ax25 AX25 + ethernet Ethernet protocol rose X.25 PLP layer + ipv4 IP version 4 x25 X.25 protocol + ipx IPX token-ring IBM token ring + bridge Bridging decnet DEC net + ipv6 IP version 6 +.............................................................................. + +We will concentrate on IP networking here. Since AX15, X.25, and DEC Net are +only minor players in the Linux world, we'll skip them in this chapter. You'll +find some short info on Appletalk and IPX further on in this chapter. Review +the online documentation and the kernel source to get a detailed view of the +parameters for those protocols. In this section we'll discuss the +subdirectories printed in bold letters in the table above. As default values +are suitable for most needs, there is no need to change these values. + +/proc/sys/net/core - Network core options +----------------------------------------- + +rmem_default +------------ + +The default setting of the socket receive buffer in bytes. + +rmem_max +-------- + +The maximum receive socket buffer size in bytes. + +wmem_default +------------ + +The default setting (in bytes) of the socket send buffer. + +wmem_max +-------- + +The maximum send socket buffer size in bytes. + +message_burst and message_cost +------------------------------ + +These parameters are used to limit the warning messages written to the kernel +log from the networking code. They enforce a rate limit to make a +denial-of-service attack impossible. A higher message_cost factor, results in +fewer messages that will be written. Message_burst controls when messages will +be dropped. The default settings limit warning messages to one every five +seconds. + +warnings +-------- + +This controls console messages from the networking stack that can occur because +of problems on the network like duplicate address or bad checksums. Normally, +this should be enabled, but if the problem persists the messages can be +disabled. + + +netdev_max_backlog +------------------ + +Maximum number of packets, queued on the INPUT side, when the interface +receives packets faster than kernel can process them. + +optmem_max +---------- + +Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence +of struct cmsghdr structures with appended data. + +/proc/sys/net/unix - Parameters for Unix domain sockets +------------------------------------------------------- + +There are only two files in this subdirectory. They control the delays for +deleting and destroying socket descriptors. + +2.8 /proc/sys/net/ipv4 - IPV4 settings +-------------------------------------- + +IP version 4 is still the most used protocol in Unix networking. It will be +replaced by IP version 6 in the next couple of years, but for the moment it's +the de facto standard for the internet and is used in most networking +environments around the world. Because of the importance of this protocol, +we'll have a deeper look into the subtree controlling the behavior of the IPv4 +subsystem of the Linux kernel. + +Let's start with the entries in /proc/sys/net/ipv4. + +ICMP settings +------------- + +icmp_echo_ignore_all and icmp_echo_ignore_broadcasts +---------------------------------------------------- + +Turn on (1) or off (0), if the kernel should ignore all ICMP ECHO requests, or +just those to broadcast and multicast addresses. + +Please note that if you accept ICMP echo requests with a broadcast/multi\-cast +destination address your network may be used as an exploder for denial of +service packet flooding attacks to other hosts. + +icmp_destunreach_rate, icmp_echoreply_rate, icmp_paramprob_rate and icmp_timeexeed_rate +--------------------------------------------------------------------------------------- + +Sets limits for sending ICMP packets to specific targets. A value of zero +disables all limiting. Any positive value sets the maximum package rate in +hundredth of a second (on Intel systems). + +IP settings +----------- + +ip_autoconfig +------------- + +This file contains the number one if the host received its IP configuration by +RARP, BOOTP, DHCP or a similar mechanism. Otherwise it is zero. + +ip_default_ttl +-------------- + +TTL (Time To Live) for IPv4 interfaces. This is simply the maximum number of +hops a packet may travel. + +ip_dynaddr +---------- + +Enable dynamic socket address rewriting on interface address change. This is +useful for dialup interface with changing IP addresses. + +ip_forward +---------- + +Enable or disable forwarding of IP packages between interfaces. Changing this +value resets all other parameters to their default values. They differ if the +kernel is configured as host or router. + +ip_local_port_range +------------------- + +Range of ports used by TCP and UDP to choose the local port. Contains two +numbers, the first number is the lowest port, the second number the highest +local port. Default is 1024-4999. Should be changed to 32768-61000 for +high-usage systems. + +ip_no_pmtu_disc +--------------- + +Global switch to turn path MTU discovery off. It can also be set on a per +socket basis by the applications or on a per route basis. + +ip_masq_debug +------------- + +Enable/disable debugging of IP masquerading. + +IP fragmentation settings +------------------------- + +ipfrag_high_trash and ipfrag_low_trash +-------------------------------------- + +Maximum memory used to reassemble IP fragments. When ipfrag_high_thresh bytes +of memory is allocated for this purpose, the fragment handler will toss +packets until ipfrag_low_thresh is reached. + +ipfrag_time +----------- + +Time in seconds to keep an IP fragment in memory. + +TCP settings +------------ + +tcp_ecn +------- + +This file controls the use of the ECN bit in the IPv4 headers. This is a new +feature about Explicit Congestion Notification, but some routers and firewalls +block traffic that has this bit set, so it could be necessary to echo 0 to +/proc/sys/net/ipv4/tcp_ecn if you want to talk to these sites. For more info +you could read RFC2481. + +tcp_retrans_collapse +-------------------- + +Bug-to-bug compatibility with some broken printers. On retransmit, try to send +larger packets to work around bugs in certain TCP stacks. Can be turned off by +setting it to zero. + +tcp_keepalive_probes +-------------------- + +Number of keep alive probes TCP sends out, until it decides that the +connection is broken. + +tcp_keepalive_time +------------------ + +How often TCP sends out keep alive messages, when keep alive is enabled. The +default is 2 hours. + +tcp_syn_retries +--------------- + +Number of times initial SYNs for a TCP connection attempt will be +retransmitted. Should not be higher than 255. This is only the timeout for +outgoing connections, for incoming connections the number of retransmits is +defined by tcp_retries1. + +tcp_sack +-------- + +Enable select acknowledgments after RFC2018. + +tcp_timestamps +-------------- + +Enable timestamps as defined in RFC1323. + +tcp_stdurg +---------- + +Enable the strict RFC793 interpretation of the TCP urgent pointer field. The +default is to use the BSD compatible interpretation of the urgent pointer +pointing to the first byte after the urgent data. The RFC793 interpretation is +to have it point to the last byte of urgent data. Enabling this option may +lead to interoperability problems. Disabled by default. + +tcp_syncookies +-------------- + +Only valid when the kernel was compiled with CONFIG_SYNCOOKIES. Send out +syncookies when the syn backlog queue of a socket overflows. This is to ward +off the common 'syn flood attack'. Disabled by default. + +Note that the concept of a socket backlog is abandoned. This means the peer +may not receive reliable error messages from an over loaded server with +syncookies enabled. + +tcp_window_scaling +------------------ + +Enable window scaling as defined in RFC1323. + +tcp_fin_timeout +--------------- + +The length of time in seconds it takes to receive a final FIN before the +socket is always closed. This is strictly a violation of the TCP +specification, but required to prevent denial-of-service attacks. + +tcp_max_ka_probes +----------------- + +Indicates how many keep alive probes are sent per slow timer run. Should not +be set too high to prevent bursts. + +tcp_max_syn_backlog +------------------- + +Length of the per socket backlog queue. Since Linux 2.2 the backlog specified +in listen(2) only specifies the length of the backlog queue of already +established sockets. When more connection requests arrive Linux starts to drop +packets. When syncookies are enabled the packets are still answered and the +maximum queue is effectively ignored. + +tcp_retries1 +------------ + +Defines how often an answer to a TCP connection request is retransmitted +before giving up. + +tcp_retries2 +------------ + +Defines how often a TCP packet is retransmitted before giving up. + +Interface specific settings +--------------------------- + +In the directory /proc/sys/net/ipv4/conf you'll find one subdirectory for each +interface the system knows about and one directory calls all. Changes in the +all subdirectory affect all interfaces, whereas changes in the other +subdirectories affect only one interface. All directories have the same +entries: + +accept_redirects +---------------- + +This switch decides if the kernel accepts ICMP redirect messages or not. The +default is 'yes' if the kernel is configured for a regular host and 'no' for a +router configuration. + +accept_source_route +------------------- + +Should source routed packages be accepted or declined. The default is +dependent on the kernel configuration. It's 'yes' for routers and 'no' for +hosts. + +bootp_relay +~~~~~~~~~~~ + +Accept packets with source address 0.b.c.d with destinations not to this host +as local ones. It is supposed that a BOOTP relay daemon will catch and forward +such packets. + +The default is 0, since this feature is not implemented yet (kernel version +2.2.12). + +forwarding +---------- + +Enable or disable IP forwarding on this interface. + +log_martians +------------ + +Log packets with source addresses with no known route to kernel log. + +mc_forwarding +------------- + +Do multicast routing. The kernel needs to be compiled with CONFIG_MROUTE and a +multicast routing daemon is required. + +proxy_arp +--------- + +Does (1) or does not (0) perform proxy ARP. + +rp_filter +--------- + +Integer value determines if a source validation should be made. 1 means yes, 0 +means no. Disabled by default, but local/broadcast address spoofing is always +on. + +If you set this to 1 on a router that is the only connection for a network to +the net, it will prevent spoofing attacks against your internal networks +(external addresses can still be spoofed), without the need for additional +firewall rules. + +secure_redirects +---------------- + +Accept ICMP redirect messages only for gateways, listed in default gateway +list. Enabled by default. + +shared_media +------------ + +If it is not set the kernel does not assume that different subnets on this +device can communicate directly. Default setting is 'yes'. + +send_redirects +-------------- + +Determines whether to send ICMP redirects to other hosts. + +Routing settings +---------------- + +The directory /proc/sys/net/ipv4/route contains several file to control +routing issues. + +error_burst and error_cost +-------------------------- + +These parameters are used to limit how many ICMP destination unreachable to +send from the host in question. ICMP destination unreachable messages are +sent when we cannot reach the next hop while trying to transmit a packet. +It will also print some error messages to kernel logs if someone is ignoring +our ICMP redirects. The higher the error_cost factor is, the fewer +destination unreachable and error messages will be let through. Error_burst +controls when destination unreachable messages and error messages will be +dropped. The default settings limit warning messages to five every second. + +flush +----- + +Writing to this file results in a flush of the routing cache. + +gc_elasticity, gc_interval, gc_min_interval_ms, gc_timeout, gc_thresh +--------------------------------------------------------------------- + +Values to control the frequency and behavior of the garbage collection +algorithm for the routing cache. gc_min_interval is deprecated and replaced +by gc_min_interval_ms. + + +max_size +-------- + +Maximum size of the routing cache. Old entries will be purged once the cache +reached has this size. + +redirect_load, redirect_number +------------------------------ + +Factors which determine if more ICPM redirects should be sent to a specific +host. No redirects will be sent once the load limit or the maximum number of +redirects has been reached. + +redirect_silence +---------------- + +Timeout for redirects. After this period redirects will be sent again, even if +this has been stopped, because the load or number limit has been reached. + +Network Neighbor handling +------------------------- + +Settings about how to handle connections with direct neighbors (nodes attached +to the same link) can be found in the directory /proc/sys/net/ipv4/neigh. + +As we saw it in the conf directory, there is a default subdirectory which +holds the default values, and one directory for each interface. The contents +of the directories are identical, with the single exception that the default +settings contain additional options to set garbage collection parameters. + +In the interface directories you'll find the following entries: + +base_reachable_time, base_reachable_time_ms +------------------------------------------- + +A base value used for computing the random reachable time value as specified +in RFC2461. + +Expression of base_reachable_time, which is deprecated, is in seconds. +Expression of base_reachable_time_ms is in milliseconds. + +retrans_time, retrans_time_ms +----------------------------- + +The time between retransmitted Neighbor Solicitation messages. +Used for address resolution and to determine if a neighbor is +unreachable. + +Expression of retrans_time, which is deprecated, is in 1/100 seconds (for +IPv4) or in jiffies (for IPv6). +Expression of retrans_time_ms is in milliseconds. + +unres_qlen +---------- + +Maximum queue length for a pending arp request - the number of packets which +are accepted from other layers while the ARP address is still resolved. + +anycast_delay +------------- + +Maximum for random delay of answers to neighbor solicitation messages in +jiffies (1/100 sec). Not yet implemented (Linux does not have anycast support +yet). + +ucast_solicit +------------- + +Maximum number of retries for unicast solicitation. + +mcast_solicit +------------- + +Maximum number of retries for multicast solicitation. + +delay_first_probe_time +---------------------- + +Delay for the first time probe if the neighbor is reachable. (see +gc_stale_time) + +locktime +-------- + +An ARP/neighbor entry is only replaced with a new one if the old is at least +locktime old. This prevents ARP cache thrashing. + +proxy_delay +----------- + +Maximum time (real time is random [0..proxytime]) before answering to an ARP +request for which we have an proxy ARP entry. In some cases, this is used to +prevent network flooding. + +proxy_qlen +---------- + +Maximum queue length of the delayed proxy arp timer. (see proxy_delay). + +app_solicit +---------- + +Determines the number of requests to send to the user level ARP daemon. Use 0 +to turn off. + +gc_stale_time +------------- + +Determines how often to check for stale ARP entries. After an ARP entry is +stale it will be resolved again (which is useful when an IP address migrates +to another machine). When ucast_solicit is greater than 0 it first tries to +send an ARP packet directly to the known host When that fails and +mcast_solicit is greater than 0, an ARP request is broadcasted. + +2.9 Appletalk +------------- + +The /proc/sys/net/appletalk directory holds the Appletalk configuration data +when Appletalk is loaded. The configurable parameters are: + +aarp-expiry-time +---------------- + +The amount of time we keep an ARP entry before expiring it. Used to age out +old hosts. + +aarp-resolve-time +----------------- + +The amount of time we will spend trying to resolve an Appletalk address. + +aarp-retransmit-limit +--------------------- + +The number of times we will retransmit a query before giving up. + +aarp-tick-time +-------------- + +Controls the rate at which expires are checked. + +The directory /proc/net/appletalk holds the list of active Appletalk sockets +on a machine. + +The fields indicate the DDP type, the local address (in network:node format) +the remote address, the size of the transmit pending queue, the size of the +received queue (bytes waiting for applications to read) the state and the uid +owning the socket. + +/proc/net/atalk_iface lists all the interfaces configured for appletalk.It +shows the name of the interface, its Appletalk address, the network range on +that address (or network number for phase 1 networks), and the status of the +interface. + +/proc/net/atalk_route lists each known network route. It lists the target +(network) that the route leads to, the router (may be directly connected), the +route flags, and the device the route is using. + +2.10 IPX +-------- + +The IPX protocol has no tunable values in proc/sys/net. + +The IPX protocol does, however, provide proc/net/ipx. This lists each IPX +socket giving the local and remote addresses in Novell format (that is +network:node:port). In accordance with the strange Novell tradition, +everything but the port is in hex. Not_Connected is displayed for sockets that +are not tied to a specific remote address. The Tx and Rx queue sizes indicate +the number of bytes pending for transmission and reception. The state +indicates the state the socket is in and the uid is the owning uid of the +socket. + +The /proc/net/ipx_interface file lists all IPX interfaces. For each interface +it gives the network number, the node number, and indicates if the network is +the primary network. It also indicates which device it is bound to (or +Internal for internal networks) and the Frame Type if appropriate. Linux +supports 802.3, 802.2, 802.2 SNAP and DIX (Blue Book) ethernet framing for +IPX. + +The /proc/net/ipx_route table holds a list of IPX routes. For each route it +gives the destination network, the router node (or Directly) and the network +address of the router (or Connected) for internal networks. + +2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem +---------------------------------------------------------- + +The "mqueue" filesystem provides the necessary kernel features to enable the +creation of a user space library that implements the POSIX message queues +API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System +Interfaces specification.) + +The "mqueue" filesystem contains values for determining/setting the amount of +resources used by the file system. + +/proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the +maximum number of message queues allowed on the system. + +/proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the +maximum number of messages in a queue value. In fact it is the limiting value +for another (user) limit which is set in mq_open invocation. This attribute of +a queue must be less or equal then msg_max. + +/proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the +maximum message size value (it is every message queue's attribute set during +its creation). + +2.12 /proc/<pid>/oom_adj - Adjust the oom-killer score +------------------------------------------------------ + +This file can be used to adjust the score used to select which processes +should be killed in an out-of-memory situation. Giving it a high score will +increase the likelihood of this process being killed by the oom-killer. Valid +values are in the range -16 to +15, plus the special value -17, which disables +oom-killing altogether for this process. + +2.13 /proc/<pid>/oom_score - Display current oom-killer score +------------------------------------------------------------- + +------------------------------------------------------------------------------ +This file can be used to check the current score used by the oom-killer is for +any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which +process should be killed in an out-of-memory situation. + +------------------------------------------------------------------------------ +Summary +------------------------------------------------------------------------------ +Certain aspects of kernel behavior can be modified at runtime, without the +need to recompile the kernel, or even to reboot the system. The files in the +/proc/sys tree can not only be read, but also modified. You can use the echo +command to write value into these files, thereby changing the default settings +of the kernel. +------------------------------------------------------------------------------ + +2.14 /proc/<pid>/io - Display the IO accounting fields +------------------------------------------------------- + +This file contains IO statistics for each running process + +Example +------- + +test:/tmp # dd if=/dev/zero of=/tmp/test.dat & +[1] 3828 + +test:/tmp # cat /proc/3828/io +rchar: 323934931 +wchar: 323929600 +syscr: 632687 +syscw: 632675 +read_bytes: 0 +write_bytes: 323932160 +cancelled_write_bytes: 0 + + +Description +----------- + +rchar +----- + +I/O counter: chars read +The number of bytes which this task has caused to be read from storage. This +is simply the sum of bytes which this process passed to read() and pread(). +It includes things like tty IO and it is unaffected by whether or not actual +physical disk IO was required (the read might have been satisfied from +pagecache) + + +wchar +----- + +I/O counter: chars written +The number of bytes which this task has caused, or shall cause to be written +to disk. Similar caveats apply here as with rchar. + + +syscr +----- + +I/O counter: read syscalls +Attempt to count the number of read I/O operations, i.e. syscalls like read() +and pread(). + + +syscw +----- + +I/O counter: write syscalls +Attempt to count the number of write I/O operations, i.e. syscalls like +write() and pwrite(). + + +read_bytes +---------- + +I/O counter: bytes read +Attempt to count the number of bytes which this process really did cause to +be fetched from the storage layer. Done at the submit_bio() level, so it is +accurate for block-backed filesystems. <please add status regarding NFS and +CIFS at a later time> + + +write_bytes +----------- + +I/O counter: bytes written +Attempt to count the number of bytes which this process caused to be sent to +the storage layer. This is done at page-dirtying time. + + +cancelled_write_bytes +--------------------- + +The big inaccuracy here is truncate. If a process writes 1MB to a file and +then deletes the file, it will in fact perform no writeout. But it will have +been accounted as having caused 1MB of write. +In other words: The number of bytes which this process caused to not happen, +by truncating pagecache. A task can cause "negative" IO too. If this task +truncates some dirty pagecache, some IO which another task has been accounted +for (in it's write_bytes) will not be happening. We _could_ just subtract that +from the truncating task's write_bytes, but there is information loss in doing +that. + + +Note +---- + +At its current implementation state, this is a bit racy on 32-bit machines: if +process A reads process B's /proc/pid/io while process B is updating one of +those 64-bit counters, process A could see an intermediate result. + + +More information about this can be found within the taskstats documentation in +Documentation/accounting. + +2.15 /proc/<pid>/coredump_filter - Core dump filtering settings +--------------------------------------------------------------- +When a process is dumped, all anonymous memory is written to a core file as +long as the size of the core file isn't limited. But sometimes we don't want +to dump some memory segments, for example, huge shared memory. Conversely, +sometimes we want to save file-backed memory segments into a core file, not +only the individual files. + +/proc/<pid>/coredump_filter allows you to customize which memory segments +will be dumped when the <pid> process is dumped. coredump_filter is a bitmask +of memory types. If a bit of the bitmask is set, memory segments of the +corresponding memory type are dumped, otherwise they are not dumped. + +The following 7 memory types are supported: + - (bit 0) anonymous private memory + - (bit 1) anonymous shared memory + - (bit 2) file-backed private memory + - (bit 3) file-backed shared memory + - (bit 4) ELF header pages in file-backed private memory areas (it is + effective only if the bit 2 is cleared) + - (bit 5) hugetlb private memory + - (bit 6) hugetlb shared memory + + Note that MMIO pages such as frame buffer are never dumped and vDSO pages + are always dumped regardless of the bitmask status. + + Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only + effected by bit 5-6. + +Default value of coredump_filter is 0x23; this means all anonymous memory +segments and hugetlb private memory are dumped. + +If you don't want to dump all shared memory segments attached to pid 1234, +write 0x21 to the process's proc file. + + $ echo 0x21 > /proc/1234/coredump_filter + +When a new process is created, the process inherits the bitmask status from its +parent. It is useful to set up coredump_filter before the program runs. +For example: + + $ echo 0x7 > /proc/self/coredump_filter + $ ./some_program + +2.16 /proc/<pid>/mountinfo - Information about mounts +-------------------------------------------------------- + +This file contains lines of the form: + +36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue +(1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11) + +(1) mount ID: unique identifier of the mount (may be reused after umount) +(2) parent ID: ID of parent (or of self for the top of the mount tree) +(3) major:minor: value of st_dev for files on filesystem +(4) root: root of the mount within the filesystem +(5) mount point: mount point relative to the process's root +(6) mount options: per mount options +(7) optional fields: zero or more fields of the form "tag[:value]" +(8) separator: marks the end of the optional fields +(9) filesystem type: name of filesystem of the form "type[.subtype]" +(10) mount source: filesystem specific information or "none" +(11) super options: per super block options + +Parsers should ignore all unrecognised optional fields. Currently the +possible optional fields are: + +shared:X mount is shared in peer group X +master:X mount is slave to peer group X +propagate_from:X mount is slave and receives propagation from peer group X (*) +unbindable mount is unbindable + +(*) X is the closest dominant peer group under the process's root. If +X is the immediate master of the mount, or if there's no dominant peer +group under the same root, then only the "master:X" field is present +and not the "propagate_from:X" field. + +For more information on mount propagation see: + + Documentation/filesystems/sharedsubtree.txt + +2.17 /proc/sys/fs/epoll - Configuration options for the epoll interface +-------------------------------------------------------- + +This directory contains configuration options for the epoll(7) interface. + +max_user_instances +------------------ + +This is the maximum number of epoll file descriptors that a single user can +have open at a given time. The default value is 128, and should be enough +for normal users. + +max_user_watches +---------------- + +Every epoll file descriptor can store a number of files to be monitored +for event readiness. Each one of these monitored files constitutes a "watch". +This configuration option sets the maximum number of "watches" that are +allowed for each user. +Each "watch" costs roughly 90 bytes on a 32bit kernel, and roughly 160 bytes +on a 64bit one. +The current default value for max_user_watches is the 1/32 of the available +low memory, divided for the "watch" cost in bytes. + + +------------------------------------------------------------------------------ + diff --git a/Documentation/filesystems/quota.txt b/Documentation/filesystems/quota.txt new file mode 100644 index 0000000..5e8de25 --- /dev/null +++ b/Documentation/filesystems/quota.txt @@ -0,0 +1,65 @@ + +Quota subsystem +=============== + +Quota subsystem allows system administrator to set limits on used space and +number of used inodes (inode is a filesystem structure which is associated with +each file or directory) for users and/or groups. For both used space and number +of used inodes there are actually two limits. The first one is called softlimit +and the second one hardlimit. An user can never exceed a hardlimit for any +resource (unless he has CAP_SYS_RESOURCE capability). User is allowed to exceed +softlimit but only for limited period of time. This period is called "grace +period" or "grace time". When grace time is over, user is not able to allocate +more space/inodes until he frees enough of them to get below softlimit. + +Quota limits (and amount of grace time) are set independently for each +filesystem. + +For more details about quota design, see the documentation in quota-tools package +(http://sourceforge.net/projects/linuxquota). + +Quota netlink interface +======================= +When user exceeds a softlimit, runs out of grace time or reaches hardlimit, +quota subsystem traditionally printed a message to the controlling terminal of +the process which caused the excess. This method has the disadvantage that +when user is using a graphical desktop he usually cannot see the message. +Thus quota netlink interface has been designed to pass information about +the above events to userspace. There they can be captured by an application +and processed accordingly. + +The interface uses generic netlink framework (see +http://lwn.net/Articles/208755/ and http://people.suug.ch/~tgr/libnl/ for more +details about this layer). The name of the quota generic netlink interface +is "VFS_DQUOT". Definitions of constants below are in <linux/quota.h>. + Currently, the interface supports only one message type QUOTA_NL_C_WARNING. +This command is used to send a notification about any of the above mentioned +events. Each message has six attributes. These are (type of the argument is +in parentheses): + QUOTA_NL_A_QTYPE (u32) + - type of quota being exceeded (one of USRQUOTA, GRPQUOTA) + QUOTA_NL_A_EXCESS_ID (u64) + - UID/GID (depends on quota type) of user / group whose limit + is being exceeded. + QUOTA_NL_A_CAUSED_ID (u64) + - UID of a user who caused the event + QUOTA_NL_A_WARNING (u32) + - what kind of limit is exceeded: + QUOTA_NL_IHARDWARN - inode hardlimit + QUOTA_NL_ISOFTLONGWARN - inode softlimit is exceeded longer + than given grace period + QUOTA_NL_ISOFTWARN - inode softlimit + QUOTA_NL_BHARDWARN - space (block) hardlimit + QUOTA_NL_BSOFTLONGWARN - space (block) softlimit is exceeded + longer than given grace period. + QUOTA_NL_BSOFTWARN - space (block) softlimit + - four warnings are also defined for the event when user stops + exceeding some limit: + QUOTA_NL_IHARDBELOW - inode hardlimit + QUOTA_NL_ISOFTBELOW - inode softlimit + QUOTA_NL_BHARDBELOW - space (block) hardlimit + QUOTA_NL_BSOFTBELOW - space (block) softlimit + QUOTA_NL_A_DEV_MAJOR (u32) + - major number of a device with the affected filesystem + QUOTA_NL_A_DEV_MINOR (u32) + - minor number of a device with the affected filesystem diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt new file mode 100644 index 0000000..a8273d5 --- /dev/null +++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt @@ -0,0 +1,355 @@ +ramfs, rootfs and initramfs +October 17, 2005 +Rob Landley <rob@landley.net> +============================= + +What is ramfs? +-------------- + +Ramfs is a very simple filesystem that exports Linux's disk caching +mechanisms (the page cache and dentry cache) as a dynamically resizable +RAM-based filesystem. + +Normally all files are cached in memory by Linux. Pages of data read from +backing store (usually the block device the filesystem is mounted on) are kept +around in case it's needed again, but marked as clean (freeable) in case the +Virtual Memory system needs the memory for something else. Similarly, data +written to files is marked clean as soon as it has been written to backing +store, but kept around for caching purposes until the VM reallocates the +memory. A similar mechanism (the dentry cache) greatly speeds up access to +directories. + +With ramfs, there is no backing store. Files written into ramfs allocate +dentries and page cache as usual, but there's nowhere to write them to. +This means the pages are never marked clean, so they can't be freed by the +VM when it's looking to recycle memory. + +The amount of code required to implement ramfs is tiny, because all the +work is done by the existing Linux caching infrastructure. Basically, +you're mounting the disk cache as a filesystem. Because of this, ramfs is not +an optional component removable via menuconfig, since there would be negligible +space savings. + +ramfs and ramdisk: +------------------ + +The older "ram disk" mechanism created a synthetic block device out of +an area of RAM and used it as backing store for a filesystem. This block +device was of fixed size, so the filesystem mounted on it was of fixed +size. Using a ram disk also required unnecessarily copying memory from the +fake block device into the page cache (and copying changes back out), as well +as creating and destroying dentries. Plus it needed a filesystem driver +(such as ext2) to format and interpret this data. + +Compared to ramfs, this wastes memory (and memory bus bandwidth), creates +unnecessary work for the CPU, and pollutes the CPU caches. (There are tricks +to avoid this copying by playing with the page tables, but they're unpleasantly +complicated and turn out to be about as expensive as the copying anyway.) +More to the point, all the work ramfs is doing has to happen _anyway_, +since all file access goes through the page and dentry caches. The RAM +disk is simply unnecessary; ramfs is internally much simpler. + +Another reason ramdisks are semi-obsolete is that the introduction of +loopback devices offered a more flexible and convenient way to create +synthetic block devices, now from files instead of from chunks of memory. +See losetup (8) for details. + +ramfs and tmpfs: +---------------- + +One downside of ramfs is you can keep writing data into it until you fill +up all memory, and the VM can't free it because the VM thinks that files +should get written to backing store (rather than swap space), but ramfs hasn't +got any backing store. Because of this, only root (or a trusted user) should +be allowed write access to a ramfs mount. + +A ramfs derivative called tmpfs was created to add size limits, and the ability +to write the data to swap space. Normal users can be allowed write access to +tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information. + +What is rootfs? +--------------- + +Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is +always present in 2.6 systems. You can't unmount rootfs for approximately the +same reason you can't kill the init process; rather than having special code +to check for and handle an empty list, it's smaller and simpler for the kernel +to just make sure certain lists can't become empty. + +Most systems just mount another filesystem over rootfs and ignore it. The +amount of space an empty instance of ramfs takes up is tiny. + +What is initramfs? +------------------ + +All 2.6 Linux kernels contain a gzipped "cpio" format archive, which is +extracted into rootfs when the kernel boots up. After extracting, the kernel +checks to see if rootfs contains a file "init", and if so it executes it as PID +1. If found, this init process is responsible for bringing the system the +rest of the way up, including locating and mounting the real root device (if +any). If rootfs does not contain an init program after the embedded cpio +archive is extracted into it, the kernel will fall through to the older code +to locate and mount a root partition, then exec some variant of /sbin/init +out of that. + +All this differs from the old initrd in several ways: + + - The old initrd was always a separate file, while the initramfs archive is + linked into the linux kernel image. (The directory linux-*/usr is devoted + to generating this archive during the build.) + + - The old initrd file was a gzipped filesystem image (in some file format, + such as ext2, that needed a driver built into the kernel), while the new + initramfs archive is a gzipped cpio archive (like tar only simpler, + see cpio(1) and Documentation/early-userspace/buffer-format.txt). The + kernel's cpio extraction code is not only extremely small, it's also + __init text and data that can be discarded during the boot process. + + - The program run by the old initrd (which was called /initrd, not /init) did + some setup and then returned to the kernel, while the init program from + initramfs is not expected to return to the kernel. (If /init needs to hand + off control it can overmount / with a new root device and exec another init + program. See the switch_root utility, below.) + + - When switching another root device, initrd would pivot_root and then + umount the ramdisk. But initramfs is rootfs: you can neither pivot_root + rootfs, nor unmount it. Instead delete everything out of rootfs to + free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs + with the new root (cd /newmount; mount --move . /; chroot .), attach + stdin/stdout/stderr to the new /dev/console, and exec the new init. + + Since this is a remarkably persnickety process (and involves deleting + commands before you can run them), the klibc package introduced a helper + program (utils/run_init.c) to do all this for you. Most other packages + (such as busybox) have named this command "switch_root". + +Populating initramfs: +--------------------- + +The 2.6 kernel build process always creates a gzipped cpio format initramfs +archive and links it into the resulting kernel binary. By default, this +archive is empty (consuming 134 bytes on x86). + +The config option CONFIG_INITRAMFS_SOURCE (in General Setup in menuconfig, +and living in usr/Kconfig) can be used to specify a source for the +initramfs archive, which will automatically be incorporated into the +resulting binary. This option can point to an existing gzipped cpio +archive, a directory containing files to be archived, or a text file +specification such as the following example: + + dir /dev 755 0 0 + nod /dev/console 644 0 0 c 5 1 + nod /dev/loop0 644 0 0 b 7 0 + dir /bin 755 1000 1000 + slink /bin/sh busybox 777 0 0 + file /bin/busybox initramfs/busybox 755 0 0 + dir /proc 755 0 0 + dir /sys 755 0 0 + dir /mnt 755 0 0 + file /init initramfs/init.sh 755 0 0 + +Run "usr/gen_init_cpio" (after the kernel build) to get a usage message +documenting the above file format. + +One advantage of the configuration file is that root access is not required to +set permissions or create device nodes in the new archive. (Note that those +two example "file" entries expect to find files named "init.sh" and "busybox" in +a directory called "initramfs", under the linux-2.6.* directory. See +Documentation/early-userspace/README for more details.) + +The kernel does not depend on external cpio tools. If you specify a +directory instead of a configuration file, the kernel's build infrastructure +creates a configuration file from that directory (usr/Makefile calls +scripts/gen_initramfs_list.sh), and proceeds to package up that directory +using the config file (by feeding it to usr/gen_init_cpio, which is created +from usr/gen_init_cpio.c). The kernel's build-time cpio creation code is +entirely self-contained, and the kernel's boot-time extractor is also +(obviously) self-contained. + +The one thing you might need external cpio utilities installed for is creating +or extracting your own preprepared cpio files to feed to the kernel build +(instead of a config file or directory). + +The following command line can extract a cpio image (either by the above script +or by the kernel build) back into its component files: + + cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames + +The following shell script can create a prebuilt cpio archive you can +use in place of the above config file: + + #!/bin/sh + + # Copyright 2006 Rob Landley <rob@landley.net> and TimeSys Corporation. + # Licensed under GPL version 2 + + if [ $# -ne 2 ] + then + echo "usage: mkinitramfs directory imagename.cpio.gz" + exit 1 + fi + + if [ -d "$1" ] + then + echo "creating $2 from $1" + (cd "$1"; find . | cpio -o -H newc | gzip) > "$2" + else + echo "First argument must be a directory" + exit 1 + fi + +Note: The cpio man page contains some bad advice that will break your initramfs +archive if you follow it. It says "A typical way to generate the list +of filenames is with the find command; you should give find the -depth option +to minimize problems with permissions on directories that are unwritable or not +searchable." Don't do this when creating initramfs.cpio.gz images, it won't +work. The Linux kernel cpio extractor won't create files in a directory that +doesn't exist, so the directory entries must go before the files that go in +those directories. The above script gets them in the right order. + +External initramfs images: +-------------------------- + +If the kernel has initrd support enabled, an external cpio.gz archive can also +be passed into a 2.6 kernel in place of an initrd. In this case, the kernel +will autodetect the type (initramfs, not initrd) and extract the external cpio +archive into rootfs before trying to run /init. + +This has the memory efficiency advantages of initramfs (no ramdisk block +device) but the separate packaging of initrd (which is nice if you have +non-GPL code you'd like to run from initramfs, without conflating it with +the GPL licensed Linux kernel binary). + +It can also be used to supplement the kernel's built-in initramfs image. The +files in the external archive will overwrite any conflicting files in +the built-in initramfs archive. Some distributors also prefer to customize +a single kernel image with task-specific initramfs images, without recompiling. + +Contents of initramfs: +---------------------- + +An initramfs archive is a complete self-contained root filesystem for Linux. +If you don't already understand what shared libraries, devices, and paths +you need to get a minimal root filesystem up and running, here are some +references: +http://www.tldp.org/HOWTO/Bootdisk-HOWTO/ +http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html +http://www.linuxfromscratch.org/lfs/view/stable/ + +The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is +designed to be a tiny C library to statically link early userspace +code against, along with some related utilities. It is BSD licensed. + +I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) +myself. These are LGPL and GPL, respectively. (A self-contained initramfs +package is planned for the busybox 1.3 release.) + +In theory you could use glibc, but that's not well suited for small embedded +uses like this. (A "hello world" program statically linked against glibc is +over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do +name lookups, even when otherwise statically linked.) + +A good first step is to get initramfs to run a statically linked "hello world" +program as init, and test it under an emulator like qemu (www.qemu.org) or +User Mode Linux, like so: + + cat > hello.c << EOF + #include <stdio.h> + #include <unistd.h> + + int main(int argc, char *argv[]) + { + printf("Hello world!\n"); + sleep(999999999); + } + EOF + gcc -static hello.c -o init + echo init | cpio -o -H newc | gzip > test.cpio.gz + # Testing external initramfs using the initrd loading mechanism. + qemu -kernel /boot/vmlinuz -initrd test.cpio.gz /dev/zero + +When debugging a normal root filesystem, it's nice to be able to boot with +"init=/bin/sh". The initramfs equivalent is "rdinit=/bin/sh", and it's +just as useful. + +Why cpio rather than tar? +------------------------- + +This decision was made back in December, 2001. The discussion started here: + + http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1538.html + +And spawned a second thread (specifically on tar vs cpio), starting here: + + http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1587.html + +The quick and dirty summary version (which is no substitute for reading +the above threads) is: + +1) cpio is a standard. It's decades old (from the AT&T days), and already + widely used on Linux (inside RPM, Red Hat's device driver disks). Here's + a Linux Journal article about it from 1996: + + http://www.linuxjournal.com/article/1213 + + It's not as popular as tar because the traditional cpio command line tools + require _truly_hideous_ command line arguments. But that says nothing + either way about the archive format, and there are alternative tools, + such as: + + http://freshmeat.net/projects/afio/ + +2) The cpio archive format chosen by the kernel is simpler and cleaner (and + thus easier to create and parse) than any of the (literally dozens of) + various tar archive formats. The complete initramfs archive format is + explained in buffer-format.txt, created in usr/gen_init_cpio.c, and + extracted in init/initramfs.c. All three together come to less than 26k + total of human-readable text. + +3) The GNU project standardizing on tar is approximately as relevant as + Windows standardizing on zip. Linux is not part of either, and is free + to make its own technical decisions. + +4) Since this is a kernel internal format, it could easily have been + something brand new. The kernel provides its own tools to create and + extract this format anyway. Using an existing standard was preferable, + but not essential. + +5) Al Viro made the decision (quote: "tar is ugly as hell and not going to be + supported on the kernel side"): + + http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1540.html + + explained his reasoning: + + http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html + http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html + + and, most importantly, designed and implemented the initramfs code. + +Future directions: +------------------ + +Today (2.6.16), initramfs is always compiled in, but not always used. The +kernel falls back to legacy boot code that is reached only if initramfs does +not contain an /init program. The fallback is legacy code, there to ensure a +smooth transition and allowing early boot functionality to gradually move to +"early userspace" (I.E. initramfs). + +The move to early userspace is necessary because finding and mounting the real +root device is complex. Root partitions can span multiple devices (raid or +separate journal). They can be out on the network (requiring dhcp, setting a +specific MAC address, logging into a server, etc). They can live on removable +media, with dynamically allocated major/minor numbers and persistent naming +issues requiring a full udev implementation to sort out. They can be +compressed, encrypted, copy-on-write, loopback mounted, strangely partitioned, +and so on. + +This kind of complexity (which inevitably includes policy) is rightly handled +in userspace. Both klibc and busybox/uClibc are working on simple initramfs +packages to drop into a kernel build. + +The klibc package has now been accepted into Andrew Morton's 2.6.17-mm tree. +The kernel's current early boot code (partition detection, etc) will probably +be migrated into a default initramfs, automatically created and used by the +kernel build. diff --git a/Documentation/filesystems/relay.txt b/Documentation/filesystems/relay.txt new file mode 100644 index 0000000..510b722 --- /dev/null +++ b/Documentation/filesystems/relay.txt @@ -0,0 +1,494 @@ +relay interface (formerly relayfs) +================================== + +The relay interface provides a means for kernel applications to +efficiently log and transfer large quantities of data from the kernel +to userspace via user-defined 'relay channels'. + +A 'relay channel' is a kernel->user data relay mechanism implemented +as a set of per-cpu kernel buffers ('channel buffers'), each +represented as a regular file ('relay file') in user space. Kernel +clients write into the channel buffers using efficient write +functions; these automatically log into the current cpu's channel +buffer. User space applications mmap() or read() from the relay files +and retrieve the data as it becomes available. The relay files +themselves are files created in a host filesystem, e.g. debugfs, and +are associated with the channel buffers using the API described below. + +The format of the data logged into the channel buffers is completely +up to the kernel client; the relay interface does however provide +hooks which allow kernel clients to impose some structure on the +buffer data. The relay interface doesn't implement any form of data +filtering - this also is left to the kernel client. The purpose is to +keep things as simple as possible. + +This document provides an overview of the relay interface API. The +details of the function parameters are documented along with the +functions in the relay interface code - please see that for details. + +Semantics +========= + +Each relay channel has one buffer per CPU, each buffer has one or more +sub-buffers. Messages are written to the first sub-buffer until it is +too full to contain a new message, in which case it it is written to +the next (if available). Messages are never split across sub-buffers. +At this point, userspace can be notified so it empties the first +sub-buffer, while the kernel continues writing to the next. + +When notified that a sub-buffer is full, the kernel knows how many +bytes of it are padding i.e. unused space occurring because a complete +message couldn't fit into a sub-buffer. Userspace can use this +knowledge to copy only valid data. + +After copying it, userspace can notify the kernel that a sub-buffer +has been consumed. + +A relay channel can operate in a mode where it will overwrite data not +yet collected by userspace, and not wait for it to be consumed. + +The relay channel itself does not provide for communication of such +data between userspace and kernel, allowing the kernel side to remain +simple and not impose a single interface on userspace. It does +provide a set of examples and a separate helper though, described +below. + +The read() interface both removes padding and internally consumes the +read sub-buffers; thus in cases where read(2) is being used to drain +the channel buffers, special-purpose communication between kernel and +user isn't necessary for basic operation. + +One of the major goals of the relay interface is to provide a low +overhead mechanism for conveying kernel data to userspace. While the +read() interface is easy to use, it's not as efficient as the mmap() +approach; the example code attempts to make the tradeoff between the +two approaches as small as possible. + +klog and relay-apps example code +================================ + +The relay interface itself is ready to use, but to make things easier, +a couple simple utility functions and a set of examples are provided. + +The relay-apps example tarball, available on the relay sourceforge +site, contains a set of self-contained examples, each consisting of a +pair of .c files containing boilerplate code for each of the user and +kernel sides of a relay application. When combined these two sets of +boilerplate code provide glue to easily stream data to disk, without +having to bother with mundane housekeeping chores. + +The 'klog debugging functions' patch (klog.patch in the relay-apps +tarball) provides a couple of high-level logging functions to the +kernel which allow writing formatted text or raw data to a channel, +regardless of whether a channel to write into exists or not, or even +whether the relay interface is compiled into the kernel or not. These +functions allow you to put unconditional 'trace' statements anywhere +in the kernel or kernel modules; only when there is a 'klog handler' +registered will data actually be logged (see the klog and kleak +examples for details). + +It is of course possible to use the relay interface from scratch, +i.e. without using any of the relay-apps example code or klog, but +you'll have to implement communication between userspace and kernel, +allowing both to convey the state of buffers (full, empty, amount of +padding). The read() interface both removes padding and internally +consumes the read sub-buffers; thus in cases where read(2) is being +used to drain the channel buffers, special-purpose communication +between kernel and user isn't necessary for basic operation. Things +such as buffer-full conditions would still need to be communicated via +some channel though. + +klog and the relay-apps examples can be found in the relay-apps +tarball on http://relayfs.sourceforge.net + +The relay interface user space API +================================== + +The relay interface implements basic file operations for user space +access to relay channel buffer data. Here are the file operations +that are available and some comments regarding their behavior: + +open() enables user to open an _existing_ channel buffer. + +mmap() results in channel buffer being mapped into the caller's + memory space. Note that you can't do a partial mmap - you + must map the entire file, which is NRBUF * SUBBUFSIZE. + +read() read the contents of a channel buffer. The bytes read are + 'consumed' by the reader, i.e. they won't be available + again to subsequent reads. If the channel is being used + in no-overwrite mode (the default), it can be read at any + time even if there's an active kernel writer. If the + channel is being used in overwrite mode and there are + active channel writers, results may be unpredictable - + users should make sure that all logging to the channel has + ended before using read() with overwrite mode. Sub-buffer + padding is automatically removed and will not be seen by + the reader. + +sendfile() transfer data from a channel buffer to an output file + descriptor. Sub-buffer padding is automatically removed + and will not be seen by the reader. + +poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are + notified when sub-buffer boundaries are crossed. + +close() decrements the channel buffer's refcount. When the refcount + reaches 0, i.e. when no process or kernel client has the + buffer open, the channel buffer is freed. + +In order for a user application to make use of relay files, the +host filesystem must be mounted. For example, + + mount -t debugfs debugfs /sys/kernel/debug + +NOTE: the host filesystem doesn't need to be mounted for kernel + clients to create or use channels - it only needs to be + mounted when user space applications need access to the buffer + data. + + +The relay interface kernel API +============================== + +Here's a summary of the API the relay interface provides to in-kernel clients: + +TBD(curr. line MT:/API/) + channel management functions: + + relay_open(base_filename, parent, subbuf_size, n_subbufs, + callbacks, private_data) + relay_close(chan) + relay_flush(chan) + relay_reset(chan) + + channel management typically called on instigation of userspace: + + relay_subbufs_consumed(chan, cpu, subbufs_consumed) + + write functions: + + relay_write(chan, data, length) + __relay_write(chan, data, length) + relay_reserve(chan, length) + + callbacks: + + subbuf_start(buf, subbuf, prev_subbuf, prev_padding) + buf_mapped(buf, filp) + buf_unmapped(buf, filp) + create_buf_file(filename, parent, mode, buf, is_global) + remove_buf_file(dentry) + + helper functions: + + relay_buf_full(buf) + subbuf_start_reserve(buf, length) + + +Creating a channel +------------------ + +relay_open() is used to create a channel, along with its per-cpu +channel buffers. Each channel buffer will have an associated file +created for it in the host filesystem, which can be and mmapped or +read from in user space. The files are named basename0...basenameN-1 +where N is the number of online cpus, and by default will be created +in the root of the filesystem (if the parent param is NULL). If you +want a directory structure to contain your relay files, you should +create it using the host filesystem's directory creation function, +e.g. debugfs_create_dir(), and pass the parent directory to +relay_open(). Users are responsible for cleaning up any directory +structure they create, when the channel is closed - again the host +filesystem's directory removal functions should be used for that, +e.g. debugfs_remove(). + +In order for a channel to be created and the host filesystem's files +associated with its channel buffers, the user must provide definitions +for two callback functions, create_buf_file() and remove_buf_file(). +create_buf_file() is called once for each per-cpu buffer from +relay_open() and allows the user to create the file which will be used +to represent the corresponding channel buffer. The callback should +return the dentry of the file created to represent the channel buffer. +remove_buf_file() must also be defined; it's responsible for deleting +the file(s) created in create_buf_file() and is called during +relay_close(). + +Here are some typical definitions for these callbacks, in this case +using debugfs: + +/* + * create_buf_file() callback. Creates relay file in debugfs. + */ +static struct dentry *create_buf_file_handler(const char *filename, + struct dentry *parent, + int mode, + struct rchan_buf *buf, + int *is_global) +{ + return debugfs_create_file(filename, mode, parent, buf, + &relay_file_operations); +} + +/* + * remove_buf_file() callback. Removes relay file from debugfs. + */ +static int remove_buf_file_handler(struct dentry *dentry) +{ + debugfs_remove(dentry); + + return 0; +} + +/* + * relay interface callbacks + */ +static struct rchan_callbacks relay_callbacks = +{ + .create_buf_file = create_buf_file_handler, + .remove_buf_file = remove_buf_file_handler, +}; + +And an example relay_open() invocation using them: + + chan = relay_open("cpu", NULL, SUBBUF_SIZE, N_SUBBUFS, &relay_callbacks, NULL); + +If the create_buf_file() callback fails, or isn't defined, channel +creation and thus relay_open() will fail. + +The total size of each per-cpu buffer is calculated by multiplying the +number of sub-buffers by the sub-buffer size passed into relay_open(). +The idea behind sub-buffers is that they're basically an extension of +double-buffering to N buffers, and they also allow applications to +easily implement random-access-on-buffer-boundary schemes, which can +be important for some high-volume applications. The number and size +of sub-buffers is completely dependent on the application and even for +the same application, different conditions will warrant different +values for these parameters at different times. Typically, the right +values to use are best decided after some experimentation; in general, +though, it's safe to assume that having only 1 sub-buffer is a bad +idea - you're guaranteed to either overwrite data or lose events +depending on the channel mode being used. + +The create_buf_file() implementation can also be defined in such a way +as to allow the creation of a single 'global' buffer instead of the +default per-cpu set. This can be useful for applications interested +mainly in seeing the relative ordering of system-wide events without +the need to bother with saving explicit timestamps for the purpose of +merging/sorting per-cpu files in a postprocessing step. + +To have relay_open() create a global buffer, the create_buf_file() +implementation should set the value of the is_global outparam to a +non-zero value in addition to creating the file that will be used to +represent the single buffer. In the case of a global buffer, +create_buf_file() and remove_buf_file() will be called only once. The +normal channel-writing functions, e.g. relay_write(), can still be +used - writes from any cpu will transparently end up in the global +buffer - but since it is a global buffer, callers should make sure +they use the proper locking for such a buffer, either by wrapping +writes in a spinlock, or by copying a write function from relay.h and +creating a local version that internally does the proper locking. + +The private_data passed into relay_open() allows clients to associate +user-defined data with a channel, and is immediately available +(including in create_buf_file()) via chan->private_data or +buf->chan->private_data. + +Buffer-only channels +-------------------- + +These channels have no files associated and can be created with +relay_open(NULL, NULL, ...). Such channels are useful in scenarios such +as when doing early tracing in the kernel, before the VFS is up. In these +cases, one may open a buffer-only channel and then call +relay_late_setup_files() when the kernel is ready to handle files, +to expose the buffered data to the userspace. + +Channel 'modes' +--------------- + +relay channels can be used in either of two modes - 'overwrite' or +'no-overwrite'. The mode is entirely determined by the implementation +of the subbuf_start() callback, as described below. The default if no +subbuf_start() callback is defined is 'no-overwrite' mode. If the +default mode suits your needs, and you plan to use the read() +interface to retrieve channel data, you can ignore the details of this +section, as it pertains mainly to mmap() implementations. + +In 'overwrite' mode, also known as 'flight recorder' mode, writes +continuously cycle around the buffer and will never fail, but will +unconditionally overwrite old data regardless of whether it's actually +been consumed. In no-overwrite mode, writes will fail, i.e. data will +be lost, if the number of unconsumed sub-buffers equals the total +number of sub-buffers in the channel. It should be clear that if +there is no consumer or if the consumer can't consume sub-buffers fast +enough, data will be lost in either case; the only difference is +whether data is lost from the beginning or the end of a buffer. + +As explained above, a relay channel is made of up one or more +per-cpu channel buffers, each implemented as a circular buffer +subdivided into one or more sub-buffers. Messages are written into +the current sub-buffer of the channel's current per-cpu buffer via the +write functions described below. Whenever a message can't fit into +the current sub-buffer, because there's no room left for it, the +client is notified via the subbuf_start() callback that a switch to a +new sub-buffer is about to occur. The client uses this callback to 1) +initialize the next sub-buffer if appropriate 2) finalize the previous +sub-buffer if appropriate and 3) return a boolean value indicating +whether or not to actually move on to the next sub-buffer. + +To implement 'no-overwrite' mode, the userspace client would provide +an implementation of the subbuf_start() callback something like the +following: + +static int subbuf_start(struct rchan_buf *buf, + void *subbuf, + void *prev_subbuf, + unsigned int prev_padding) +{ + if (prev_subbuf) + *((unsigned *)prev_subbuf) = prev_padding; + + if (relay_buf_full(buf)) + return 0; + + subbuf_start_reserve(buf, sizeof(unsigned int)); + + return 1; +} + +If the current buffer is full, i.e. all sub-buffers remain unconsumed, +the callback returns 0 to indicate that the buffer switch should not +occur yet, i.e. until the consumer has had a chance to read the +current set of ready sub-buffers. For the relay_buf_full() function +to make sense, the consumer is responsible for notifying the relay +interface when sub-buffers have been consumed via +relay_subbufs_consumed(). Any subsequent attempts to write into the +buffer will again invoke the subbuf_start() callback with the same +parameters; only when the consumer has consumed one or more of the +ready sub-buffers will relay_buf_full() return 0, in which case the +buffer switch can continue. + +The implementation of the subbuf_start() callback for 'overwrite' mode +would be very similar: + +static int subbuf_start(struct rchan_buf *buf, + void *subbuf, + void *prev_subbuf, + unsigned int prev_padding) +{ + if (prev_subbuf) + *((unsigned *)prev_subbuf) = prev_padding; + + subbuf_start_reserve(buf, sizeof(unsigned int)); + + return 1; +} + +In this case, the relay_buf_full() check is meaningless and the +callback always returns 1, causing the buffer switch to occur +unconditionally. It's also meaningless for the client to use the +relay_subbufs_consumed() function in this mode, as it's never +consulted. + +The default subbuf_start() implementation, used if the client doesn't +define any callbacks, or doesn't define the subbuf_start() callback, +implements the simplest possible 'no-overwrite' mode, i.e. it does +nothing but return 0. + +Header information can be reserved at the beginning of each sub-buffer +by calling the subbuf_start_reserve() helper function from within the +subbuf_start() callback. This reserved area can be used to store +whatever information the client wants. In the example above, room is +reserved in each sub-buffer to store the padding count for that +sub-buffer. This is filled in for the previous sub-buffer in the +subbuf_start() implementation; the padding value for the previous +sub-buffer is passed into the subbuf_start() callback along with a +pointer to the previous sub-buffer, since the padding value isn't +known until a sub-buffer is filled. The subbuf_start() callback is +also called for the first sub-buffer when the channel is opened, to +give the client a chance to reserve space in it. In this case the +previous sub-buffer pointer passed into the callback will be NULL, so +the client should check the value of the prev_subbuf pointer before +writing into the previous sub-buffer. + +Writing to a channel +-------------------- + +Kernel clients write data into the current cpu's channel buffer using +relay_write() or __relay_write(). relay_write() is the main logging +function - it uses local_irqsave() to protect the buffer and should be +used if you might be logging from interrupt context. If you know +you'll never be logging from interrupt context, you can use +__relay_write(), which only disables preemption. These functions +don't return a value, so you can't determine whether or not they +failed - the assumption is that you wouldn't want to check a return +value in the fast logging path anyway, and that they'll always succeed +unless the buffer is full and no-overwrite mode is being used, in +which case you can detect a failed write in the subbuf_start() +callback by calling the relay_buf_full() helper function. + +relay_reserve() is used to reserve a slot in a channel buffer which +can be written to later. This would typically be used in applications +that need to write directly into a channel buffer without having to +stage data in a temporary buffer beforehand. Because the actual write +may not happen immediately after the slot is reserved, applications +using relay_reserve() can keep a count of the number of bytes actually +written, either in space reserved in the sub-buffers themselves or as +a separate array. See the 'reserve' example in the relay-apps tarball +at http://relayfs.sourceforge.net for an example of how this can be +done. Because the write is under control of the client and is +separated from the reserve, relay_reserve() doesn't protect the buffer +at all - it's up to the client to provide the appropriate +synchronization when using relay_reserve(). + +Closing a channel +----------------- + +The client calls relay_close() when it's finished using the channel. +The channel and its associated buffers are destroyed when there are no +longer any references to any of the channel buffers. relay_flush() +forces a sub-buffer switch on all the channel buffers, and can be used +to finalize and process the last sub-buffers before the channel is +closed. + +Misc +---- + +Some applications may want to keep a channel around and re-use it +rather than open and close a new channel for each use. relay_reset() +can be used for this purpose - it resets a channel to its initial +state without reallocating channel buffer memory or destroying +existing mappings. It should however only be called when it's safe to +do so, i.e. when the channel isn't currently being written to. + +Finally, there are a couple of utility callbacks that can be used for +different purposes. buf_mapped() is called whenever a channel buffer +is mmapped from user space and buf_unmapped() is called when it's +unmapped. The client can use this notification to trigger actions +within the kernel application, such as enabling/disabling logging to +the channel. + + +Resources +========= + +For news, example code, mailing list, etc. see the relay interface homepage: + + http://relayfs.sourceforge.net + + +Credits +======= + +The ideas and specs for the relay interface came about as a result of +discussions on tracing involving the following: + +Michel Dagenais <michel.dagenais@polymtl.ca> +Richard Moore <richardj_moore@uk.ibm.com> +Bob Wisniewski <bob@watson.ibm.com> +Karim Yaghmour <karim@opersys.com> +Tom Zanussi <zanussi@us.ibm.com> + +Also thanks to Hubertus Franke for a lot of useful suggestions and bug +reports. diff --git a/Documentation/filesystems/romfs.txt b/Documentation/filesystems/romfs.txt new file mode 100644 index 0000000..2d2a7b2 --- /dev/null +++ b/Documentation/filesystems/romfs.txt @@ -0,0 +1,187 @@ +ROMFS - ROM FILE SYSTEM + +This is a quite dumb, read only filesystem, mainly for initial RAM +disks of installation disks. It has grown up by the need of having +modules linked at boot time. Using this filesystem, you get a very +similar feature, and even the possibility of a small kernel, with a +file system which doesn't take up useful memory from the router +functions in the basement of your office. + +For comparison, both the older minix and xiafs (the latter is now +defunct) filesystems, compiled as module need more than 20000 bytes, +while romfs is less than a page, about 4000 bytes (assuming i586 +code). Under the same conditions, the msdos filesystem would need +about 30K (and does not support device nodes or symlinks), while the +nfs module with nfsroot is about 57K. Furthermore, as a bit unfair +comparison, an actual rescue disk used up 3202 blocks with ext2, while +with romfs, it needed 3079 blocks. + +To create such a file system, you'll need a user program named +genromfs. It is available via anonymous ftp on sunsite.unc.edu and +its mirrors, in the /pub/Linux/system/recovery/ directory. + +As the name suggests, romfs could be also used (space-efficiently) on +various read-only media, like (E)EPROM disks if someone will have the +motivation.. :) + +However, the main purpose of romfs is to have a very small kernel, +which has only this filesystem linked in, and then can load any module +later, with the current module utilities. It can also be used to run +some program to decide if you need SCSI devices, and even IDE or +floppy drives can be loaded later if you use the "initrd"--initial +RAM disk--feature of the kernel. This would not be really news +flash, but with romfs, you can even spare off your ext2 or minix or +maybe even affs filesystem until you really know that you need it. + +For example, a distribution boot disk can contain only the cd disk +drivers (and possibly the SCSI drivers), and the ISO 9660 filesystem +module. The kernel can be small enough, since it doesn't have other +filesystems, like the quite large ext2fs module, which can then be +loaded off the CD at a later stage of the installation. Another use +would be for a recovery disk, when you are reinstalling a workstation +from the network, and you will have all the tools/modules available +from a nearby server, so you don't want to carry two disks for this +purpose, just because it won't fit into ext2. + +romfs operates on block devices as you can expect, and the underlying +structure is very simple. Every accessible structure begins on 16 +byte boundaries for fast access. The minimum space a file will take +is 32 bytes (this is an empty file, with a less than 16 character +name). The maximum overhead for any non-empty file is the header, and +the 16 byte padding for the name and the contents, also 16+14+15 = 45 +bytes. This is quite rare however, since most file names are longer +than 3 bytes, and shorter than 15 bytes. + +The layout of the filesystem is the following: + +offset content + + +---+---+---+---+ + 0 | - | r | o | m | \ + +---+---+---+---+ The ASCII representation of those bytes + 4 | 1 | f | s | - | / (i.e. "-rom1fs-") + +---+---+---+---+ + 8 | full size | The number of accessible bytes in this fs. + +---+---+---+---+ + 12 | checksum | The checksum of the FIRST 512 BYTES. + +---+---+---+---+ + 16 | volume name | The zero terminated name of the volume, + : : padded to 16 byte boundary. + +---+---+---+---+ + xx | file | + : headers : + +Every multi byte value (32 bit words, I'll use the longwords term from +now on) must be in big endian order. + +The first eight bytes identify the filesystem, even for the casual +inspector. After that, in the 3rd longword, it contains the number of +bytes accessible from the start of this filesystem. The 4th longword +is the checksum of the first 512 bytes (or the number of bytes +accessible, whichever is smaller). The applied algorithm is the same +as in the AFFS filesystem, namely a simple sum of the longwords +(assuming bigendian quantities again). For details, please consult +the source. This algorithm was chosen because although it's not quite +reliable, it does not require any tables, and it is very simple. + +The following bytes are now part of the file system; each file header +must begin on a 16 byte boundary. + +offset content + + +---+---+---+---+ + 0 | next filehdr|X| The offset of the next file header + +---+---+---+---+ (zero if no more files) + 4 | spec.info | Info for directories/hard links/devices + +---+---+---+---+ + 8 | size | The size of this file in bytes + +---+---+---+---+ + 12 | checksum | Covering the meta data, including the file + +---+---+---+---+ name, and padding + 16 | file name | The zero terminated name of the file, + : : padded to 16 byte boundary + +---+---+---+---+ + xx | file data | + : : + +Since the file headers begin always at a 16 byte boundary, the lowest +4 bits would be always zero in the next filehdr pointer. These four +bits are used for the mode information. Bits 0..2 specify the type of +the file; while bit 4 shows if the file is executable or not. The +permissions are assumed to be world readable, if this bit is not set, +and world executable if it is; except the character and block devices, +they are never accessible for other than owner. The owner of every +file is user and group 0, this should never be a problem for the +intended use. The mapping of the 8 possible values to file types is +the following: + + mapping spec.info means + 0 hard link link destination [file header] + 1 directory first file's header + 2 regular file unused, must be zero [MBZ] + 3 symbolic link unused, MBZ (file data is the link content) + 4 block device 16/16 bits major/minor number + 5 char device - " - + 6 socket unused, MBZ + 7 fifo unused, MBZ + +Note that hard links are specifically marked in this filesystem, but +they will behave as you can expect (i.e. share the inode number). +Note also that it is your responsibility to not create hard link +loops, and creating all the . and .. links for directories. This is +normally done correctly by the genromfs program. Please refrain from +using the executable bits for special purposes on the socket and fifo +special files, they may have other uses in the future. Additionally, +please remember that only regular files, and symlinks are supposed to +have a nonzero size field; they contain the number of bytes available +directly after the (padded) file name. + +Another thing to note is that romfs works on file headers and data +aligned to 16 byte boundaries, but most hardware devices and the block +device drivers are unable to cope with smaller than block-sized data. +To overcome this limitation, the whole size of the file system must be +padded to an 1024 byte boundary. + +If you have any problems or suggestions concerning this file system, +please contact me. However, think twice before wanting me to add +features and code, because the primary and most important advantage of +this file system is the small code. On the other hand, don't be +alarmed, I'm not getting that much romfs related mail. Now I can +understand why Avery wrote poems in the ARCnet docs to get some more +feedback. :) + +romfs has also a mailing list, and to date, it hasn't received any +traffic, so you are welcome to join it to discuss your ideas. :) + +It's run by ezmlm, so you can subscribe to it by sending a message +to romfs-subscribe@shadow.banki.hu, the content is irrelevant. + +Pending issues: + +- Permissions and owner information are pretty essential features of a +Un*x like system, but romfs does not provide the full possibilities. +I have never found this limiting, but others might. + +- The file system is read only, so it can be very small, but in case +one would want to write _anything_ to a file system, he still needs +a writable file system, thus negating the size advantages. Possible +solutions: implement write access as a compile-time option, or a new, +similarly small writable filesystem for RAM disks. + +- Since the files are only required to have alignment on a 16 byte +boundary, it is currently possibly suboptimal to read or execute files +from the filesystem. It might be resolved by reordering file data to +have most of it (i.e. except the start and the end) laying at "natural" +boundaries, thus it would be possible to directly map a big portion of +the file contents to the mm subsystem. + +- Compression might be an useful feature, but memory is quite a +limiting factor in my eyes. + +- Where it is used? + +- Does it work on other architectures than intel and motorola? + + +Have fun, +Janos Farkas <chexum@shadow.banki.hu> diff --git a/Documentation/filesystems/rpc-cache.txt b/Documentation/filesystems/rpc-cache.txt new file mode 100644 index 0000000..8a382be --- /dev/null +++ b/Documentation/filesystems/rpc-cache.txt @@ -0,0 +1,202 @@ + This document gives a brief introduction to the caching +mechanisms in the sunrpc layer that is used, in particular, +for NFS authentication. + +CACHES +====== +The caching replaces the old exports table and allows for +a wide variety of values to be caches. + +There are a number of caches that are similar in structure though +quite possibly very different in content and use. There is a corpus +of common code for managing these caches. + +Examples of caches that are likely to be needed are: + - mapping from IP address to client name + - mapping from client name and filesystem to export options + - mapping from UID to list of GIDs, to work around NFS's limitation + of 16 gids. + - mappings between local UID/GID and remote UID/GID for sites that + do not have uniform uid assignment + - mapping from network identify to public key for crypto authentication. + +The common code handles such things as: + - general cache lookup with correct locking + - supporting 'NEGATIVE' as well as positive entries + - allowing an EXPIRED time on cache items, and removing + items after they expire, and are no longer in-use. + - making requests to user-space to fill in cache entries + - allowing user-space to directly set entries in the cache + - delaying RPC requests that depend on as-yet incomplete + cache entries, and replaying those requests when the cache entry + is complete. + - clean out old entries as they expire. + +Creating a Cache +---------------- + +1/ A cache needs a datum to store. This is in the form of a + structure definition that must contain a + struct cache_head + as an element, usually the first. + It will also contain a key and some content. + Each cache element is reference counted and contains + expiry and update times for use in cache management. +2/ A cache needs a "cache_detail" structure that + describes the cache. This stores the hash table, some + parameters for cache management, and some operations detailing how + to work with particular cache items. + The operations requires are: + struct cache_head *alloc(void) + This simply allocates appropriate memory and returns + a pointer to the cache_detail embedded within the + structure + void cache_put(struct kref *) + This is called when the last reference to an item is + dropped. The pointer passed is to the 'ref' field + in the cache_head. cache_put should release any + references create by 'cache_init' and, if CACHE_VALID + is set, any references created by cache_update. + It should then release the memory allocated by + 'alloc'. + int match(struct cache_head *orig, struct cache_head *new) + test if the keys in the two structures match. Return + 1 if they do, 0 if they don't. + void init(struct cache_head *orig, struct cache_head *new) + Set the 'key' fields in 'new' from 'orig'. This may + include taking references to shared objects. + void update(struct cache_head *orig, struct cache_head *new) + Set the 'content' fileds in 'new' from 'orig'. + int cache_show(struct seq_file *m, struct cache_detail *cd, + struct cache_head *h) + Optional. Used to provide a /proc file that lists the + contents of a cache. This should show one item, + usually on just one line. + int cache_request(struct cache_detail *cd, struct cache_head *h, + char **bpp, int *blen) + Format a request to be send to user-space for an item + to be instantiated. *bpp is a buffer of size *blen. + bpp should be moved forward over the encoded message, + and *blen should be reduced to show how much free + space remains. Return 0 on success or <0 if not + enough room or other problem. + int cache_parse(struct cache_detail *cd, char *buf, int len) + A message from user space has arrived to fill out a + cache entry. It is in 'buf' of length 'len'. + cache_parse should parse this, find the item in the + cache with sunrpc_cache_lookup, and update the item + with sunrpc_cache_update. + + +3/ A cache needs to be registered using cache_register(). This + includes it on a list of caches that will be regularly + cleaned to discard old data. + +Using a cache +------------- + +To find a value in a cache, call sunrpc_cache_lookup passing a pointer +to the cache_head in a sample item with the 'key' fields filled in. +This will be passed to ->match to identify the target entry. If no +entry is found, a new entry will be create, added to the cache, and +marked as not containing valid data. + +The item returned is typically passed to cache_check which will check +if the data is valid, and may initiate an up-call to get fresh data. +cache_check will return -ENOENT in the entry is negative or if an up +call is needed but not possible, -EAGAIN if an upcall is pending, +or 0 if the data is valid; + +cache_check can be passed a "struct cache_req *". This structure is +typically embedded in the actual request and can be used to create a +deferred copy of the request (struct cache_deferred_req). This is +done when the found cache item is not uptodate, but the is reason to +believe that userspace might provide information soon. When the cache +item does become valid, the deferred copy of the request will be +revisited (->revisit). It is expected that this method will +reschedule the request for processing. + +The value returned by sunrpc_cache_lookup can also be passed to +sunrpc_cache_update to set the content for the item. A second item is +passed which should hold the content. If the item found by _lookup +has valid data, then it is discarded and a new item is created. This +saves any user of an item from worrying about content changing while +it is being inspected. If the item found by _lookup does not contain +valid data, then the content is copied across and CACHE_VALID is set. + +Populating a cache +------------------ + +Each cache has a name, and when the cache is registered, a directory +with that name is created in /proc/net/rpc + +This directory contains a file called 'channel' which is a channel +for communicating between kernel and user for populating the cache. +This directory may later contain other files of interacting +with the cache. + +The 'channel' works a bit like a datagram socket. Each 'write' is +passed as a whole to the cache for parsing and interpretation. +Each cache can treat the write requests differently, but it is +expected that a message written will contain: + - a key + - an expiry time + - a content. +with the intention that an item in the cache with the give key +should be create or updated to have the given content, and the +expiry time should be set on that item. + +Reading from a channel is a bit more interesting. When a cache +lookup fails, or when it succeeds but finds an entry that may soon +expire, a request is lodged for that cache item to be updated by +user-space. These requests appear in the channel file. + +Successive reads will return successive requests. +If there are no more requests to return, read will return EOF, but a +select or poll for read will block waiting for another request to be +added. + +Thus a user-space helper is likely to: + open the channel. + select for readable + read a request + write a response + loop. + +If it dies and needs to be restarted, any requests that have not been +answered will still appear in the file and will be read by the new +instance of the helper. + +Each cache should define a "cache_parse" method which takes a message +written from user-space and processes it. It should return an error +(which propagates back to the write syscall) or 0. + +Each cache should also define a "cache_request" method which +takes a cache item and encodes a request into the buffer +provided. + +Note: If a cache has no active readers on the channel, and has had not +active readers for more than 60 seconds, further requests will not be +added to the channel but instead all lookups that do not find a valid +entry will fail. This is partly for backward compatibility: The +previous nfs exports table was deemed to be authoritative and a +failed lookup meant a definite 'no'. + +request/response format +----------------------- + +While each cache is free to use it's own format for requests +and responses over channel, the following is recommended as +appropriate and support routines are available to help: +Each request or response record should be printable ASCII +with precisely one newline character which should be at the end. +Fields within the record should be separated by spaces, normally one. +If spaces, newlines, or nul characters are needed in a field they +much be quoted. two mechanisms are available: +1/ If a field begins '\x' then it must contain an even number of + hex digits, and pairs of these digits provide the bytes in the + field. +2/ otherwise a \ in the field must be followed by 3 octal digits + which give the code for a byte. Other characters are treated + as them selves. At the very least, space, newline, nul, and + '\' must be quoted in this way. diff --git a/Documentation/filesystems/seq_file.txt b/Documentation/filesystems/seq_file.txt new file mode 100644 index 0000000..b843743 --- /dev/null +++ b/Documentation/filesystems/seq_file.txt @@ -0,0 +1,294 @@ +The seq_file interface + + Copyright 2003 Jonathan Corbet <corbet@lwn.net> + This file is originally from the LWN.net Driver Porting series at + http://lwn.net/Articles/driver-porting/ + + +There are numerous ways for a device driver (or other kernel component) to +provide information to the user or system administrator. One useful +technique is the creation of virtual files, in debugfs, /proc or elsewhere. +Virtual files can provide human-readable output that is easy to get at +without any special utility programs; they can also make life easier for +script writers. It is not surprising that the use of virtual files has +grown over the years. + +Creating those files correctly has always been a bit of a challenge, +however. It is not that hard to make a virtual file which returns a +string. But life gets trickier if the output is long - anything greater +than an application is likely to read in a single operation. Handling +multiple reads (and seeks) requires careful attention to the reader's +position within the virtual file - that position is, likely as not, in the +middle of a line of output. The kernel has traditionally had a number of +implementations that got this wrong. + +The 2.6 kernel contains a set of functions (implemented by Alexander Viro) +which are designed to make it easy for virtual file creators to get it +right. + +The seq_file interface is available via <linux/seq_file.h>. There are +three aspects to seq_file: + + * An iterator interface which lets a virtual file implementation + step through the objects it is presenting. + + * Some utility functions for formatting objects for output without + needing to worry about things like output buffers. + + * A set of canned file_operations which implement most operations on + the virtual file. + +We'll look at the seq_file interface via an extremely simple example: a +loadable module which creates a file called /proc/sequence. The file, when +read, simply produces a set of increasing integer values, one per line. The +sequence will continue until the user loses patience and finds something +better to do. The file is seekable, in that one can do something like the +following: + + dd if=/proc/sequence of=out1 count=1 + dd if=/proc/sequence skip=1 out=out2 count=1 + +Then concatenate the output files out1 and out2 and get the right +result. Yes, it is a thoroughly useless module, but the point is to show +how the mechanism works without getting lost in other details. (Those +wanting to see the full source for this module can find it at +http://lwn.net/Articles/22359/). + + +The iterator interface + +Modules implementing a virtual file with seq_file must implement a simple +iterator object that allows stepping through the data of interest. +Iterators must be able to move to a specific position - like the file they +implement - but the interpretation of that position is up to the iterator +itself. A seq_file implementation that is formatting firewall rules, for +example, could interpret position N as the Nth rule in the chain. +Positioning can thus be done in whatever way makes the most sense for the +generator of the data, which need not be aware of how a position translates +to an offset in the virtual file. The one obvious exception is that a +position of zero should indicate the beginning of the file. + +The /proc/sequence iterator just uses the count of the next number it +will output as its position. + +Four functions must be implemented to make the iterator work. The first, +called start() takes a position as an argument and returns an iterator +which will start reading at that position. For our simple sequence example, +the start() function looks like: + + static void *ct_seq_start(struct seq_file *s, loff_t *pos) + { + loff_t *spos = kmalloc(sizeof(loff_t), GFP_KERNEL); + if (! spos) + return NULL; + *spos = *pos; + return spos; + } + +The entire data structure for this iterator is a single loff_t value +holding the current position. There is no upper bound for the sequence +iterator, but that will not be the case for most other seq_file +implementations; in most cases the start() function should check for a +"past end of file" condition and return NULL if need be. + +For more complicated applications, the private field of the seq_file +structure can be used. There is also a special value which can be returned +by the start() function called SEQ_START_TOKEN; it can be used if you wish +to instruct your show() function (described below) to print a header at the +top of the output. SEQ_START_TOKEN should only be used if the offset is +zero, however. + +The next function to implement is called, amazingly, next(); its job is to +move the iterator forward to the next position in the sequence. The +example module can simply increment the position by one; more useful +modules will do what is needed to step through some data structure. The +next() function returns a new iterator, or NULL if the sequence is +complete. Here's the example version: + + static void *ct_seq_next(struct seq_file *s, void *v, loff_t *pos) + { + loff_t *spos = v; + *pos = ++*spos; + return spos; + } + +The stop() function is called when iteration is complete; its job, of +course, is to clean up. If dynamic memory is allocated for the iterator, +stop() is the place to free it. + + static void ct_seq_stop(struct seq_file *s, void *v) + { + kfree(v); + } + +Finally, the show() function should format the object currently pointed to +by the iterator for output. The example module's show() function is: + + static int ct_seq_show(struct seq_file *s, void *v) + { + loff_t *spos = v; + seq_printf(s, "%lld\n", (long long)*spos); + return 0; + } + +If all is well, the show() function should return zero. A negative error +code in the usual manner indicates that something went wrong; it will be +passed back to user space. This function can also return SEQ_SKIP, which +causes the current item to be skipped; if the show() function has already +generated output before returning SEQ_SKIP, that output will be dropped. + +We will look at seq_printf() in a moment. But first, the definition of the +seq_file iterator is finished by creating a seq_operations structure with +the four functions we have just defined: + + static const struct seq_operations ct_seq_ops = { + .start = ct_seq_start, + .next = ct_seq_next, + .stop = ct_seq_stop, + .show = ct_seq_show + }; + +This structure will be needed to tie our iterator to the /proc file in +a little bit. + +It's worth noting that the iterator value returned by start() and +manipulated by the other functions is considered to be completely opaque by +the seq_file code. It can thus be anything that is useful in stepping +through the data to be output. Counters can be useful, but it could also be +a direct pointer into an array or linked list. Anything goes, as long as +the programmer is aware that things can happen between calls to the +iterator function. However, the seq_file code (by design) will not sleep +between the calls to start() and stop(), so holding a lock during that time +is a reasonable thing to do. The seq_file code will also avoid taking any +other locks while the iterator is active. + + +Formatted output + +The seq_file code manages positioning within the output created by the +iterator and getting it into the user's buffer. But, for that to work, that +output must be passed to the seq_file code. Some utility functions have +been defined which make this task easy. + +Most code will simply use seq_printf(), which works pretty much like +printk(), but which requires the seq_file pointer as an argument. It is +common to ignore the return value from seq_printf(), but a function +producing complicated output may want to check that value and quit if +something non-zero is returned; an error return means that the seq_file +buffer has been filled and further output will be discarded. + +For straight character output, the following functions may be used: + + int seq_putc(struct seq_file *m, char c); + int seq_puts(struct seq_file *m, const char *s); + int seq_escape(struct seq_file *m, const char *s, const char *esc); + +The first two output a single character and a string, just like one would +expect. seq_escape() is like seq_puts(), except that any character in s +which is in the string esc will be represented in octal form in the output. + +There is also a pair of functions for printing filenames: + + int seq_path(struct seq_file *m, struct path *path, char *esc); + int seq_path_root(struct seq_file *m, struct path *path, + struct path *root, char *esc) + +Here, path indicates the file of interest, and esc is a set of characters +which should be escaped in the output. A call to seq_path() will output +the path relative to the current process's filesystem root. If a different +root is desired, it can be used with seq_path_root(). Note that, if it +turns out that path cannot be reached from root, the value of root will be +changed in seq_file_root() to a root which *does* work. + + +Making it all work + +So far, we have a nice set of functions which can produce output within the +seq_file system, but we have not yet turned them into a file that a user +can see. Creating a file within the kernel requires, of course, the +creation of a set of file_operations which implement the operations on that +file. The seq_file interface provides a set of canned operations which do +most of the work. The virtual file author still must implement the open() +method, however, to hook everything up. The open function is often a single +line, as in the example module: + + static int ct_open(struct inode *inode, struct file *file) + { + return seq_open(file, &ct_seq_ops); + } + +Here, the call to seq_open() takes the seq_operations structure we created +before, and gets set up to iterate through the virtual file. + +On a successful open, seq_open() stores the struct seq_file pointer in +file->private_data. If you have an application where the same iterator can +be used for more than one file, you can store an arbitrary pointer in the +private field of the seq_file structure; that value can then be retrieved +by the iterator functions. + +The other operations of interest - read(), llseek(), and release() - are +all implemented by the seq_file code itself. So a virtual file's +file_operations structure will look like: + + static const struct file_operations ct_file_ops = { + .owner = THIS_MODULE, + .open = ct_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release + }; + +There is also a seq_release_private() which passes the contents of the +seq_file private field to kfree() before releasing the structure. + +The final step is the creation of the /proc file itself. In the example +code, that is done in the initialization code in the usual way: + + static int ct_init(void) + { + struct proc_dir_entry *entry; + + entry = create_proc_entry("sequence", 0, NULL); + if (entry) + entry->proc_fops = &ct_file_ops; + return 0; + } + + module_init(ct_init); + +And that is pretty much it. + + +seq_list + +If your file will be iterating through a linked list, you may find these +routines useful: + + struct list_head *seq_list_start(struct list_head *head, + loff_t pos); + struct list_head *seq_list_start_head(struct list_head *head, + loff_t pos); + struct list_head *seq_list_next(void *v, struct list_head *head, + loff_t *ppos); + +These helpers will interpret pos as a position within the list and iterate +accordingly. Your start() and next() functions need only invoke the +seq_list_* helpers with a pointer to the appropriate list_head structure. + + +The extra-simple version + +For extremely simple virtual files, there is an even easier interface. A +module can define only the show() function, which should create all the +output that the virtual file will contain. The file's open() method then +calls: + + int single_open(struct file *file, + int (*show)(struct seq_file *m, void *p), + void *data); + +When output time comes, the show() function will be called once. The data +value given to single_open() can be found in the private field of the +seq_file structure. When using single_open(), the programmer should use +single_release() instead of seq_release() in the file_operations structure +to avoid a memory leak. diff --git a/Documentation/filesystems/sharedsubtree.txt b/Documentation/filesystems/sharedsubtree.txt new file mode 100644 index 0000000..7365400 --- /dev/null +++ b/Documentation/filesystems/sharedsubtree.txt @@ -0,0 +1,1061 @@ +Shared Subtrees +--------------- + +Contents: + 1) Overview + 2) Features + 3) smount command + 4) Use-case + 5) Detailed semantics + 6) Quiz + 7) FAQ + 8) Implementation + + +1) Overview +----------- + +Consider the following situation: + +A process wants to clone its own namespace, but still wants to access the CD +that got mounted recently. Shared subtree semantics provide the necessary +mechanism to accomplish the above. + +It provides the necessary building blocks for features like per-user-namespace +and versioned filesystem. + +2) Features +----------- + +Shared subtree provides four different flavors of mounts; struct vfsmount to be +precise + + a. shared mount + b. slave mount + c. private mount + d. unbindable mount + + +2a) A shared mount can be replicated to as many mountpoints and all the +replicas continue to be exactly same. + + Here is an example: + + Lets say /mnt has a mount that is shared. + mount --make-shared /mnt + + note: mount command does not yet support the --make-shared flag. + I have included a small C program which does the same by executing + 'smount /mnt shared' + + #mount --bind /mnt /tmp + The above command replicates the mount at /mnt to the mountpoint /tmp + and the contents of both the mounts remain identical. + + #ls /mnt + a b c + + #ls /tmp + a b c + + Now lets say we mount a device at /tmp/a + #mount /dev/sd0 /tmp/a + + #ls /tmp/a + t1 t2 t2 + + #ls /mnt/a + t1 t2 t2 + + Note that the mount has propagated to the mount at /mnt as well. + + And the same is true even when /dev/sd0 is mounted on /mnt/a. The + contents will be visible under /tmp/a too. + + +2b) A slave mount is like a shared mount except that mount and umount events + only propagate towards it. + + All slave mounts have a master mount which is a shared. + + Here is an example: + + Lets say /mnt has a mount which is shared. + #mount --make-shared /mnt + + Lets bind mount /mnt to /tmp + #mount --bind /mnt /tmp + + the new mount at /tmp becomes a shared mount and it is a replica of + the mount at /mnt. + + Now lets make the mount at /tmp; a slave of /mnt + #mount --make-slave /tmp + [or smount /tmp slave] + + lets mount /dev/sd0 on /mnt/a + #mount /dev/sd0 /mnt/a + + #ls /mnt/a + t1 t2 t3 + + #ls /tmp/a + t1 t2 t3 + + Note the mount event has propagated to the mount at /tmp + + However lets see what happens if we mount something on the mount at /tmp + + #mount /dev/sd1 /tmp/b + + #ls /tmp/b + s1 s2 s3 + + #ls /mnt/b + + Note how the mount event has not propagated to the mount at + /mnt + + +2c) A private mount does not forward or receive propagation. + + This is the mount we are familiar with. Its the default type. + + +2d) A unbindable mount is a unbindable private mount + + lets say we have a mount at /mnt and we make is unbindable + + #mount --make-unbindable /mnt + [ smount /mnt unbindable ] + + Lets try to bind mount this mount somewhere else. + # mount --bind /mnt /tmp + mount: wrong fs type, bad option, bad superblock on /mnt, + or too many mounted file systems + + Binding a unbindable mount is a invalid operation. + + +3) smount command + + Currently the mount command is not aware of shared subtree features. + Work is in progress to add the support in mount ( util-linux package ). + Till then use the following program. + + ------------------------------------------------------------------------ + // + //this code was developed my Miklos Szeredi <miklos@szeredi.hu> + //and modified by Ram Pai <linuxram@us.ibm.com> + // sample usage: + // smount /tmp shared + // + #include <stdio.h> + #include <stdlib.h> + #include <unistd.h> + #include <string.h> + #include <sys/mount.h> + #include <sys/fsuid.h> + + #ifndef MS_REC + #define MS_REC 0x4000 /* 16384: Recursive loopback */ + #endif + + #ifndef MS_SHARED + #define MS_SHARED 1<<20 /* Shared */ + #endif + + #ifndef MS_PRIVATE + #define MS_PRIVATE 1<<18 /* Private */ + #endif + + #ifndef MS_SLAVE + #define MS_SLAVE 1<<19 /* Slave */ + #endif + + #ifndef MS_UNBINDABLE + #define MS_UNBINDABLE 1<<17 /* Unbindable */ + #endif + + int main(int argc, char *argv[]) + { + int type; + if(argc != 3) { + fprintf(stderr, "usage: %s dir " + "<rshared|rslave|rprivate|runbindable|shared|slave" + "|private|unbindable>\n" , argv[0]); + return 1; + } + + fprintf(stdout, "%s %s %s\n", argv[0], argv[1], argv[2]); + + if (strcmp(argv[2],"rshared")==0) + type=(MS_SHARED|MS_REC); + else if (strcmp(argv[2],"rslave")==0) + type=(MS_SLAVE|MS_REC); + else if (strcmp(argv[2],"rprivate")==0) + type=(MS_PRIVATE|MS_REC); + else if (strcmp(argv[2],"runbindable")==0) + type=(MS_UNBINDABLE|MS_REC); + else if (strcmp(argv[2],"shared")==0) + type=MS_SHARED; + else if (strcmp(argv[2],"slave")==0) + type=MS_SLAVE; + else if (strcmp(argv[2],"private")==0) + type=MS_PRIVATE; + else if (strcmp(argv[2],"unbindable")==0) + type=MS_UNBINDABLE; + else { + fprintf(stderr, "invalid operation: %s\n", argv[2]); + return 1; + } + setfsuid(getuid()); + + if(mount("", argv[1], "dontcare", type, "") == -1) { + perror("mount"); + return 1; + } + return 0; + } + ----------------------------------------------------------------------- + + Copy the above code snippet into smount.c + gcc -o smount smount.c + + + (i) To mark all the mounts under /mnt as shared execute the following + command: + + smount /mnt rshared + the corresponding syntax planned for mount command is + mount --make-rshared /mnt + + just to mark a mount /mnt as shared, execute the following + command: + smount /mnt shared + the corresponding syntax planned for mount command is + mount --make-shared /mnt + + (ii) To mark all the shared mounts under /mnt as slave execute the + following + + command: + smount /mnt rslave + the corresponding syntax planned for mount command is + mount --make-rslave /mnt + + just to mark a mount /mnt as slave, execute the following + command: + smount /mnt slave + the corresponding syntax planned for mount command is + mount --make-slave /mnt + + (iii) To mark all the mounts under /mnt as private execute the + following command: + + smount /mnt rprivate + the corresponding syntax planned for mount command is + mount --make-rprivate /mnt + + just to mark a mount /mnt as private, execute the following + command: + smount /mnt private + the corresponding syntax planned for mount command is + mount --make-private /mnt + + NOTE: by default all the mounts are created as private. But if + you want to change some shared/slave/unbindable mount as + private at a later point in time, this command can help. + + (iv) To mark all the mounts under /mnt as unbindable execute the + following + + command: + smount /mnt runbindable + the corresponding syntax planned for mount command is + mount --make-runbindable /mnt + + just to mark a mount /mnt as unbindable, execute the following + command: + smount /mnt unbindable + the corresponding syntax planned for mount command is + mount --make-unbindable /mnt + + +4) Use cases +------------ + + A) A process wants to clone its own namespace, but still wants to + access the CD that got mounted recently. + + Solution: + + The system administrator can make the mount at /cdrom shared + mount --bind /cdrom /cdrom + mount --make-shared /cdrom + + Now any process that clones off a new namespace will have a + mount at /cdrom which is a replica of the same mount in the + parent namespace. + + So when a CD is inserted and mounted at /cdrom that mount gets + propagated to the other mount at /cdrom in all the other clone + namespaces. + + B) A process wants its mounts invisible to any other process, but + still be able to see the other system mounts. + + Solution: + + To begin with, the administrator can mark the entire mount tree + as shareable. + + mount --make-rshared / + + A new process can clone off a new namespace. And mark some part + of its namespace as slave + + mount --make-rslave /myprivatetree + + Hence forth any mounts within the /myprivatetree done by the + process will not show up in any other namespace. However mounts + done in the parent namespace under /myprivatetree still shows + up in the process's namespace. + + + Apart from the above semantics this feature provides the + building blocks to solve the following problems: + + C) Per-user namespace + + The above semantics allows a way to share mounts across + namespaces. But namespaces are associated with processes. If + namespaces are made first class objects with user API to + associate/disassociate a namespace with userid, then each user + could have his/her own namespace and tailor it to his/her + requirements. Offcourse its needs support from PAM. + + D) Versioned files + + If the entire mount tree is visible at multiple locations, then + a underlying versioning file system can return different + version of the file depending on the path used to access that + file. + + An example is: + + mount --make-shared / + mount --rbind / /view/v1 + mount --rbind / /view/v2 + mount --rbind / /view/v3 + mount --rbind / /view/v4 + + and if /usr has a versioning filesystem mounted, than that + mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and + /view/v4/usr too + + A user can request v3 version of the file /usr/fs/namespace.c + by accessing /view/v3/usr/fs/namespace.c . The underlying + versioning filesystem can then decipher that v3 version of the + filesystem is being requested and return the corresponding + inode. + +5) Detailed semantics: +------------------- + The section below explains the detailed semantics of + bind, rbind, move, mount, umount and clone-namespace operations. + + Note: the word 'vfsmount' and the noun 'mount' have been used + to mean the same thing, throughout this document. + +5a) Mount states + + A given mount can be in one of the following states + 1) shared + 2) slave + 3) shared and slave + 4) private + 5) unbindable + + A 'propagation event' is defined as event generated on a vfsmount + that leads to mount or unmount actions in other vfsmounts. + + A 'peer group' is defined as a group of vfsmounts that propagate + events to each other. + + (1) Shared mounts + + A 'shared mount' is defined as a vfsmount that belongs to a + 'peer group'. + + For example: + mount --make-shared /mnt + mount --bin /mnt /tmp + + The mount at /mnt and that at /tmp are both shared and belong + to the same peer group. Anything mounted or unmounted under + /mnt or /tmp reflect in all the other mounts of its peer + group. + + + (2) Slave mounts + + A 'slave mount' is defined as a vfsmount that receives + propagation events and does not forward propagation events. + + A slave mount as the name implies has a master mount from which + mount/unmount events are received. Events do not propagate from + the slave mount to the master. Only a shared mount can be made + a slave by executing the following command + + mount --make-slave mount + + A shared mount that is made as a slave is no more shared unless + modified to become shared. + + (3) Shared and Slave + + A vfsmount can be both shared as well as slave. This state + indicates that the mount is a slave of some vfsmount, and + has its own peer group too. This vfsmount receives propagation + events from its master vfsmount, and also forwards propagation + events to its 'peer group' and to its slave vfsmounts. + + Strictly speaking, the vfsmount is shared having its own + peer group, and this peer-group is a slave of some other + peer group. + + Only a slave vfsmount can be made as 'shared and slave' by + either executing the following command + mount --make-shared mount + or by moving the slave vfsmount under a shared vfsmount. + + (4) Private mount + + A 'private mount' is defined as vfsmount that does not + receive or forward any propagation events. + + (5) Unbindable mount + + A 'unbindable mount' is defined as vfsmount that does not + receive or forward any propagation events and cannot + be bind mounted. + + + State diagram: + The state diagram below explains the state transition of a mount, + in response to various commands. + ------------------------------------------------------------------------ + | |make-shared | make-slave | make-private |make-unbindab| + --------------|------------|--------------|--------------|-------------| + |shared |shared |*slave/private| private | unbindable | + | | | | | | + |-------------|------------|--------------|--------------|-------------| + |slave |shared | **slave | private | unbindable | + | |and slave | | | | + |-------------|------------|--------------|--------------|-------------| + |shared |shared | slave | private | unbindable | + |and slave |and slave | | | | + |-------------|------------|--------------|--------------|-------------| + |private |shared | **private | private | unbindable | + |-------------|------------|--------------|--------------|-------------| + |unbindable |shared |**unbindable | private | unbindable | + ------------------------------------------------------------------------ + + * if the shared mount is the only mount in its peer group, making it + slave, makes it private automatically. Note that there is no master to + which it can be slaved to. + + ** slaving a non-shared mount has no effect on the mount. + + Apart from the commands listed below, the 'move' operation also changes + the state of a mount depending on type of the destination mount. Its + explained in section 5d. + +5b) Bind semantics + + Consider the following command + + mount --bind A/a B/b + + where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B' + is the destination mount and 'b' is the dentry in the destination mount. + + The outcome depends on the type of mount of 'A' and 'B'. The table + below contains quick reference. + --------------------------------------------------------------------------- + | BIND MOUNT OPERATION | + |************************************************************************** + |source(A)->| shared | private | slave | unbindable | + | dest(B) | | | | | + | | | | | | | + | v | | | | | + |************************************************************************** + | shared | shared | shared | shared & slave | invalid | + | | | | | | + |non-shared| shared | private | slave | invalid | + *************************************************************************** + + Details: + + 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C' + which is clone of 'A', is created. Its root dentry is 'a' . 'C' is + mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ... + are created and mounted at the dentry 'b' on all mounts where 'B' + propagates to. A new propagation tree containing 'C1',..,'Cn' is + created. This propagation tree is identical to the propagation tree of + 'B'. And finally the peer-group of 'C' is merged with the peer group + of 'A'. + + 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C' + which is clone of 'A', is created. Its root dentry is 'a'. 'C' is + mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ... + are created and mounted at the dentry 'b' on all mounts where 'B' + propagates to. A new propagation tree is set containing all new mounts + 'C', 'C1', .., 'Cn' with exactly the same configuration as the + propagation tree for 'B'. + + 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new + mount 'C' which is clone of 'A', is created. Its root dentry is 'a' . + 'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2', + 'C3' ... are created and mounted at the dentry 'b' on all mounts where + 'B' propagates to. A new propagation tree containing the new mounts + 'C','C1',.. 'Cn' is created. This propagation tree is identical to the + propagation tree for 'B'. And finally the mount 'C' and its peer group + is made the slave of mount 'Z'. In other words, mount 'C' is in the + state 'slave and shared'. + + 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a + invalid operation. + + 5. 'A' is a private mount and 'B' is a non-shared(private or slave or + unbindable) mount. A new mount 'C' which is clone of 'A', is created. + Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'. + + 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C' + which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is + mounted on mount 'B' at dentry 'b'. 'C' is made a member of the + peer-group of 'A'. + + 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A + new mount 'C' which is a clone of 'A' is created. Its root dentry is + 'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a + slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of + 'Z'. All mount/unmount events on 'Z' propagates to 'A' and 'C'. But + mount/unmount on 'A' do not propagate anywhere else. Similarly + mount/unmount on 'C' do not propagate anywhere else. + + 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a + invalid operation. A unbindable mount cannot be bind mounted. + +5c) Rbind semantics + + rbind is same as bind. Bind replicates the specified mount. Rbind + replicates all the mounts in the tree belonging to the specified mount. + Rbind mount is bind mount applied to all the mounts in the tree. + + If the source tree that is rbind has some unbindable mounts, + then the subtree under the unbindable mount is pruned in the new + location. + + eg: lets say we have the following mount tree. + + A + / \ + B C + / \ / \ + D E F G + + Lets say all the mount except the mount C in the tree are + of a type other than unbindable. + + If this tree is rbound to say Z + + We will have the following tree at the new location. + + Z + | + A' + / + B' Note how the tree under C is pruned + / \ in the new location. + D' E' + + + +5d) Move semantics + + Consider the following command + + mount --move A B/b + + where 'A' is the source mount, 'B' is the destination mount and 'b' is + the dentry in the destination mount. + + The outcome depends on the type of the mount of 'A' and 'B'. The table + below is a quick reference. + --------------------------------------------------------------------------- + | MOVE MOUNT OPERATION | + |************************************************************************** + | source(A)->| shared | private | slave | unbindable | + | dest(B) | | | | | + | | | | | | | + | v | | | | | + |************************************************************************** + | shared | shared | shared |shared and slave| invalid | + | | | | | | + |non-shared| shared | private | slave | unbindable | + *************************************************************************** + NOTE: moving a mount residing under a shared mount is invalid. + + Details follow: + + 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is + mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An' + are created and mounted at dentry 'b' on all mounts that receive + propagation from mount 'B'. A new propagation tree is created in the + exact same configuration as that of 'B'. This new propagation tree + contains all the new mounts 'A1', 'A2'... 'An'. And this new + propagation tree is appended to the already existing propagation tree + of 'A'. + + 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is + mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An' + are created and mounted at dentry 'b' on all mounts that receive + propagation from mount 'B'. The mount 'A' becomes a shared mount and a + propagation tree is created which is identical to that of + 'B'. This new propagation tree contains all the new mounts 'A1', + 'A2'... 'An'. + + 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The + mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', + 'A2'... 'An' are created and mounted at dentry 'b' on all mounts that + receive propagation from mount 'B'. A new propagation tree is created + in the exact same configuration as that of 'B'. This new propagation + tree contains all the new mounts 'A1', 'A2'... 'An'. And this new + propagation tree is appended to the already existing propagation tree of + 'A'. Mount 'A' continues to be the slave mount of 'Z' but it also + becomes 'shared'. + + 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation + is invalid. Because mounting anything on the shared mount 'B' can + create new mounts that get mounted on the mounts that receive + propagation from 'B'. And since the mount 'A' is unbindable, cloning + it to mount at other mountpoints is not possible. + + 5. 'A' is a private mount and 'B' is a non-shared(private or slave or + unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'. + + 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A' + is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a + shared mount. + + 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. + The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' + continues to be a slave mount of mount 'Z'. + + 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount + 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a + unbindable mount. + +5e) Mount semantics + + Consider the following command + + mount device B/b + + 'B' is the destination mount and 'b' is the dentry in the destination + mount. + + The above operation is the same as bind operation with the exception + that the source mount is always a private mount. + + +5f) Unmount semantics + + Consider the following command + + umount A + + where 'A' is a mount mounted on mount 'B' at dentry 'b'. + + If mount 'B' is shared, then all most-recently-mounted mounts at dentry + 'b' on mounts that receive propagation from mount 'B' and does not have + sub-mounts within them are unmounted. + + Example: Lets say 'B1', 'B2', 'B3' are shared mounts that propagate to + each other. + + lets say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount + 'B1', 'B2' and 'B3' respectively. + + lets say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on + mount 'B1', 'B2' and 'B3' respectively. + + if 'C1' is unmounted, all the mounts that are most-recently-mounted on + 'B1' and on the mounts that 'B1' propagates-to are unmounted. + + 'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount + on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'. + + So all 'C1', 'C2' and 'C3' should be unmounted. + + If any of 'C2' or 'C3' has some child mounts, then that mount is not + unmounted, but all other mounts are unmounted. However if 'C1' is told + to be unmounted and 'C1' has some sub-mounts, the umount operation is + failed entirely. + +5g) Clone Namespace + + A cloned namespace contains all the mounts as that of the parent + namespace. + + Lets say 'A' and 'B' are the corresponding mounts in the parent and the + child namespace. + + If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to + each other. + + If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of + 'Z'. + + If 'A' is a private mount, then 'B' is a private mount too. + + If 'A' is unbindable mount, then 'B' is a unbindable mount too. + + +6) Quiz + + A. What is the result of the following command sequence? + + mount --bind /mnt /mnt + mount --make-shared /mnt + mount --bind /mnt /tmp + mount --move /tmp /mnt/1 + + what should be the contents of /mnt /mnt/1 /mnt/1/1 should be? + Should they all be identical? or should /mnt and /mnt/1 be + identical only? + + + B. What is the result of the following command sequence? + + mount --make-rshared / + mkdir -p /v/1 + mount --rbind / /v/1 + + what should be the content of /v/1/v/1 be? + + + C. What is the result of the following command sequence? + + mount --bind /mnt /mnt + mount --make-shared /mnt + mkdir -p /mnt/1/2/3 /mnt/1/test + mount --bind /mnt/1 /tmp + mount --make-slave /mnt + mount --make-shared /mnt + mount --bind /mnt/1/2 /tmp1 + mount --make-slave /mnt + + At this point we have the first mount at /tmp and + its root dentry is 1. Lets call this mount 'A' + And then we have a second mount at /tmp1 with root + dentry 2. Lets call this mount 'B' + Next we have a third mount at /mnt with root dentry + mnt. Lets call this mount 'C' + + 'B' is the slave of 'A' and 'C' is a slave of 'B' + A -> B -> C + + at this point if we execute the following command + + mount --bind /bin /tmp/test + + The mount is attempted on 'A' + + will the mount propagate to 'B' and 'C' ? + + what would be the contents of + /mnt/1/test be? + +7) FAQ + + Q1. Why is bind mount needed? How is it different from symbolic links? + symbolic links can get stale if the destination mount gets + unmounted or moved. Bind mounts continue to exist even if the + other mount is unmounted or moved. + + Q2. Why can't the shared subtree be implemented using exportfs? + + exportfs is a heavyweight way of accomplishing part of what + shared subtree can do. I cannot imagine a way to implement the + semantics of slave mount using exportfs? + + Q3 Why is unbindable mount needed? + + Lets say we want to replicate the mount tree at multiple + locations within the same subtree. + + if one rbind mounts a tree within the same subtree 'n' times + the number of mounts created is an exponential function of 'n'. + Having unbindable mount can help prune the unneeded bind + mounts. Here is a example. + + step 1: + lets say the root tree has just two directories with + one vfsmount. + root + / \ + tmp usr + + And we want to replicate the tree at multiple + mountpoints under /root/tmp + + step2: + mount --make-shared /root + + mkdir -p /tmp/m1 + + mount --rbind /root /tmp/m1 + + the new tree now looks like this: + + root + / \ + tmp usr + / + m1 + / \ + tmp usr + / + m1 + + it has two vfsmounts + + step3: + mkdir -p /tmp/m2 + mount --rbind /root /tmp/m2 + + the new tree now looks like this: + + root + / \ + tmp usr + / \ + m1 m2 + / \ / \ + tmp usr tmp usr + / \ / + m1 m2 m1 + / \ / \ + tmp usr tmp usr + / / \ + m1 m1 m2 + / \ + tmp usr + / \ + m1 m2 + + it has 6 vfsmounts + + step 4: + mkdir -p /tmp/m3 + mount --rbind /root /tmp/m3 + + I wont' draw the tree..but it has 24 vfsmounts + + + at step i the number of vfsmounts is V[i] = i*V[i-1]. + This is an exponential function. And this tree has way more + mounts than what we really needed in the first place. + + One could use a series of umount at each step to prune + out the unneeded mounts. But there is a better solution. + Unclonable mounts come in handy here. + + step 1: + lets say the root tree has just two directories with + one vfsmount. + root + / \ + tmp usr + + How do we set up the same tree at multiple locations under + /root/tmp + + step2: + mount --bind /root/tmp /root/tmp + + mount --make-rshared /root + mount --make-unbindable /root/tmp + + mkdir -p /tmp/m1 + + mount --rbind /root /tmp/m1 + + the new tree now looks like this: + + root + / \ + tmp usr + / + m1 + / \ + tmp usr + + step3: + mkdir -p /tmp/m2 + mount --rbind /root /tmp/m2 + + the new tree now looks like this: + + root + / \ + tmp usr + / \ + m1 m2 + / \ / \ + tmp usr tmp usr + + step4: + + mkdir -p /tmp/m3 + mount --rbind /root /tmp/m3 + + the new tree now looks like this: + + root + / \ + tmp usr + / \ \ + m1 m2 m3 + / \ / \ / \ + tmp usr tmp usr tmp usr + +8) Implementation + +8A) Datastructure + + 4 new fields are introduced to struct vfsmount + ->mnt_share + ->mnt_slave_list + ->mnt_slave + ->mnt_master + + ->mnt_share links together all the mount to/from which this vfsmount + send/receives propagation events. + + ->mnt_slave_list links all the mounts to which this vfsmount propagates + to. + + ->mnt_slave links together all the slaves that its master vfsmount + propagates to. + + ->mnt_master points to the master vfsmount from which this vfsmount + receives propagation. + + ->mnt_flags takes two more flags to indicate the propagation status of + the vfsmount. MNT_SHARE indicates that the vfsmount is a shared + vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be + replicated. + + All the shared vfsmounts in a peer group form a cyclic list through + ->mnt_share. + + All vfsmounts with the same ->mnt_master form on a cyclic list anchored + in ->mnt_master->mnt_slave_list and going through ->mnt_slave. + + ->mnt_master can point to arbitrary (and possibly different) members + of master peer group. To find all immediate slaves of a peer group + you need to go through _all_ ->mnt_slave_list of its members. + Conceptually it's just a single set - distribution among the + individual lists does not affect propagation or the way propagation + tree is modified by operations. + + A example propagation tree looks as shown in the figure below. + [ NOTE: Though it looks like a forest, if we consider all the shared + mounts as a conceptual entity called 'pnode', it becomes a tree] + + + A <--> B <--> C <---> D + /|\ /| |\ + / F G J K H I + / + E<-->K + /|\ + M L N + + In the above figure A,B,C and D all are shared and propagate to each + other. 'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave + mounts 'J' and 'K' and 'D' has got two slave mounts 'H' and 'I'. + 'E' is also shared with 'K' and they propagate to each other. And + 'K' has 3 slaves 'M', 'L' and 'N' + + A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D' + + A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G' + + E's ->mnt_share links with ->mnt_share of K + 'E', 'K', 'F', 'G' have their ->mnt_master point to struct + vfsmount of 'A' + 'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K' + K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N' + + C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K' + J and K's ->mnt_master points to struct vfsmount of C + and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I' + 'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'. + + + NOTE: The propagation tree is orthogonal to the mount tree. + + +8B Algorithm: + + The crux of the implementation resides in rbind/move operation. + + The overall algorithm breaks the operation into 3 phases: (look at + attach_recursive_mnt() and propagate_mnt()) + + 1. prepare phase. + 2. commit phases. + 3. abort phases. + + Prepare phase: + + for each mount in the source tree: + a) Create the necessary number of mount trees to + be attached to each of the mounts that receive + propagation from the destination mount. + b) Do not attach any of the trees to its destination. + However note down its ->mnt_parent and ->mnt_mountpoint + c) Link all the new mounts to form a propagation tree that + is identical to the propagation tree of the destination + mount. + + If this phase is successful, there should be 'n' new + propagation trees; where 'n' is the number of mounts in the + source tree. Go to the commit phase + + Also there should be 'm' new mount trees, where 'm' is + the number of mounts to which the destination mount + propagates to. + + if any memory allocations fail, go to the abort phase. + + Commit phase + attach each of the mount trees to their corresponding + destination mounts. + + Abort phase + delete all the newly created trees. + + NOTE: all the propagation related functionality resides in the file + pnode.c + + +------------------------------------------------------------------------ + +version 0.1 (created the initial document, Ram Pai linuxram@us.ibm.com) +version 0.2 (Incorporated comments from Al Viro) diff --git a/Documentation/filesystems/smbfs.txt b/Documentation/filesystems/smbfs.txt new file mode 100644 index 0000000..f673ef0 --- /dev/null +++ b/Documentation/filesystems/smbfs.txt @@ -0,0 +1,8 @@ +Smbfs is a filesystem that implements the SMB protocol, which is the +protocol used by Windows for Workgroups, Windows 95 and Windows NT. +Smbfs was inspired by Samba, the program written by Andrew Tridgell +that turns any Unix host into a file server for DOS or Windows clients. + +Smbfs is a SMB client, but uses parts of samba for it's operation. For +more info on samba, including documentation, please go to +http://www.samba.org/ and then on to your nearest mirror. diff --git a/Documentation/filesystems/spufs.txt b/Documentation/filesystems/spufs.txt new file mode 100644 index 0000000..1343d11 --- /dev/null +++ b/Documentation/filesystems/spufs.txt @@ -0,0 +1,521 @@ +SPUFS(2) Linux Programmer's Manual SPUFS(2) + + + +NAME + spufs - the SPU file system + + +DESCRIPTION + The SPU file system is used on PowerPC machines that implement the Cell + Broadband Engine Architecture in order to access Synergistic Processor + Units (SPUs). + + The file system provides a name space similar to posix shared memory or + message queues. Users that have write permissions on the file system + can use spu_create(2) to establish SPU contexts in the spufs root. + + Every SPU context is represented by a directory containing a predefined + set of files. These files can be used for manipulating the state of the + logical SPU. Users can change permissions on those files, but not actu- + ally add or remove files. + + +MOUNT OPTIONS + uid=<uid> + set the user owning the mount point, the default is 0 (root). + + gid=<gid> + set the group owning the mount point, the default is 0 (root). + + +FILES + The files in spufs mostly follow the standard behavior for regular sys- + tem calls like read(2) or write(2), but often support only a subset of + the operations supported on regular file systems. This list details the + supported operations and the deviations from the behaviour in the + respective man pages. + + All files that support the read(2) operation also support readv(2) and + all files that support the write(2) operation also support writev(2). + All files support the access(2) and stat(2) family of operations, but + only the st_mode, st_nlink, st_uid and st_gid fields of struct stat + contain reliable information. + + All files support the chmod(2)/fchmod(2) and chown(2)/fchown(2) opera- + tions, but will not be able to grant permissions that contradict the + possible operations, e.g. read access on the wbox file. + + The current set of files is: + + + /mem + the contents of the local storage memory of the SPU. This can be + accessed like a regular shared memory file and contains both code and + data in the address space of the SPU. The possible operations on an + open mem file are: + + read(2), pread(2), write(2), pwrite(2), lseek(2) + These operate as documented, with the exception that seek(2), + write(2) and pwrite(2) are not supported beyond the end of the + file. The file size is the size of the local storage of the SPU, + which normally is 256 kilobytes. + + mmap(2) + Mapping mem into the process address space gives access to the + SPU local storage within the process address space. Only + MAP_SHARED mappings are allowed. + + + /mbox + The first SPU to CPU communication mailbox. This file is read-only and + can be read in units of 32 bits. The file can only be used in non- + blocking mode and it even poll() will not block on it. The possible + operations on an open mbox file are: + + read(2) + If a count smaller than four is requested, read returns -1 and + sets errno to EINVAL. If there is no data available in the mail + box, the return value is set to -1 and errno becomes EAGAIN. + When data has been read successfully, four bytes are placed in + the data buffer and the value four is returned. + + + /ibox + The second SPU to CPU communication mailbox. This file is similar to + the first mailbox file, but can be read in blocking I/O mode, and the + poll family of system calls can be used to wait for it. The possible + operations on an open ibox file are: + + read(2) + If a count smaller than four is requested, read returns -1 and + sets errno to EINVAL. If there is no data available in the mail + box and the file descriptor has been opened with O_NONBLOCK, the + return value is set to -1 and errno becomes EAGAIN. + + If there is no data available in the mail box and the file + descriptor has been opened without O_NONBLOCK, the call will + block until the SPU writes to its interrupt mailbox channel. + When data has been read successfully, four bytes are placed in + the data buffer and the value four is returned. + + poll(2) + Poll on the ibox file returns (POLLIN | POLLRDNORM) whenever + data is available for reading. + + + /wbox + The CPU to SPU communation mailbox. It is write-only and can be written + in units of 32 bits. If the mailbox is full, write() will block and + poll can be used to wait for it becoming empty again. The possible + operations on an open wbox file are: write(2) If a count smaller than + four is requested, write returns -1 and sets errno to EINVAL. If there + is no space available in the mail box and the file descriptor has been + opened with O_NONBLOCK, the return value is set to -1 and errno becomes + EAGAIN. + + If there is no space available in the mail box and the file descriptor + has been opened without O_NONBLOCK, the call will block until the SPU + reads from its PPE mailbox channel. When data has been read success- + fully, four bytes are placed in the data buffer and the value four is + returned. + + poll(2) + Poll on the ibox file returns (POLLOUT | POLLWRNORM) whenever + space is available for writing. + + + /mbox_stat + /ibox_stat + /wbox_stat + Read-only files that contain the length of the current queue, i.e. how + many words can be read from mbox or ibox or how many words can be + written to wbox without blocking. The files can be read only in 4-byte + units and return a big-endian binary integer number. The possible + operations on an open *box_stat file are: + + read(2) + If a count smaller than four is requested, read returns -1 and + sets errno to EINVAL. Otherwise, a four byte value is placed in + the data buffer, containing the number of elements that can be + read from (for mbox_stat and ibox_stat) or written to (for + wbox_stat) the respective mail box without blocking or resulting + in EAGAIN. + + + /npc + /decr + /decr_status + /spu_tag_mask + /event_mask + /srr0 + Internal registers of the SPU. The representation is an ASCII string + with the numeric value of the next instruction to be executed. These + can be used in read/write mode for debugging, but normal operation of + programs should not rely on them because access to any of them except + npc requires an SPU context save and is therefore very inefficient. + + The contents of these files are: + + npc Next Program Counter + + decr SPU Decrementer + + decr_status Decrementer Status + + spu_tag_mask MFC tag mask for SPU DMA + + event_mask Event mask for SPU interrupts + + srr0 Interrupt Return address register + + + The possible operations on an open npc, decr, decr_status, + spu_tag_mask, event_mask or srr0 file are: + + read(2) + When the count supplied to the read call is shorter than the + required length for the pointer value plus a newline character, + subsequent reads from the same file descriptor will result in + completing the string, regardless of changes to the register by + a running SPU task. When a complete string has been read, all + subsequent read operations will return zero bytes and a new file + descriptor needs to be opened to read the value again. + + write(2) + A write operation on the file results in setting the register to + the value given in the string. The string is parsed from the + beginning to the first non-numeric character or the end of the + buffer. Subsequent writes to the same file descriptor overwrite + the previous setting. + + + /fpcr + This file gives access to the Floating Point Status and Control Regis- + ter as a four byte long file. The operations on the fpcr file are: + + read(2) + If a count smaller than four is requested, read returns -1 and + sets errno to EINVAL. Otherwise, a four byte value is placed in + the data buffer, containing the current value of the fpcr regis- + ter. + + write(2) + If a count smaller than four is requested, write returns -1 and + sets errno to EINVAL. Otherwise, a four byte value is copied + from the data buffer, updating the value of the fpcr register. + + + /signal1 + /signal2 + The two signal notification channels of an SPU. These are read-write + files that operate on a 32 bit word. Writing to one of these files + triggers an interrupt on the SPU. The value written to the signal + files can be read from the SPU through a channel read or from host user + space through the file. After the value has been read by the SPU, it + is reset to zero. The possible operations on an open signal1 or sig- + nal2 file are: + + read(2) + If a count smaller than four is requested, read returns -1 and + sets errno to EINVAL. Otherwise, a four byte value is placed in + the data buffer, containing the current value of the specified + signal notification register. + + write(2) + If a count smaller than four is requested, write returns -1 and + sets errno to EINVAL. Otherwise, a four byte value is copied + from the data buffer, updating the value of the specified signal + notification register. The signal notification register will + either be replaced with the input data or will be updated to the + bitwise OR or the old value and the input data, depending on the + contents of the signal1_type, or signal2_type respectively, + file. + + + /signal1_type + /signal2_type + These two files change the behavior of the signal1 and signal2 notifi- + cation files. The contain a numerical ASCII string which is read as + either "1" or "0". In mode 0 (overwrite), the hardware replaces the + contents of the signal channel with the data that is written to it. in + mode 1 (logical OR), the hardware accumulates the bits that are subse- + quently written to it. The possible operations on an open signal1_type + or signal2_type file are: + + read(2) + When the count supplied to the read call is shorter than the + required length for the digit plus a newline character, subse- + quent reads from the same file descriptor will result in com- + pleting the string. When a complete string has been read, all + subsequent read operations will return zero bytes and a new file + descriptor needs to be opened to read the value again. + + write(2) + A write operation on the file results in setting the register to + the value given in the string. The string is parsed from the + beginning to the first non-numeric character or the end of the + buffer. Subsequent writes to the same file descriptor overwrite + the previous setting. + + +EXAMPLES + /etc/fstab entry + none /spu spufs gid=spu 0 0 + + +AUTHORS + Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>, + Ulrich Weigand <Ulrich.Weigand@de.ibm.com> + +SEE ALSO + capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7) + + + +Linux 2005-09-28 SPUFS(2) + +------------------------------------------------------------------------------ + +SPU_RUN(2) Linux Programmer's Manual SPU_RUN(2) + + + +NAME + spu_run - execute an spu context + + +SYNOPSIS + #include <sys/spu.h> + + int spu_run(int fd, unsigned int *npc, unsigned int *event); + +DESCRIPTION + The spu_run system call is used on PowerPC machines that implement the + Cell Broadband Engine Architecture in order to access Synergistic Pro- + cessor Units (SPUs). It uses the fd that was returned from spu_cre- + ate(2) to address a specific SPU context. When the context gets sched- + uled to a physical SPU, it starts execution at the instruction pointer + passed in npc. + + Execution of SPU code happens synchronously, meaning that spu_run does + not return while the SPU is still running. If there is a need to exe- + cute SPU code in parallel with other code on either the main CPU or + other SPUs, you need to create a new thread of execution first, e.g. + using the pthread_create(3) call. + + When spu_run returns, the current value of the SPU instruction pointer + is written back to npc, so you can call spu_run again without updating + the pointers. + + event can be a NULL pointer or point to an extended status code that + gets filled when spu_run returns. It can be one of the following con- + stants: + + SPE_EVENT_DMA_ALIGNMENT + A DMA alignment error + + SPE_EVENT_SPE_DATA_SEGMENT + A DMA segmentation error + + SPE_EVENT_SPE_DATA_STORAGE + A DMA storage error + + If NULL is passed as the event argument, these errors will result in a + signal delivered to the calling process. + +RETURN VALUE + spu_run returns the value of the spu_status register or -1 to indicate + an error and set errno to one of the error codes listed below. The + spu_status register value contains a bit mask of status codes and + optionally a 14 bit code returned from the stop-and-signal instruction + on the SPU. The bit masks for the status codes are: + + 0x02 SPU was stopped by stop-and-signal. + + 0x04 SPU was stopped by halt. + + 0x08 SPU is waiting for a channel. + + 0x10 SPU is in single-step mode. + + 0x20 SPU has tried to execute an invalid instruction. + + 0x40 SPU has tried to access an invalid channel. + + 0x3fff0000 + The bits masked with this value contain the code returned from + stop-and-signal. + + There are always one or more of the lower eight bits set or an error + code is returned from spu_run. + +ERRORS + EAGAIN or EWOULDBLOCK + fd is in non-blocking mode and spu_run would block. + + EBADF fd is not a valid file descriptor. + + EFAULT npc is not a valid pointer or status is neither NULL nor a valid + pointer. + + EINTR A signal occurred while spu_run was in progress. The npc value + has been updated to the new program counter value if necessary. + + EINVAL fd is not a file descriptor returned from spu_create(2). + + ENOMEM Insufficient memory was available to handle a page fault result- + ing from an MFC direct memory access. + + ENOSYS the functionality is not provided by the current system, because + either the hardware does not provide SPUs or the spufs module is + not loaded. + + +NOTES + spu_run is meant to be used from libraries that implement a more + abstract interface to SPUs, not to be used from regular applications. + See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec- + ommended libraries. + + +CONFORMING TO + This call is Linux specific and only implemented by the ppc64 architec- + ture. Programs using this system call are not portable. + + +BUGS + The code does not yet fully implement all features lined out here. + + +AUTHOR + Arnd Bergmann <arndb@de.ibm.com> + +SEE ALSO + capabilities(7), close(2), spu_create(2), spufs(7) + + + +Linux 2005-09-28 SPU_RUN(2) + +------------------------------------------------------------------------------ + +SPU_CREATE(2) Linux Programmer's Manual SPU_CREATE(2) + + + +NAME + spu_create - create a new spu context + + +SYNOPSIS + #include <sys/types.h> + #include <sys/spu.h> + + int spu_create(const char *pathname, int flags, mode_t mode); + +DESCRIPTION + The spu_create system call is used on PowerPC machines that implement + the Cell Broadband Engine Architecture in order to access Synergistic + Processor Units (SPUs). It creates a new logical context for an SPU in + pathname and returns a handle to associated with it. pathname must + point to a non-existing directory in the mount point of the SPU file + system (spufs). When spu_create is successful, a directory gets cre- + ated on pathname and it is populated with files. + + The returned file handle can only be passed to spu_run(2) or closed, + other operations are not defined on it. When it is closed, all associ- + ated directory entries in spufs are removed. When the last file handle + pointing either inside of the context directory or to this file + descriptor is closed, the logical SPU context is destroyed. + + The parameter flags can be zero or any bitwise or'd combination of the + following constants: + + SPU_RAWIO + Allow mapping of some of the hardware registers of the SPU into + user space. This flag requires the CAP_SYS_RAWIO capability, see + capabilities(7). + + The mode parameter specifies the permissions used for creating the new + directory in spufs. mode is modified with the user's umask(2) value + and then used for both the directory and the files contained in it. The + file permissions mask out some more bits of mode because they typically + support only read or write access. See stat(2) for a full list of the + possible mode values. + + +RETURN VALUE + spu_create returns a new file descriptor. It may return -1 to indicate + an error condition and set errno to one of the error codes listed + below. + + +ERRORS + EACCESS + The current user does not have write access on the spufs mount + point. + + EEXIST An SPU context already exists at the given path name. + + EFAULT pathname is not a valid string pointer in the current address + space. + + EINVAL pathname is not a directory in the spufs mount point. + + ELOOP Too many symlinks were found while resolving pathname. + + EMFILE The process has reached its maximum open file limit. + + ENAMETOOLONG + pathname was too long. + + ENFILE The system has reached the global open file limit. + + ENOENT Part of pathname could not be resolved. + + ENOMEM The kernel could not allocate all resources required. + + ENOSPC There are not enough SPU resources available to create a new + context or the user specific limit for the number of SPU con- + texts has been reached. + + ENOSYS the functionality is not provided by the current system, because + either the hardware does not provide SPUs or the spufs module is + not loaded. + + ENOTDIR + A part of pathname is not a directory. + + + +NOTES + spu_create is meant to be used from libraries that implement a more + abstract interface to SPUs, not to be used from regular applications. + See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec- + ommended libraries. + + +FILES + pathname must point to a location beneath the mount point of spufs. By + convention, it gets mounted in /spu. + + +CONFORMING TO + This call is Linux specific and only implemented by the ppc64 architec- + ture. Programs using this system call are not portable. + + +BUGS + The code does not yet fully implement all features lined out here. + + +AUTHOR + Arnd Bergmann <arndb@de.ibm.com> + +SEE ALSO + capabilities(7), close(2), spu_run(2), spufs(7) + + + +Linux 2005-09-28 SPU_CREATE(2) diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.txt new file mode 100644 index 0000000..de99749 --- /dev/null +++ b/Documentation/filesystems/sysfs-pci.txt @@ -0,0 +1,107 @@ +Accessing PCI device resources through sysfs +-------------------------------------------- + +sysfs, usually mounted at /sys, provides access to PCI resources on platforms +that support it. For example, a given bus might look like this: + + /sys/devices/pci0000:17 + |-- 0000:17:00.0 + | |-- class + | |-- config + | |-- device + | |-- enable + | |-- irq + | |-- local_cpus + | |-- resource + | |-- resource0 + | |-- resource1 + | |-- resource2 + | |-- rom + | |-- subsystem_device + | |-- subsystem_vendor + | `-- vendor + `-- ... + +The topmost element describes the PCI domain and bus number. In this case, +the domain number is 0000 and the bus number is 17 (both values are in hex). +This bus contains a single function device in slot 0. The domain and bus +numbers are reproduced for convenience. Under the device directory are several +files, each with their own function. + + file function + ---- -------- + class PCI class (ascii, ro) + config PCI config space (binary, rw) + device PCI device (ascii, ro) + enable Whether the device is enabled (ascii, rw) + irq IRQ number (ascii, ro) + local_cpus nearby CPU mask (cpumask, ro) + resource PCI resource host addresses (ascii, ro) + resource0..N PCI resource N, if present (binary, mmap) + resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap) + rom PCI ROM resource, if present (binary, ro) + subsystem_device PCI subsystem device (ascii, ro) + subsystem_vendor PCI subsystem vendor (ascii, ro) + vendor PCI vendor (ascii, ro) + + ro - read only file + rw - file is readable and writable + mmap - file is mmapable + ascii - file contains ascii text + binary - file contains binary data + cpumask - file contains a cpumask type + +The read only files are informational, writes to them will be ignored, with +the exception of the 'rom' file. Writable files can be used to perform +actions on the device (e.g. changing config space, detaching a device). +mmapable files are available via an mmap of the file at offset 0 and can be +used to do actual device programming from userspace. Note that some platforms +don't support mmapping of certain resources, so be sure to check the return +value from any attempted mmap. + +The 'enable' file provides a counter that indicates how many times the device +has been enabled. If the 'enable' file currently returns '4', and a '1' is +echoed into it, it will then return '5'. Echoing a '0' into it will decrease +the count. Even when it returns to 0, though, some of the initialisation +may not be reversed. + +The 'rom' file is special in that it provides read-only access to the device's +ROM file, if available. It's disabled by default, however, so applications +should write the string "1" to the file to enable it before attempting a read +call, and disable it following the access by writing "0" to the file. Note +that the device must be enabled for a rom read to return data succesfully. +In the event a driver is not bound to the device, it can be enabled using the +'enable' file, documented above. + +Accessing legacy resources through sysfs +---------------------------------------- + +Legacy I/O port and ISA memory resources are also provided in sysfs if the +underlying platform supports them. They're located in the PCI class hierarchy, +e.g. + + /sys/class/pci_bus/0000:17/ + |-- bridge -> ../../../devices/pci0000:17 + |-- cpuaffinity + |-- legacy_io + `-- legacy_mem + +The legacy_io file is a read/write file that can be used by applications to +do legacy port I/O. The application should open the file, seek to the desired +port (e.g. 0x3e8) and do a read or a write of 1, 2 or 4 bytes. The legacy_mem +file should be mmapped with an offset corresponding to the memory offset +desired, e.g. 0xa0000 for the VGA frame buffer. The application can then +simply dereference the returned pointer (after checking for errors of course) +to access legacy memory space. + +Supporting PCI access on new platforms +-------------------------------------- + +In order to support PCI resource mapping as described above, Linux platform +code must define HAVE_PCI_MMAP and provide a pci_mmap_page_range function. +Platforms are free to only support subsets of the mmap functionality, but +useful return codes should be provided. + +Legacy resources are protected by the HAVE_PCI_LEGACY define. Platforms +wishing to support legacy functionality should define it and provide +pci_legacy_read, pci_legacy_write and pci_mmap_legacy_page_range functions. diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt new file mode 100644 index 0000000..9e9c348 --- /dev/null +++ b/Documentation/filesystems/sysfs.txt @@ -0,0 +1,357 @@ + +sysfs - _The_ filesystem for exporting kernel objects. + +Patrick Mochel <mochel@osdl.org> + +10 January 2003 + + +What it is: +~~~~~~~~~~~ + +sysfs is a ram-based filesystem initially based on ramfs. It provides +a means to export kernel data structures, their attributes, and the +linkages between them to userspace. + +sysfs is tied inherently to the kobject infrastructure. Please read +Documentation/kobject.txt for more information concerning the kobject +interface. + + +Using sysfs +~~~~~~~~~~~ + +sysfs is always compiled in. You can access it by doing: + + mount -t sysfs sysfs /sys + + +Directory Creation +~~~~~~~~~~~~~~~~~~ + +For every kobject that is registered with the system, a directory is +created for it in sysfs. That directory is created as a subdirectory +of the kobject's parent, expressing internal object hierarchies to +userspace. Top-level directories in sysfs represent the common +ancestors of object hierarchies; i.e. the subsystems the objects +belong to. + +Sysfs internally stores the kobject that owns the directory in the +->d_fsdata pointer of the directory's dentry. This allows sysfs to do +reference counting directly on the kobject when the file is opened and +closed. + + +Attributes +~~~~~~~~~~ + +Attributes can be exported for kobjects in the form of regular files in +the filesystem. Sysfs forwards file I/O operations to methods defined +for the attributes, providing a means to read and write kernel +attributes. + +Attributes should be ASCII text files, preferably with only one value +per file. It is noted that it may not be efficient to contain only one +value per file, so it is socially acceptable to express an array of +values of the same type. + +Mixing types, expressing multiple lines of data, and doing fancy +formatting of data is heavily frowned upon. Doing these things may get +you publically humiliated and your code rewritten without notice. + + +An attribute definition is simply: + +struct attribute { + char * name; + mode_t mode; +}; + + +int sysfs_create_file(struct kobject * kobj, struct attribute * attr); +void sysfs_remove_file(struct kobject * kobj, struct attribute * attr); + + +A bare attribute contains no means to read or write the value of the +attribute. Subsystems are encouraged to define their own attribute +structure and wrapper functions for adding and removing attributes for +a specific object type. + +For example, the driver model defines struct device_attribute like: + +struct device_attribute { + struct attribute attr; + ssize_t (*show)(struct device * dev, char * buf); + ssize_t (*store)(struct device * dev, const char * buf); +}; + +int device_create_file(struct device *, struct device_attribute *); +void device_remove_file(struct device *, struct device_attribute *); + +It also defines this helper for defining device attributes: + +#define DEVICE_ATTR(_name, _mode, _show, _store) \ +struct device_attribute dev_attr_##_name = { \ + .attr = {.name = __stringify(_name) , .mode = _mode }, \ + .show = _show, \ + .store = _store, \ +}; + +For example, declaring + +static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo); + +is equivalent to doing: + +static struct device_attribute dev_attr_foo = { + .attr = { + .name = "foo", + .mode = S_IWUSR | S_IRUGO, + }, + .show = show_foo, + .store = store_foo, +}; + + +Subsystem-Specific Callbacks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When a subsystem defines a new attribute type, it must implement a +set of sysfs operations for forwarding read and write calls to the +show and store methods of the attribute owners. + +struct sysfs_ops { + ssize_t (*show)(struct kobject *, struct attribute *, char *); + ssize_t (*store)(struct kobject *, struct attribute *, const char *); +}; + +[ Subsystems should have already defined a struct kobj_type as a +descriptor for this type, which is where the sysfs_ops pointer is +stored. See the kobject documentation for more information. ] + +When a file is read or written, sysfs calls the appropriate method +for the type. The method then translates the generic struct kobject +and struct attribute pointers to the appropriate pointer types, and +calls the associated methods. + + +To illustrate: + +#define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr) +#define to_dev(d) container_of(d, struct device, kobj) + +static ssize_t +dev_attr_show(struct kobject * kobj, struct attribute * attr, char * buf) +{ + struct device_attribute * dev_attr = to_dev_attr(attr); + struct device * dev = to_dev(kobj); + ssize_t ret = 0; + + if (dev_attr->show) + ret = dev_attr->show(dev, buf); + return ret; +} + + + +Reading/Writing Attribute Data +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To read or write attributes, show() or store() methods must be +specified when declaring the attribute. The method types should be as +simple as those defined for device attributes: + + ssize_t (*show)(struct device * dev, char * buf); + ssize_t (*store)(struct device * dev, const char * buf); + +IOW, they should take only an object and a buffer as parameters. + + +sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the +method. Sysfs will call the method exactly once for each read or +write. This forces the following behavior on the method +implementations: + +- On read(2), the show() method should fill the entire buffer. + Recall that an attribute should only be exporting one value, or an + array of similar values, so this shouldn't be that expensive. + + This allows userspace to do partial reads and forward seeks + arbitrarily over the entire file at will. If userspace seeks back to + zero or does a pread(2) with an offset of '0' the show() method will + be called again, rearmed, to fill the buffer. + +- On write(2), sysfs expects the entire buffer to be passed during the + first write. Sysfs then passes the entire buffer to the store() + method. + + When writing sysfs files, userspace processes should first read the + entire file, modify the values it wishes to change, then write the + entire buffer back. + + Attribute method implementations should operate on an identical + buffer when reading and writing values. + +Other notes: + +- Writing causes the show() method to be rearmed regardless of current + file position. + +- The buffer will always be PAGE_SIZE bytes in length. On i386, this + is 4096. + +- show() methods should return the number of bytes printed into the + buffer. This is the return value of snprintf(). + +- show() should always use snprintf(). + +- store() should return the number of bytes used from the buffer. This + can be done using strlen(). + +- show() or store() can always return errors. If a bad value comes + through, be sure to return an error. + +- The object passed to the methods will be pinned in memory via sysfs + referencing counting its embedded object. However, the physical + entity (e.g. device) the object represents may not be present. Be + sure to have a way to check this, if necessary. + + +A very simple (and naive) implementation of a device attribute is: + +static ssize_t show_name(struct device *dev, struct device_attribute *attr, char *buf) +{ + return snprintf(buf, PAGE_SIZE, "%s\n", dev->name); +} + +static ssize_t store_name(struct device * dev, const char * buf) +{ + sscanf(buf, "%20s", dev->name); + return strnlen(buf, PAGE_SIZE); +} + +static DEVICE_ATTR(name, S_IRUGO, show_name, store_name); + + +(Note that the real implementation doesn't allow userspace to set the +name for a device.) + + +Top Level Directory Layout +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The sysfs directory arrangement exposes the relationship of kernel +data structures. + +The top level sysfs directory looks like: + +block/ +bus/ +class/ +dev/ +devices/ +firmware/ +net/ +fs/ + +devices/ contains a filesystem representation of the device tree. It maps +directly to the internal kernel device tree, which is a hierarchy of +struct device. + +bus/ contains flat directory layout of the various bus types in the +kernel. Each bus's directory contains two subdirectories: + + devices/ + drivers/ + +devices/ contains symlinks for each device discovered in the system +that point to the device's directory under root/. + +drivers/ contains a directory for each device driver that is loaded +for devices on that particular bus (this assumes that drivers do not +span multiple bus types). + +fs/ contains a directory for some filesystems. Currently each +filesystem wanting to export attributes must create its own hierarchy +below fs/ (see ./fuse.txt for an example). + +dev/ contains two directories char/ and block/. Inside these two +directories there are symlinks named <major>:<minor>. These symlinks +point to the sysfs directory for the given device. /sys/dev provides a +quick way to lookup the sysfs interface for a device from the result of +a stat(2) operation. + +More information can driver-model specific features can be found in +Documentation/driver-model/. + + +TODO: Finish this section. + + +Current Interfaces +~~~~~~~~~~~~~~~~~~ + +The following interface layers currently exist in sysfs: + + +- devices (include/linux/device.h) +---------------------------------- +Structure: + +struct device_attribute { + struct attribute attr; + ssize_t (*show)(struct device * dev, char * buf); + ssize_t (*store)(struct device * dev, const char * buf); +}; + +Declaring: + +DEVICE_ATTR(_name, _str, _mode, _show, _store); + +Creation/Removal: + +int device_create_file(struct device *device, struct device_attribute * attr); +void device_remove_file(struct device * dev, struct device_attribute * attr); + + +- bus drivers (include/linux/device.h) +-------------------------------------- +Structure: + +struct bus_attribute { + struct attribute attr; + ssize_t (*show)(struct bus_type *, char * buf); + ssize_t (*store)(struct bus_type *, const char * buf); +}; + +Declaring: + +BUS_ATTR(_name, _mode, _show, _store) + +Creation/Removal: + +int bus_create_file(struct bus_type *, struct bus_attribute *); +void bus_remove_file(struct bus_type *, struct bus_attribute *); + + +- device drivers (include/linux/device.h) +----------------------------------------- + +Structure: + +struct driver_attribute { + struct attribute attr; + ssize_t (*show)(struct device_driver *, char * buf); + ssize_t (*store)(struct device_driver *, const char * buf); +}; + +Declaring: + +DRIVER_ATTR(_name, _mode, _show, _store) + +Creation/Removal: + +int driver_create_file(struct device_driver *, struct driver_attribute *); +void driver_remove_file(struct device_driver *, struct driver_attribute *); + + diff --git a/Documentation/filesystems/sysv-fs.txt b/Documentation/filesystems/sysv-fs.txt new file mode 100644 index 0000000..253b50d --- /dev/null +++ b/Documentation/filesystems/sysv-fs.txt @@ -0,0 +1,197 @@ +It implements all of + - Xenix FS, + - SystemV/386 FS, + - Coherent FS. + +To install: +* Answer the 'System V and Coherent filesystem support' question with 'y' + when configuring the kernel. +* To mount a disk or a partition, use + mount [-r] -t sysv device mountpoint + The file system type names + -t sysv + -t xenix + -t coherent + may be used interchangeably, but the last two will eventually disappear. + +Bugs in the present implementation: +- Coherent FS: + - The "free list interleave" n:m is currently ignored. + - Only file systems with no filesystem name and no pack name are recognized. + (See Coherent "man mkfs" for a description of these features.) +- SystemV Release 2 FS: + The superblock is only searched in the blocks 9, 15, 18, which + corresponds to the beginning of track 1 on floppy disks. No support + for this FS on hard disk yet. + + +These filesystems are rather similar. Here is a comparison with Minix FS: + +* Linux fdisk reports on partitions + - Minix FS 0x81 Linux/Minix + - Xenix FS ?? + - SystemV FS ?? + - Coherent FS 0x08 AIX bootable + +* Size of a block or zone (data allocation unit on disk) + - Minix FS 1024 + - Xenix FS 1024 (also 512 ??) + - SystemV FS 1024 (also 512 and 2048) + - Coherent FS 512 + +* General layout: all have one boot block, one super block and + separate areas for inodes and for directories/data. + On SystemV Release 2 FS (e.g. Microport) the first track is reserved and + all the block numbers (including the super block) are offset by one track. + +* Byte ordering of "short" (16 bit entities) on disk: + - Minix FS little endian 0 1 + - Xenix FS little endian 0 1 + - SystemV FS little endian 0 1 + - Coherent FS little endian 0 1 + Of course, this affects only the file system, not the data of files on it! + +* Byte ordering of "long" (32 bit entities) on disk: + - Minix FS little endian 0 1 2 3 + - Xenix FS little endian 0 1 2 3 + - SystemV FS little endian 0 1 2 3 + - Coherent FS PDP-11 2 3 0 1 + Of course, this affects only the file system, not the data of files on it! + +* Inode on disk: "short", 0 means non-existent, the root dir ino is: + - Minix FS 1 + - Xenix FS, SystemV FS, Coherent FS 2 + +* Maximum number of hard links to a file: + - Minix FS 250 + - Xenix FS ?? + - SystemV FS ?? + - Coherent FS >=10000 + +* Free inode management: + - Minix FS a bitmap + - Xenix FS, SystemV FS, Coherent FS + There is a cache of a certain number of free inodes in the super-block. + When it is exhausted, new free inodes are found using a linear search. + +* Free block management: + - Minix FS a bitmap + - Xenix FS, SystemV FS, Coherent FS + Free blocks are organized in a "free list". Maybe a misleading term, + since it is not true that every free block contains a pointer to + the next free block. Rather, the free blocks are organized in chunks + of limited size, and every now and then a free block contains pointers + to the free blocks pertaining to the next chunk; the first of these + contains pointers and so on. The list terminates with a "block number" + 0 on Xenix FS and SystemV FS, with a block zeroed out on Coherent FS. + +* Super-block location: + - Minix FS block 1 = bytes 1024..2047 + - Xenix FS block 1 = bytes 1024..2047 + - SystemV FS bytes 512..1023 + - Coherent FS block 1 = bytes 512..1023 + +* Super-block layout: + - Minix FS + unsigned short s_ninodes; + unsigned short s_nzones; + unsigned short s_imap_blocks; + unsigned short s_zmap_blocks; + unsigned short s_firstdatazone; + unsigned short s_log_zone_size; + unsigned long s_max_size; + unsigned short s_magic; + - Xenix FS, SystemV FS, Coherent FS + unsigned short s_firstdatazone; + unsigned long s_nzones; + unsigned short s_fzone_count; + unsigned long s_fzones[NICFREE]; + unsigned short s_finode_count; + unsigned short s_finodes[NICINOD]; + char s_flock; + char s_ilock; + char s_modified; + char s_rdonly; + unsigned long s_time; + short s_dinfo[4]; -- SystemV FS only + unsigned long s_free_zones; + unsigned short s_free_inodes; + short s_dinfo[4]; -- Xenix FS only + unsigned short s_interleave_m,s_interleave_n; -- Coherent FS only + char s_fname[6]; + char s_fpack[6]; + then they differ considerably: + Xenix FS + char s_clean; + char s_fill[371]; + long s_magic; + long s_type; + SystemV FS + long s_fill[12 or 14]; + long s_state; + long s_magic; + long s_type; + Coherent FS + unsigned long s_unique; + Note that Coherent FS has no magic. + +* Inode layout: + - Minix FS + unsigned short i_mode; + unsigned short i_uid; + unsigned long i_size; + unsigned long i_time; + unsigned char i_gid; + unsigned char i_nlinks; + unsigned short i_zone[7+1+1]; + - Xenix FS, SystemV FS, Coherent FS + unsigned short i_mode; + unsigned short i_nlink; + unsigned short i_uid; + unsigned short i_gid; + unsigned long i_size; + unsigned char i_zone[3*(10+1+1+1)]; + unsigned long i_atime; + unsigned long i_mtime; + unsigned long i_ctime; + +* Regular file data blocks are organized as + - Minix FS + 7 direct blocks + 1 indirect block (pointers to blocks) + 1 double-indirect block (pointer to pointers to blocks) + - Xenix FS, SystemV FS, Coherent FS + 10 direct blocks + 1 indirect block (pointers to blocks) + 1 double-indirect block (pointer to pointers to blocks) + 1 triple-indirect block (pointer to pointers to pointers to blocks) + +* Inode size, inodes per block + - Minix FS 32 32 + - Xenix FS 64 16 + - SystemV FS 64 16 + - Coherent FS 64 8 + +* Directory entry on disk + - Minix FS + unsigned short inode; + char name[14/30]; + - Xenix FS, SystemV FS, Coherent FS + unsigned short inode; + char name[14]; + +* Dir entry size, dir entries per block + - Minix FS 16/32 64/32 + - Xenix FS 16 64 + - SystemV FS 16 64 + - Coherent FS 16 32 + +* How to implement symbolic links such that the host fsck doesn't scream: + - Minix FS normal + - Xenix FS kludge: as regular files with chmod 1000 + - SystemV FS ?? + - Coherent FS kludge: as regular files with chmod 1000 + + +Notation: We often speak of a "block" but mean a zone (the allocation unit) +and not the disk driver's notion of "block". diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt new file mode 100644 index 0000000..222437e --- /dev/null +++ b/Documentation/filesystems/tmpfs.txt @@ -0,0 +1,136 @@ +Tmpfs is a file system which keeps all files in virtual memory. + + +Everything in tmpfs is temporary in the sense that no files will be +created on your hard drive. If you unmount a tmpfs instance, +everything stored therein is lost. + +tmpfs puts everything into the kernel internal caches and grows and +shrinks to accommodate the files it contains and is able to swap +unneeded pages out to swap space. It has maximum size limits which can +be adjusted on the fly via 'mount -o remount ...' + +If you compare it to ramfs (which was the template to create tmpfs) +you gain swapping and limit checking. Another similar thing is the RAM +disk (/dev/ram*), which simulates a fixed size hard disk in physical +RAM, where you have to create an ordinary filesystem on top. Ramdisks +cannot swap and you do not have the possibility to resize them. + +Since tmpfs lives completely in the page cache and on swap, all tmpfs +pages currently in memory will show up as cached. It will not show up +as shared or something like that. Further on you can check the actual +RAM+swap use of a tmpfs instance with df(1) and du(1). + + +tmpfs has the following uses: + +1) There is always a kernel internal mount which you will not see at + all. This is used for shared anonymous mappings and SYSV shared + memory. + + This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not + set, the user visible part of tmpfs is not build. But the internal + mechanisms are always present. + +2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for + POSIX shared memory (shm_open, shm_unlink). Adding the following + line to /etc/fstab should take care of this: + + tmpfs /dev/shm tmpfs defaults 0 0 + + Remember to create the directory that you intend to mount tmpfs on + if necessary. + + This mount is _not_ needed for SYSV shared memory. The internal + mount is used for that. (In the 2.3 kernel versions it was + necessary to mount the predecessor of tmpfs (shm fs) to use SYSV + shared memory) + +3) Some people (including me) find it very convenient to mount it + e.g. on /tmp and /var/tmp and have a big swap partition. And now + loop mounts of tmpfs files do work, so mkinitrd shipped by most + distributions should succeed with a tmpfs /tmp. + +4) And probably a lot more I do not know about :-) + + +tmpfs has three mount options for sizing: + +size: The limit of allocated bytes for this tmpfs instance. The + default is half of your physical RAM without swap. If you + oversize your tmpfs instances the machine will deadlock + since the OOM handler will not be able to free that memory. +nr_blocks: The same as size, but in blocks of PAGE_CACHE_SIZE. +nr_inodes: The maximum number of inodes for this instance. The default + is half of the number of your physical RAM pages, or (on a + machine with highmem) the number of lowmem RAM pages, + whichever is the lower. + +These parameters accept a suffix k, m or g for kilo, mega and giga and +can be changed on remount. The size parameter also accepts a suffix % +to limit this tmpfs instance to that percentage of your physical RAM: +the default, when neither size nor nr_blocks is specified, is size=50% + +If nr_blocks=0 (or size=0), blocks will not be limited in that instance; +if nr_inodes=0, inodes will not be limited. It is generally unwise to +mount with such options, since it allows any user with write access to +use up all the memory on the machine; but enhances the scalability of +that instance in a system with many cpus making intensive use of it. + + +tmpfs has a mount option to set the NUMA memory allocation policy for +all files in that instance (if CONFIG_NUMA is enabled) - which can be +adjusted on the fly via 'mount -o remount ...' + +mpol=default prefers to allocate memory from the local node +mpol=prefer:Node prefers to allocate memory from the given Node +mpol=bind:NodeList allocates memory only from nodes in NodeList +mpol=interleave prefers to allocate from each node in turn +mpol=interleave:NodeList allocates from each node of NodeList in turn + +NodeList format is a comma-separated list of decimal numbers and ranges, +a range being two hyphen-separated decimal numbers, the smallest and +largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15 + +NUMA memory allocation policies have optional flags that can be used in +conjunction with their modes. These optional flags can be specified +when tmpfs is mounted by appending them to the mode before the NodeList. +See Documentation/vm/numa_memory_policy.txt for a list of all available +memory allocation policy mode flags. + + =static is equivalent to MPOL_F_STATIC_NODES + =relative is equivalent to MPOL_F_RELATIVE_NODES + +For example, mpol=bind=static:NodeList, is the equivalent of an +allocation policy of MPOL_BIND | MPOL_F_STATIC_NODES. + +Note that trying to mount a tmpfs with an mpol option will fail if the +running kernel does not support NUMA; and will fail if its nodelist +specifies a node which is not online. If your system relies on that +tmpfs being mounted, but from time to time runs a kernel built without +NUMA capability (perhaps a safe recovery kernel), or with fewer nodes +online, then it is advisable to omit the mpol option from automatic +mount options. It can be added later, when the tmpfs is already mounted +on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'. + + +To specify the initial root directory you can use the following mount +options: + +mode: The permissions as an octal number +uid: The user id +gid: The group id + +These options do not have any effect on remount. You can change these +parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem. + + +So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs' +will give you tmpfs instance on /mytmpfs which can allocate 10GB +RAM/SWAP in 10240 inodes and it is only accessible by root. + + +Author: + Christoph Rohland <cr@sap.com>, 1.12.01 +Updated: + Hugh Dickins <hugh@veritas.com>, 4 June 2007 diff --git a/Documentation/filesystems/ubifs.txt b/Documentation/filesystems/ubifs.txt new file mode 100644 index 0000000..dd84ea3 --- /dev/null +++ b/Documentation/filesystems/ubifs.txt @@ -0,0 +1,173 @@ +Introduction +============= + +UBIFS file-system stands for UBI File System. UBI stands for "Unsorted +Block Images". UBIFS is a flash file system, which means it is designed +to work with flash devices. It is important to understand, that UBIFS +is completely different to any traditional file-system in Linux, like +Ext2, XFS, JFS, etc. UBIFS represents a separate class of file-systems +which work with MTD devices, not block devices. The other Linux +file-system of this class is JFFS2. + +To make it more clear, here is a small comparison of MTD devices and +block devices. + +1 MTD devices represent flash devices and they consist of eraseblocks of + rather large size, typically about 128KiB. Block devices consist of + small blocks, typically 512 bytes. +2 MTD devices support 3 main operations - read from some offset within an + eraseblock, write to some offset within an eraseblock, and erase a whole + eraseblock. Block devices support 2 main operations - read a whole + block and write a whole block. +3 The whole eraseblock has to be erased before it becomes possible to + re-write its contents. Blocks may be just re-written. +4 Eraseblocks become worn out after some number of erase cycles - + typically 100K-1G for SLC NAND and NOR flashes, and 1K-10K for MLC + NAND flashes. Blocks do not have the wear-out property. +5 Eraseblocks may become bad (only on NAND flashes) and software should + deal with this. Blocks on hard drives typically do not become bad, + because hardware has mechanisms to substitute bad blocks, at least in + modern LBA disks. + +It should be quite obvious why UBIFS is very different to traditional +file-systems. + +UBIFS works on top of UBI. UBI is a separate software layer which may be +found in drivers/mtd/ubi. UBI is basically a volume management and +wear-leveling layer. It provides so called UBI volumes which is a higher +level abstraction than a MTD device. The programming model of UBI devices +is very similar to MTD devices - they still consist of large eraseblocks, +they have read/write/erase operations, but UBI devices are devoid of +limitations like wear and bad blocks (items 4 and 5 in the above list). + +In a sense, UBIFS is a next generation of JFFS2 file-system, but it is +very different and incompatible to JFFS2. The following are the main +differences. + +* JFFS2 works on top of MTD devices, UBIFS depends on UBI and works on + top of UBI volumes. +* JFFS2 does not have on-media index and has to build it while mounting, + which requires full media scan. UBIFS maintains the FS indexing + information on the flash media and does not require full media scan, + so it mounts many times faster than JFFS2. +* JFFS2 is a write-through file-system, while UBIFS supports write-back, + which makes UBIFS much faster on writes. + +Similarly to JFFS2, UBIFS supports on-the-flight compression which makes +it possible to fit quite a lot of data to the flash. + +Similarly to JFFS2, UBIFS is tolerant of unclean reboots and power-cuts. +It does not need stuff like fsck.ext2. UBIFS automatically replays its +journal and recovers from crashes, ensuring that the on-flash data +structures are consistent. + +UBIFS scales logarithmically (most of the data structures it uses are +trees), so the mount time and memory consumption do not linearly depend +on the flash size, like in case of JFFS2. This is because UBIFS +maintains the FS index on the flash media. However, UBIFS depends on +UBI, which scales linearly. So overall UBI/UBIFS stack scales linearly. +Nevertheless, UBI/UBIFS scales considerably better than JFFS2. + +The authors of UBIFS believe, that it is possible to develop UBI2 which +would scale logarithmically as well. UBI2 would support the same API as UBI, +but it would be binary incompatible to UBI. So UBIFS would not need to be +changed to use UBI2 + + +Mount options +============= + +(*) == default. + +norm_unmount (*) commit on unmount; the journal is committed + when the file-system is unmounted so that the + next mount does not have to replay the journal + and it becomes very fast; +fast_unmount do not commit on unmount; this option makes + unmount faster, but the next mount slower + because of the need to replay the journal. +bulk_read read more in one go to take advantage of flash + media that read faster sequentially +no_bulk_read (*) do not bulk-read +no_chk_data_crc skip checking of CRCs on data nodes in order to + improve read performance. Use this option only + if the flash media is highly reliable. The effect + of this option is that corruption of the contents + of a file can go unnoticed. +chk_data_crc (*) do not skip checking CRCs on data nodes + + +Quick usage instructions +======================== + +The UBI volume to mount is specified using "ubiX_Y" or "ubiX:NAME" syntax, +where "X" is UBI device number, "Y" is UBI volume number, and "NAME" is +UBI volume name. + +Mount volume 0 on UBI device 0 to /mnt/ubifs: +$ mount -t ubifs ubi0_0 /mnt/ubifs + +Mount "rootfs" volume of UBI device 0 to /mnt/ubifs ("rootfs" is volume +name): +$ mount -t ubifs ubi0:rootfs /mnt/ubifs + +The following is an example of the kernel boot arguments to attach mtd0 +to UBI and mount volume "rootfs": +ubi.mtd=0 root=ubi0:rootfs rootfstype=ubifs + + +Module Parameters for Debugging +=============================== + +When UBIFS has been compiled with debugging enabled, there are 3 module +parameters that are available to control aspects of testing and debugging. +The parameters are unsigned integers where each bit controls an option. +The parameters are: + +debug_msgs Selects which debug messages to display, as follows: + + Message Type Flag value + + General messages 1 + Journal messages 2 + Mount messages 4 + Commit messages 8 + LEB search messages 16 + Budgeting messages 32 + Garbage collection messages 64 + Tree Node Cache (TNC) messages 128 + LEB properties (lprops) messages 256 + Input/output messages 512 + Log messages 1024 + Scan messages 2048 + Recovery messages 4096 + +debug_chks Selects extra checks that UBIFS can do while running: + + Check Flag value + + General checks 1 + Check Tree Node Cache (TNC) 2 + Check indexing tree size 4 + Check orphan area 8 + Check old indexing tree 16 + Check LEB properties (lprops) 32 + Check leaf nodes and inodes 64 + +debug_tsts Selects a mode of testing, as follows: + + Test mode Flag value + + Force in-the-gaps method 2 + Failure mode for recovery testing 4 + +For example, set debug_msgs to 5 to display General messages and Mount +messages. + + +References +========== + +UBIFS documentation and FAQ/HOWTO at the MTD web site: +http://www.linux-mtd.infradead.org/doc/ubifs.html +http://www.linux-mtd.infradead.org/faq/ubifs.html diff --git a/Documentation/filesystems/udf.txt b/Documentation/filesystems/udf.txt new file mode 100644 index 0000000..fde829a --- /dev/null +++ b/Documentation/filesystems/udf.txt @@ -0,0 +1,80 @@ +* +* Documentation/filesystems/udf.txt +* +UDF Filesystem version 0.9.8.1 + +If you encounter problems with reading UDF discs using this driver, +please report them to linux_udf@hpesjro.fc.hp.com, which is the +developer's list. + +Write support requires a block driver which supports writing. Currently +dvd+rw drives and media support true random sector writes, and so a udf +filesystem on such devices can be directly mounted read/write. CD-RW +media however, does not support this. Instead the media can be formatted +for packet mode using the utility cdrwtool, then the pktcdvd driver can +be bound to the underlying cd device to provide the required buffering +and read-modify-write cycles to allow the filesystem random sector writes +while providing the hardware with only full packet writes. While not +required for dvd+rw media, use of the pktcdvd driver often enhances +performance due to very poor read-modify-write support supplied internally +by drive firmware. + +------------------------------------------------------------------------------- +The following mount options are supported: + + gid= Set the default group. + umask= Set the default umask. + uid= Set the default user. + bs= Set the block size. + unhide Show otherwise hidden files. + undelete Show deleted files in lists. + adinicb Embed data in the inode (default) + noadinicb Don't embed data in the inode + shortad Use short ad's + longad Use long ad's (default) + nostrict Unset strict conformance + iocharset= Set the NLS character set + +The uid= and gid= options need a bit more explaining. They will accept a +decimal numeric value which will be used as the default ID for that mount. +They will also accept the string "ignore" and "forget". For files on the disk +that are owned by nobody ( -1 ), they will instead look as if they are owned +by the default ID. The ignore option causes the default ID to override all +IDs on the disk, not just -1. The forget option causes all IDs to be written +to disk as -1, so when the media is later remounted, they will appear to be +owned by whatever default ID it is mounted with at that time. + +For typical desktop use of removable media, you should set the ID to that +of the interactively logged on user, and also specify both the forget and +ignore options. This way the interactive user will always see the files +on the disk as belonging to him. + +The remaining are for debugging and disaster recovery: + + novrs Skip volume sequence recognition + +The following expect a offset from 0. + + session= Set the CDROM session (default= last session) + anchor= Override standard anchor location. (default= 256) + volume= Override the VolumeDesc location. (unused) + partition= Override the PartitionDesc location. (unused) + lastblock= Set the last block of the filesystem/ + +The following expect a offset from the partition root. + + fileset= Override the fileset block location. (unused) + rootdir= Override the root directory location. (unused) + WARNING: overriding the rootdir to a non-directory may + yield highly unpredictable results. +------------------------------------------------------------------------------- + + +For the latest version and toolset see: + http://linux-udf.sourceforge.net/ + +Documentation on UDF and ECMA 167 is available FREE from: + http://www.osta.org/ + http://www.ecma-international.org/ + +Ben Fennema <bfennema@falcon.csc.calpoly.edu> diff --git a/Documentation/filesystems/ufs.txt b/Documentation/filesystems/ufs.txt new file mode 100644 index 0000000..7a602ad --- /dev/null +++ b/Documentation/filesystems/ufs.txt @@ -0,0 +1,60 @@ +USING UFS +========= + +mount -t ufs -o ufstype=type_of_ufs device dir + + +UFS OPTIONS +=========== + +ufstype=type_of_ufs + UFS is a file system widely used in different operating systems. + The problem are differences among implementations. Features of + some implementations are undocumented, so its hard to recognize + type of ufs automatically. That's why user must specify type of + ufs manually by mount option ufstype. Possible values are: + + old old format of ufs + default value, supported as read-only + + 44bsd used in FreeBSD, NetBSD, OpenBSD + supported as read-write + + ufs2 used in FreeBSD 5.x + supported as read-write + + 5xbsd synonym for ufs2 + + sun used in SunOS (Solaris) + supported as read-write + + sunx86 used in SunOS for Intel (Solarisx86) + supported as read-write + + hp used in HP-UX + supported as read-only + + nextstep + used in NextStep + supported as read-only + + nextstep-cd + used for NextStep CDROMs (block_size == 2048) + supported as read-only + + openstep + used in OpenStep + supported as read-only + + +POSSIBLE PROBLEMS +================= + +See next section, if you have any. + + +BUG REPORTS +=========== + +Any ufs bug report you can send to daniel.pirkl@email.cz or +to dushistov@mail.ru (do not send partition tables bug reports). diff --git a/Documentation/filesystems/unionfs/00-INDEX b/Documentation/filesystems/unionfs/00-INDEX new file mode 100644 index 0000000..96fdf67 --- /dev/null +++ b/Documentation/filesystems/unionfs/00-INDEX @@ -0,0 +1,10 @@ +00-INDEX + - this file. +concepts.txt + - A brief introduction of concepts. +issues.txt + - A summary of known issues with unionfs. +rename.txt + - Information regarding rename operations. +usage.txt + - Usage information and examples. diff --git a/Documentation/filesystems/unionfs/concepts.txt b/Documentation/filesystems/unionfs/concepts.txt new file mode 100644 index 0000000..b853788 --- /dev/null +++ b/Documentation/filesystems/unionfs/concepts.txt @@ -0,0 +1,287 @@ +Unionfs 2.x CONCEPTS: +===================== + +This file describes the concepts needed by a namespace unification file +system. + + +Branch Priority: +================ + +Each branch is assigned a unique priority - starting from 0 (highest +priority). No two branches can have the same priority. + + +Branch Mode: +============ + +Each branch is assigned a mode - read-write or read-only. This allows +directories on media mounted read-write to be used in a read-only manner. + + +Whiteouts: +========== + +A whiteout removes a file name from the namespace. Whiteouts are needed when +one attempts to remove a file on a read-only branch. + +Suppose we have a two-branch union, where branch 0 is read-write and branch +1 is read-only. And a file 'foo' on branch 1: + +./b0/ +./b1/ +./b1/foo + +The unified view would simply be: + +./union/ +./union/foo + +Since 'foo' is stored on a read-only branch, it cannot be removed. A +whiteout is used to remove the name 'foo' from the unified namespace. Again, +since branch 1 is read-only, the whiteout cannot be created there. So, we +try on a higher priority (lower numerically) branch and create the whiteout +there. + +./b0/ +./b0/.wh.foo +./b1/ +./b1/foo + +Later, when Unionfs traverses branches (due to lookup or readdir), it +eliminate 'foo' from the namespace (as well as the whiteout itself.) + + +Opaque Directories: +=================== + +Assume we have a unionfs mount comprising of two branches. Branch 0 is +empty; branch 1 has the directory /a and file /a/f. Let's say we mount a +union of branch 0 as read-write and branch 1 as read-only. Now, let's say +we try to perform the following operation in the union: + + rm -fr a + +Because branch 1 is not writable, we cannot physically remove the file /a/f +or the directory /a. So instead, we will create a whiteout in branch 0 +named /.wh.a, masking out the name "a" from branch 1. Next, let's say we +try to create a directory named "a" as follows: + + mkdir a + +Because we have a whiteout for "a" already, Unionfs behaves as if "a" +doesn't exist, and thus will delete the whiteout and replace it with an +actual directory named "a". + +The problem now is that if you try to "ls" in the union, Unionfs will +perform is normal directory name unification, for *all* directories named +"a" in all branches. This will cause the file /a/f from branch 1 to +re-appear in the union's namespace, which violates Unix semantics. + +To avoid this problem, we have a different form of whiteouts for +directories, called "opaque directories" (same as BSD Union Mount does). +Whenever we replace a whiteout with a directory, that directory is marked as +opaque. In Unionfs 2.x, it means that we create a file named +/a/.wh.__dir_opaque in branch 0, after having created directory /a there. +When unionfs notices that a directory is opaque, it stops all namespace +operations (including merging readdir contents) at that opaque directory. +This prevents re-exposing names from masked out directories. + + +Duplicate Elimination: +====================== + +It is possible for files on different branches to have the same name. +Unionfs then has to select which instance of the file to show to the user. +Given the fact that each branch has a priority associated with it, the +simplest solution is to take the instance from the highest priority +(numerically lowest value) and "hide" the others. + + +Unlinking: +========= + +Unlink operation on non-directory instances is optimized to remove the +maximum possible objects in case multiple underlying branches have the same +file name. The unlink operation will first try to delete file instances +from highest priority branch and then move further to delete from remaining +branches in order of their decreasing priority. Consider a case (F..D..F), +where F is a file and D is a directory of the same name; here, some +intermediate branch could have an empty directory instance with the same +name, so this operation also tries to delete this directory instance and +proceed further to delete from next possible lower priority branch. The +unionfs unlink operation will smoothly delete the files with same name from +all possible underlying branches. In case if some error occurs, it creates +whiteout in highest priority branch that will hide file instance in rest of +the branches. An error could occur either if an unlink operations in any of +the underlying branch failed or if a branch has no write permission. + +This unlinking policy is known as "delete all" and it has the benefit of +overall reducing the number of inodes used by duplicate files, and further +reducing the total number of inodes consumed by whiteouts. The cost is of +extra processing, but testing shows this extra processing is well worth the +savings. + + +Copyup: +======= + +When a change is made to the contents of a file's data or meta-data, they +have to be stored somewhere. The best way is to create a copy of the +original file on a branch that is writable, and then redirect the write +though to this copy. The copy must be made on a higher priority branch so +that lookup and readdir return this newer "version" of the file rather than +the original (see duplicate elimination). + +An entire unionfs mount can be read-only or read-write. If it's read-only, +then none of the branches will be written to, even if some of the branches +are physically writeable. If the unionfs mount is read-write, then the +leftmost (highest priority) branch must be writeable (for copyup to take +place); the remaining branches can be any mix of read-write and read-only. + +In a writeable mount, unionfs will create new files/dir in the leftmost +branch. If one tries to modify a file in a read-only branch/media, unionfs +will copyup the file to the leftmost branch and modify it there. If you try +to modify a file from a writeable branch which is not the leftmost branch, +then unionfs will modify it in that branch; this is useful if you, say, +unify differnet packages (e.g., apache, sendmail, ftpd, etc.) and you want +changes to specific package files to remain logically in the directory where +they came from. + +Cache Coherency: +================ + +Unionfs users often want to be able to modify files and directories directly +on the lower branches, and have those changes be visible at the Unionfs +level. This means that data (e.g., pages) and meta-data (dentries, inodes, +open files, etc.) have to be synchronized between the upper and lower +layers. In other words, the newest changes from a layer below have to be +propagated to the Unionfs layer above. If the two layers are not in sync, a +cache incoherency ensues, which could lead to application failures and even +oopses. The Linux kernel, however, has a rather limited set of mechanisms +to ensure this inter-layer cache coherency---so Unionfs has to do most of +the hard work on its own. + +Maintaining Invariants: + +The way Unionfs ensures cache coherency is as follows. At each entry point +to a Unionfs file system method, we call a utility function to validate the +primary objects of this method. Generally, we call unionfs_file_revalidate +on open files, and __unionfs_d_revalidate_chain on dentries (which also +validates inodes). These utility functions check to see whether the upper +Unionfs object is in sync with any of the lower objects that it represents. +The checks we perform include whether the Unionfs superblock has a newer +generation number, or if any of the lower objects mtime's or ctime's are +newer. (Note: generation numbers change when branch-management commands are +issued, so in a way, maintaining cache coherency is also very important for +branch-management.) If indeed we determine that any Unionfs object is no +longer in sync with its lower counterparts, then we rebuild that object +similarly to how we do so for branch-management. + +While rebuilding Unionfs's objects, we also purge any page mappings and +truncate inode pages (see fs/unionfs/dentry.c:purge_inode_data). This is to +ensure that Unionfs will re-get the newer data from the lower branches. We +perform this purging only if the Unionfs operation in question is a reading +operation; if Unionfs is performing a data writing operation (e.g., ->write, +->commit_write, etc.) then we do NOT flush the lower mappings/pages: this is +because (1) a self-deadlock could occur and (2) the upper Unionfs pages are +considered more authoritative anyway, as they are newer and will overwrite +any lower pages. + +Unionfs maintains the following important invariant regarding mtime's, +ctime's, and atime's: the upper inode object's times are the max() of all of +the lower ones. For non-directory objects, there's only one object below, +so the mapping is simple; for directory objects, there could me multiple +lower objects and we have to sync up with the newest one of all the lower +ones. This invariant is important to maintain, especially for directories +(besides, we need this to be POSIX compliant). A union could comprise +multiple writable branches, each of which could change. If we don't reflect +the newest possible mtime/ctime, some applications could fail. For example, +NFSv2/v3 exports check for newer directory mtimes on the server to determine +if the client-side attribute cache should be purged. + +To maintain these important invariants, of course, Unionfs carefully +synchronizes upper and lower times in various places. For example, if we +copy-up a file to a top-level branch, the parent directory where the file +was copied up to will now have a new mtime: so after a successful copy-up, +we sync up with the new top-level branch's parent directory mtime. + +Implementation: + +This cache-coherency implementation is efficient because it defers any +synchronizing between the upper and lower layers until absolutely needed. +Consider the example a common situation where users perform a lot of lower +changes, such as untarring a whole package. While these take place, +typically the user doesn't access the files via Unionfs; only after the +lower changes are done, does the user try to access the lower files. With +our cache-coherency implementation, the entirety of the changes to the lower +branches will not result in a single CPU cycle spent at the Unionfs level +until the user invokes a system call that goes through Unionfs. + +We have considered two alternate cache-coherency designs. (1) Using the +dentry/inode notify functionality to register interest in finding out about +any lower changes. This is a somewhat limited and also a heavy-handed +approach which could result in many notifications to the Unionfs layer upon +each small change at the lower layer (imagine a file being modified multiple +times in rapid succession). (2) Rewriting the VFS to support explicit +callbacks from lower objects to upper objects. We began exploring such an +implementation, but found it to be very complicated--it would have resulted +in massive VFS/MM changes which are unlikely to be accepted by the LKML +community. We therefore believe that our current cache-coherency design and +implementation represent the best approach at this time. + +Limitations: + +Our implementation works in that as long as a user process will have caused +Unionfs to be called, directly or indirectly, even to just do +->d_revalidate; then we will have purged the current Unionfs data and the +process will see the new data. For example, a process that continually +re-reads the same file's data will see the NEW data as soon as the lower +file had changed, upon the next read(2) syscall (even if the file is still +open!) However, this doesn't work when the process re-reads the open file's +data via mmap(2) (unless the user unmaps/closes the file and remaps/reopens +it). Once we respond to ->readpage(s), then the kernel maps the page into +the process's address space and there doesn't appear to be a way to force +the kernel to invalidate those pages/mappings, and force the process to +re-issue ->readpage. If there's a way to invalidate active mappings and +force a ->readpage, let us know please (invalidate_inode_pages2 doesn't do +the trick). + +Our current Unionfs code has to perform many file-revalidation calls. It +would be really nice if the VFS would export an optional file system hook +->file_revalidate (similarly to dentry->d_revalidate) that will be called +before each VFS op that has a "struct file" in it. + +Certain file systems have micro-second granularity (or better) for inode +times, and asynchronous actions could cause those times to change with some +small delay. In such cases, Unionfs may see a changed inode time that only +differs by a tiny fraction of a second: such a change may be a false +positive indication that the lower object has changed, whereas if unionfs +waits a little longer, that false indication will not be seen. (These false +positives are harmless, because they would at most cause unionfs to +re-validate an object that may need no revalidation, and print a debugging +message that clutters the console/logs.) Therefore, to minimize the chances +of these situations, we delay the detection of changed times by a small +factor of a few seconds, called UNIONFS_MIN_CC_TIME (which defaults to 3 +seconds, as does NFS). This means that we will detect the change, only a +couple of seconds later, if indeed the time change persists in the lower +file object. This delayed detection has an added performance benefit: we +reduce the number of times that unionfs has to revalidate objects, in case +there's a lot of concurrent activity on both the upper and lower objects, +for the same file(s). Lastly, this delayed time attribute detection is +similar to how NFS clients operate (e.g., acregmin). + +Finally, there is no way currently in Linux to prevent lower directories +from being moved around (i.e., topology changes); there's no way to prevent +modifications to directory sub-trees of whole file systems which are mounted +read-write. It is therefore possible for in-flight operations in unionfs to +take place, while a lower directory is being moved around. Therefore, if +you try to, say, create a new file in a directory through unionfs, while the +directory is being moved around directly, then the new file may get created +in the new location where that directory was moved to. This is a somewhat +similar behaviour in NFS: an NFS client could be creating a new file while +th NFS server is moving th directory around; the file will get successfully +created in the new location. (The one exception in unionfs is that if the +branch is marked read-only by unionfs, then a copyup will take place.) + +For more information, see <http://unionfs.filesystems.org/>. diff --git a/Documentation/filesystems/unionfs/issues.txt b/Documentation/filesystems/unionfs/issues.txt new file mode 100644 index 0000000..f4b7e7e --- /dev/null +++ b/Documentation/filesystems/unionfs/issues.txt @@ -0,0 +1,28 @@ +KNOWN Unionfs 2.x ISSUES: +========================= + +1. Unionfs should not use lookup_one_len() on the underlying f/s as it + confuses NFSv4. Currently, unionfs_lookup() passes lookup intents to the + lower file-system, this eliminates part of the problem. The remaining + calls to lookup_one_len may need to be changed to pass an intent. We are + currently introducing VFS changes to fs/namei.c's do_path_lookup() to + allow proper file lookup and opening in stackable file systems. + +2. Lockdep (a debugging feature) isn't aware of stacking, and so it + incorrectly complains about locking problems. The problem boils down to + this: Lockdep considers all objects of a certain type to be in the same + class, for example, all inodes. Lockdep doesn't like to see a lock held + on two inodes within the same task, and warns that it could lead to a + deadlock. However, stackable file systems do precisely that: they lock + an upper object, and then a lower object, in a strict order to avoid + locking problems; in addition, Unionfs, as a fan-out file system, may + have to lock several lower inodes. We are currently looking into Lockdep + to see how to make it aware of stackable file systems. For now, we + temporarily disable lockdep when calling vfs methods on lower objects, + but only for those places where lockdep complained. While this solution + may seem unclean, it is not without precedent: other places in the kernel + also do similar temporary disabling, of course after carefully having + checked that it is the right thing to do. Anyway, you get any warnings + from Lockdep, please report them to the Unionfs maintainers. + +For more information, see <http://unionfs.filesystems.org/>. diff --git a/Documentation/filesystems/unionfs/rename.txt b/Documentation/filesystems/unionfs/rename.txt new file mode 100644 index 0000000..e20bb82 --- /dev/null +++ b/Documentation/filesystems/unionfs/rename.txt @@ -0,0 +1,31 @@ +Rename is a complex beast. The following table shows which rename(2) operations +should succeed and which should fail. + +o: success +E: error (either unionfs or vfs) +X: EXDEV + +none = file does not exist +file = file is a file +dir = file is a empty directory +child= file is a non-empty directory +wh = file is a directory containing only whiteouts; this makes it logically + empty + + none file dir child wh +file o o E E E +dir o E o E o +child X E X E X +wh o E o E o + + +Renaming directories: +===================== + +Whenever a empty (either physically or logically) directory is being renamed, +the following sequence of events should take place: + +1) Remove whiteouts from both source and destination directory +2) Rename source to destination +3) Make destination opaque to prevent anything under it from showing up + diff --git a/Documentation/filesystems/unionfs/usage.txt b/Documentation/filesystems/unionfs/usage.txt new file mode 100644 index 0000000..1adde69 --- /dev/null +++ b/Documentation/filesystems/unionfs/usage.txt @@ -0,0 +1,134 @@ +Unionfs is a stackable unification file system, which can appear to merge +the contents of several directories (branches), while keeping their physical +content separate. Unionfs is useful for unified source tree management, +merged contents of split CD-ROM, merged separate software package +directories, data grids, and more. Unionfs allows any mix of read-only and +read-write branches, as well as insertion and deletion of branches anywhere +in the fan-out. To maintain Unix semantics, Unionfs handles elimination of +duplicates, partial-error conditions, and more. + +GENERAL SYNTAX +============== + +# mount -t unionfs -o <OPTIONS>,<BRANCH-OPTIONS> none MOUNTPOINT + +OPTIONS can be any legal combination of: + +- ro # mount file system read-only +- rw # mount file system read-write +- remount # remount the file system (see Branch Management below) +- incgen # increment generation no. (see Cache Consistency below) + +BRANCH-OPTIONS can be either (1) a list of branches given to the "dirs=" +option, or (2) a list of individual branch manipulation commands, combined +with the "remount" option, and is further described in the "Branch +Management" section below. + +The syntax for the "dirs=" mount option is: + + dirs=branch[=ro|=rw][:...] + +The "dirs=" option takes a colon-delimited list of directories to compose +the union, with an optional branch mode for each of those directories. +Directories that come earlier (specified first, on the left) in the list +have a higher precedence than those which come later. Additionally, +read-only or read-write permissions of the branch can be specified by +appending =ro or =rw (default) to each directory. See the Copyup section in +concepts.txt, for a description of Unionfs's behavior when mixing read-only +and read-write branches and mounts. + +Syntax: + + dirs=/branch1[=ro|=rw]:/branch2[=ro|=rw]:...:/branchN[=ro|=rw] + +Example: + + dirs=/writable_branch=rw:/read-only_branch=ro + + +BRANCH MANAGEMENT +================= + +Once you mount your union for the first time, using the "dirs=" option, you +can then change the union's overall mode or reconfigure the branches, using +the remount option, as follows. + +To downgrade a union from read-write to read-only: + +# mount -t unionfs -o remount,ro none MOUNTPOINT + +To upgrade a union from read-only to read-write: + +# mount -t unionfs -o remount,rw none MOUNTPOINT + +To delete a branch /foo, regardless where it is in the current union: + +# mount -t unionfs -o remount,del=/foo none MOUNTPOINT + +To insert (add) a branch /foo before /bar: + +# mount -t unionfs -o remount,add=/bar:/foo none MOUNTPOINT + +To insert (add) a branch /foo (with the "rw" mode flag) before /bar: + +# mount -t unionfs -o remount,add=/bar:/foo=rw none MOUNTPOINT + +To insert (add) a branch /foo (in "rw" mode) at the very beginning (i.e., a +new highest-priority branch), you can use the above syntax, or use a short +hand version as follows: + +# mount -t unionfs -o remount,add=/foo none MOUNTPOINT + +To append a branch to the very end (new lowest-priority branch): + +# mount -t unionfs -o remount,add=:/foo none MOUNTPOINT + +To append a branch to the very end (new lowest-priority branch), in +read-only mode: + +# mount -t unionfs -o remount,add=:/foo=ro none MOUNTPOINT + +Finally, to change the mode of one existing branch, say /foo, from read-only +to read-write, and change /bar from read-write to read-only: + +# mount -t unionfs -o remount,mode=/foo=rw,mode=/bar=ro none MOUNTPOINT + +Note: in Unionfs 2.x, you cannot set the leftmost branch to readonly because +then Unionfs won't have any writable place for copyups to take place. +Moreover, the VFS can get confused when it tries to modify something in a +file system mounted read-write, but isn't permitted to write to it. +Instead, you should set the whole union as readonly, as described above. +If, however, you must set the leftmost branch as readonly, perhaps so you +can get a snapshot of it at a point in time, then you should insert a new +writable top-level branch, and mark the one you want as readonly. This can +be accomplished as follows, assuming that /foo is your current leftmost +branch: + +# mount -t tmpfs -o size=NNN /new +# mount -t unionfs -o remount,add=/new,mode=/foo=ro none MOUNTPOINT +<do what you want safely in /foo> +# mount -t unionfs -o remount,del=/new,mode=/foo=rw none MOUNTPOINT +<check if there's anything in /new you want to preserve> +# umount /new + +CACHE CONSISTENCY +================= + +If you modify any file on any of the lower branches directly, while there is +a Unionfs 2.x mounted above any of those branches, you should tell Unionfs +to purge its caches and re-get the objects. To do that, you have to +increment the generation number of the superblock using the following +command: + +# mount -t unionfs -o remount,incgen none MOUNTPOINT + +Note that the older way of incrementing the generation number using an +ioctl, is no longer supported in Unionfs 2.0 and newer. Ioctls in general +are not encouraged. Plus, an ioctl is per-file concept, whereas the +generation number is a per-file-system concept. Worse, such an ioctl +requires an open file, which then has to be invalidated by the very nature +of the generation number increase (read: the old generation increase ioctl +was pretty racy). + + +For more information, see <http://unionfs.filesystems.org/>. diff --git a/Documentation/filesystems/vfat.txt b/Documentation/filesystems/vfat.txt new file mode 100644 index 0000000..3a5ddc9 --- /dev/null +++ b/Documentation/filesystems/vfat.txt @@ -0,0 +1,289 @@ +USING VFAT +---------------------------------------------------------------------- +To use the vfat filesystem, use the filesystem type 'vfat'. i.e. + mount -t vfat /dev/fd0 /mnt + +No special partition formatter is required. mkdosfs will work fine +if you want to format from within Linux. + +VFAT MOUNT OPTIONS +---------------------------------------------------------------------- +uid=### -- Set the owner of all files on this filesystem. + The default is the uid of current process. + +gid=### -- Set the group of all files on this filesystem. + The default is the gid of current process. + +umask=### -- The permission mask (for files and directories, see umask(1)). + The default is the umask of current process. + +dmask=### -- The permission mask for the directory. + The default is the umask of current process. + +fmask=### -- The permission mask for files. + The default is the umask of current process. + +allow_utime=### -- This option controls the permission check of mtime/atime. + + 20 - If current process is in group of file's group ID, + you can change timestamp. + 2 - Other users can change timestamp. + + The default is set from `dmask' option. (If the directory is + writable, utime(2) is also allowed. I.e. ~dmask & 022) + + Normally utime(2) checks current process is owner of + the file, or it has CAP_FOWNER capability. But FAT + filesystem doesn't have uid/gid on disk, so normal + check is too unflexible. With this option you can + relax it. + +codepage=### -- Sets the codepage number for converting to shortname + characters on FAT filesystem. + By default, FAT_DEFAULT_CODEPAGE setting is used. + +iocharset=<name> -- Character set to use for converting between the + encoding is used for user visible filename and 16 bit + Unicode characters. Long filenames are stored on disk + in Unicode format, but Unix for the most part doesn't + know how to deal with Unicode. + By default, FAT_DEFAULT_IOCHARSET setting is used. + + There is also an option of doing UTF-8 translations + with the utf8 option. + + NOTE: "iocharset=utf8" is not recommended. If unsure, + you should consider the following option instead. + +utf8=<bool> -- UTF-8 is the filesystem safe version of Unicode that + is used by the console. It can be enabled for the + filesystem with this option. If 'uni_xlate' gets set, + UTF-8 gets disabled. + +uni_xlate=<bool> -- Translate unhandled Unicode characters to special + escaped sequences. This would let you backup and + restore filenames that are created with any Unicode + characters. Until Linux supports Unicode for real, + this gives you an alternative. Without this option, + a '?' is used when no translation is possible. The + escape character is ':' because it is otherwise + illegal on the vfat filesystem. The escape sequence + that gets used is ':' and the four digits of hexadecimal + unicode. + +nonumtail=<bool> -- When creating 8.3 aliases, normally the alias will + end in '~1' or tilde followed by some number. If this + option is set, then if the filename is + "longfilename.txt" and "longfile.txt" does not + currently exist in the directory, 'longfile.txt' will + be the short alias instead of 'longfi~1.txt'. + +usefree -- Use the "free clusters" value stored on FSINFO. It'll + be used to determine number of free clusters without + scanning disk. But it's not used by default, because + recent Windows don't update it correctly in some + case. If you are sure the "free clusters" on FSINFO is + correct, by this option you can avoid scanning disk. + +quiet -- Stops printing certain warning messages. + +check=s|r|n -- Case sensitivity checking setting. + s: strict, case sensitive + r: relaxed, case insensitive + n: normal, default setting, currently case insensitive + +nocase -- This was deprecated for vfat. Use shortname=win95 instead. + +shortname=lower|win95|winnt|mixed + -- Shortname display/create setting. + lower: convert to lowercase for display, + emulate the Windows 95 rule for create. + win95: emulate the Windows 95 rule for display/create. + winnt: emulate the Windows NT rule for display/create. + mixed: emulate the Windows NT rule for display, + emulate the Windows 95 rule for create. + Default setting is `lower'. + +tz=UTC -- Interpret timestamps as UTC rather than local time. + This option disables the conversion of timestamps + between local time (as used by Windows on FAT) and UTC + (which Linux uses internally). This is particularly + useful when mounting devices (like digital cameras) + that are set to UTC in order to avoid the pitfalls of + local time. + +showexec -- If set, the execute permission bits of the file will be + allowed only if the extension part of the name is .EXE, + .COM, or .BAT. Not set by default. + +debug -- Can be set, but unused by the current implementation. + +sys_immutable -- If set, ATTR_SYS attribute on FAT is handled as + IMMUTABLE flag on Linux. Not set by default. + +flush -- If set, the filesystem will try to flush to disk more + early than normal. Not set by default. + +rodir -- FAT has the ATTR_RO (read-only) attribute. But on Windows, + the ATTR_RO of the directory will be just ignored actually, + and is used by only applications as flag. E.g. it's setted + for the customized folder. + + If you want to use ATTR_RO as read-only flag even for + the directory, set this option. + +<bool>: 0,1,yes,no,true,false + +TODO +---------------------------------------------------------------------- +* Need to get rid of the raw scanning stuff. Instead, always use + a get next directory entry approach. The only thing left that uses + raw scanning is the directory renaming code. + + +POSSIBLE PROBLEMS +---------------------------------------------------------------------- +* vfat_valid_longname does not properly checked reserved names. +* When a volume name is the same as a directory name in the root + directory of the filesystem, the directory name sometimes shows + up as an empty file. +* autoconv option does not work correctly. + +BUG REPORTS +---------------------------------------------------------------------- +If you have trouble with the VFAT filesystem, mail bug reports to +chaffee@bmrc.cs.berkeley.edu. Please specify the filename +and the operation that gave you trouble. + +TEST SUITE +---------------------------------------------------------------------- +If you plan to make any modifications to the vfat filesystem, please +get the test suite that comes with the vfat distribution at + + http://bmrc.berkeley.edu/people/chaffee/vfat.html + +This tests quite a few parts of the vfat filesystem and additional +tests for new features or untested features would be appreciated. + +NOTES ON THE STRUCTURE OF THE VFAT FILESYSTEM +---------------------------------------------------------------------- +(This documentation was provided by Galen C. Hunt <gchunt@cs.rochester.edu> + and lightly annotated by Gordon Chaffee). + +This document presents a very rough, technical overview of my +knowledge of the extended FAT file system used in Windows NT 3.5 and +Windows 95. I don't guarantee that any of the following is correct, +but it appears to be so. + +The extended FAT file system is almost identical to the FAT +file system used in DOS versions up to and including 6.223410239847 +:-). The significant change has been the addition of long file names. +These names support up to 255 characters including spaces and lower +case characters as opposed to the traditional 8.3 short names. + +Here is the description of the traditional FAT entry in the current +Windows 95 filesystem: + + struct directory { // Short 8.3 names + unsigned char name[8]; // file name + unsigned char ext[3]; // file extension + unsigned char attr; // attribute byte + unsigned char lcase; // Case for base and extension + unsigned char ctime_ms; // Creation time, milliseconds + unsigned char ctime[2]; // Creation time + unsigned char cdate[2]; // Creation date + unsigned char adate[2]; // Last access date + unsigned char reserved[2]; // reserved values (ignored) + unsigned char time[2]; // time stamp + unsigned char date[2]; // date stamp + unsigned char start[2]; // starting cluster number + unsigned char size[4]; // size of the file + }; + +The lcase field specifies if the base and/or the extension of an 8.3 +name should be capitalized. This field does not seem to be used by +Windows 95 but it is used by Windows NT. The case of filenames is not +completely compatible from Windows NT to Windows 95. It is not completely +compatible in the reverse direction, however. Filenames that fit in +the 8.3 namespace and are written on Windows NT to be lowercase will +show up as uppercase on Windows 95. + +Note that the "start" and "size" values are actually little +endian integer values. The descriptions of the fields in this +structure are public knowledge and can be found elsewhere. + +With the extended FAT system, Microsoft has inserted extra +directory entries for any files with extended names. (Any name which +legally fits within the old 8.3 encoding scheme does not have extra +entries.) I call these extra entries slots. Basically, a slot is a +specially formatted directory entry which holds up to 13 characters of +a file's extended name. Think of slots as additional labeling for the +directory entry of the file to which they correspond. Microsoft +prefers to refer to the 8.3 entry for a file as its alias and the +extended slot directory entries as the file name. + +The C structure for a slot directory entry follows: + + struct slot { // Up to 13 characters of a long name + unsigned char id; // sequence number for slot + unsigned char name0_4[10]; // first 5 characters in name + unsigned char attr; // attribute byte + unsigned char reserved; // always 0 + unsigned char alias_checksum; // checksum for 8.3 alias + unsigned char name5_10[12]; // 6 more characters in name + unsigned char start[2]; // starting cluster number + unsigned char name11_12[4]; // last 2 characters in name + }; + +If the layout of the slots looks a little odd, it's only +because of Microsoft's efforts to maintain compatibility with old +software. The slots must be disguised to prevent old software from +panicking. To this end, a number of measures are taken: + + 1) The attribute byte for a slot directory entry is always set + to 0x0f. This corresponds to an old directory entry with + attributes of "hidden", "system", "read-only", and "volume + label". Most old software will ignore any directory + entries with the "volume label" bit set. Real volume label + entries don't have the other three bits set. + + 2) The starting cluster is always set to 0, an impossible + value for a DOS file. + +Because the extended FAT system is backward compatible, it is +possible for old software to modify directory entries. Measures must +be taken to ensure the validity of slots. An extended FAT system can +verify that a slot does in fact belong to an 8.3 directory entry by +the following: + + 1) Positioning. Slots for a file always immediately proceed + their corresponding 8.3 directory entry. In addition, each + slot has an id which marks its order in the extended file + name. Here is a very abbreviated view of an 8.3 directory + entry and its corresponding long name slots for the file + "My Big File.Extension which is long": + + <proceeding files...> + <slot #3, id = 0x43, characters = "h is long"> + <slot #2, id = 0x02, characters = "xtension whic"> + <slot #1, id = 0x01, characters = "My Big File.E"> + <directory entry, name = "MYBIGFIL.EXT"> + + Note that the slots are stored from last to first. Slots + are numbered from 1 to N. The Nth slot is or'ed with 0x40 + to mark it as the last one. + + 2) Checksum. Each slot has an "alias_checksum" value. The + checksum is calculated from the 8.3 name using the + following algorithm: + + for (sum = i = 0; i < 11; i++) { + sum = (((sum&1)<<7)|((sum&0xfe)>>1)) + name[i] + } + + 3) If there is free space in the final slot, a Unicode NULL (0x0000) + is stored after the final character. After that, all unused + characters in the final slot are set to Unicode 0xFFFF. + +Finally, note that the extended name is stored in Unicode. Each Unicode +character takes two bytes. diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt new file mode 100644 index 0000000..5579bda --- /dev/null +++ b/Documentation/filesystems/vfs.txt @@ -0,0 +1,1000 @@ + + Overview of the Linux Virtual File System + + Original author: Richard Gooch <rgooch@atnf.csiro.au> + + Last updated on June 24, 2007. + + Copyright (C) 1999 Richard Gooch + Copyright (C) 2005 Pekka Enberg + + This file is released under the GPLv2. + + +Introduction +============ + +The Virtual File System (also known as the Virtual Filesystem Switch) +is the software layer in the kernel that provides the filesystem +interface to userspace programs. It also provides an abstraction +within the kernel which allows different filesystem implementations to +coexist. + +VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so +on are called from a process context. Filesystem locking is described +in the document Documentation/filesystems/Locking. + + +Directory Entry Cache (dcache) +------------------------------ + +The VFS implements the open(2), stat(2), chmod(2), and similar system +calls. The pathname argument that is passed to them is used by the VFS +to search through the directory entry cache (also known as the dentry +cache or dcache). This provides a very fast look-up mechanism to +translate a pathname (filename) into a specific dentry. Dentries live +in RAM and are never saved to disc: they exist only for performance. + +The dentry cache is meant to be a view into your entire filespace. As +most computers cannot fit all dentries in the RAM at the same time, +some bits of the cache are missing. In order to resolve your pathname +into a dentry, the VFS may have to resort to creating dentries along +the way, and then loading the inode. This is done by looking up the +inode. + + +The Inode Object +---------------- + +An individual dentry usually has a pointer to an inode. Inodes are +filesystem objects such as regular files, directories, FIFOs and other +beasts. They live either on the disc (for block device filesystems) +or in the memory (for pseudo filesystems). Inodes that live on the +disc are copied into the memory when required and changes to the inode +are written back to disc. A single inode can be pointed to by multiple +dentries (hard links, for example, do this). + +To look up an inode requires that the VFS calls the lookup() method of +the parent directory inode. This method is installed by the specific +filesystem implementation that the inode lives in. Once the VFS has +the required dentry (and hence the inode), we can do all those boring +things like open(2) the file, or stat(2) it to peek at the inode +data. The stat(2) operation is fairly simple: once the VFS has the +dentry, it peeks at the inode data and passes some of it back to +userspace. + + +The File Object +--------------- + +Opening a file requires another operation: allocation of a file +structure (this is the kernel-side implementation of file +descriptors). The freshly allocated file structure is initialized with +a pointer to the dentry and a set of file operation member functions. +These are taken from the inode data. The open() file method is then +called so the specific filesystem implementation can do it's work. You +can see that this is another switch performed by the VFS. The file +structure is placed into the file descriptor table for the process. + +Reading, writing and closing files (and other assorted VFS operations) +is done by using the userspace file descriptor to grab the appropriate +file structure, and then calling the required file structure method to +do whatever is required. For as long as the file is open, it keeps the +dentry in use, which in turn means that the VFS inode is still in use. + + +Registering and Mounting a Filesystem +===================================== + +To register and unregister a filesystem, use the following API +functions: + + #include <linux/fs.h> + + extern int register_filesystem(struct file_system_type *); + extern int unregister_filesystem(struct file_system_type *); + +The passed struct file_system_type describes your filesystem. When a +request is made to mount a device onto a directory in your filespace, +the VFS will call the appropriate get_sb() method for the specific +filesystem. The dentry for the mount point will then be updated to +point to the root inode for the new filesystem. + +You can see all filesystems that are registered to the kernel in the +file /proc/filesystems. + + +struct file_system_type +----------------------- + +This describes the filesystem. As of kernel 2.6.22, the following +members are defined: + +struct file_system_type { + const char *name; + int fs_flags; + int (*get_sb) (struct file_system_type *, int, + const char *, void *, struct vfsmount *); + void (*kill_sb) (struct super_block *); + struct module *owner; + struct file_system_type * next; + struct list_head fs_supers; + struct lock_class_key s_lock_key; + struct lock_class_key s_umount_key; +}; + + name: the name of the filesystem type, such as "ext2", "iso9660", + "msdos" and so on + + fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.) + + get_sb: the method to call when a new instance of this + filesystem should be mounted + + kill_sb: the method to call when an instance of this filesystem + should be unmounted + + owner: for internal VFS use: you should initialize this to THIS_MODULE in + most cases. + + next: for internal VFS use: you should initialize this to NULL + + s_lock_key, s_umount_key: lockdep-specific + +The get_sb() method has the following arguments: + + struct file_system_type *fs_type: describes the filesystem, partly initialized + by the specific filesystem code + + int flags: mount flags + + const char *dev_name: the device name we are mounting. + + void *data: arbitrary mount options, usually comes as an ASCII + string (see "Mount Options" section) + + struct vfsmount *mnt: a vfs-internal representation of a mount point + +The get_sb() method must determine if the block device specified +in the dev_name and fs_type contains a filesystem of the type the method +supports. If it succeeds in opening the named block device, it initializes a +struct super_block descriptor for the filesystem contained by the block device. +On failure it returns an error. + +The most interesting member of the superblock structure that the +get_sb() method fills in is the "s_op" field. This is a pointer to +a "struct super_operations" which describes the next level of the +filesystem implementation. + +Usually, a filesystem uses one of the generic get_sb() implementations +and provides a fill_super() method instead. The generic methods are: + + get_sb_bdev: mount a filesystem residing on a block device + + get_sb_nodev: mount a filesystem that is not backed by a device + + get_sb_single: mount a filesystem which shares the instance between + all mounts + +A fill_super() method implementation has the following arguments: + + struct super_block *sb: the superblock structure. The method fill_super() + must initialize this properly. + + void *data: arbitrary mount options, usually comes as an ASCII + string (see "Mount Options" section) + + int silent: whether or not to be silent on error + + +The Superblock Object +===================== + +A superblock object represents a mounted filesystem. + + +struct super_operations +----------------------- + +This describes how the VFS can manipulate the superblock of your +filesystem. As of kernel 2.6.22, the following members are defined: + +struct super_operations { + struct inode *(*alloc_inode)(struct super_block *sb); + void (*destroy_inode)(struct inode *); + + void (*dirty_inode) (struct inode *); + int (*write_inode) (struct inode *, int); + void (*drop_inode) (struct inode *); + void (*delete_inode) (struct inode *); + void (*put_super) (struct super_block *); + void (*write_super) (struct super_block *); + int (*sync_fs)(struct super_block *sb, int wait); + void (*write_super_lockfs) (struct super_block *); + void (*unlockfs) (struct super_block *); + int (*statfs) (struct dentry *, struct kstatfs *); + int (*remount_fs) (struct super_block *, int *, char *); + void (*clear_inode) (struct inode *); + void (*umount_begin) (struct super_block *); + + int (*show_options)(struct seq_file *, struct vfsmount *); + + ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); + ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); +}; + +All methods are called without any locks being held, unless otherwise +noted. This means that most methods can block safely. All methods are +only called from a process context (i.e. not from an interrupt handler +or bottom half). + + alloc_inode: this method is called by inode_alloc() to allocate memory + for struct inode and initialize it. If this function is not + defined, a simple 'struct inode' is allocated. Normally + alloc_inode will be used to allocate a larger structure which + contains a 'struct inode' embedded within it. + + destroy_inode: this method is called by destroy_inode() to release + resources allocated for struct inode. It is only required if + ->alloc_inode was defined and simply undoes anything done by + ->alloc_inode. + + dirty_inode: this method is called by the VFS to mark an inode dirty. + + write_inode: this method is called when the VFS needs to write an + inode to disc. The second parameter indicates whether the write + should be synchronous or not, not all filesystems check this flag. + + drop_inode: called when the last access to the inode is dropped, + with the inode_lock spinlock held. + + This method should be either NULL (normal UNIX filesystem + semantics) or "generic_delete_inode" (for filesystems that do not + want to cache inodes - causing "delete_inode" to always be + called regardless of the value of i_nlink) + + The "generic_delete_inode()" behavior is equivalent to the + old practice of using "force_delete" in the put_inode() case, + but does not have the races that the "force_delete()" approach + had. + + delete_inode: called when the VFS wants to delete an inode + + put_super: called when the VFS wishes to free the superblock + (i.e. unmount). This is called with the superblock lock held + + write_super: called when the VFS superblock needs to be written to + disc. This method is optional + + sync_fs: called when VFS is writing out all dirty data associated with + a superblock. The second parameter indicates whether the method + should wait until the write out has been completed. Optional. + + write_super_lockfs: called when VFS is locking a filesystem and + forcing it into a consistent state. This method is currently + used by the Logical Volume Manager (LVM). + + unlockfs: called when VFS is unlocking a filesystem and making it writable + again. + + statfs: called when the VFS needs to get filesystem statistics. This + is called with the kernel lock held + + remount_fs: called when the filesystem is remounted. This is called + with the kernel lock held + + clear_inode: called then the VFS clears the inode. Optional + + umount_begin: called when the VFS is unmounting a filesystem. + + show_options: called by the VFS to show mount options for + /proc/<pid>/mounts. (see "Mount Options" section) + + quota_read: called by the VFS to read from filesystem quota file. + + quota_write: called by the VFS to write to filesystem quota file. + +Whoever sets up the inode is responsible for filling in the "i_op" field. This +is a pointer to a "struct inode_operations" which describes the methods that +can be performed on individual inodes. + + +The Inode Object +================ + +An inode object represents an object within the filesystem. + + +struct inode_operations +----------------------- + +This describes how the VFS can manipulate an inode in your +filesystem. As of kernel 2.6.22, the following members are defined: + +struct inode_operations { + int (*create) (struct inode *,struct dentry *,int, struct nameidata *); + struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *); + int (*link) (struct dentry *,struct inode *,struct dentry *); + int (*unlink) (struct inode *,struct dentry *); + int (*symlink) (struct inode *,struct dentry *,const char *); + int (*mkdir) (struct inode *,struct dentry *,int); + int (*rmdir) (struct inode *,struct dentry *); + int (*mknod) (struct inode *,struct dentry *,int,dev_t); + int (*rename) (struct inode *, struct dentry *, + struct inode *, struct dentry *); + int (*readlink) (struct dentry *, char __user *,int); + void * (*follow_link) (struct dentry *, struct nameidata *); + void (*put_link) (struct dentry *, struct nameidata *, void *); + void (*truncate) (struct inode *); + int (*permission) (struct inode *, int, struct nameidata *); + int (*setattr) (struct dentry *, struct iattr *); + int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *); + int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); + ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); + ssize_t (*listxattr) (struct dentry *, char *, size_t); + int (*removexattr) (struct dentry *, const char *); + void (*truncate_range)(struct inode *, loff_t, loff_t); +}; + +Again, all methods are called without any locks being held, unless +otherwise noted. + + create: called by the open(2) and creat(2) system calls. Only + required if you want to support regular files. The dentry you + get should not have an inode (i.e. it should be a negative + dentry). Here you will probably call d_instantiate() with the + dentry and the newly created inode + + lookup: called when the VFS needs to look up an inode in a parent + directory. The name to look for is found in the dentry. This + method must call d_add() to insert the found inode into the + dentry. The "i_count" field in the inode structure should be + incremented. If the named inode does not exist a NULL inode + should be inserted into the dentry (this is called a negative + dentry). Returning an error code from this routine must only + be done on a real error, otherwise creating inodes with system + calls like create(2), mknod(2), mkdir(2) and so on will fail. + If you wish to overload the dentry methods then you should + initialise the "d_dop" field in the dentry; this is a pointer + to a struct "dentry_operations". + This method is called with the directory inode semaphore held + + link: called by the link(2) system call. Only required if you want + to support hard links. You will probably need to call + d_instantiate() just as you would in the create() method + + unlink: called by the unlink(2) system call. Only required if you + want to support deleting inodes + + symlink: called by the symlink(2) system call. Only required if you + want to support symlinks. You will probably need to call + d_instantiate() just as you would in the create() method + + mkdir: called by the mkdir(2) system call. Only required if you want + to support creating subdirectories. You will probably need to + call d_instantiate() just as you would in the create() method + + rmdir: called by the rmdir(2) system call. Only required if you want + to support deleting subdirectories + + mknod: called by the mknod(2) system call to create a device (char, + block) inode or a named pipe (FIFO) or socket. Only required + if you want to support creating these types of inodes. You + will probably need to call d_instantiate() just as you would + in the create() method + + rename: called by the rename(2) system call to rename the object to + have the parent and name given by the second inode and dentry. + + readlink: called by the readlink(2) system call. Only required if + you want to support reading symbolic links + + follow_link: called by the VFS to follow a symbolic link to the + inode it points to. Only required if you want to support + symbolic links. This method returns a void pointer cookie + that is passed to put_link(). + + put_link: called by the VFS to release resources allocated by + follow_link(). The cookie returned by follow_link() is passed + to this method as the last parameter. It is used by + filesystems such as NFS where page cache is not stable + (i.e. page that was installed when the symbolic link walk + started might not be in the page cache at the end of the + walk). + + truncate: called by the VFS to change the size of a file. The + i_size field of the inode is set to the desired size by the + VFS before this method is called. This method is called by + the truncate(2) system call and related functionality. + + permission: called by the VFS to check for access rights on a POSIX-like + filesystem. + + setattr: called by the VFS to set attributes for a file. This method + is called by chmod(2) and related system calls. + + getattr: called by the VFS to get attributes of a file. This method + is called by stat(2) and related system calls. + + setxattr: called by the VFS to set an extended attribute for a file. + Extended attribute is a name:value pair associated with an + inode. This method is called by setxattr(2) system call. + + getxattr: called by the VFS to retrieve the value of an extended + attribute name. This method is called by getxattr(2) function + call. + + listxattr: called by the VFS to list all extended attributes for a + given file. This method is called by listxattr(2) system call. + + removexattr: called by the VFS to remove an extended attribute from + a file. This method is called by removexattr(2) system call. + + truncate_range: a method provided by the underlying filesystem to truncate a + range of blocks , i.e. punch a hole somewhere in a file. + + +The Address Space Object +======================== + +The address space object is used to group and manage pages in the page +cache. It can be used to keep track of the pages in a file (or +anything else) and also track the mapping of sections of the file into +process address spaces. + +There are a number of distinct yet related services that an +address-space can provide. These include communicating memory +pressure, page lookup by address, and keeping track of pages tagged as +Dirty or Writeback. + +The first can be used independently to the others. The VM can try to +either write dirty pages in order to clean them, or release clean +pages in order to reuse them. To do this it can call the ->writepage +method on dirty pages, and ->releasepage on clean pages with +PagePrivate set. Clean pages without PagePrivate and with no external +references will be released without notice being given to the +address_space. + +To achieve this functionality, pages need to be placed on an LRU with +lru_cache_add and mark_page_active needs to be called whenever the +page is used. + +Pages are normally kept in a radix tree index by ->index. This tree +maintains information about the PG_Dirty and PG_Writeback status of +each page, so that pages with either of these flags can be found +quickly. + +The Dirty tag is primarily used by mpage_writepages - the default +->writepages method. It uses the tag to find dirty pages to call +->writepage on. If mpage_writepages is not used (i.e. the address +provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is +almost unused. write_inode_now and sync_inode do use it (through +__sync_single_inode) to check if ->writepages has been successful in +writing out the whole address_space. + +The Writeback tag is used by filemap*wait* and sync_page* functions, +via wait_on_page_writeback_range, to wait for all writeback to +complete. While waiting ->sync_page (if defined) will be called on +each page that is found to require writeback. + +An address_space handler may attach extra information to a page, +typically using the 'private' field in the 'struct page'. If such +information is attached, the PG_Private flag should be set. This will +cause various VM routines to make extra calls into the address_space +handler to deal with that data. + +An address space acts as an intermediate between storage and +application. Data is read into the address space a whole page at a +time, and provided to the application either by copying of the page, +or by memory-mapping the page. +Data is written into the address space by the application, and then +written-back to storage typically in whole pages, however the +address_space has finer control of write sizes. + +The read process essentially only requires 'readpage'. The write +process is more complicated and uses write_begin/write_end or +set_page_dirty to write data into the address_space, and writepage, +sync_page, and writepages to writeback data to storage. + +Adding and removing pages to/from an address_space is protected by the +inode's i_mutex. + +When data is written to a page, the PG_Dirty flag should be set. It +typically remains set until writepage asks for it to be written. This +should clear PG_Dirty and set PG_Writeback. It can be actually +written at any point after PG_Dirty is clear. Once it is known to be +safe, PG_Writeback is cleared. + +Writeback makes use of a writeback_control structure... + +struct address_space_operations +------------------------------- + +This describes how the VFS can manipulate mapping of a file to page cache in +your filesystem. As of kernel 2.6.22, the following members are defined: + +struct address_space_operations { + int (*writepage)(struct page *page, struct writeback_control *wbc); + int (*readpage)(struct file *, struct page *); + int (*sync_page)(struct page *); + int (*writepages)(struct address_space *, struct writeback_control *); + int (*set_page_dirty)(struct page *page); + int (*readpages)(struct file *filp, struct address_space *mapping, + struct list_head *pages, unsigned nr_pages); + int (*write_begin)(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata); + int (*write_end)(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata); + sector_t (*bmap)(struct address_space *, sector_t); + int (*invalidatepage) (struct page *, unsigned long); + int (*releasepage) (struct page *, int); + ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, + loff_t offset, unsigned long nr_segs); + struct page* (*get_xip_page)(struct address_space *, sector_t, + int); + /* migrate the contents of a page to the specified target */ + int (*migratepage) (struct page *, struct page *); + int (*launder_page) (struct page *); +}; + + writepage: called by the VM to write a dirty page to backing store. + This may happen for data integrity reasons (i.e. 'sync'), or + to free up memory (flush). The difference can be seen in + wbc->sync_mode. + The PG_Dirty flag has been cleared and PageLocked is true. + writepage should start writeout, should set PG_Writeback, + and should make sure the page is unlocked, either synchronously + or asynchronously when the write operation completes. + + If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to + try too hard if there are problems, and may choose to write out + other pages from the mapping if that is easier (e.g. due to + internal dependencies). If it chooses not to start writeout, it + should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep + calling ->writepage on that page. + + See the file "Locking" for more details. + + readpage: called by the VM to read a page from backing store. + The page will be Locked when readpage is called, and should be + unlocked and marked uptodate once the read completes. + If ->readpage discovers that it needs to unlock the page for + some reason, it can do so, and then return AOP_TRUNCATED_PAGE. + In this case, the page will be relocated, relocked and if + that all succeeds, ->readpage will be called again. + + sync_page: called by the VM to notify the backing store to perform all + queued I/O operations for a page. I/O operations for other pages + associated with this address_space object may also be performed. + + This function is optional and is called only for pages with + PG_Writeback set while waiting for the writeback to complete. + + writepages: called by the VM to write out pages associated with the + address_space object. If wbc->sync_mode is WBC_SYNC_ALL, then + the writeback_control will specify a range of pages that must be + written out. If it is WBC_SYNC_NONE, then a nr_to_write is given + and that many pages should be written if possible. + If no ->writepages is given, then mpage_writepages is used + instead. This will choose pages from the address space that are + tagged as DIRTY and will pass them to ->writepage. + + set_page_dirty: called by the VM to set a page dirty. + This is particularly needed if an address space attaches + private data to a page, and that data needs to be updated when + a page is dirtied. This is called, for example, when a memory + mapped page gets modified. + If defined, it should set the PageDirty flag, and the + PAGECACHE_TAG_DIRTY tag in the radix tree. + + readpages: called by the VM to read pages associated with the address_space + object. This is essentially just a vector version of + readpage. Instead of just one page, several pages are + requested. + readpages is only used for read-ahead, so read errors are + ignored. If anything goes wrong, feel free to give up. + + write_begin: + Called by the generic buffered write code to ask the filesystem to + prepare to write len bytes at the given offset in the file. The + address_space should check that the write will be able to complete, + by allocating space if necessary and doing any other internal + housekeeping. If the write will update parts of any basic-blocks on + storage, then those blocks should be pre-read (if they haven't been + read already) so that the updated blocks can be written out properly. + + The filesystem must return the locked pagecache page for the specified + offset, in *pagep, for the caller to write into. + + It must be able to cope with short writes (where the length passed to + write_begin is greater than the number of bytes copied into the page). + + flags is a field for AOP_FLAG_xxx flags, described in + include/linux/fs.h. + + A void * may be returned in fsdata, which then gets passed into + write_end. + + Returns 0 on success; < 0 on failure (which is the error code), in + which case write_end is not called. + + write_end: After a successful write_begin, and data copy, write_end must + be called. len is the original len passed to write_begin, and copied + is the amount that was able to be copied (copied == len is always true + if write_begin was called with the AOP_FLAG_UNINTERRUPTIBLE flag). + + The filesystem must take care of unlocking the page and releasing it + refcount, and updating i_size. + + Returns < 0 on failure, otherwise the number of bytes (<= 'copied') + that were able to be copied into pagecache. + + bmap: called by the VFS to map a logical block offset within object to + physical block number. This method is used by the FIBMAP + ioctl and for working with swap-files. To be able to swap to + a file, the file must have a stable mapping to a block + device. The swap system does not go through the filesystem + but instead uses bmap to find out where the blocks in the file + are and uses those addresses directly. + + + invalidatepage: If a page has PagePrivate set, then invalidatepage + will be called when part or all of the page is to be removed + from the address space. This generally corresponds to either a + truncation or a complete invalidation of the address space + (in the latter case 'offset' will always be 0). + Any private data associated with the page should be updated + to reflect this truncation. If offset is 0, then + the private data should be released, because the page + must be able to be completely discarded. This may be done by + calling the ->releasepage function, but in this case the + release MUST succeed. + + releasepage: releasepage is called on PagePrivate pages to indicate + that the page should be freed if possible. ->releasepage + should remove any private data from the page and clear the + PagePrivate flag. It may also remove the page from the + address_space. If this fails for some reason, it may indicate + failure with a 0 return value. + This is used in two distinct though related cases. The first + is when the VM finds a clean page with no active users and + wants to make it a free page. If ->releasepage succeeds, the + page will be removed from the address_space and become free. + + The second case is when a request has been made to invalidate + some or all pages in an address_space. This can happen + through the fadvice(POSIX_FADV_DONTNEED) system call or by the + filesystem explicitly requesting it as nfs and 9fs do (when + they believe the cache may be out of date with storage) by + calling invalidate_inode_pages2(). + If the filesystem makes such a call, and needs to be certain + that all pages are invalidated, then its releasepage will + need to ensure this. Possibly it can clear the PageUptodate + bit if it cannot free private data yet. + + direct_IO: called by the generic read/write routines to perform + direct_IO - that is IO requests which bypass the page cache + and transfer data directly between the storage and the + application's address space. + + get_xip_page: called by the VM to translate a block number to a page. + The page is valid until the corresponding filesystem is unmounted. + Filesystems that want to use execute-in-place (XIP) need to implement + it. An example implementation can be found in fs/ext2/xip.c. + + migrate_page: This is used to compact the physical memory usage. + If the VM wants to relocate a page (maybe off a memory card + that is signalling imminent failure) it will pass a new page + and an old page to this function. migrate_page should + transfer any private data across and update any references + that it has to the page. + + launder_page: Called before freeing a page - it writes back the dirty page. To + prevent redirtying the page, it is kept locked during the whole + operation. + +The File Object +=============== + +A file object represents a file opened by a process. + + +struct file_operations +---------------------- + +This describes how the VFS can manipulate an open file. As of kernel +2.6.22, the following members are defined: + +struct file_operations { + struct module *owner; + loff_t (*llseek) (struct file *, loff_t, int); + ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); + ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + int (*readdir) (struct file *, void *, filldir_t); + unsigned int (*poll) (struct file *, struct poll_table_struct *); + int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); + long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); + long (*compat_ioctl) (struct file *, unsigned int, unsigned long); + int (*mmap) (struct file *, struct vm_area_struct *); + int (*open) (struct inode *, struct file *); + int (*flush) (struct file *); + int (*release) (struct inode *, struct file *); + int (*fsync) (struct file *, struct dentry *, int datasync); + int (*aio_fsync) (struct kiocb *, int datasync); + int (*fasync) (int, struct file *, int); + int (*lock) (struct file *, int, struct file_lock *); + ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); + ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); + ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *); + ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); + unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); + int (*check_flags)(int); + int (*dir_notify)(struct file *filp, unsigned long arg); + int (*flock) (struct file *, int, struct file_lock *); + ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned int); + ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned int); +}; + +Again, all methods are called without any locks being held, unless +otherwise noted. + + llseek: called when the VFS needs to move the file position index + + read: called by read(2) and related system calls + + aio_read: called by io_submit(2) and other asynchronous I/O operations + + write: called by write(2) and related system calls + + aio_write: called by io_submit(2) and other asynchronous I/O operations + + readdir: called when the VFS needs to read the directory contents + + poll: called by the VFS when a process wants to check if there is + activity on this file and (optionally) go to sleep until there + is activity. Called by the select(2) and poll(2) system calls + + ioctl: called by the ioctl(2) system call + + unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not + require the BKL should use this method instead of the ioctl() above. + + compat_ioctl: called by the ioctl(2) system call when 32 bit system calls + are used on 64 bit kernels. + + mmap: called by the mmap(2) system call + + open: called by the VFS when an inode should be opened. When the VFS + opens a file, it creates a new "struct file". It then calls the + open method for the newly allocated file structure. You might + think that the open method really belongs in + "struct inode_operations", and you may be right. I think it's + done the way it is because it makes filesystems simpler to + implement. The open() method is a good place to initialize the + "private_data" member in the file structure if you want to point + to a device structure + + flush: called by the close(2) system call to flush a file + + release: called when the last reference to an open file is closed + + fsync: called by the fsync(2) system call + + fasync: called by the fcntl(2) system call when asynchronous + (non-blocking) mode is enabled for a file + + lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW + commands + + readv: called by the readv(2) system call + + writev: called by the writev(2) system call + + sendfile: called by the sendfile(2) system call + + get_unmapped_area: called by the mmap(2) system call + + check_flags: called by the fcntl(2) system call for F_SETFL command + + dir_notify: called by the fcntl(2) system call for F_NOTIFY command + + flock: called by the flock(2) system call + + splice_write: called by the VFS to splice data from a pipe to a file. This + method is used by the splice(2) system call + + splice_read: called by the VFS to splice data from file to a pipe. This + method is used by the splice(2) system call + +Note that the file operations are implemented by the specific +filesystem in which the inode resides. When opening a device node +(character or block special) most filesystems will call special +support routines in the VFS which will locate the required device +driver information. These support routines replace the filesystem file +operations with those for the device driver, and then proceed to call +the new open() method for the file. This is how opening a device file +in the filesystem eventually ends up calling the device driver open() +method. + + +Directory Entry Cache (dcache) +============================== + + +struct dentry_operations +------------------------ + +This describes how a filesystem can overload the standard dentry +operations. Dentries and the dcache are the domain of the VFS and the +individual filesystem implementations. Device drivers have no business +here. These methods may be set to NULL, as they are either optional or +the VFS uses a default. As of kernel 2.6.22, the following members are +defined: + +struct dentry_operations { + int (*d_revalidate)(struct dentry *, struct nameidata *); + int (*d_hash) (struct dentry *, struct qstr *); + int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); + int (*d_delete)(struct dentry *); + void (*d_release)(struct dentry *); + void (*d_iput)(struct dentry *, struct inode *); + char *(*d_dname)(struct dentry *, char *, int); +}; + + d_revalidate: called when the VFS needs to revalidate a dentry. This + is called whenever a name look-up finds a dentry in the + dcache. Most filesystems leave this as NULL, because all their + dentries in the dcache are valid + + d_hash: called when the VFS adds a dentry to the hash table + + d_compare: called when a dentry should be compared with another + + d_delete: called when the last reference to a dentry is + deleted. This means no-one is using the dentry, however it is + still valid and in the dcache + + d_release: called when a dentry is really deallocated + + d_iput: called when a dentry loses its inode (just prior to its + being deallocated). The default when this is NULL is that the + VFS calls iput(). If you define this method, you must call + iput() yourself + + d_dname: called when the pathname of a dentry should be generated. + Useful for some pseudo filesystems (sockfs, pipefs, ...) to delay + pathname generation. (Instead of doing it when dentry is created, + it's done only when the path is needed.). Real filesystems probably + dont want to use it, because their dentries are present in global + dcache hash, so their hash should be an invariant. As no lock is + held, d_dname() should not try to modify the dentry itself, unless + appropriate SMP safety is used. CAUTION : d_path() logic is quite + tricky. The correct way to return for example "Hello" is to put it + at the end of the buffer, and returns a pointer to the first char. + dynamic_dname() helper function is provided to take care of this. + +Example : + +static char *pipefs_dname(struct dentry *dent, char *buffer, int buflen) +{ + return dynamic_dname(dentry, buffer, buflen, "pipe:[%lu]", + dentry->d_inode->i_ino); +} + +Each dentry has a pointer to its parent dentry, as well as a hash list +of child dentries. Child dentries are basically like files in a +directory. + + +Directory Entry Cache API +-------------------------- + +There are a number of functions defined which permit a filesystem to +manipulate dentries: + + dget: open a new handle for an existing dentry (this just increments + the usage count) + + dput: close a handle for a dentry (decrements the usage count). If + the usage count drops to 0, the "d_delete" method is called + and the dentry is placed on the unused list if the dentry is + still in its parents hash list. Putting the dentry on the + unused list just means that if the system needs some RAM, it + goes through the unused list of dentries and deallocates them. + If the dentry has already been unhashed and the usage count + drops to 0, in this case the dentry is deallocated after the + "d_delete" method is called + + d_drop: this unhashes a dentry from its parents hash list. A + subsequent call to dput() will deallocate the dentry if its + usage count drops to 0 + + d_delete: delete a dentry. If there are no other open references to + the dentry then the dentry is turned into a negative dentry + (the d_iput() method is called). If there are other + references, then d_drop() is called instead + + d_add: add a dentry to its parents hash list and then calls + d_instantiate() + + d_instantiate: add a dentry to the alias hash list for the inode and + updates the "d_inode" member. The "i_count" member in the + inode structure should be set/incremented. If the inode + pointer is NULL, the dentry is called a "negative + dentry". This function is commonly called when an inode is + created for an existing negative dentry + + d_lookup: look up a dentry given its parent and path name component + It looks up the child of that given name from the dcache + hash table. If it is found, the reference count is incremented + and the dentry is returned. The caller must use d_put() + to free the dentry when it finishes using it. + +For further information on dentry locking, please refer to the document +Documentation/filesystems/dentry-locking.txt. + +Mount Options +============= + +Parsing options +--------------- + +On mount and remount the filesystem is passed a string containing a +comma separated list of mount options. The options can have either of +these forms: + + option + option=value + +The <linux/parser.h> header defines an API that helps parse these +options. There are plenty of examples on how to use it in existing +filesystems. + +Showing options +--------------- + +If a filesystem accepts mount options, it must define show_options() +to show all the currently active options. The rules are: + + - options MUST be shown which are not default or their values differ + from the default + + - options MAY be shown which are enabled by default or have their + default value + +Options used only internally between a mount helper and the kernel +(such as file descriptors), or which only have an effect during the +mounting (such as ones controlling the creation of a journal) are exempt +from the above rules. + +The underlying reason for the above rules is to make sure, that a +mount can be accurately replicated (e.g. umounting and mounting again) +based on the information found in /proc/mounts. + +A simple method of saving options at mount/remount time and showing +them is provided with the save_mount_options() and +generic_show_options() helper functions. Please note, that using +these may have drawbacks. For more info see header comments for these +functions in fs/namespace.c. + +Resources +========= + +(Note some of these resources are not up-to-date with the latest kernel + version.) + +Creating Linux virtual filesystems. 2002 + <http://lwn.net/Articles/13325/> + +The Linux Virtual File-system Layer by Neil Brown. 1999 + <http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html> + +A tour of the Linux VFS by Michael K. Johnson. 1996 + <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html> + +A small trail through the Linux kernel by Andries Brouwer. 2001 + <http://www.win.tue.nl/~aeb/linux/vfs/trail.html> diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt new file mode 100644 index 0000000..0a1668b --- /dev/null +++ b/Documentation/filesystems/xfs.txt @@ -0,0 +1,261 @@ + +The SGI XFS Filesystem +====================== + +XFS is a high performance journaling filesystem which originated +on the SGI IRIX platform. It is completely multi-threaded, can +support large files and large filesystems, extended attributes, +variable block sizes, is extent based, and makes extensive use of +Btrees (directories, extents, free space) to aid both performance +and scalability. + +Refer to the documentation at http://oss.sgi.com/projects/xfs/ +for further details. This implementation is on-disk compatible +with the IRIX version of XFS. + + +Mount Options +============= + +When mounting an XFS filesystem, the following options are accepted. + + allocsize=size + Sets the buffered I/O end-of-file preallocation size when + doing delayed allocation writeout (default size is 64KiB). + Valid values for this option are page size (typically 4KiB) + through to 1GiB, inclusive, in power-of-2 increments. + + attr2/noattr2 + The options enable/disable (default is disabled for backward + compatibility on-disk) an "opportunistic" improvement to be + made in the way inline extended attributes are stored on-disk. + When the new form is used for the first time (by setting or + removing extended attributes) the on-disk superblock feature + bit field will be updated to reflect this format being in use. + + barrier + Enables the use of block layer write barriers for writes into + the journal and unwritten extent conversion. This allows for + drive level write caching to be enabled, for devices that + support write barriers. + + dmapi + Enable the DMAPI (Data Management API) event callouts. + Use with the "mtpt" option. + + grpid/bsdgroups and nogrpid/sysvgroups + These options define what group ID a newly created file gets. + When grpid is set, it takes the group ID of the directory in + which it is created; otherwise (the default) it takes the fsgid + of the current process, unless the directory has the setgid bit + set, in which case it takes the gid from the parent directory, + and also gets the setgid bit set if it is a directory itself. + + ihashsize=value + In memory inode hashes have been removed, so this option has + no function as of August 2007. Option is deprecated. + + ikeep/noikeep + When ikeep is specified, XFS does not delete empty inode clusters + and keeps them around on disk. ikeep is the traditional XFS + behaviour. When noikeep is specified, empty inode clusters + are returned to the free space pool. The default is noikeep for + non-DMAPI mounts, while ikeep is the default when DMAPI is in use. + + inode64 + Indicates that XFS is allowed to create inodes at any location + in the filesystem, including those which will result in inode + numbers occupying more than 32 bits of significance. This is + provided for backwards compatibility, but causes problems for + backup applications that cannot handle large inode numbers. + + largeio/nolargeio + If "nolargeio" is specified, the optimal I/O reported in + st_blksize by stat(2) will be as small as possible to allow user + applications to avoid inefficient read/modify/write I/O. + If "largeio" specified, a filesystem that has a "swidth" specified + will return the "swidth" value (in bytes) in st_blksize. If the + filesystem does not have a "swidth" specified but does specify + an "allocsize" then "allocsize" (in bytes) will be returned + instead. + If neither of these two options are specified, then filesystem + will behave as if "nolargeio" was specified. + + logbufs=value + Set the number of in-memory log buffers. Valid numbers range + from 2-8 inclusive. + The default value is 8 buffers for filesystems with a + blocksize of 64KiB, 4 buffers for filesystems with a blocksize + of 32KiB, 3 buffers for filesystems with a blocksize of 16KiB + and 2 buffers for all other configurations. Increasing the + number of buffers may increase performance on some workloads + at the cost of the memory used for the additional log buffers + and their associated control structures. + + logbsize=value + Set the size of each in-memory log buffer. + Size may be specified in bytes, or in kilobytes with a "k" suffix. + Valid sizes for version 1 and version 2 logs are 16384 (16k) and + 32768 (32k). Valid sizes for version 2 logs also include + 65536 (64k), 131072 (128k) and 262144 (256k). + The default value for machines with more than 32MiB of memory + is 32768, machines with less memory use 16384 by default. + + logdev=device and rtdev=device + Use an external log (metadata journal) and/or real-time device. + An XFS filesystem has up to three parts: a data section, a log + section, and a real-time section. The real-time section is + optional, and the log section can be separate from the data + section or contained within it. + + mtpt=mountpoint + Use with the "dmapi" option. The value specified here will be + included in the DMAPI mount event, and should be the path of + the actual mountpoint that is used. + + noalign + Data allocations will not be aligned at stripe unit boundaries. + + noatime + Access timestamps are not updated when a file is read. + + norecovery + The filesystem will be mounted without running log recovery. + If the filesystem was not cleanly unmounted, it is likely to + be inconsistent when mounted in "norecovery" mode. + Some files or directories may not be accessible because of this. + Filesystems mounted "norecovery" must be mounted read-only or + the mount will fail. + + nouuid + Don't check for double mounted file systems using the file system uuid. + This is useful to mount LVM snapshot volumes. + + osyncisosync + Make O_SYNC writes implement true O_SYNC. WITHOUT this option, + Linux XFS behaves as if an "osyncisdsync" option is used, + which will make writes to files opened with the O_SYNC flag set + behave as if the O_DSYNC flag had been used instead. + This can result in better performance without compromising + data safety. + However if this option is not in effect, timestamp updates from + O_SYNC writes can be lost if the system crashes. + If timestamp updates are critical, use the osyncisosync option. + + uquota/usrquota/uqnoenforce/quota + User disk quota accounting enabled, and limits (optionally) + enforced. Refer to xfs_quota(8) for further details. + + gquota/grpquota/gqnoenforce + Group disk quota accounting enabled and limits (optionally) + enforced. Refer to xfs_quota(8) for further details. + + pquota/prjquota/pqnoenforce + Project disk quota accounting enabled and limits (optionally) + enforced. Refer to xfs_quota(8) for further details. + + sunit=value and swidth=value + Used to specify the stripe unit and width for a RAID device or + a stripe volume. "value" must be specified in 512-byte block + units. + If this option is not specified and the filesystem was made on + a stripe volume or the stripe width or unit were specified for + the RAID device at mkfs time, then the mount system call will + restore the value from the superblock. For filesystems that + are made directly on RAID devices, these options can be used + to override the information in the superblock if the underlying + disk layout changes after the filesystem has been created. + The "swidth" option is required if the "sunit" option has been + specified, and must be a multiple of the "sunit" value. + + swalloc + Data allocations will be rounded up to stripe width boundaries + when the current end of file is being extended and the file + size is larger than the stripe width size. + + +sysctls +======= + +The following sysctls are available for the XFS filesystem: + + fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1) + Setting this to "1" clears accumulated XFS statistics + in /proc/fs/xfs/stat. It then immediately resets to "0". + + fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000) + The interval at which the xfssyncd thread flushes metadata + out to disk. This thread will flush log activity out, and + do some processing on unlinked inodes. + + fs.xfs.xfsbufd_centisecs (Min: 50 Default: 100 Max: 3000) + The interval at which xfsbufd scans the dirty metadata buffers list. + + fs.xfs.age_buffer_centisecs (Min: 100 Default: 1500 Max: 720000) + The age at which xfsbufd flushes dirty metadata buffers to disk. + + fs.xfs.error_level (Min: 0 Default: 3 Max: 11) + A volume knob for error reporting when internal errors occur. + This will generate detailed messages & backtraces for filesystem + shutdowns, for example. Current threshold values are: + + XFS_ERRLEVEL_OFF: 0 + XFS_ERRLEVEL_LOW: 1 + XFS_ERRLEVEL_HIGH: 5 + + fs.xfs.panic_mask (Min: 0 Default: 0 Max: 127) + Causes certain error conditions to call BUG(). Value is a bitmask; + AND together the tags which represent errors which should cause panics: + + XFS_NO_PTAG 0 + XFS_PTAG_IFLUSH 0x00000001 + XFS_PTAG_LOGRES 0x00000002 + XFS_PTAG_AILDELETE 0x00000004 + XFS_PTAG_ERROR_REPORT 0x00000008 + XFS_PTAG_SHUTDOWN_CORRUPT 0x00000010 + XFS_PTAG_SHUTDOWN_IOERROR 0x00000020 + XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040 + + This option is intended for debugging only. + + fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1) + Controls whether symlinks are created with mode 0777 (default) + or whether their mode is affected by the umask (irix mode). + + fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1) + Controls files created in SGID directories. + If the group ID of the new file does not match the effective group + ID or one of the supplementary group IDs of the parent dir, the + ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl + is set. + + fs.xfs.restrict_chown (Min: 0 Default: 1 Max: 1) + Controls whether unprivileged users can use chown to "give away" + a file to another user. + + fs.xfs.inherit_sync (Min: 0 Default: 1 Max: 1) + Setting this to "1" will cause the "sync" flag set + by the xfs_io(8) chattr command on a directory to be + inherited by files in that directory. + + fs.xfs.inherit_nodump (Min: 0 Default: 1 Max: 1) + Setting this to "1" will cause the "nodump" flag set + by the xfs_io(8) chattr command on a directory to be + inherited by files in that directory. + + fs.xfs.inherit_noatime (Min: 0 Default: 1 Max: 1) + Setting this to "1" will cause the "noatime" flag set + by the xfs_io(8) chattr command on a directory to be + inherited by files in that directory. + + fs.xfs.inherit_nosymlinks (Min: 0 Default: 1 Max: 1) + Setting this to "1" will cause the "nosymlinks" flag set + by the xfs_io(8) chattr command on a directory to be + inherited by files in that directory. + + fs.xfs.rotorstep (Min: 1 Default: 1 Max: 256) + In "inode32" allocation mode, this option determines how many + files the allocator attempts to allocate in the same allocation + group before moving to the next allocation group. The intent + is to control the rate at which the allocator moves between + allocation groups when allocating extents for new files. diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt new file mode 100644 index 0000000..0466ee5 --- /dev/null +++ b/Documentation/filesystems/xip.txt @@ -0,0 +1,68 @@ +Execute-in-place for file mappings +---------------------------------- + +Motivation +---------- +File mappings are performed by mapping page cache pages to userspace. In +addition, read&write type file operations also transfer data from/to the page +cache. + +For memory backed storage devices that use the block device interface, the page +cache pages are in fact copies of the original storage. Various approaches +exist to work around the need for an extra copy. The ramdisk driver for example +does read the data into the page cache, keeps a reference, and discards the +original data behind later on. + +Execute-in-place solves this issue the other way around: instead of keeping +data in the page cache, the need to have a page cache copy is eliminated +completely. With execute-in-place, read&write type operations are performed +directly from/to the memory backed storage device. For file mappings, the +storage device itself is mapped directly into userspace. + +This implementation was initially written for shared memory segments between +different virtual machines on s390 hardware to allow multiple machines to +share the same binaries and libraries. + +Implementation +-------------- +Execute-in-place is implemented in three steps: block device operation, +address space operation, and file operations. + +A block device operation named direct_access is used to retrieve a +reference (pointer) to a block on-disk. The reference is supposed to be +cpu-addressable, physical address and remain valid until the release operation +is performed. A struct block_device reference is used to address the device, +and a sector_t argument is used to identify the individual block. As an +alternative, memory technology devices can be used for this. + +The block device operation is optional, these block devices support it as of +today: +- dcssblk: s390 dcss block device driver + +An address space operation named get_xip_mem is used to retrieve references +to a page frame number and a kernel address. To obtain these values a reference +to an address_space is provided. This function assigns values to the kmem and +pfn parameters. The third argument indicates whether the function should allocate +blocks if needed. + +This address space operation is mutually exclusive with readpage&writepage that +do page cache read/write operations. +The following filesystems support it as of today: +- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt + +A set of file operations that do utilize get_xip_page can be found in +mm/filemap_xip.c . The following file operation implementations are provided: +- aio_read/aio_write +- readv/writev +- sendfile + +The generic file operations do_sync_read/do_sync_write can be used to implement +classic synchronous IO calls. + +Shortcomings +------------ +This implementation is limited to storage devices that are cpu addressable at +all times (no highmem or such). It works well on rom/ram, but enhancements are +needed to make it work with flash in read+write mode. +Putting the Linux kernel and/or its modules on a xip filesystem does not mean +they are not copied. |