op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	mm: remove remaining __FUNCTION__ occurrences	Harvey Harrison	2008-04-30	3	-6/+6
\| \| \| \| \| \| \| \|	__FUNCTION__ is gcc-specific, use __func__ Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	brd: modify ramdisk device to be able to manage partitions	Laurent Vivier	2008-04-30	1	-5/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds partition management for Block RAM Device (BRD). This patch is done to keep in sync BRD and loop device drivers. This patch adds a parameter to the module, max_part, to specify the maximum number of partitions per RAM device. Example: # modprobe brd max_part=63 # ls -l /dev/ram* brw-rw---- 1 root disk 1, 0 2008-04-03 13:39 /dev/ram0 brw-rw---- 1 root disk 1, 64 2008-04-03 13:39 /dev/ram1 brw-rw---- 1 root disk 1, 640 2008-04-03 13:39 /dev/ram10 brw-rw---- 1 root disk 1, 704 2008-04-03 13:39 /dev/ram11 brw-rw---- 1 root disk 1, 768 2008-04-03 13:39 /dev/ram12 brw-rw---- 1 root disk 1, 832 2008-04-03 13:39 /dev/ram13 brw-rw---- 1 root disk 1, 896 2008-04-03 13:39 /dev/ram14 brw-rw---- 1 root disk 1, 960 2008-04-03 13:39 /dev/ram15 brw-rw---- 1 root disk 1, 128 2008-04-03 13:39 /dev/ram2 brw-rw---- 1 root disk 1, 192 2008-04-03 13:39 /dev/ram3 brw-rw---- 1 root disk 1, 256 2008-04-03 13:39 /dev/ram4 brw-rw---- 1 root disk 1, 320 2008-04-03 13:39 /dev/ram5 brw-rw---- 1 root disk 1, 384 2008-04-03 13:39 /dev/ram6 brw-rw---- 1 root disk 1, 448 2008-04-03 13:39 /dev/ram7 brw-rw---- 1 root disk 1, 512 2008-04-03 13:39 /dev/ram8 brw-rw---- 1 root disk 1, 576 2008-04-03 13:39 /dev/ram9 # fdisk /dev/ram0 Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel Building a new DOS disklabel. Changes will remain in memory only, until you decide to write them. After that, of course, the previous content won't be recoverable. Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite) Command (m for help): o Building a new DOS disklabel. Changes will remain in memory only, until you decide to write them. After that, of course, the previous content won't be recoverable. Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite) Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 1 First cylinder (1-2, default 1): 1 Last cylinder or +size or +sizeM or +sizeK (1-2, default 2): 2 Command (m for help): w The partition table has been altered! Calling ioctl() to re-read partition table. Syncing disks. # ls -l /dev/ram0* brw-rw---- 1 root disk 1, 0 2008-04-03 13:40 /dev/ram0 brw-rw---- 1 root disk 1, 1 2008-04-03 13:40 /dev/ram0p1 # mkfs /dev/ram0p1 mke2fs 1.40-WIP (14-Nov-2006) Filesystem label= OS type: Linux Block size=1024 (log=0) Fragment size=1024 (log=0) 4016 inodes, 16032 blocks 801 blocks (5.00%) reserved for the super user First data block=1 Maximum filesystem blocks=16515072 2 block groups 8192 blocks per group, 8192 fragments per group 2008 inodes per group Superblock backups stored on blocks: 8193 Writing inode tables: done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 26 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. # mount /dev/ram0p1 /mnt df /mnt Filesystem 1K-blocks Used Available Use% Mounted on /dev/ram0p1 15521 138 14582 1% /mnt # ls -l /mnt total 12 drwx------ 2 root root 12288 2008-04-03 13:41 lost+found # umount /mnt # rmmod brd Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	add hrtimer specific debugobjects code	Thomas Gleixner	2008-04-30	3	-23/+186
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	hrtimers have now dynamic users in the network code. Put them under debugobjects surveillance as well. Add calls to the generic object debugging infrastructure and provide fixup functions which allow to keep the system alive when recoverable problems have been detected by the object debugging core code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	debugobjects: add timer specific object debugging code	Thomas Gleixner	2008-04-30	6	-13/+187
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add calls to the generic object debugging infrastructure and provide fixup functions which allow to keep the system alive when recoverable problems have been detected by the object debugging core code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	debugobjects: add documentation	Thomas Gleixner	2008-04-30	2	-1/+392
\| \| \| \| \| \| \| \| \| \| \| \|	Add a DocBook for debugobjects. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	infrastructure to debug (dynamic) objects	Thomas Gleixner	2008-04-30	10	-4/+1030
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We can see an ever repeating problem pattern with objects of any kind in the kernel: 1) freeing of active objects 2) reinitialization of active objects Both problems can be hard to debug because the crash happens at a point where we have no chance to decode the root cause anymore. One problem spot are kernel timers, where the detection of the problem often happens in interrupt context and usually causes the machine to panic. While working on a timer related bug report I had to hack specialized code into the timer subsystem to get a reasonable hint for the root cause. This debug hack was fine for temporary use, but far from a mergeable solution due to the intrusiveness into the timer code. The code further lacked the ability to detect and report the root cause instantly and keep the system operational. Keeping the system operational is important to get hold of the debug information without special debugging aids like serial consoles and special knowledge of the bug reporter. The problems described above are not restricted to timers, but timers tend to expose it usually in a full system crash. Other objects are less explosive, but the symptoms caused by such mistakes can be even harder to debug. Instead of creating specialized debugging code for the timer subsystem a generic infrastructure is created which allows developers to verify their code and provides an easy to enable debug facility for users in case of trouble. The debugobjects core code keeps track of operations on static and dynamic objects by inserting them into a hashed list and sanity checking them on object operations and provides additional checks whenever kernel memory is freed. The tracked object operations are: - initializing an object - adding an object to a subsystem list - deleting an object from a subsystem list Each operation is sanity checked before the operation is executed and the subsystem specific code can provide a fixup function which allows to prevent the damage of the operation. When the sanity check triggers a warning message and a stack trace is printed. The list of operations can be extended if the need arises. For now it's limited to the requirements of the first user (timers). The core code enqueues the objects into hash buckets. The hash index is generated from the address of the object to simplify the lookup for the check on kfree/vfree. Each bucket has it's own spinlock to avoid contention on a global lock. The debug code can be compiled in without being active. The runtime overhead is minimal and could be optimized by asm alternatives. A kernel command line option enables the debugging code. Thanks to Ingo Molnar for review, suggestions and cleanup patches. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Greg KH <greg@kroah.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	slab: add a flag to prevent debug_free checks on a kmem_cache	Thomas Gleixner	2008-04-30	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \|	This is a preperatory patch for the debugobjects infrastructure. The flag prevents debug_free checks on kmem_caches. This is necessary to avoid resursive calls into a debug mechanism which uses a kmem_cache itself. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	drivers: replace remaining __FUNCTION__ occurrences	Harvey Harrison	2008-04-30	9	-18/+18
\| \| \| \| \| \| \| \| \|	__FUNCTION__ is gcc-specific, use __func__ Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Ben Dooks <ben-linux@fluff.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Add macros similar to min/max/min_t/max_t	Harvey Harrison	2008-04-30	3	-25/+74
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Also, change the variable names used in the min/max macros to avoid shadowed variable warnings when min/max min_t/max_t are nested. Small formatting changes to make all the macros have a similar form. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix v4l build] Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Jeff Garzik <jeff@garzik.org> Cc: Tejun Heo <htejun@gmail.com> Cc: Michael Buesch <mb@bu3sch.de> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Dmitry Torokhov <dtor@mail.ru> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	alloc_uid: cleanup	Andrew Morton	2008-04-30	1	-16/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use kmem_cache_zalloc(), remove large amounts of initialisation code and ifdeffery. Note: this assumes that memset(*atomic_t, 0) correctly initialises the atomic_t. This is true for all present archtiectures and if it becomes false for a future architecture then we'll need to make large changes all over the place anyway. Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	hfsplus: fix warning with 64k PAGE_SIZE	Andrew Morton	2008-04-30	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \|	fs/hfsplus/btree.c: In function 'hfsplus_bmap_alloc': fs/hfsplus/btree.c:239: warning: comparison is always false due to limited range of data type But this might hide a real bug? Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	hfs: fix warning with 64k PAGE_SIZE	Andrew Morton	2008-04-30	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \|	fs/hfs/btree.c: In function 'hfs_bmap_alloc': fs/hfs/btree.c:263: warning: comparison is always false due to limited range of data type The patch makes the warning go away, but the code might actually be buggy? Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	printk: don't read beyond string arguments' terminating zero	Markus Armbruster	2008-04-30	1	-1/+1
\| \| \| \| \| \| \| \| \|	Fix update_console_cmdline() not to to read beyond the terminating zero of its name argument. Signed-off-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Basic braille screen reader support	Samuel Thibault	2008-04-30	15	-30/+538
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds a minimalistic braille screen reader support. This is meant to be used by blind people e.g. on boot failures or when / cannot be mounted etc and thus the userland screen readers can not work. [akpm@linux-foundation.org: fix exports] Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Cc: Jiri Kosina <jikos@jikos.cz> Cc: Dmitry Torokhov <dtor@mail.ru> Acked-by: Alan Cox <alan@redhat.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	asm-*/futex.h should include linux/uaccess.h	Jeff Dike	2008-04-30	8	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Lots of asm-*/futex.h call pagefault_enable and pagefault_disable, which are declared in linux/uaccess.h, without including linux/uaccess.h. They all include asm/uaccess.h, so this patch replaces asm/uaccess.h with linux/uaccess.h. Signed-off-by: Jeff Dike <jdike@linux.intel.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	sysv: [bl]e*_add_cpu conversion	Marcin Slusarz	2008-04-30	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	replace all: big/little_endian_variable = cpu_to_[bl]eX([bl]eX_to_cpu(big/little_endian_variable) + expression_in_cpu_byteorder); with: [bl]eX_add_cpu(&big/little_endian_variable, expression_in_cpu_byteorder); generated with semantic patch Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	quota: le*_add_cpu conversion	Marcin Slusarz	2008-04-30	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	replace all: little_endian_variable = cpu_to_leX(leX_to_cpu(little_endian_variable) + expression_in_cpu_byteorder); with: leX_add_cpu(&little_endian_variable, expression_in_cpu_byteorder); generated with semantic patch Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	hfs/hfsplus: be*_add_cpu conversion	Marcin Slusarz	2008-04-30	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	replace all: big_endian_variable = cpu_to_beX(beX_to_cpu(big_endian_variable) + expression_in_cpu_byteorder); with: beX_add_cpu(&big_endian_variable, expression_in_cpu_byteorder); generated with semantic patch Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	affs: be*_add_cpu conversion	Marcin Slusarz	2008-04-30	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	replace all: big_endian_variable = cpu_to_beX(beX_to_cpu(big_endian_variable) + expression_in_cpu_byteorder); with: beX_add_cpu(&big_endian_variable, expression_in_cpu_byteorder); generated with semantic patch Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	reiserfs: use open_bdev_excl	Christoph Hellwig	2008-04-30	2	-28/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use the proper helper to open a blockdevice by name for filesystem use, this makes sure it's properly claimed (also added for open-by-number) and gets rid of the struct file abuse. Tested by mounting a reiserfs filesystem with external journal. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jeff Mahoney <jeffm@suse.com> Acked-by: Edward Shishkin <edward.shishkin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fuse: fix sparse warnings	Miklos Szeredi	2008-04-30	1	-1/+3
\| \| \| \| \| \| \| \| \| \|	fs/fuse/dev.c:306:2: warning: context imbalance in 'wait_answer_interruptible' - unexpected unlock fs/fuse/dev.c:361:2: warning: context imbalance in 'request_wait_answer' - unexpected unlock fs/fuse/dev.c:1002:4: warning: context imbalance in 'end_io_requests' - unexpected unlock Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fuse: fix race in llseek	Miklos Szeredi	2008-04-30	1	-2/+27
\| \| \| \| \| \| \| \| \| \| \| \|	Fuse doesn't use i_mutex to protect setting i_size, and so generic_file_llseek() can be racy: it doesn't use i_size_read(). So do a fuse specific llseek method, which does use i_size_read(). [akpm@linux-foundation.org: make `retval' loff_t] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fuse: fix node ID type	Miklos Szeredi	2008-04-30	2	-6/+6
\| \| \| \| \| \| \| \| \|	Node ID is 64bit but it is passed as unsigned long to some functions. This breakage wasn't noticed, because libfuse uses unsigned long too. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fuse: fix max i/o size calculation	Miklos Szeredi	2008-04-30	2	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \|	Fix a bug that Werner Baumann reported: fuse can send a bigger write request than the maximum specified. This only affected direct_io operation. In addition set a sane minimum for the max_read and max_write tunables, so I/O always makes some progress. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fuse: update file size on short read	Miklos Szeredi	2008-04-30	3	-8/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the READ request returned a short count, then either - cached size is incorrect - filesystem is buggy, as short reads are only allowed on EOF So assume that the size is wrong and refresh it, so that cached read() doesn't zero fill the missing chunk. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fuse: implement perform_write	Nick Piggin	2008-04-30	1	-1/+193
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce fuse_perform_write. With fusexmp (a passthrough filesystem), large (1MB) writes into a backing tmpfs filesystem are sped up by almost 4 times (256MB/s vs 71MB/s). [mszeredi@suse.cz]: - split into smaller functions - testing - duplicate generic_file_aio_write(), so that there's no need to add a new ->perform_write() a_op. Comment from hch. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fuse: clean up setting i_size in write	Miklos Szeredi	2008-04-30	1	-13/+15
\| \| \| \| \| \| \| \| \|	Extract common code for setting i_size in write functions into a common helper. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fuse: support writable mmap	Miklos Szeredi	2008-04-30	5	-29/+481
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Quoting Linus (3 years ago, FUSE inclusion discussions): "User-space filesystems are hard to get right. I'd claim that they are almost impossible, unless you limit them somehow (shared writable mappings are the nastiest part - if you don't have those, you can reasonably limit your problems by limiting the number of dirty pages you accept through normal "write()" calls)." Instead of attempting the impossible, I've just waited for the dirty page accounting infrastructure to materialize (thanks to Peter Zijlstra and others). This nicely solved the biggest problem: limiting the number of pages used for write caching. Some small details remained, however, which this largish patch attempts to address. It provides a page writeback implementation for fuse, which is completely safe against VM related deadlocks. Performance may not be very good for certain usage patterns, but generally it should be acceptable. It has been tested extensively with fsx-linux and bash-shared-mapping. Fuse page writeback design -------------------------- fuse_writepage() allocates a new temporary page with GFP_NOFS\|__GFP_HIGHMEM. It copies the contents of the original page, and queues a WRITE request to the userspace filesystem using this temp page. The writeback is finished instantly from the MM's point of view: the page is removed from the radix trees, and the PageDirty and PageWriteback flags are cleared. For the duration of the actual write, the NR_WRITEBACK_TEMP counter is incremented. The per-bdi writeback count is not decremented until the actual write completes. On dirtying the page, fuse waits for a previous write to finish before proceeding. This makes sure, there can only be one temporary page used at a time for one cached page. This approach is wasteful in both memory and CPU bandwidth, so why is this complication needed? The basic problem is that there can be no guarantee about the time in which the userspace filesystem will complete a write. It may be buggy or even malicious, and fail to complete WRITE requests. We don't want unrelated parts of the system to grind to a halt in such cases. Also a filesystem may need additional resources (particularly memory) to complete a WRITE request. There's a great danger of a deadlock if that allocation may wait for the writepage to finish. Currently there are several cases where the kernel can block on page writeback: - allocation order is larger than PAGE_ALLOC_COSTLY_ORDER - page migration - throttle_vm_writeout (through NR_WRITEBACK) - sync(2) Of course in some cases (fsync, msync) we explicitly want to allow blocking. So for these cases new code has to be added to fuse, since the VM is not tracking writeback pages for us any more. As an extra safetly measure, the maximum dirty ratio allocated to a single fuse filesystem is set to 1% by default. This way one (or several) buggy or malicious fuse filesystems cannot slow down the rest of the system by hogging dirty memory. With appropriate privileges, this limit can be raised through '/sys/class/bdi/<bdi>/max_ratio'. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm: document missing fields for /proc/meminfo	Miklos Szeredi	2008-04-30	1	-4/+17
\| \| \| \| \| \| \| \| \|	A few fields in /proc/meminfo were not documented. Fix. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm: Add NR_WRITEBACK_TEMP counter	Miklos Szeredi	2008-04-30	5	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fuse will use temporary buffers to write back dirty data from memory mappings (normal writes are done synchronously). This is needed, because there cannot be any guarantee about the time in which a write will complete. By using temporary buffers, from the MM's point if view the page is written back immediately. If the writeout was due to memory pressure, this effectively migrates data from a full zone to a less full zone. This patch adds a new counter (NR_WRITEBACK_TEMP) for the number of pages used as temporary buffers. [Lee.Schermerhorn@hp.com: add vmstat_text for NR_WRITEBACK_TEMP] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm: bdi: export bdi_writeout_inc()	Miklos Szeredi	2008-04-30	2	-0/+12
\| \| \| \| \| \| \| \| \|	Fuse needs this for writable mmap support. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm: bdi: add separate writeback accounting capability	Miklos Szeredi	2008-04-30	10	-30/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a new BDI capability flag: BDI_CAP_NO_ACCT_WB. If this flag is set, then don't update the per-bdi writeback stats from test_set_page_writeback() and test_clear_page_writeback(). Misc cleanups: - convert bdi_cap_writeback_dirty() and friends to static inline functions - create a flag that includes all three dirty/writeback related flags, since almst all users will want to have them toghether Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm: bdi: move statistics to debugfs	Miklos Szeredi	2008-04-30	3	-48/+99
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move BDI statistics to debugfs: /sys/kernel/debug/bdi/<bdi>/stats Use postcore_initcall() to initialize the sysfs class and debugfs, because debugfs is initialized in core_initcall(). Update descriptions in ABI documentation. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm: bdi: allow setting a maximum for the bdi dirty limit	Peter Zijlstra	2008-04-30	6	-13/+111
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add "max_ratio" to /sys/class/bdi. This indicates the maximum percentage of the global dirty threshold allocated to this bdi. [mszeredi@suse.cz] - fix parsing in max_ratio_store(). - export bdi_set_max_ratio() to modules - limit bdi_dirty with bdi->max_ratio - document new sysfs attribute Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm: bdi: allow setting a minimum for the bdi dirty limit	Peter Zijlstra	2008-04-30	4	-1/+57
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Under normal circumstances each device is given a part of the total write-back cache that relates to its current avg writeout speed in relation to the other devices. min_ratio - allows one to assign a minimum portion of the write-back cache to a particular device. This is useful in situations where you might want to provide a minimum QoS. (One request for this feature came from flash based storage people who wanted to avoid writing out at all costs - they of course needed some pdflush hacks as well) max_ratio - allows one to assign a maximum portion of the dirty limit to a particular device. This is useful in situations where you want to avoid one device taking all or most of the write-back cache. Eg. an NFS mount that is prone to get stuck, or a FUSE mount which you don't trust to play fair. Add "min_ratio" to /sys/class/bdi. This indicates the minimum percentage of the global dirty threshold allocated to this bdi. [mszeredi@suse.cz] - fix parsing in min_ratio_store() - document new sysfs attribute Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm: bdi: expose the BDI object in sysfs for FUSE	Miklos Szeredi	2008-04-30	3	-18/+18
\| \| \| \| \| \| \| \| \| \| \| \| \|	Register FUSE's backing_dev_info under sysfs with the name "fuse-MAJOR:MINOR" Make the fuse control filesystem use s_dev instead of a fuse specific ID. This makes it easier to match directories under /sys/fs/fuse/connections/ with directories under /sys/class/bdi, and with actual mounts. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm: bdi: expose the BDI object in sysfs for NFS	Miklos Szeredi	2008-04-30	1	-0/+26
\| \| \| \| \| \| \| \| \| \| \|	Register NFS' backing_dev_info under sysfs with the name "nfs-MAJOR:MINOR" Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mm: bdi: export BDI attributes in sysfs	Peter Zijlstra	2008-04-30	8	-2/+194
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info object. This allows us to see and set the various BDI specific variables. In particular this properly exposes the read-ahead window for all relevant users and /sys/block/<block>/queue/read_ahead_kb should be deprecated. With patient help from Kay Sievers and Greg KH [mszeredi@suse.cz] - split off NFS and FUSE changes into separate patches - document new sysfs attributes under Documentation/ABI - do bdi_class_init as a core_initcall, otherwise the "default" BDI won't be initialized - remove bdi_init_fmt macro, it's not used very much [akpm@linux-foundation.org: fix ia64 warning] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Kay Sievers <kay.sievers@vrfy.org> Acked-by: Greg KH <greg@kroah.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	pidns: make pid->level and pid_ns->level unsigned	Pavel Emelyanov	2008-04-30	3	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	These values represent the nesting level of a namespace and pids living in it, and it's always non-negative. Turning this from int to unsigned int saves some space in pid.c (11 bytes on x86 and 64 on ia64) by letting the compiler optimize the pid_nr_ns a bit. E.g. on ia64 this removes the sign extension calls, which compiler adds to optimize access to pid->nubers[ns->level]. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	make marker_debug static	Adrian Bunk	2008-04-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	With the needlessly global marker_debug being static gcc can optimize the unused code away. Signed-off-by: Adrian Bunk <bunk@kernel.org> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	mxser: convert large macros to functions	Christoph Hellwig	2008-04-30	2	-152/+214
\| \| \| \| \| \| \|	Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	pids: sys_getpgid: fix unsafe *pid usage, s/tasklist/rcu/	Oleg Nesterov	2008-04-30	1	-15/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. sys_getpgid() needs rcu_read_lock() to derive the pgrp _nr, even if the task is current, otherwise we can race with another thread which does sys_setpgid(). 2. Use rcu_read_lock() instead of tasklist_lock when pid != 0, make sure that we don't use the NULL pid if the task exits right after successful find_task_by_vpid(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	pids: sys_getsid: fix unsafe *pid usage, fix possible 0 instead of -ESRCH	Oleg Nesterov	2008-04-30	1	-13/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. sys_getsid() needs rcu_read_lock() to derive the session _nr, even if the task is current, otherwise we can race with another thread which does sys_setsid(). 2. The task can exit between find_task_by_vpid() and task_session_vnr(), in that unlikely case sys_getsid() returns 0 instead of -ESRCH. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	pids: __set_special_pids: use change_pid() helper	Oleg Nesterov	2008-04-30	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use change_pid() instead of detach_pid() + attach_pid() in __set_special_pids(). This way task_session() is not NULL in between. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	pids: sys_setpgid: use change_pid() helper	Oleg Nesterov	2008-04-30	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	Use change_pid() instead of detach_pid() + attach_pid() in sys_setpgid(). This way task_pgrp() is not NULL in between. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	pids: introduce change_pid() helper	Oleg Nesterov	2008-04-30	2	-7/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Based on Eric W. Biederman's idea. Without tasklist_lock held task_session()/task_pgrp() can return NULL if the caller races with setprgp()/setsid() which does detach_pid() + attach_pid(). This can happen even if task == current. Intoduce the new helper, change_pid(), which should be used instead. This way the caller always sees the special pid != NULL, either old or new. Also change the prototype of attach_pid(), it always returns 0 and nobody check the returned value. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	pids: de_thread: don't clear session/pgrp pids for the old leader	Oleg Nesterov	2008-04-30	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Based on Eric W. Biederman's idea. Unless task == current, without tasklist_lock held task_session()/task_pgrp() can return NULL if the caller races with de_thread() which switches the group leader. Change transfer_pid() to not clear old->pids[type].pid for the old leader. This means that its .pid can point to "nowhere", but this is already true for sub-threads, and the old leader is not group_leader() any longer. IOW, with or without this change we can't trust task's special pids unless it is the group leader. With this change the following code rcu_read_lock(); task = find_task_by_xxx(); do_something(task_pgrp(task), task_session(task)); rcu_read_unlock(); can't race with exec and hit the NULL pid. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Deprecate find_task_by_pid()	Pavel Emelyanov	2008-04-30	5	-9/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There are some places that are known to operate on tasks' global pids only: * the rest_init() call (called on boot) * the kgdb's getthread * the create_kthread() (since the kthread is run in init ns) So use the find_task_by_pid_ns(..., &init_pid_ns) there and schedule the find_task_by_pid for removal. [sukadev@us.ibm.com: Fix warning in kernel/pid.c] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Use find_task_by_vpid in taskstats	Pavel Emelyanov	2008-04-30	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The pid to lookup a task by is passed inside taskstats code via genetlink message. Since netlink packets are now processed in the context of the sending task, this is correct to lookup the task with find_task_by_vpid() here. Besides, I fix the call to fill_pid() from taskstats_exit(), since the tsk->pid is not required in fill_pid() in this case, and the pid field on task_struct is going to be deprecated as well. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Jay Lan <jlan@engr.sgi.com> Cc: Jonathan Lim <jlim@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	free_pidmap: turn it into free_pidmap(struct upid *)	Oleg Nesterov	2008-04-30	1	-6/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The callers of free_pidmap() pass 2 members of "struct upid", we can just pass "struct upid *" instead. Shaves off 10 bytes from pid.o. Also, simplify the alloc_pid's "out_free:" error path a little bit. This way it looks more clear which subset of pid->numbers[] we are freeing. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc :Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>