summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-linus' of ↵Linus Torvalds2011-05-256-41/+65
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits) ceph: fix cap flush race reentrancy libceph: subscribe to osdmap when cluster is full libceph: handle new osdmap down/state change encoding rbd: handle online resize of underlying rbd image ceph: avoid inode lookup on nfs fh reconnect ceph: use LOOKUPINO to make unconnected nfs fh more reliable rbd: use snprintf for disk->disk_name rbd: cleanup: make kfree match kmalloc rbd: warn on update_snaps failure on notify ceph: check return value for start_request in writepages ceph: remove useless check libceph: add missing breaks in addr_set_port libceph: fix TAG_WAIT case ceph: fix broken comparison in readdir loop libceph: fix osdmap timestamp assignment ceph: fix rare potential cap leak libceph: use snprintf for unknown addrs libceph: use snprintf for formatting object name ceph: use snprintf for dirstat content libceph: fix uninitialized value when no get_authorizer method is set ...
| * ceph: fix cap flush race reentrancySage Weil2011-05-243-29/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In e9964c10 we change cap flushing to do a delicate dance because some inodes on the cap_dirty list could be in a migrating state (got EXPORT but not IMPORT) in which we couldn't actually flush and move from dirty->flushing, breaking the while (!empty) { process first } loop structure. It worked for a single sync thread, but was not reentrant and triggered infinite loops when multiple syncers came along. Instead, move inodes with dirty to a separate cap_dirty_migrating list when in the limbo export-but-no-import state, allowing us to go back to the simple loop structure (which was reentrant). This is cleaner and more robust. Audited the cap_dirty users and this looks fine: list_empty(&ci->i_dirty_item) is still a reliable indicator of whether we have dirty caps (which list we're on is irrelevant) and list_del_init() calls still do the right thing. Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: avoid inode lookup on nfs fh reconnectSage Weil2011-05-241-2/+6
| | | | | | | | | | | | | | If we get the inode from the MDS, we have a reference in req; don't do a fresh lookup. Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: use LOOKUPINO to make unconnected nfs fh more reliableSage Weil2011-05-241-2/+17
| | | | | | | | | | | | | | If we are unable to locate an inode by ino, ask the MDS using the new LOOKUPINO command. Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: check return value for start_request in writepagesSage Weil2011-05-191-1/+2
| | | | | | | | | | | | | | Since we pass the nofail arg, we should never get an error; BUG if we do. (And fix the function to not return an error if __map_request fails.) Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: remove useless checkSage Weil2011-05-191-2/+0
| | | | | | | | | | | | rc is only ever 0 or negative in this method. Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: fix broken comparison in readdir loopSage Weil2011-05-191-1/+1
| | | | | | | | | | | | | | Both off and fi->offset are unsigned, so the difference is always >= 0. Compare them directly instead of the sign of the difference. Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: fix rare potential cap leakSage Weil2011-05-191-1/+2
| | | | | | | | | | | | | | If we grab new_cap, retake the lock, and find we already have a cap now for the given mds, release new_cap. Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: use snprintf for dirstat contentSage Weil2011-05-191-2/+3
| | | | | | | | | | | | | | We allocate a buffer for rstats if the dirstat option is enabled. Use snprintf. Signed-off-by: Sage Weil <sage@newdream.net>
| * libceph: remove unused variableSage Weil2011-05-191-2/+0
| | | | | | | | Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: take reference on mds request r_unsafe_dirSage Weil2011-05-191-0/+4
| | | | | | | | | | | | | | | | | | | | We put ourselves on an inode list for the parent directory of metadata operations so that an fsync on the directory will wait for metadata updates to commit to disk. We weren't holding a reference to that directory, however, and under certain workloads (fsstress in this case) the directory can go away. Signed-off-by: Sage Weil <sage@newdream.net>
* | Merge branch 'for-linus' of ↵Linus Torvalds2011-05-252-8/+8
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: 9p: update Documentation pointers net/9p: enable 9p to work in non-default network namespace net/9p: p9_idpool_get return -1 on error fs/9p: Don't clunk dentry fid when we fail to get a writeback inode 9p: Small cleanup in <net/9p/9p.h> 9p: remove experimental tag from tested configurations 9p: typo fixes and minor cleanups net/9p: Change linuxdoc names to match functions.
| * | fs/9p: Don't clunk dentry fid when we fail to get a writeback inodeAneesh Kumar K.V2011-05-251-5/+6
| | | | | | | | | | | | | | | | | | | | | The dentry fid get clunked via the dput. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
| * | 9p: remove experimental tag from tested configurationsEric Van Hensbergen2011-05-251-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | The 9p client is currently undergoing regular regresssion and stress testing as a by-product of the virtfs work. I think its finally time to take off the experimental tags from the well-tested code paths. Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
* | | Merge branch 'for-2.6.40/splice' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2011-05-251-18/+15
|\ \ \ | | | | | | | | | | | | | | | | * 'for-2.6.40/splice' of git://git.kernel.dk/linux-2.6-block: splice: add wakeup_pipe_readers()
| * | | splice: add wakeup_pipe_readers()Namhyung Kim2011-05-231-18/+15
| | |/ | |/| | | | | | | | | | | | | | | | | | | Add and use wakeup_pipe_readers() to consolidate duplicated codes. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* | | Merge branch 'for-2.6.40/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2011-05-252-10/+15
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-2.6.40/core' of git://git.kernel.dk/linux-2.6-block: (40 commits) cfq-iosched: free cic_index if cfqd allocation fails cfq-iosched: remove unused 'group_changed' in cfq_service_tree_add() cfq-iosched: reduce bit operations in cfq_choose_req() cfq-iosched: algebraic simplification in cfq_prio_to_maxrq() blk-cgroup: Initialize ioc->cgroup_changed at ioc creation time block: move bd_set_size() above rescan_partitions() in __blkdev_get() block: call elv_bio_merged() when merged cfq-iosched: Make IO merge related stats per cpu cfq-iosched: Fix a memory leak of per cpu stats for root group backing-dev: Kill set but not used var in bdi_debug_stats_show() block: get rid of on-stack plugging debug checks blk-throttle: Make no throttling rule group processing lockless blk-cgroup: Make cgroup stat reset path blkg->lock free for dispatch stats blk-cgroup: Make 64bit per cpu stats safe on 32bit arch blk-throttle: Make dispatch stats per cpu blk-throttle: Free up a group only after one rcu grace period blk-throttle: Use helper function to add root throtl group to lists blk-throttle: Introduce a helper function to fill in device details blk-throttle: Dynamically allocate root group blk-cgroup: Allow sleeping while dynamically allocating a group ...
| * | | block: move bd_set_size() above rescan_partitions() in __blkdev_get()Tejun Heo2011-05-231-8/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 02e352287a4 (block: rescan partitions on invalidated devices on -ENOMEDIA too) relocated partition rescan above explicit bd_set_size() to simplify condition check. As rescan_partitions() does its own bdev size setting, this doesn't break anything; however, rescan_partitions() prints out the following messages when adjusting bdev size, which can be confusing. sda: detected capacity change from 0 to 146815737856 sdb: detected capacity change from 0 to 146815737856 This patch restores the original order and remove the warning messages. stable: Please apply together with 02e352287a4 (block: rescan partitions on invalidated devices on -ENOMEDIA too). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Tony Luck <tony.luck@gmail.com> Tested-by: Tony Luck <tony.luck@gmail.com> Cc: stable@kernel.org Stable note: 2.6.39 only. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| * | | Merge commit 'v2.6.39' into for-2.6.40/coreJens Axboe2011-05-2092-1383/+1660
| |\ \ \ | | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since for-2.6.40/core was forked off the 2.6.39 devel tree, we've had churn in the core area that makes it difficult to handle patches for eg cfq or blk-throttle. Instead of requiring that they be based in older versions with bugs that have been fixed later in the rc cycle, merge in 2.6.39 final. Also fixes up conflicts in the below files. Conflicts: drivers/block/paride/pcd.c drivers/cdrom/viocd.c drivers/ide/ide-cd.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| * | | fs: fixup warning part_discard_alignment_show()Jens Axboe2011-05-091-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Stephen reports: ----- After merging the block tree, today's linux-next build (x86_64 allmodconfig) produced this warning: fs/partitions/check.c: In function 'part_discard_alignment_show': fs/partitions/check.c:263: warning: format '%u' expects type 'unsigned int', but argument 3 has type 'long long unsigned int' Introduced by commit ("block: Remove extra discard_alignment from hd_struct") ----- Fix it up by just removing the cast, we return an int already. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| * | | block: Remove extra discard_alignment from hd_struct.Tao Ma2011-05-061-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, hd_struct.discard_alignment is only used when we show /sys/block/sdx/sdx/discard_alignment. So remove it and calculate when it is asked to show. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| * | | block: don't block events on excl write for non-optical devicesTejun Heo2011-04-211-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disk event code automatically blocks events on excl write. This is primarily to avoid issuing polling commands while burning is in progress. This behavior doesn't fit other types of devices with removeable media where polling commands don't have adverse side effects and door locking usually doesn't exist. This patch introduces new genhd flag which controls the auto-blocking behavior and uses it to enable auto-blocking only on optical devices. Note for stable: 2.6.38 and later only Cc: stable@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| * | | block: rescan partitions on invalidated devices on -ENOMEDIA tooTejun Heo2011-04-211-9/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __blkdev_get() doesn't rescan partitions if disk->fops->open() fails, which leads to ghost partition devices lingering after medimum removal is known to both the kernel and userland. The behavior also creates a subtle inconsistency where O_NONBLOCK open, which doesn't fail even if there's no medium, clears the ghots partitions, which is exploited to work around the problem from userland. Fix it by updating __blkdev_get() to issue partition rescan after -ENOMEDIA too. This was reported in the following bz. https://bugzilla.kernel.org/show_bug.cgi?id=13029 Note for stable: 2.6.38 and later only Cc: stable@kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: David Zeuthen <zeuthen@gmail.com> Reported-by: Martin Pitt <martin.pitt@ubuntu.com> Reported-by: Kay Sievers <kay.sievers@vrfy.org> Tested-by: Kay Sievers <kay.sievers@vrfy.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* | | | fs/ncpfs/inode.c: suppress used-uninitialised warningAndrew Morton2011-05-251-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We get this spurious warning: fs/ncpfs/inode.c: In function 'ncp_fill_super': fs/ncpfs/inode.c:451: warning: 'data.mounted_vol[1u]' may be used uninitialized in this function fs/ncpfs/inode.c:451: warning: 'data.mounted_vol[2u]' may be used uninitialized in this function fs/ncpfs/inode.c:451: warning: 'data.mounted_vol[3u]' may be used uninitialized in this function ... It's notabug, but we can easily fix it with a memset(). Reported-by: Harry Wei <jiaweiwei.xiyou@gmail.com> Cc: Petr Vandrovec <petr@vandrovec.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | fscache: remove dead code under CONFIG_WORKQUEUE_DEBUGFSAmerigo Wang2011-05-252-23/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no CONFIG_WORKQUEUE_DEBUGFS any more, so this code is dead. Signed-off-by: WANG Cong <amwang@redhat.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | proc: allocate storage for numa_maps statistics onceStephen Wilson2011-05-251-9/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In show_numa_map() we collect statistics into a numa_maps structure. Since the number of NUMA nodes can be very large, this structure is not a candidate for stack allocation. Instead of going thru a kmalloc()+kfree() cycle each time show_numa_map() is invoked, perform the allocation just once when /proc/pid/numa_maps is opened. Performing the allocation when numa_maps is opened, and thus before a reference to the target tasks mm is taken, eliminates a potential stalemate condition in the oom-killer as originally described by Hugh Dickins: ... imagine what happens if the system is out of memory, and the mm we're looking at is selected for killing by the OOM killer: while we wait in __get_free_page for more memory, no memory is freed from the selected mm because it cannot reach exit_mmap while we hold that reference. Signed-off-by: Stephen Wilson <wilsons@start.ca> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | proc: make struct proc_maps_private truly privateStephen Wilson2011-05-251-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that mm/mempolicy.c is no longer implementing /proc/pid/numa_maps there is no need to export struct proc_maps_private to the world. Move it to fs/proc/internal.h instead. Signed-off-by: Stephen Wilson <wilsons@start.ca> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mm: proc: move show_numa_map() to fs/proc/task_mmu.cStephen Wilson2011-05-251-2/+182
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Moving show_numa_map() from mempolicy.c to task_mmu.c solves several issues. - Having the show() operation "miles away" from the corresponding seq_file iteration operations is a maintenance burden. - The need to export ad hoc info like struct proc_maps_private is eliminated. - The implementation of show_numa_map() can be improved in a simple manner by cooperating with the other seq_file operations (start, stop, etc) -- something that would be messy to do without this change. Signed-off-by: Stephen Wilson <wilsons@start.ca> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | tmpfs: implement generic xattr supportEric Paris2011-05-251-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement generic xattrs for tmpfs filesystems. The Feodra project, while trying to replace suid apps with file capabilities, realized that tmpfs, which is used on the build systems, does not support file capabilities and thus cannot be used to build packages which use file capabilities. Xattrs are also needed for overlayfs. The xattr interface is a bit odd. If a filesystem does not implement any {get,set,list}xattr functions the VFS will call into some random LSM hooks and the running LSM can then implement some method for handling xattrs. SELinux for example provides a method to support security.selinux but no other security.* xattrs. As it stands today when one enables CONFIG_TMPFS_POSIX_ACL tmpfs will have xattr handler routines specifically to handle acls. Because of this tmpfs would loose the VFS/LSM helpers to support the running LSM. To make up for that tmpfs had stub functions that did nothing but call into the LSM hooks which implement the helpers. This new patch does not use the LSM fallback functions and instead just implements a native get/set/list xattr feature for the full security.* and trusted.* namespace like a normal filesystem. This means that tmpfs can now support both security.selinux and security.capability, which was not previously possible. The basic implementation is that I attach a: struct shmem_xattr { struct list_head list; /* anchored by shmem_inode_info->xattr_list */ char *name; size_t size; char value[0]; }; Into the struct shmem_inode_info for each xattr that is set. This implementation could easily support the user.* namespace as well, except some care needs to be taken to prevent large amounts of unswappable memory being allocated for unprivileged users. [mszeredi@suse.cz: new config option, suport trusted.*, support symlinks] Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Serge Hallyn <serge.hallyn@ubuntu.com> Tested-by: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Acked-by: Hugh Dickins <hughd@google.com> Tested-by: Jordi Pujol <jordipujolp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | vmscan: change shrinker API by passing shrink_control structYing Han2011-05-2513-26/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change each shrinker's API by consolidating the existing parameters into shrink_control struct. This will simplify any further features added w/o touching each file of shrinker. [akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: fix warning] [kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API] [akpm@linux-foundation.org: fix xfs warning] [akpm@linux-foundation.org: update gfs2] Signed-off-by: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | vmscan: change shrink_slab() interfaces by passing shrink_controlYing Han2011-05-251-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Consolidate the existing parameters to shrink_slab() into a new shrink_control struct. This is needed later to pass the same struct to shrinkers. Signed-off-by: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Acked-by: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mm: Convert i_mmap_lock to a mutexPeter Zijlstra2011-05-252-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Straightforward conversion of i_mmap_lock to a mutex. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Hugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mm: Remove i_mmap_lock lockbreakPeter Zijlstra2011-05-251-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hugh says: "The only significant loser, I think, would be page reclaim (when concurrent with truncation): could spin for a long time waiting for the i_mmap_mutex it expects would soon be dropped? " Counter points: - cpu contention makes the spin stop (need_resched()) - zap pages should be freeing pages at a higher rate than reclaim ever can I think the simplification of the truncate code is definitely worth it. Effectively reverts: 2aa15890f3c ("mm: prevent concurrent unmap_mapping_range() on the same inode") and takes out the code that caused its problem. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mm: mmu_gather reworkPeter Zijlstra2011-05-251-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rework the existing mmu_gather infrastructure. The direct purpose of these patches was to allow preemptible mmu_gather, but even without that I think these patches provide an improvement to the status quo. The first 9 patches rework the mmu_gather infrastructure. For review purpose I've split them into generic and per-arch patches with the last of those a generic cleanup. The next patch provides generic RCU page-table freeing, and the followup is a patch converting s390 to use this. I've also got 4 patches from DaveM lined up (not included in this series) that uses this to implement gup_fast() for sparc64. Then there is one patch that extends the generic mmu_gather batching. After that follow the mm preemptibility patches, these make part of the mm a lot more preemptible. It converts i_mmap_lock and anon_vma->lock to mutexes which together with the mmu_gather rework makes mmu_gather preemptible as well. Making i_mmap_lock a mutex also enables a clean-up of the truncate code. This also allows for preemptible mmu_notifiers, something that XPMEM I think wants. Furthermore, it removes the new and universially detested unmap_mutex. This patch: Remove the first obstacle towards a fully preemptible mmu_gather. The current scheme assumes mmu_gather is always done with preemption disabled and uses per-cpu storage for the page batches. Change this to try and allocate a page for batching and in case of failure, use a small on-stack array to make some progress. Preemptible mmu_gather is desired in general and usable once i_mmap_lock becomes a mutex. Doing it before the mutex conversion saves us from having to rework the code by moving the mmu_gather bits inside the pte_lock. Also avoid flushing the tlb batches from under the pte lock, this is useful even without the i_mmap_lock conversion as it significantly reduces pte lock hold times. [akpm@linux-foundation.org: fix comment tpyo] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mm: make expand_downwards() symmetrical with expand_upwards()Michal Hocko2011-05-251-1/+1
| |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we have expand_upwards exported while expand_downwards is accessible only via expand_stack or expand_stack_downwards. check_stack_guard_page is a nice example of the asymmetry. It uses expand_stack for VM_GROWSDOWN while expand_upwards is called for VM_GROWSUP case. Let's clean this up by exporting both functions and make those names consistent. Let's use expand_{upwards,downwards} because expanding doesn't always involve stack manipulation (an example is ia64_do_page_fault which uses expand_upwards for registers backing store expansion). expand_downwards has to be defined for both CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards version in the early process initialization phase for growsup configuration. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Hugh Dickins <hughd@google.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge branch 'for_linus' of ↵Linus Torvalds2011-05-246-32/+91
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: jbd: Fix comment to match the code in journal_start() jbd/jbd2: remove obsolete summarise_journal_usage. jbd: Fix forever sleeping process in do_get_write_access() ext2: fix error msg when mounting fs with too-large blocksize jbd: fix fsync() tid wraparound bug ext3: Fix fs corruption when make_indexed_dir() fails ext3: Fix lock inversion in ext3_symlink()
| * | | jbd: Fix comment to match the code in journal_start()Eryu Guan2011-05-241-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | journal_start returns an ERR_PTR() value rather than NULL on failure. Cc: Jan Kara <jack@suse.cz> Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
| * | | jbd/jbd2: remove obsolete summarise_journal_usage.Tao Ma2011-05-172-12/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | summarise_journal_usage seems to be obsolete for a long time, so remove it. Cc: Jan Kara <jack@suse.cz> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz>
| * | | jbd: Fix forever sleeping process in do_get_write_access()Jan Kara2011-05-171-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In do_get_write_access() we wait on BH_Unshadow bit for buffer to get from shadow state. The waking code in journal_commit_transaction() has a bug because it does not issue a memory barrier after the buffer is moved from the shadow state and before wake_up_bit() is called. Thus a waitqueue check can happen before the buffer is actually moved from the shadow state and waiting process may never be woken. Fix the problem by issuing proper barrier. CC: stable@kernel.org Reported-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz>
| * | | ext2: fix error msg when mounting fs with too-large blocksizeRobin Dong2011-05-171-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When ext2 mounts a filesystem, it attempts to set the block device blocksize with a call to sb_set_blocksize, which can fail for several reasons. The current failure message in ext2 prints: EXT2-fs (loop1): error: blocksize is too small which is not correct in all cases. This can be demonstrated by creating a filesystem with # mkfs.ext2 -b 8192 on a 4k page system, and attempting to mount it. Change the error message to a more generic: EXT2-fs (loop1): bad blocksize 8192 to match the error message in ext3. Signed-off-by: Robin Dong <sanbai@taobao.com> Reviewed-by: Coly Li <bosong.ly@taobao.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>
| * | | jbd: fix fsync() tid wraparound bugTed Ts'o2011-05-171-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an application program does not make any changes to the indirect blocks or extent tree, i_datasync_tid will not get updated. If there are enough commits (i.e., 2**31) such that tid_geq()'s calculations wrap, and there isn't a currently active transaction at the time of the fdatasync() call, this can end up triggering a BUG_ON in fs/jbd/commit.c: J_ASSERT(journal->j_running_transaction != NULL); It's pretty rare that this can happen, since it requires the use of fdatasync() plus *very* frequent and excessive use of fsync(). But with the right workload, it can. We fix this by replacing the use of tid_geq() with an equality test, since there's only one valid transaction id that is valid for us to start: namely, the currently running transaction (if it exists). CC: stable@kernel.org Reported-by: Martin_Zielinski@McAfee.com Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz>
| * | | ext3: Fix fs corruption when make_indexed_dir() failsJan Kara2011-05-171-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When make_indexed_dir() fails (e.g. because of ENOSPC) after it has allocated block for index tree root, we did not properly mark all changed buffers dirty. This lead to only some of these buffers being written out and thus effectively corrupting the directory. Fix the issue by marking all changed data dirty even in the error failure case. CC: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz>
| * | | ext3: Fix lock inversion in ext3_symlink()Jan Kara2011-04-291-11/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ext3_symlink() cannot call __page_symlink() with transaction open. __page_symlink() calls ext3_write_begin() which gets page lock which ranks above transaction start (thus lock ordering is violated) and and also ext3_write_begin() waits for a transaction commit when we run out of space which never happens if we hold transaction open. Fix the problem by stopping a transaction before calling __page_symlink() (we have to be careful and put inode to orphan list so that it gets deleted in case of crash) and starting another one after __page_symlink() returns for addition of symlink into a directory. Signed-off-by: Jan Kara <jack@suse.cz>
* | | | Merge branch 'for-linus' of ↵Linus Torvalds2011-05-248-49/+219
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm: dlm: make plock operation killable dlm: remove shared message stub for recovery dlm: delayed reply message warning dlm: Remove superfluous call to recalc_sigpending()
| * | | | dlm: make plock operation killableDavid Teigland2011-05-231-4/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow processes blocked on plock requests to be interrupted when they are killed. This leaves the problem of cleaning up the lock state in userspace. This has three parts: 1. Add a flag to unlock operations sent to userspace indicating the file is being closed. Userspace will then look for and clear any waiting plock operations that were abandoned by an interrupted process. 2. Queue an unlock-close operation (like in 1) to clean up userspace from an interrupted plock request. This is needed because the vfs will not send a cleanup-unlock if it sees no locks on the file, which it won't if the interrupted operation was the only one. 3. Do not use replies from userspace for unlock-close operations because they are unnecessary (they are just cleaning up for the process which did not make an unlock call). This also simplifies the new unlock-close generated from point 2. Signed-off-by: David Teigland <teigland@redhat.com>
| * | | | dlm: remove shared message stub for recoveryDavid Teigland2011-04-052-33/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kmalloc a stub message struct during recovery instead of sharing the struct in the lockspace. This leaves the lockspace stub_ms only for faking downconvert replies, where it is never modified and sharing is not a problem. Also improve the debug messages in the same recovery function. Signed-off-by: David Teigland <teigland@redhat.com>
| * | | | dlm: delayed reply message warningDavid Teigland2011-04-016-11/+108
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an option (disabled by default) to print a warning message when a lock has been waiting a configurable amount of time for a reply message from another node. This is mainly for debugging. Signed-off-by: David Teigland <teigland@redhat.com>
| * | | | dlm: Remove superfluous call to recalc_sigpending()Matt Fleming2011-03-281-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | recalc_sigpending() is called within sigprocmask(), so there is no need call it again after sigprocmask() has returned. Signed-off-by: Matt Fleming <matt.fleming@linux.intel.com> Signed-off-by: David Teigland <teigland@redhat.com>
* | | | | Merge branch 'for-linus' of ↵Linus Torvalds2011-05-241-0/+8
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (43 commits) TOMOYO: Fix wrong domainname validation. SELINUX: add /sys/fs/selinux mount point to put selinuxfs CRED: Fix load_flat_shared_library() to initialise bprm correctly SELinux: introduce path_has_perm flex_array: allow 0 length elements flex_arrays: allow zero length flex arrays flex_array: flex_array_prealloc takes a number of elements, not an end SELinux: pass last path component in may_create SELinux: put name based create rules in a hashtable SELinux: generic hashtab entry counter SELinux: calculate and print hashtab stats with a generic function SELinux: skip filename trans rules if ttype does not match parent dir SELinux: rename filename_compute_type argument to *type instead of *con SELinux: fix comment to state filename_compute_type takes an objname not a qstr SMACK: smack_file_lock can use the struct path LSM: separate LSM_AUDIT_DATA_DENTRY from LSM_AUDIT_DATA_PATH LSM: split LSM_AUDIT_DATA_FS into _PATH and _INODE SELINUX: Make selinux cache VFS RCU walks safe SECURITY: Move exec_permission RCU checks into security modules SELinux: security_read_policy should take a size_t not ssize_t ...
| * \ \ \ \ Merge branch 'next' into for-linusJames Morris2011-05-241-0/+8
| |\ \ \ \ \
OpenPOWER on IntegriCloud