summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
...
| * | | ext4: Fix the delalloc writepages to allocate blocks at the right offset.Aneesh Kumar K.V2009-01-051-17/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When iterating through the pages which have mapped buffer_heads, we failed to update the b_state value. This results in allocating blocks at logical offset 0. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
| * | | ext4: tone down ext4_da_writepages warningsTheodore Ts'o2008-11-051-1/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the filesystem has errors, ext4_da_writepages() will return a *lot* of errors, including lots and lots of stack dumps. While it's true that we are dropping user data on the floor, which is unfortunate, the stack dumps aren't helpful, and they tend to obscure the true original root cause of the problem. So in the case where the filesystem has aborted, return an EROFS right away. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * | | ext4: remove do_blk_alloc()Theodore Ts'o2008-12-123-38/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The convenience function do_blk_alloc() is a static function with only one caller, so fold it into ext4_new_meta_blocks() to simplify the code and to make it easier to understand. To save more stack space, if count is a null pointer in ext4_new_meta_blocks() assume that caller wanted a single block (and if there is an error, no blocks were allocated). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * | | ext4: remove ext4_new_meta_block()Theodore Ts'o2008-12-074-22/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There were only two one callers of the function ext4_new_meta_block(), which just a very simpler wrapper function around ext4_new_meta_blocks(). Change those two functions to call ext4_new_meta_blocks() directly, to save code and stack space usage. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * | | ext4: remove ext4_new_blocks() and call ext4_mb_new_blocks() directlyTheodore Ts'o2009-01-013-28/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There was only one caller of the compatibility function ext4_new_blocks(), in balloc.c's ext4_alloc_blocks(). Change it to call ext4_mb_new_blocks() directly, and remove ext4_new_blocks() altogether. This cleans up the code, by removing two extra functions from the call chain, and hopefully saving some stack usage. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * | | ext3/4: Fix loop index in do_split() so it is signedTheodore Ts'o2008-12-062-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes a gcc warning but it doesn't appear able to result in a failure, since the primary way the loop is exited is the first conditional in the for loop, and at least for a consistent filesystem, the signed/unsigned should in practice never be exposed. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * | | ext4: Add support for non-native signed/unsigned htree hash algorithmsTheodore Ts'o2008-10-285-10/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The original ext3 hash algorithms assumed that variables of type char were signed, as God and K&R intended. Unfortunately, this assumption is not true on some architectures. Userspace support for marking filesystems with non-native signed/unsigned chars was added two years ago, but the kernel-side support was never added (until now). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| * | | ext3: Add support for non-native signed/unsigned htree hash algorithmsTheodore Ts'o2008-10-283-10/+86
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The original ext3 hash algorithms assumed that variables of type char were signed, as God and K&R intended. Unfortunately, this assumption is not true on some architectures. Userspace support for marking filesystems with non-native signed/unsigned chars was added two years ago, but the kernel-side support was never added (until now). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org
| * | | ext4: fix printk format warningAlexander Beregalov2008-10-292-3/+3
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | fs/ext4/balloc.c:607: warning: format '%lld' expects type 'long long int', but argument 2 has type 's64' fs/ext4/inode.c:1822: warning: format '%lld' expects type 'long long int', but argument 2 has type 's64' fs/ext4/inode.c:1824: warning: format '%lld' expects type 'long long int', but argument 2 has type 's64' Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6Linus Torvalds2009-01-081-14/+22
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (45 commits) [SCSI] qla2xxx: Update version number to 8.03.00-k1. [SCSI] qla2xxx: Add ISP81XX support. [SCSI] qla2xxx: Use proper request/response queues with MQ instantiations. [SCSI] qla2xxx: Correct MQ-chain information retrieval during a firmware dump. [SCSI] qla2xxx: Collapse EFT/FCE copy procedures during a firmware dump. [SCSI] qla2xxx: Don't pollute kernel logs with ZIO/RIO status messages. [SCSI] qla2xxx: Don't fallback to interrupt-polling during re-initialization with MSI-X enabled. [SCSI] qla2xxx: Remove support for reading/writing HW-event-log. [SCSI] cxgb3i: add missing include [SCSI] scsi_lib: fix DID_RESET status problems [SCSI] fc transport: restore missing dev_loss_tmo callback to LLDD [SCSI] aha152x_cs: Fix regression that keeps driver from using shared interrupts [SCSI] sd: Correctly handle 6-byte commands with DIX [SCSI] sd: DIF: Fix tagging on platforms with signed char [SCSI] sd: DIF: Show app tag on error [SCSI] Fix error handling for DIF/DIX [SCSI] scsi_lib: don't decrement busy counters when inserting commands [SCSI] libsas: fix test for negative unsigned and typos [SCSI] a2091, gvp11: kill warn_unused_result warnings [SCSI] fusion: Move a dereference below a NULL test ... Fixed up trivial conflict due to moving the async part of sd_probe around in the async probes vs using dev_set_name() in naming.
| * | | [SCSI] block: make blk_rq_map_user take a NULL user-space buffer for WRITEFUJITA Tomonori2009-01-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit 818827669d85b84241696ffef2de485db46b0b5e (block: make blk_rq_map_user take a NULL user-space buffer) extended blk_rq_map_user to accept a NULL user-space buffer with a READ command. It was necessary to convert sg to use the block layer mapping API. This patch extends blk_rq_map_user again for a WRITE command. It is necessary to convert st and osst drivers to use the block layer apping API. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Acked-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
| * | | [SCSI] block: fix the partial mappings with struct rq_map_dataFUJITA Tomonori2009-01-021-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes bio_copy_user_iov to properly handle the partial mappings with struct rq_map_data (which only sg uses for now but st and osst will shortly). It adds the offset member to struct rq_map_data and changes blk_rq_map_user to update it so that bio_copy_user_iov can add an appropriate page frame via bio_add_pc_page(). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Acked-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
| * | | [SCSI] block: fix bio_add_page misuse with rq_map_dataFUJITA Tomonori2009-01-021-12/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes bio_add_page misuse in bio_copy_user_iov with rq_map_data, which only sg uses now. rq_map_data carries page frames for bio_add_pc_page. bio_copy_user_iov uses bio_add_pc_page with a larger size than PAGE_SIZE. It's clearly wrong. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Acked-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
* | | | Merge branch 'for-linus' of git://neil.brown.name/mdLinus Torvalds2009-01-081-0/+14
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-linus' of git://neil.brown.name/md: md: don't retry recovery of raid1 that fails due to error on source drive. md: Allow md devices to be created by name. md: make devices disappear when they are no longer needed. md: centralise all freeing of an 'mddev' in 'md_free' md: move allocation of ->queue from mddev_find to md_probe md: need another print_sb for mdp_superblock_1 md: use list_for_each_entry macro directly md: raid0: make hash_spacing and preshift sector-based. md: raid0: Represent the size of strip zones in sectors. md: raid0 create_strip_zones(): Add KERN_INFO/KERN_ERR to printk's. md: raid0 create_strip_zones(): Make two local variables sector-based. md: raid0: Represent zone->zone_offset in sectors. md: raid0: Represent device offset in sectors. md: raid0_make_request(): Replace local variable block by sector. md: raid0_make_request(): Remove local variable chunk_size. md: raid0_make_request(): Replace chunksize_bits by chunksect_bits. md: use sysfs_notify_dirent to notify changes to md/sync_action. md: fix bitmap-on-external-file bug.
| * | | | md: make devices disappear when they are no longer needed.NeilBrown2009-01-091-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently md devices, once created, never disappear until the module is unloaded. This is essentially because the gendisk holds a reference to the mddev, and the mddev holds a reference to the gendisk, this a circular reference. If we drop the reference from mddev to gendisk, then we need to ensure that the mddev is destroyed when the gendisk is destroyed. However it is not possible to hook into the gendisk destruction process to enable this. So we drop the reference from the gendisk to the mddev and destroy the gendisk when the mddev gets destroyed. However this has a complication. Between the call __blkdev_get->get_gendisk->kobj_lookup->md_probe and the call __blkdev_get->md_open there is no obvious way to hold a reference on the mddev any more, so unless something is done, it will disappear and gendisk will be destroyed prematurely. Also, once we decide to destroy the mddev, there will be an unlockable moment before the gendisk is unlinked (blk_unregister_region) during which a new reference to the gendisk can be created. We need to ensure that this reference can not be used. i.e. the ->open must fail. So: 1/ in md_probe we set a flag in the mddev (hold_active) which indicates that the array should be treated as active, even though there are no references, and no appearance of activity. This is cleared by md_release when the device is closed if it is no longer needed. This ensures that the gendisk will survive between md_probe and md_open. 2/ In md_open we check if the mddev we expect to open matches the gendisk that we did open. If there is a mismatch we return -ERESTARTSYS and modify __blkdev_get to retry from the top in that case. In the -ERESTARTSYS sys case we make sure to wait until the old gendisk (that we succeeded in opening) is really gone so we loop at most once. Some udev configurations will always open an md device when it first appears. If we allow an md device that was just created by an open to disappear on an immediate close, then this can race with such udev configurations and result in an infinite loop the device being opened and closed, then re-open due to the 'ADD' even from the first open, and then close and so on. So we make sure an md device, once created by an open, remains active at least until some md 'ioctl' has been made on it. This means that all normal usage of md devices will allow them to disappear promptly when not needed, but the worst that an incorrect usage will do it cause an inactive md device to be left in existence (it can easily be removed). As an array can be stopped by writing to a sysfs attribute echo clear > /sys/block/mdXXX/md/array_state we need to use scheduled work for deleting the gendisk and other kobjects. This allows us to wait for any pending gendisk deletion to complete by simply calling flush_scheduled_work(). Signed-off-by: NeilBrown <neilb@suse.de>
* | | | | fix similar typos to successfullColy Li2009-01-082-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I review ocfs2 code, find there are 2 typos to "successfull". After doing grep "successfull " in kernel tree, 22 typos found totally -- great minds always think alike :) This patch fixes all the similar typos. Thanks for Randy's ack and comments. Signed-off-by: Coly Li <coyli@suse.de> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Roland Dreier <rolandd@cisco.com> Cc: Jeremy Kerr <jk@ozlabs.org> Cc: Jeff Garzik <jeff@garzik.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Vlad Yasevich <vladislav.yasevich@hp.com> Cc: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | generic swap(): dcache: use swap() instead of private do_switch()Wu Fengguang2009-01-081-9/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the new generic implementation. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | generic swap(): ext4: remove local swap() macroWu Fengguang2009-01-081-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the new generic implementation. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | generic swap(): ext3: remove local swap() macroWu Fengguang2009-01-081-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the new generic implementation. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | remove lots of double-semicolonsFernando Carrijo2009-01-082-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: Theodore Ts'o <tytso@mit.edu> Acked-by: Mark Fasheh <mfasheh@suse.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: James Morris <jmorris@namei.org> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | romfs: romfs_iget() - unsigned ino >= 0 is always trueroel kluin2009-01-081-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | romfs_strnlen() returns int unsigned X >= 0 is always true [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: roel kluin <roel.kluin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | vmcore: remove saved_max_pfn checkMagnus Damm2009-01-081-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the saved_max_pfn check from the /proc/vmcore function read_from_oldmem(). No need to verify, we should be able to just trust that "elfcorehdr=" is correctly passed to the crash kernel on the kernel command line like we do with other parameters. The read_from_oldmem() function in fs/proc/vmcore.c is quite similar to read_from_oldmem() in drivers/char/mem.c, but only in the latter it makes sense to use saved_max_pfn. For oldmem it is used to determine when to stop reading. For vmcore we already have the elf header info pointing out the physical memory regions, no need to pass the end-of- old-memory twice. Removing the saved_max_pfn check from vmcore makes it possible for architectures to skip oldmem but still support crash dump through vmcore - without the need for the old saved_max_pfn cruft. Architectures that want to play safe can do the saved_max_pfn check in copy_oldmem_page(). Not sure why anyone would want to do that, but that's even safer than today - the saved_max_pfn check in vmcore removed by this patch only checks the first page. Signed-off-by: Magnus Damm <damm@igel.co.jp> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Simon Horman <horms@verge.net.au> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | ELF: implement AT_RANDOM for glibc PRNG seedingKees Cook2009-01-081-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While discussing[1] the need for glibc to have access to random bytes during program load, it seems that an earlier attempt to implement AT_RANDOM got stalled. This implements a random 16 byte string, available to every ELF program via a new auxv AT_RANDOM vector. [1] http://sourceware.org/ml/libc-alpha/2008-10/msg00006.html Ulrich said: glibc needs right after startup a bit of random data for internal protections (stack canary etc). What is now in upstream glibc is that we always unconditionally open /dev/urandom, read some data, and use it. For every process startup. That's slow. ... The solution is to provide a limited amount of random data to the starting process in the aux vector. I suggested 16 bytes and this is what the patch implements. If we need only 16 bytes or less we use the data directly. If we need more we'll use the 16 bytes to see a PRNG. This avoids the costly /dev/urandom use and it allows the kernel to use the most adequate source of random data for this purpose. It might not be the same pool as that for /dev/urandom. Concerns were expressed about the depletion of the randomness pool. But this patch doesn't make the situation worse, it doesn't deplete entropy more than happens now. Signed-off-by: Kees Cook <kees.cook@canonical.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | memcg: synchronized LRUKAMEZAWA Hiroyuki2009-01-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A big patch for changing memcg's LRU semantics. Now, - page_cgroup is linked to mem_cgroup's its own LRU (per zone). - LRU of page_cgroup is not synchronous with global LRU. - page and page_cgroup is one-to-one and statically allocated. - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc); - SwapCache is handled. And, when we handle LRU list of page_cgroup, we do following. pc = lookup_page_cgroup(page); lock_page_cgroup(pc); .....................(1) mz = page_cgroup_zoneinfo(pc); spin_lock(&mz->lru_lock); .....add to LRU spin_unlock(&mz->lru_lock); unlock_page_cgroup(pc); But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock. So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct. This is a trial to remove this dirty nesting of locks. This patch changes mz->lru_lock to be zone->lru_lock. Then, above sequence will be written as spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU mem_cgroup_add/remove/etc_lru() { pc = lookup_page_cgroup(page); mz = page_cgroup_zoneinfo(pc); if (PageCgroupUsed(pc)) { ....add to LRU } spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU This is much simpler. (*) We're safe even if we don't take lock_page_cgroup(pc). Because.. 1. When pc->mem_cgroup can be modified. - at charge. - at account_move(). 2. at charge the PCG_USED bit is not set before pc->mem_cgroup is fixed. 3. at account_move() the page is isolated and not on LRU. Pros. - easy for maintenance. - memcg can make use of laziness of pagevec. - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup. - LRU status of memcg will be synchronized with global LRU's one. - # of locks are reduced. - account_move() is simplified very much. Cons. - may increase cost of LRU rotation. (no impact if memcg is not configured.) Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | quota: don't set grace time when user isn't above softlimitJan Kara2009-01-081-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | do_set_dqblk() allowed SETDQBLK quotactl to set user's grace time even if user was not above his softlimit. This does not make much sence and by coincidence causes quota code to omit softlimit warning when user really exceeds softlimit. This patch makes do_set_dqblk() reset user's grace time if he has not exceeded softlimit. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | coda: fix fs/coda/sysctl.c build warnings when !CONFIG_SYSCTLRichard A. Holden III2009-01-081-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix fs/coda/sysctl.c:14: warning: 'fs_table_header' defined but not used fs/coda/sysctl.c:44: warning: 'fs_table' defined but not used these are only used when CONFIG_SYSCTL is defined. Signed-off-by: Richard A. Holden III <aciddeath@gmail.com> Cc: Jan Harkes <jaharkes@cs.cmu.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | jbd: remove excess kernel-doc notationRandy Dunlap2009-01-081-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove excess kernel-doc from fs/jbd/transaction.c: Warning(linux-2.6.28-git5//fs/jbd/transaction.c:764): Excess function parameter 'credits' description in 'journal_get_write_access' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | ext3: tighten restrictions on inode flagsDuane Griffin2009-01-082-8/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the moment there are few restrictions on which flags may be set on which inodes. Specifically DIRSYNC may only be set on directories and IMMUTABLE and APPEND may not be set on links. Tighten that to disallow TOPDIR being set on non-directories and only NODUMP and NOATIME to be set on non-regular file, non-directories. Introduces a flags masking function which masks flags based on mode and use it during inode creation and when flags are set via the ioctl to facilitate future consistency. Signed-off-by: Duane Griffin <duaneg@dghda.com> Acked-by: Andreas Dilger <adilger@sun.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | ext3: don't inherit inappropriate inode flags from parentDuane Griffin2009-01-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At present INDEX is the only flag that new ext3 inodes do NOT inherit from their parent. In addition prevent the flags DIRTY, ECOMPR, IMAGIC and TOPDIR from being inherited. List inheritable flags explicitly to prevent future flags from accidentally being inherited. This fixes the TOPDIR flag inheritance bug reported at http://bugzilla.kernel.org/show_bug.cgi?id=9866. Signed-off-by: Duane Griffin <duaneg@dghda.com> Acked-by: Andreas Dilger <adilger@sun.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | ext3: allocate ->s_blockgroup_lock separatelyPekka Enberg2009-01-081-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As spotted by kmemtrace, struct ext3_sb_info is 17152 bytes on 64-bit which makes it a very bad fit for SLAB allocators. The culprit of the wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when NR_CPUS >= 32. To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2 page in the worst case, separately. This shinks down struct ext3_sb_info enough to fit a 1 KB slab cache so now we allocate 16 KB + 1 KB instead of 32 KB saving 15 KB of memory. Acked-by: Andreas Dilger <adilger@sun.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | jbd: improve fsync batchingJosef Bacik2009-01-082-5/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a flaw with the way jbd handles fsync batching. If we fsync() a file and we were not the last person to run fsync() on this fs then we automatically sleep for 1 jiffie in order to wait for new writers to join into the transaction before forcing the commit. The problem with this is that with really fast storage (ie a Clariion) the time it takes to commit a transaction to disk is way faster than 1 jiffie in most cases, so sleeping means waiting longer with nothing to do than if we just committed the transaction and kept going. Ric Wheeler noticed this when using fs_mark with more than 1 thread, the throughput would plummet as he added more threads. This patch attempts to fix this problem by recording the average time in nanoseconds that it takes to commit a transaction to disk, and what time we started the transaction. If we run an fsync() and we have been running for less time than it takes to commit the transaction to disk, we sleep for the delta amount of time and then commit to disk. We acheive sub-jiffie sleeping using schedule_hrtimeout. This means that the wait time is auto-tuned to the speed of the underlying disk, instead of having this static timeout. I weighted the average according to somebody's comments (Andreas Dilger I think) in order to help normalize random outliers where we take way longer or way less time to commit than the average. I also have a min() check in there to make sure we don't sleep longer than a jiffie in case our storage is super slow, this was requested by Andrew. I unfortunately do not have access to a Clariion, so I had to use a ramdisk to represent a super fast array. I tested with a SATA drive with barrier=1 to make sure there was no regression with local disks, I tested with a 4 way multipathed Apple Xserve RAID array and of course the ramdisk. I ran the following command fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t $i where $i was 2, 4, 8, 16 and 32. I mkfs'ed the fs each time. Here are my results type threads with patch without patch sata 2 24.6 26.3 sata 4 49.2 48.1 sata 8 70.1 67.0 sata 16 104.0 94.1 sata 32 153.6 142.7 xserve 2 246.4 222.0 xserve 4 480.0 440.8 xserve 8 829.5 730.8 xserve 16 1172.7 1026.9 xserve 32 1816.3 1650.5 ramdisk 2 2538.3 1745.6 ramdisk 4 2942.3 661.9 ramdisk 8 2882.5 999.8 ramdisk 16 2738.7 1801.9 ramdisk 32 2541.9 2394.0 Signed-off-by: Josef Bacik <jbacik@redhat.com> Cc: Andreas Dilger <adilger@sun.com> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Ric Wheeler <rwheeler@redhat.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | ext2: tighten restrictions on inode flagsDuane Griffin2009-01-082-8/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the moment there are few restrictions on which flags may be set on which inodes. Specifically DIRSYNC may only be set on directories and IMMUTABLE and APPEND may not be set on links. Tighten that to disallow TOPDIR being set on non-directories and only NODUMP and NOATIME to be set on non-regular file, non-directories. Introduces a flags masking function which masks flags based on mode and use it during inode creation and when flags are set via the ioctl to facilitate future consistency. Signed-off-by: Duane Griffin <duaneg@dghda.com> Acked-by: Andreas Dilger <adilger@sun.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | ext2: don't inherit inappropriate inode flags from parentDuane Griffin2009-01-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At present BTREE/INDEX is the only flag that new ext2 inodes do NOT inherit from their parent. In addition prevent the flags DIRTY, ECOMPR, INDEX, IMAGIC and TOPDIR from being inherited. List inheritable flags explicitly to prevent future flags from accidentally being inherited. This fixes the TOPDIR flag inheritance bug reported at http://bugzilla.kernel.org/show_bug.cgi?id=9866. Signed-off-by: Duane Griffin <duaneg@dghda.com> Acked-by: Andreas Dilger <adilger@sun.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | ext2: allocate ->s_blockgroup_lock separatelyPekka J Enberg2009-01-081-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As spotted by kmemtrace, struct ext2_sb_info is 17024 bytes on 64-bit which makes it a very bad fit for SLAB allocators. The culprit of the wasted memory is ->s_blockgroup_lock which can be as big as 16 KB when NR_CPUS >= 32. To fix that, allocate ->s_blockgroup_lock, which fits nicely in a order 2 page in the worst case, separately. This shinks down struct ext2_sb_info enough to fit a 1 KB slab cache so now we allocate 16 KB + 1 KB instead of 32 KB saving 15 KB of memory. Acked-by: Andreas Dilger <adilger@sun.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | ext2: fix ext2_splice_branch() commentsQinghuang Feng2009-01-081-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no argument named @chain in ext2_splice_branch, remove references to it. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | | async: Don't call async_synchronize_full_special() while holding sb_lockDave Kleikamp2009-01-081-1/+1
|/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sync_filesystems() shouldn't be calling async_synchronize_full_special while holding a spinlock. The second while loop in that function is the right place for this anyway. Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Reported-by: Grissiom <chaos.proton@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge branch 'for-2.6.29' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2009-01-0719-435/+1065
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-2.6.29' of git://linux-nfs.org/~bfields/linux: (67 commits) nfsd: get rid of NFSD_VERSION nfsd: last_byte_offset nfsd: delete wrong file comment from nfsd/nfs4xdr.c nfsd: git rid of nfs4_cb_null_ops declaration nfsd: dprint each op status in nfsd4_proc_compound nfsd: add etoosmall to nfserrno NFSD: FIDs need to take precedence over UUIDs SUNRPC: The sunrpc server code should not be used by out-of-tree modules svc: Clean up deferred requests on transport destruction nfsd: fix double-locks of directory mutex svc: Move kfree of deferral record to common code CRED: Fix NFSD regression NLM: Clean up flow of control in make_socks() function NLM: Refactor make_socks() function nfsd: Ensure nfsv4 calls the underlying filesystem on LOCKT SUNRPC: Ensure the server closes sockets in a timely fashion NFSD: Add documenting comments for nfsctl interface NFSD: Replace open-coded integer with macro NFSD: Fix a handful of coding style issues in write_filehandle() NFSD: clean up failover sysctl function naming ...
| * | | | nfsd: last_byte_offsetBenny Halevy2009-01-071-16/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | refactor the nfs4 server lock code to use last_byte_offset to compute the last byte covered by the lock. Check for overflow so that the last byte is set to NFS4_MAX_UINT64 if offset + len wraps around. Also, use NFS4_MAX_UINT64 for ~(u64)0 where appropriate. Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
| * | | | nfsd: delete wrong file comment from nfsd/nfs4xdr.cMarc Eshel2009-01-071-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
| * | | | nfsd: git rid of nfs4_cb_null_ops declarationBenny Halevy2009-01-071-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's no use for nfs4_cb_null_ops's declaration in fs/nfsd/nfs4callback.c Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
| * | | | nfsd: dprint each op status in nfsd4_proc_compoundBenny Halevy2009-01-071-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
| * | | | nfsd: add etoosmall to nfserrnoDean Hildebrand2009-01-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
| * | | | NFSD: FIDs need to take precedence over UUIDsSteve Dickson2009-01-071-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When determining the fsid_type in fh_compose(), the setting of the FID via fsid= export option needs to take precedence over using the UUID device id. Signed-off-by: Steve Dickson <steved@redhat.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
| * | | | nfsd: fix double-locks of directory mutexJ. Bruce Fields2009-01-071-3/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A number of nfsd operations depend on the i_mutex to cover more code than just the fsync, so the approach of 4c728ef583b3d8 "add a vfs_fsync helper" doesn't work for nfsd. Revert the parts of those patches that touch nfsd. Note: we can't, however, remove the logic from vfs_fsync that was needed only for the special case of nfsd, because a vfs_fsync(NULL,...) call can still result indirectly from a stackable filesystem that was called by nfsd. (Thanks to Christoph Hellwig for pointing this out.) Reported-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
| * | | | CRED: Fix NFSD regressionDavid Howells2009-01-071-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a regression in NFSD's permission checking introduced by the credentials patches. There are two parts to the problem, both in nfsd_setuser(): (1) The return value of set_groups() is -ve if in error, not 0, and should be checked appropriately. 0 indicates success. (2) The UID to use for fs accesses is in new->fsuid, not new->uid (which is 0). This causes CAP_DAC_OVERRIDE to always be set, rather than being cleared if the UID is anything other than 0 after squashing. Reported-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
| * | | | NLM: Clean up flow of control in make_socks() functionChuck Lever2009-01-071-8/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up: Use Bruce's preferred control flow style in make_socks(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
| * | | | NLM: Refactor make_socks() functionChuck Lever2009-01-071-15/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up: extract common logic in NLM's make_socks() function into a helper. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
| * | | | nfsd: Ensure nfsv4 calls the underlying filesystem on LOCKTJ. Bruce Fields2009-01-071-10/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since nfsv4 allows LOCKT without an open, but the ->lock() method is a file method, we fake up a struct file in the nfsv4 code with just the fields we need initialized. But we forgot to initialize the file operations, with the result that LOCKT never results in a call to the filesystem's ->lock() method (if it exists). We could just add that one more initialization. But this hack of faking up a struct file with only some fields initialized seems the kind of thing that might cause more problems in the future. We should either do an open and get a real struct file, or make lock-testing an inode (not a file) method. This patch does the former. Reported-by: Marc Eshel <eshel@almaden.ibm.com> Tested-by: Marc Eshel <eshel@almaden.ibm.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
| * | | | NFSD: Add documenting comments for nfsctl interfaceChuck Lever2009-01-061-16/+437
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Document the NFSD sysctl interface laid out in fs/nfsd/nfsctl.c. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
| * | | | NFSD: Replace open-coded integer with macroChuck Lever2009-01-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clean up: Instead of open-coding 2049, use the NFS_PORT macro. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
OpenPOWER on IntegriCloud