summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2014-08-2014-22/+269
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull cifs fixes from Steve French: "Most important fixes in this set include three SMB3 fixes for stable (including fix for possible kernel oops), and a workaround to allow writes to Mac servers (only cifs dialect, not more current SMB2.1, worked to Mac servers). Also fallocate support added, and lease fix from Jeff" * 'for-linus' of git://git.samba.org/sfrench/cifs-2.6: [SMB3] Enable fallocate -z support for SMB3 mounts enable fallocate punch hole ("fallocate -p") for SMB3 Incorrect error returned on setting file compressed on SMB2 CIFS: Fix wrong directory attributes after rename CIFS: Fix SMB2 readdir error handling [CIFS] Possible null ptr deref in SMB2_tcon [CIFS] Workaround MacOS server problem with SMB2.1 write response cifs: handle lease F_UNLCK requests properly Cleanup sparse file support by creating worker function for it Add sparse file support to SMB2/SMB3 mounts Add missing definitions for CIFS File System Attributes cifs: remove unused function cifs_oplock_break_wait
| * [SMB3] Enable fallocate -z support for SMB3 mountsSteve French2014-08-171-0/+55
| | | | | | | | | | | | | | | | | | | | | | fallocate -z (FALLOC_FL_ZERO_RANGE) can map to SMB3 FSCTL_SET_ZERO_DATA SMB3 FSCTL but FALLOC_FL_ZERO_RANGE when called without the FALLOC_FL_KEEPSIZE flag set could want the file size changed so we can not support that subcase unless the file is cached (and thus we know the file size). Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org>
| * enable fallocate punch hole ("fallocate -p") for SMB3Steve French2014-08-175-1/+73
| | | | | | | | | | | | | | | | | | | | Implement FALLOC_FL_PUNCH_HOLE (which does not change the file size fortunately so this matches the behavior of the equivalent SMB3 fsctl call) for SMB3 mounts. This allows "fallocate -p" to work. It requires that the server support setting files as sparse (which Windows allows). Signed-off-by: Steve French <smfrench@gmail.com>
| * Incorrect error returned on setting file compressed on SMB2Steve French2014-08-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | When the server (for an SMB2 or SMB3 mount) doesn't support an ioctl (such as setting the compressed flag on a file) we were incorrectly returning EIO instead of EOPNOTSUPP, this is confusing e.g. doing chattr +c to a file on a non-btrfs Samba partition, now the error returned is more intuitive to the user. Also fixes error mapping on setting hardlink to servers which don't support that. Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: David Disseldorp <ddiss@suse.de>
| * CIFS: Fix wrong directory attributes after renamePavel Shilovsky2014-08-171-0/+6
| | | | | | | | | | | | | | | | | | | | | | When we requests rename we also need to update attributes of both source and target parent directories. Not doing it causes generic/309 xfstest to fail on SMB2 mounts. Fix this by marking these directories for force revalidating. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>
| * CIFS: Fix SMB2 readdir error handlingPavel Shilovsky2014-08-177-8/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SMB2 servers indicates the end of a directory search with STATUS_NO_MORE_FILE error code that is not processed now. This causes generic/257 xfstest to fail. Fix this by triggering the end of search by this error code in SMB2_query_directory. Also when negotiating CIFS protocol we tell the server to close the search automatically at the end and there is no need to do it itself. In the case of SMB2 protocol, we need to close it explicitly - separate close directory checks for different protocols. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>
| * [CIFS] Possible null ptr deref in SMB2_tconSteve French2014-08-171-1/+2
| | | | | | | | | | | | | | | | | | As Raphael Geissert pointed out, tcon_error_exit can dereference tcon and there is one path in which tcon can be null. Signed-off-by: Steve French <smfrench@gmail.com> CC: Stable <stable@vger.kernel.org> # v3.7+ Reported-by: Raphael Geissert <geissert@debian.org>
| * [CIFS] Workaround MacOS server problem with SMB2.1 writeSteve French2014-08-151-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | response Writes fail to Mac servers with SMB2.1 mounts (works with cifs though) due to them sending an incorrect RFC1001 length for the SMB2.1 Write response. Workaround this problem. MacOS server sends a write response with 3 bytes of pad beyond the end of the SMB itself. The RFC1001 length is 3 bytes more than the sum of the SMB2.1 header length + the write reponse. Incorporate feedback from Jeff and JRA to allow servers to send a tcp frame that is even more than three bytes too long (ie much longer than the SMB2/SMB3 request that it contains) but we do log it once now. In the earlier version of the patch I had limited how far off the length field could be before we fail the request. Signed-off-by: Steve French <smfrench@gmail.com>
| * cifs: handle lease F_UNLCK requests properlyJeff Layton2014-08-151-2/+3
| | | | | | | | | | | | | | | | Currently any F_UNLCK request for a lease just gets back -EAGAIN. Allow them to go immediately to generic_setlease instead. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Steve French <smfrench@gmail.com>
| * Cleanup sparse file support by creating worker function for itSteve French2014-08-151-31/+49
| | | | | | | | | | | | | | | | Simply move code to new function (for clarity). Function sets or clears the sparse file attribute flag. Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: David Disseldorp <ddiss@samba.org>
| * Add sparse file support to SMB2/SMB3 mountsSteve French2014-08-133-1/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many Linux filesystes make a file "sparse" when extending a file with ftruncate. This does work for CIFS to Samba (only) but not for SMB2/SMB3 (to Samba or Windows) since there is a "set sparse" fsctl which is supposed to be sent to mark a file as sparse. This patch marks a file as sparse by sending this simple set sparse fsctl if it is extended more than 2 pages. It has been tested to Windows 8.1, Samba and various SMB2/SMB3 servers which do support setting sparse (and MacOS which does not appear to support the fsctl yet). If a server share does not support setting a file as sparse, then we do not retry setting sparse on that share. The disk space savings for sparse files can be quite large (even more significant on Windows servers than Samba). Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
| * Add missing definitions for CIFS File System AttributesSteve French2014-08-121-0/+23
| | | | | | | | | | Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Shirish Pargaonkar <spargaonkar@suse.com>
| * cifs: remove unused function cifs_oplock_break_waitVincent Stehlé2014-08-111-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions") has removed the call to cifs_oplock_break_wait, making this function unused; remove it. This fixes the following compilation warning: fs/cifs/misc.c:578:1: warning: ‘cifs_oplock_break_wait’ defined but not used [-Wunused-function] Signed-off-by: Vincent Stehlé <vincent.stehle@laposte.net> Cc: Steve French <sfrench@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>
* | ext3: Count internal journal as bsddf overhead in ext3_statfsChin-Tsung Cheng2014-08-191-2/+3
| | | | | | | | | | | | | | | | The journal blocks of external journal device should not be counted as overhead. Signed-off-by: Chin-Tsung Cheng <chintzung@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
* | isofs: Fix unbounded recursion when processing relocated directoriesJan Kara2014-08-193-22/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We did not check relocated directory in any way when processing Rock Ridge 'CL' tag. Thus a corrupted isofs image can possibly have a CL entry pointing to another CL entry leading to possibly unbounded recursion in kernel code and thus stack overflow or deadlocks (if there is a loop created from CL entries). Fix the problem by not allowing CL entry to point to a directory entry with CL entry (such use makes no good sense anyway) and by checking whether CL entry doesn't point to itself. CC: stable@vger.kernel.org Reported-by: Chris Evans <cevans@google.com> Signed-off-by: Jan Kara <jack@suse.cz>
* | udf: avoid unneeded up_write when fail to add entry in ->symlinkChao Yu2014-08-191-1/+2
| | | | | | | | | | | | | | | | We have released the ->i_data_sem before invoking udf_add_entry(), so in following error path, we should not release this lock again. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Jan Kara <jack@suse.cz>
* | Merge branch 'for-linus2' of ↵Linus Torvalds2014-08-1617-305/+541
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "These are all fixes I'd like to get out to a broader audience. The biggest of the bunch is Mark's quota fix, which is also in the SUSE kernel, and makes our subvolume quotas dramatically more accurate. I've been running xfstests with these against your current git overnight, but I'm queueing up longer tests as well" * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: disable strict file flushes for renames and truncates Btrfs: fix csum tree corruption, duplicate and outdated checksums Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch Btrfs: fix compressed write corruption on enospc btrfs: correctly handle return from ulist_add btrfs: qgroup: account shared subtrees during snapshot delete Btrfs: read lock extent buffer while walking backrefs Btrfs: __btrfs_mod_ref should always use no_quota btrfs: adjust statfs calculations according to raid profiles
| * | btrfs: disable strict file flushes for renames and truncatesChris Mason2014-08-158-267/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Truncates and renames are often used to replace old versions of a file with new versions. Applications often expect this to be an atomic replacement, even if they haven't done anything to make sure the new version is fully on disk. Btrfs has strict flushing in place to make sure that renaming over an old file with a new file will fully flush out the new file before allowing the transaction commit with the rename to complete. This ordering means the commit code needs to be able to lock file pages, and there are a few paths in the filesystem where we will try to end a transaction with the page lock held. It's rare, but these things can deadlock. This patch removes the ordered flushes and switches to a best effort filemap_flush like ext4 uses. It's not perfect, but it should fix the deadlocks. Signed-off-by: Chris Mason <clm@fb.com>
| * | Btrfs: fix csum tree corruption, duplicate and outdated checksumsFilipe Manana2014-08-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Under rare circumstances we can end up leaving 2 versions of a checksum for the same file extent range. The reason for this is that after calling btrfs_next_leaf we process slot 0 of the leaf it returns, instead of processing the slot set in path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after btrfs_next_leaf() releases the path and before it searches for the next leaf, another task might cause a split of the next leaf, which migrates some of its keys to the leaf we were processing before calling btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the same leaf but with path->slots[0] having a slot number corresponding to the first new key it got, that is, a slot number that didn't exist before calling btrfs_next_leaf(), as the leaf now has more keys than it had before. So we must really process the returned leaf starting at path->slots[0] always, as it isn't always 0, and the key at slot 0 can have an offset much lower than our search offset/bytenr. For example, consider the following scenario, where we have: sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568 four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472 Leaf N: slot = 0 slot = btrfs_header_nritems() - 1 |-------------------------------------------------------------------| | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] | |-------------------------------------------------------------------| Leaf N + 1: slot = 0 slot = btrfs_header_nritems() - 1 |--------------------------------------------------------------------| | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 | |--------------------------------------------------------------------| Because we are at the last slot of leaf N, we call btrfs_next_leaf() to find the next highest key, which releases the current path and then searches for that next key. However after releasing the path and before finding that next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore btrfs_next_leaf() will returns us a path again with leaf N but with the slot pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N is then: slot = 0 slot = btrfs_header_nritems() - 2 slot = btrfs_header_nritems() - 1 |----------------------------------------------------------------------------------------------------| | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] [(CSUM CSUM 40161280), size 32] | |----------------------------------------------------------------------------------------------------| And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump into the "insert:" label, which will set tmp to: tmp = min((sums->len - total_bytes) >> blocksize_bits, (next_offset - file_key.offset) >> blocksize_bits) = min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) = min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4 and ins_size = csum_size * tmp = 4 * 4 = 16 bytes. In other words, we insert a new csum item in the tree with key (CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong, because the item with key (CSUM CSUM 40161280) (the one that was moved from leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288 bytes of our data and won't get those old checksums removed. So this leaves us 2 different checksums for 3 4kb blocks of data in the tree, and breaks the logical rule: Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover An obvious bad effect of this is that a subsequent csum tree lookup to get the checksum of any of the blocks with logical offset of 40161280, 40165376 or 40169472 (the last 3 4kb blocks of file data), will get the old checksums. Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | Btrfs: Fix memory corruption by ulist_add_merge() on 32bit archTakashi Iwai2014-08-152-6/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've got bug reports that btrfs crashes when quota is enabled on 32bit kernel, typically with the Oops like below: BUG: unable to handle kernel NULL pointer dereference at 00000004 IP: [<f9234590>] find_parent_nodes+0x360/0x1380 [btrfs] *pde = 00000000 Oops: 0000 [#1] SMP CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S W 3.15.2-1.gd43d97e-default #1 Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs] task: f1478130 ti: f147c000 task.ti: f147c000 EIP: 0060:[<f9234590>] EFLAGS: 00010213 CPU: 0 EIP is at find_parent_nodes+0x360/0x1380 [btrfs] EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000 ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690 Stack: 00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050 00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000 00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000 Call Trace: [<f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs] [<f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs] [<f9206148>] normal_work_helper+0xc8/0x270 [btrfs] [<c025e38b>] process_one_work+0x11b/0x390 [<c025eea1>] worker_thread+0x101/0x340 [<c026432b>] kthread+0x9b/0xb0 [<c0712a71>] ret_from_kernel_thread+0x21/0x30 [<c0264290>] kthread_create_on_node+0x110/0x110 This indicates a NULL corruption in prefs_delayed list. The further investigation and bisection pointed that the call of ulist_add_merge() results in the corruption. ulist_add_merge() takes u64 as aux and writes a 64bit value into old_aux. The callers of this function in backref.c, however, pass a pointer of a pointer to old_aux. That is, the function overwrites 64bit value on 32bit pointer. This caused a NULL in the adjacent variable, in this case, prefs_delayed. Here is a quick attempt to band-aid over this: a new function, ulist_add_merge_ptr() is introduced to pass/store properly a pointer value instead of u64. There are still ugly void ** cast remaining in the callers because void ** cannot be taken implicitly. But, it's safer than explicit cast to u64, anyway. Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046 Cc: <stable@vger.kernel.org> [v3.11+] Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Chris Mason <clm@fb.com>
| * | Btrfs: fix compressed write corruption on enospcLiu Bo2014-08-151-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When failing to allocate space for the whole compressed extent, we'll fallback to uncompressed IO, but we've forgotten to redirty the pages which belong to this compressed extent, and these 'clean' pages will simply skip 'submit' part and go to endio directly, at last we got data corruption as we write nothing. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Tested-By: Martin Steigerwald <martin@lichtvoll.de> Signed-off-by: Chris Mason <clm@fb.com>
| * | btrfs: correctly handle return from ulist_addMark Fasheh2014-08-151-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ulist_add() can return '1' on sucess, which qgroup_subtree_accounting() doesn't take into account. As a result, that value can be bubbled up to callers, causing an error to be printed. Fix this by only returning the value of ulist_add() when it indicates an error. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>
| * | btrfs: qgroup: account shared subtrees during snapshot deleteMark Fasheh2014-08-153-0/+426
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During its tree walk, btrfs_drop_snapshot() will skip any shared subtrees it encounters. This is incorrect when we have qgroups turned on as those subtrees need to have their contents accounted. In particular, the case we're concerned with is when removing our snapshot root leaves the subtree with only one root reference. In those cases we need to find the last remaining root and add each extent in the subtree to the corresponding qgroup exclusive counts. This patch implements the shared subtree walk and a new qgroup operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of this type is encountered during qgroup accounting, we search for any root references to that extent and in the case that we find only one reference left, we go ahead and do the math on it's exclusive counts. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | Btrfs: read lock extent buffer while walking backrefsFilipe Manana2014-08-151-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | Before processing the extent buffer, acquire a read lock on it, so that we're safe against concurrent updates on the extent buffer. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | Btrfs: __btrfs_mod_ref should always use no_quotaJosef Bacik2014-08-153-25/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before I extended the no_quota arg to btrfs_dec/inc_ref because I didn't understand how snapshot delete was using it and assumed that we needed the quota operations there. With Mark's work this has turned out to be not the case, we _always_ need to use no_quota for btrfs_dec/inc_ref, so just drop the argument and make __btrfs_mod_ref call it's process function with no_quota set always. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
| * | btrfs: adjust statfs calculations according to raid profilesDavid Sterba2014-08-151-6/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This has been discussed in thread: http://thread.gmane.org/gmane.comp.file-systems.btrfs/32528 and this patch implements this proposal: http://thread.gmane.org/gmane.comp.file-systems.btrfs/32536 Works fine for "clean" raid profiles where the raid factor correction does the right job. Otherwise it's pessimistic and may show low space although there's still some left. The df nubmers are lightly wrong in case of mixed block groups, but this is not a major usecase and can be addressed later. The RAID56 numbers are wrong almost the same way as before and will be addressed separately. CC: Hugo Mills <hugo@carfax.org.uk> CC: cwillu <cwillu@cwillu.com> CC: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
* | | Merge tag 'locks-v3.17-2' of git://git.samba.org/jlayton/linuxLinus Torvalds2014-08-161-29/+57
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull file locking bugfixes from Jeff Layton: "Most of these patches are to fix a long-standing regression that crept in when the BKL was removed from the file-locking code. The code was converted to use a conventional spinlock, but some fl_release_private ops can block and you can end up sleeping inside the lock. There's also a patch to make /proc/locks show delegations as 'DELEG'" * tag 'locks-v3.17-2' of git://git.samba.org/jlayton/linux: locks: update Locking documentation to clarify fl_release_private behavior locks: move locks_free_lock calls in do_fcntl_add_lease outside spinlock locks: defer freeing locks in locks_delete_lock until after i_lock has been dropped locks: don't reuse file_lock in __posix_lock_file locks: don't call locks_release_private from locks_copy_lock locks: show delegations as "DELEG" in /proc/locks
| * | | locks: move locks_free_lock calls in do_fcntl_add_lease outside spinlockJeff Layton2014-08-141-9/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's no need to call locks_free_lock here while still holding the i_lock. Defer that until the lock has been dropped. Acked-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Jeff Layton <jlayton@primarydata.com>
| * | | locks: defer freeing locks in locks_delete_lock until after i_lock has been ↵Jeff Layton2014-08-141-8/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | dropped In commit 72f98e72551fa (locks: turn lock_flocks into a spinlock), we moved from using the BKL to a global spinlock. With this change, we lost the ability to block in the fl_release_private operation. This is problematic for NFS (and probably some other filesystems as well). Add a new list_head argument to locks_delete_lock. If that argument is non-NULL, then queue any locks that we want to free to the list instead of freeing them. Then, add a new locks_dispose_list function that will walk such a list and call locks_free_lock on them after the i_lock has been dropped. Finally, change all of the callers of locks_delete_lock to pass in a list_head, except for lease_modify. That function can be called long after the i_lock has been acquired. Deferring the freeing of a lease after unlocking it in that function is non-trivial until we overhaul some of the spinlocking in the lease code. Currently though, no filesystem that sets fl_release_private supports leases, so this is not currently a problem. We'll eventually want to make the same change in the lease code, but it needs a lot more work before we can reasonably do so. Acked-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Jeff Layton <jlayton@primarydata.com>
| * | | locks: don't reuse file_lock in __posix_lock_fileJeff Layton2014-08-141-11/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently in the case where a new file lock completely replaces the old one, we end up overwriting the existing lock with the new info. This means that we have to call fl_release_private inside i_lock. Change the code to instead copy the info to new_fl, insert that lock into the correct spot and then delete the old lock. In a later patch, we'll defer the freeing of the old lock until after the i_lock has been dropped. Acked-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Jeff Layton <jlayton@primarydata.com>
| * | | locks: don't call locks_release_private from locks_copy_lockJeff Layton2014-08-111-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All callers of locks_copy_lock pass in a brand new file_lock struct, so there's no need to call locks_release_private on it. Replace that with a warning that fires in the event that we receive a target lock that doesn't look like it's properly initialized. Acked-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Jeff Layton <jlayton@primarydata.com>
| * | | locks: show delegations as "DELEG" in /proc/locksJeff Layton2014-08-111-1/+5
| | |/ | |/| | | | | | | | | | | | | | | | Now that they are a distinct lease type, show them as such. Cc: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Jeff Layton <jlayton@primarydata.com>
* | | Merge git://git.kvack.org/~bcrl/aio-nextLinus Torvalds2014-08-161-56/+30
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull aio updates from Ben LaHaise. * git://git.kvack.org/~bcrl/aio-next: aio: use iovec array rather than the single one aio: fix some comments aio: use the macro rather than the inline magic number aio: remove the needless registration of ring file's private_data aio: remove no longer needed preempt_disable() aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx() aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()
| * | | aio: use iovec array rather than the single oneGu Zheng2014-07-241-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, we only offer a single iovec to handle all the read/write cases, so the PREADV/PWRITEV request always need to alloc more iovec buffer when copying user vectors. If we use a tmp iovec array rather than the single one, some small PREADV/PWRITEV workloads(vector size small than the tmp buffer) will not need to alloc more iovec buffer when copying user vectors. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: fix some commentsGu Zheng2014-07-241-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function comments of aio_run_iocb and aio_read_events are out of date, so fix them here. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: use the macro rather than the inline magic numberGu Zheng2014-07-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace the inline magic number with the ready-made macro(AIO_RING_MAGIC), just clean up. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: remove the needless registration of ring file's private_dataGu Zheng2014-07-241-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the registration of ring file's private_data, we do not use it. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | aio: remove no longer needed preempt_disable()Benjamin LaHaise2014-07-221-8/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Based on feedback from Jens Axboe on 263782c1c95bbddbb022dc092fd89a36bb8d5577, clean up get/put_reqs_available() to remove the no longer needed preempt_disable() and preempt_enable() pair. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Jens Axboe <axboe@kernel.dk>
| * | | Merge ../aio-fixesBenjamin LaHaise2014-07-1446-269/+513
| |\ \ \
| * | | | aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx()Oleg Nesterov2014-06-241-11/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ioctx_add_table() is the writer, it does not need rcu_read_lock() to protect ->ioctx_table. It relies on mm->ioctx_lock and rcu locks just add the confusion. And it doesn't need rcu_dereference() by the same reason, it must see any updates previously done under the same ->ioctx_lock. We could use rcu_dereference_protected() but the patch uses rcu_dereference_raw(), the function is simple enough. The same for kill_ioctx(), although it does not update the pointer. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
| * | | | aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()Oleg Nesterov2014-06-241-26/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On 04/30, Benjamin LaHaise wrote: > > > - ctx->mmap_size = 0; > > - > > - kill_ioctx(mm, ctx, NULL); > > + if (ctx) { > > + ctx->mmap_size = 0; > > + kill_ioctx(mm, ctx, NULL); > > + } > > Rather than indenting and moving the two lines changing mmap_size and the > kill_ioctx() call, why not just do "if (!ctx) ... continue;"? That reduces > the number of lines changed and avoid excessive indentation. OK. To me the code looks better/simpler with "if (ctx)", but this is subjective of course, I won't argue. The patch still removes the empty line between mmap_size = 0 and kill_ioctx(), we reset mmap_size only for kill_ioctx(). But feel free to remove this change. ------------------------------------------------------------------------------- Subject: [PATCH v3 1/2] aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock() 1. We can read ->ioctx_table only once and we do not read rcu_read_lock() or even rcu_dereference(). This mm has no users, nobody else can play with ->ioctx_table. Otherwise the code is buggy anyway, if we need rcu_read_lock() in a loop because ->ioctx_table can be updated then kfree(table) is obviously wrong. 2. Update the comment. "exit_mmap(mm) is coming" is the good reason to avoid munmap(), but another reason is that we simply can't do vm_munmap() unless current->mm == mm and this is not true in general, the caller is mmput(). 3. We do not really need to nullify mm->ioctx_table before return, probably the current code does this to catch the potential problems. But in this case RCU_INIT_POINTER(NULL) looks better. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
* | | | | Merge tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2014-08-1330-870/+1124
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client updates from Trond Myklebust: "Highlights include: - stable fix for a bug in nfs3_list_one_acl() - speed up NFS path walks by supporting LOOKUP_RCU - more read/write code cleanups - pNFS fixes for layout return on close - fixes for the RCU handling in the rpcsec_gss code - more NFS/RDMA fixes" * tag 'nfs-for-3.17-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (79 commits) nfs: reject changes to resvport and sharecache during remount NFS: Avoid infinite loop when RELEASE_LOCKOWNER getting expired error SUNRPC: remove all refcounting of groupinfo from rpcauth_lookupcred NFS: fix two problems in lookup_revalidate in RCU-walk NFS: allow lockless access to access_cache NFS: teach nfs_lookup_verify_inode to handle LOOKUP_RCU NFS: teach nfs_neg_need_reval to understand LOOKUP_RCU NFS: support RCU_WALK in nfs_permission() sunrpc/auth: allow lockless (rcu) lookup of credential cache. NFS: prepare for RCU-walk support but pushing tests later in code. NFS: nfs4_lookup_revalidate: only evaluate parent if it will be used. NFS: add checks for returned value of try_module_get() nfs: clear_request_commit while holding i_lock pnfs: add pnfs_put_lseg_async pnfs: find swapped pages on pnfs commit lists too nfs: fix comment and add warn_on for PG_INODE_REF nfs: check wait_on_bit_lock err in page_group_lock sunrpc: remove "ec" argument from encrypt_v2 operation sunrpc: clean up sparse endianness warnings in gss_krb5_wrap.c sunrpc: clean up sparse endianness warnings in gss_krb5_seal.c ...
| * | | | | nfs: reject changes to resvport and sharecache during remountScott Mayhew2014-08-041-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit c8e47028 made it possible to change resvport/noresvport and sharecache/nosharecache via a remount operation, neither of which should be allowed. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Fixes: c8e47028 (nfs: Apply NFS_MOUNT_CMP_FLAGMASK to nfs_compare_remount_data) Cc: stable@vger.kernel.org # 3.16+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | NFS: Avoid infinite loop when RELEASE_LOCKOWNER getting expired errorKinglong Mee2014-08-041-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix Commit 60ea681299 (NFS: Migration support for RELEASE_LOCKOWNER) If getting expired error, client will enter a infinite loop as, client server RELEASE_LOCKOWNER(old clid) -----> <--- expired error RENEW(old clid) -----> <--- expired error SETCLIENTID -----> <--- a new clid SETCLIENTID_CONFIRM (new clid) --> <--- ok RELEASE_LOCKOWNER(old clid) -----> <--- expired error RENEW(new clid) -----> <-- ok RELEASE_LOCKOWNER(old clid) -----> <--- expired error RENEW(new clid) -----> <-- ok ... ... Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> [Trond: replace call to nfs4_async_handle_error() with nfs4_schedule_lease_recovery()] Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | NFS: fix two problems in lookup_revalidate in RCU-walkNeilBrown2014-08-041-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1/ rcu_dereference isn't correct: that field isn't RCU protected. It could potentially change at any time so ACCESS_ONCE might be justified. changes to ->d_parent are protected by ->d_seq. However that isn't always checked after ->d_revalidate is called, so it is safest to keep the double-check that ->d_parent hasn't changed at the end of these functions. 2/ in nfs4_lookup_revalidate, "->d_parent" was forgotten. So 'parent' was not the parent of 'dentry'. This fails safe is the context is that dentry->d_inode is NULL, and the result of parent->d_inode being NULL is that ECHILD is returned, which is always safe. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | NFS: allow lockless access to access_cacheNeilBrown2014-08-031-2/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The access cache is used during RCU-walk path lookups, so it is best to avoid locking if possible as taking a lock kills concurrency. The rbtree is not rcu-safe and cannot easily be made so. Instead we simply check the last (i.e. most recent) entry on the LRU list. If this doesn't match, then we return -ECHILD and retry in lock/refcount mode. This requires freeing the nfs_access_entry struct with rcu, and requires using rcu access primatives when adding entries to the lru, and when examining the last entry. Calling put_rpccred before kfree_rcu looks a bit odd, but as put_rpccred already provides rcu protection, we know that the cred will not actually be freed until the next grace period, so any concurrent access will be safe. This patch provides about 5% performance improvement on a stat-heavy synthetic work load with 4 threads on a 2-core CPU. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | NFS: teach nfs_lookup_verify_inode to handle LOOKUP_RCUNeilBrown2014-08-031-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It fails with -ECHILD rather than make an RPC call. This allows nfs_lookup_revalidate to call it in RCU-walk mode. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | NFS: teach nfs_neg_need_reval to understand LOOKUP_RCUNeilBrown2014-08-032-17/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This requires nfs_check_verifier to take an rcu_walk flag, and requires an rcu version of nfs_revalidate_inode which returns -ECHILD rather than making an RPC call. With this, nfs_lookup_revalidate can call nfs_neg_need_reval in RCU-walk mode. We can also move the LOOKUP_RCU check past the nfs_check_verifier() call in nfs_lookup_revalidate. If RCU_WALK prevents nfs_check_verifier or nfs_neg_need_reval from doing a full check, they return a status indicating that a revalidation is required. As this revalidation will not be possible in RCU_WALK mode, -ECHILD will ultimately be returned, which is the desired result. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | NFS: support RCU_WALK in nfs_permission()NeilBrown2014-08-031-8/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nfs_permission makes two calls which are not always safe in RCU_WALK, rpc_lookup_cred and nfs_do_access. The second can easily be made rcu-safe by aborting with -ECHILD before making the RPC call. The former can be made rcu-safe by calling rpc_lookup_cred_nonblock() instead. As this will almost always succeed, we use it even when RCU_WALK isn't being used as it still saves some spinlocks in a common case. We only fall back to rpc_lookup_cred() if rpc_lookup_cred_nonblock() fails and MAY_NOT_BLOCK isn't set. This optimisation (always trying rpc_lookup_cred_nonblock()) is particularly important when a security module is active. In that case inode_permission() may return -ECHILD from security_inode_permission() even though ->permission() succeeded in RCU_WALK mode. This leads to may_lookup() retrying inode_permission after performing unlazy_walk(). The spinlock that rpc_lookup_cred() takes is often more expensive than anything security_inode_permission() does, so that spinlock becomes the main bottleneck. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | NFS: prepare for RCU-walk support but pushing tests later in code.NeilBrown2014-08-031-12/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nfs_lookup_revalidate, nfs4_lookup_revalidate, and nfs_permission all need to understand and handle RCU-walk for NFS to gain the benefits of RCU-walk for cached information. Currently these functions all immediately return -ECHILD if the relevant flag (LOOKUP_RCU or MAY_NOT_BLOCK) is set. This patch pushes those tests later in the code so that we only abort immediately before we enter rcu-unsafe code. As subsequent patches make that rcu-unsafe code rcu-safe, several of these new tests will disappear. With this patch there are several paths through the code which will no longer return -ECHILD during an RCU-walk. However these are mostly error paths or other uninteresting cases. A noteworthy change in nfs_lookup_revalidate is that we don't take (or put) the reference to ->d_parent when LOOKUP_RCU is set. Rather we rcu_dereference ->d_parent, and check that ->d_inode is not NULL. We also check that ->d_parent hasn't changed after all the tests. In nfs4_lookup_revalidate we simply avoid testing LOOKUP_RCU on the path that only calls nfs_lookup_revalidate() as that function already performs the required test. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
OpenPOWER on IntegriCloud