summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
...
* | | | | Merge branch 'for-linus' of ↵Linus Torvalds2015-07-0317-59/+292
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace updates from Eric Biederman: "Long ago and far away when user namespaces where young it was realized that allowing fresh mounts of proc and sysfs with only user namespace permissions could violate the basic rule that only root gets to decide if proc or sysfs should be mounted at all. Some hacks were put in place to reduce the worst of the damage could be done, and the common sense rule was adopted that fresh mounts of proc and sysfs should allow no more than bind mounts of proc and sysfs. Unfortunately that rule has not been fully enforced. There are two kinds of gaps in that enforcement. Only filesystems mounted on empty directories of proc and sysfs should be ignored but the test for empty directories was insufficient. So in my tree directories on proc, sysctl and sysfs that will always be empty are created specially. Every other technique is imperfect as an ordinary directory can have entries added even after a readdir returns and shows that the directory is empty. Special creation of directories for mount points makes the code in the kernel a smidge clearer about it's purpose. I asked container developers from the various container projects to help test this and no holes were found in the set of mount points on proc and sysfs that are created specially. This set of changes also starts enforcing the mount flags of fresh mounts of proc and sysfs are consistent with the existing mount of proc and sysfs. I expected this to be the boring part of the work but unfortunately unprivileged userspace winds up mounting fresh copies of proc and sysfs with noexec and nosuid clear when root set those flags on the previous mount of proc and sysfs. So for now only the atime, read-only and nodev attributes which userspace happens to keep consistent are enforced. Dealing with the noexec and nosuid attributes remains for another time. This set of changes also addresses an issue with how open file descriptors from /proc/<pid>/ns/* are displayed. Recently readlink of /proc/<pid>/fd has been triggering a WARN_ON that has not been meaningful since it was added (as all of the code in the kernel was converted) and is not now actively wrong. There is also a short list of issues that have not been fixed yet that I will mention briefly. It is possible to rename a directory from below to above a bind mount. At which point any directory pointers below the renamed directory can be walked up to the root directory of the filesystem. With user namespaces enabled a bind mount of the bind mount can be created allowing the user to pick a directory whose children they can rename to outside of the bind mount. This is challenging to fix and doubly so because all obvious solutions must touch code that is in the performance part of pathname resolution. As mentioned above there is also a question of how to ensure that developers by accident or with purpose do not introduce exectuable files on sysfs and proc and in doing so introduce security regressions in the current userspace that will not be immediately obvious and as such are likely to require breaking userspace in painful ways once they are recognized" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: vfs: Remove incorrect debugging WARN in prepend_path mnt: Update fs_fully_visible to test for permanently empty directories sysfs: Create mountpoints with sysfs_create_mount_point sysfs: Add support for permanently empty directories to serve as mount points. kernfs: Add support for always empty directories. proc: Allow creating permanently empty directories that serve as mount points sysctl: Allow creating permanently empty directories that serve as mountpoints. fs: Add helper functions for permanently empty directories. vfs: Ignore unlocked mounts in fs_fully_visible mnt: Modify fs_fully_visible to deal with locked ro nodev and atime mnt: Refactor the logic for mounting sysfs and proc in a user namespace
| * | | | | vfs: Remove incorrect debugging WARN in prepend_pathEric W. Biederman2015-07-011-11/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The warning message in prepend_path is unclear and outdated. It was added as a warning that the mechanism for generating names of pseudo files had been removed from prepend_path and d_dname should be used instead. Unfortunately the warning reads like a general warning, making it unclear what to do with it. Remove the warning. The transition it was added to warn about is long over, and I added code several years ago which in rare cases causes the warning to fire on legitimate code, and the warning is now firing and scaring people for no good reason. Cc: stable@vger.kernel.org Reported-by: Ivan Delalande <colona@arista.com> Reported-by: Omar Sandoval <osandov@osandov.com> Fixes: f48cfddc6729e ("vfs: In d_path don't call d_dname on a mount point") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | | | mnt: Update fs_fully_visible to test for permanently empty directoriesEric W. Biederman2015-07-011-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fs_fully_visible attempts to make fresh mounts of proc and sysfs give the mounter no more access to proc and sysfs than if they could have by creating a bind mount. One aspect of proc and sysfs that makes this particularly tricky is that there are other filesystems that typically mount on top of proc and sysfs. As those filesystems are mounted on empty directories in practice it is safe to ignore them. However testing to ensure filesystems are mounted on empty directories has not been something the in kernel data structures have supported so the current test for an empty directory which checks to see if nlink <= 2 is a bit lacking. proc and sysfs have recently been modified to use the new empty_dir infrastructure to create all of their dedicated mount points. Instead of testing for S_ISDIR(inode->i_mode) && i_nlink <= 2 to see if a directory is empty, test for is_empty_dir_inode(inode). That small change guaranteess mounts found on proc and sysfs really are safe to ignore, because the directories are not only empty but nothing can ever be added to them. This guarantees there is nothing to worry about when mounting proc and sysfs. Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | | | sysfs: Create mountpoints with sysfs_create_mount_pointEric W. Biederman2015-07-015-31/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows for better documentation in the code and it allows for a simpler and fully correct version of fs_fully_visible to be written. The mount points converted and their filesystems are: /sys/hypervisor/s390/ s390_hypfs /sys/kernel/config/ configfs /sys/kernel/debug/ debugfs /sys/firmware/efi/efivars/ efivarfs /sys/fs/fuse/connections/ fusectl /sys/fs/pstore/ pstore /sys/kernel/tracing/ tracefs /sys/fs/cgroup/ cgroup /sys/kernel/security/ securityfs /sys/fs/selinux/ selinuxfs /sys/fs/smackfs/ smackfs Cc: stable@vger.kernel.org Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | | | sysfs: Add support for permanently empty directories to serve as mount points.Eric W. Biederman2015-07-011-0/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add two functions sysfs_create_mount_point and sysfs_remove_mount_point that hang a permanently empty directory off of a kobject or remove a permanently emptpy directory hanging from a kobject. Export these new functions so modular filesystems can use them. Cc: stable@vger.kernel.org Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | | | kernfs: Add support for always empty directories.Eric W. Biederman2015-07-012-1/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new function kernfs_create_empty_dir that can be used to create directory that can not be modified. Update the code to use make_empty_dir_inode when reporting a permanently empty directory to the vfs. Update the code to not allow adding to permanently empty directories. Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | | | proc: Allow creating permanently empty directories that serve as mount pointsEric W. Biederman2015-07-014-2/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new function proc_create_mount_point that when used to creates a directory that can not be added to. Add a new function is_empty_pde to test if a function is a mount point. Update the code to use make_empty_dir_inode when reporting a permanently empty directory to the vfs. Update the code to not allow adding to permanently empty directories. Update /proc/openprom and /proc/fs/nfsd to be permanently empty directories. Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | | | sysctl: Allow creating permanently empty directories that serve as mountpoints.Eric W. Biederman2015-07-011-0/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a magic sysctl table sysctl_mount_point that when used to create a directory forces that directory to be permanently empty. Update the code to use make_empty_dir_inode when accessing permanently empty directories. Update the code to not allow adding to permanently empty directories. Update /proc/sys/fs/binfmt_misc to be a permanently empty directory. Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | | | fs: Add helper functions for permanently empty directories.Eric W. Biederman2015-07-011-0/+96
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To ensure it is safe to mount proc and sysfs I need to check if filesystems that are mounted on top of them are mounted on truly empty directories. Given that some directories can gain entries over time, knowing that a directory is empty right now is insufficient. Therefore add supporting infrastructure for permantently empty directories that proc and sysfs can use when they create mount points for filesystems and fs_fully_visible can use to test for permanently empty directories to ensure that nothing will be gained by mounting a fresh copy of proc or sysfs. Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | | | vfs: Ignore unlocked mounts in fs_fully_visibleEric W. Biederman2015-07-011-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Limit the mounts fs_fully_visible considers to locked mounts. Unlocked can always be unmounted so considering them adds hassle but no security benefit. Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | | | mnt: Modify fs_fully_visible to deal with locked ro nodev and atimeEric W. Biederman2015-06-041-3/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ignore an existing mount if the locked readonly, nodev or atime attributes are less permissive than the desired attributes of the new mount. On success ensure the new mount locks all of the same readonly, nodev and atime attributes as the old mount. The nosuid and noexec attributes are not checked here as this change is destined for stable and enforcing those attributes causes a regression in lxc and libvirt-lxc where those applications will not start and there are no known executables on sysfs or proc and no known way to create exectuables without code modifications Cc: stable@vger.kernel.org Fixes: e51db73532955 ("userns: Better restrictions on when proc and sysfs can be mounted") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * | | | | mnt: Refactor the logic for mounting sysfs and proc in a user namespaceEric W. Biederman2015-05-133-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fresh mounts of proc and sysfs are a very special case that works very much like a bind mount. Unfortunately the current structure can not preserve the MNT_LOCK... mount flags. Therefore refactor the logic into a form that can be modified to preserve those lock bits. Add a new filesystem flag FS_USERNS_VISIBLE that requires some mount of the filesystem be fully visible in the current mount namespace, before the filesystem may be mounted. Move the logic for calling fs_fully_visible from proc and sysfs into fs/namespace.c where it has greater access to mount namespace state. Cc: stable@vger.kernel.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* | | | | | Merge branch 'for-linus' of ↵Linus Torvalds2015-07-0212-894/+1689
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "We have a pile of bug fixes from Ilya, including a few patches that sync up the CRUSH code with the latest from userspace. There is also a long series from Zheng that fixes various issues with snapshots, inline data, and directory fsync, some simplification and improvement in the cap release code, and a rework of the caching of directory contents. To top it off there are a few small fixes and cleanups from Benoit and Hong" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (40 commits) rbd: use GFP_NOIO in rbd_obj_request_create() crush: fix a bug in tree bucket decode libceph: Fix ceph_tcp_sendpage()'s more boolean usage libceph: Remove spurious kunmap() of the zero page rbd: queue_depth map option rbd: store rbd_options in rbd_device rbd: terminate rbd_opts_tokens with Opt_err ceph: fix ceph_writepages_start() rbd: bump queue_max_segments ceph: rework dcache readdir crush: sync up with userspace crush: fix crash from invalid 'take' argument ceph: switch some GFP_NOFS memory allocation to GFP_KERNEL ceph: pre-allocate data structure that tracks caps flushing ceph: re-send flushing caps (which are revoked) in reconnect stage ceph: send TID of the oldest pending caps flush to MDS ceph: track pending caps flushing globally ceph: track pending caps flushing accurately libceph: fix wrong name "Ceph filesystem for Linux" ceph: fix directory fsync ...
| * | | | | | ceph: fix ceph_writepages_start()Yan, Zheng2015-06-251-14/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before a page get locked, someone else can write data to the page and increase the i_size. So we should re-check the i_size after pages are locked. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: rework dcache readdirYan, Zheng2015-06-256-215/+295
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously our dcache readdir code relies on that child dentries in directory dentry's d_subdir list are sorted by dentry's offset in descending order. When adding dentries to the dcache, if a dentry already exists, our readdir code moves it to head of directory dentry's d_subdir list. This design relies on dcache internals. Al Viro suggests using ncpfs's approach: keeping array of pointers to dentries in page cache of directory inode. the validity of those pointers are presented by directory inode's complete and ordered flags. When a dentry gets pruned, we clear directory inode's complete flag in the d_prune() callback. Before moving a dentry to other directory, we clear the ordered flag for both old and new directory. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: switch some GFP_NOFS memory allocation to GFP_KERNELYan, Zheng2015-06-255-14/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GFP_NOFS memory allocation is required for page writeback path. But there is no need to use GFP_NOFS in syscall path and readpage path Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: pre-allocate data structure that tracks caps flushingYan, Zheng2015-06-258-16/+102
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: re-send flushing caps (which are revoked) in reconnect stageYan, Zheng2015-06-253-6/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | if flushing caps were revoked, we should re-send the cap flush in client reconnect stage. This guarantees that MDS processes the cap flush message before issuing the flushing caps to other client. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: send TID of the oldest pending caps flush to MDSYan, Zheng2015-06-251-18/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to this information, MDS can trim its completed caps flush list (which is used to detect duplicated cap flush). Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: track pending caps flushing globallyYan, Zheng2015-06-255-57/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So we know TID of the oldest pending caps flushing. Later patch will send this information to MDS, so that MDS can trim its completed caps flush list. Tracking pending caps flushing globally also simplifies syncfs code. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: track pending caps flushing accuratelyYan, Zheng2015-06-255-88/+192
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously we do not trace accurate TID for flushing caps. when MDS failovers, we have no choice but to re-send all flushing caps with a new TID. This can cause problem because MDS can has already flushed some caps and has issued the same caps to other client. The re-sent cap flush has a new TID, which makes MDS unable to detect if it has already processed the cap flush. This patch adds code to track pending caps flushing accurately. When re-sending cap flush is needed, we use its original flush TID. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: fix directory fsyncYan, Zheng2015-06-252-64/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fsync() on directory should flush dirty caps and wait for any uncommitted directory opertions to commit. But ceph_dir_fsync() only waits for uncommitted directory opertions. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: fix flushing capsYan, Zheng2015-06-251-24/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current ceph_fsync() only flushes dirty caps and wait for them to be flushed. It doesn't wait for caps that has already been flushing. This patch makes ceph_fsync() wait for pending flushing caps too. Besides, this patch also makes caps_are_flushed() peroperly handle tid wrapping. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: don't include used caps in cap_wantedYan, Zheng2015-06-251-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | when copying files to cephfs, file data may stay in page cache after corresponding file is closed. Cached data use Fc capability. If we include Fc capability in cap_wanted, MDS will treat files with cached data as open files, and journal them in an EOpen event when trimming log segment. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: ratelimit warn messages for MDS closes sessionYan, Zheng2015-06-251-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: simplify two mount_timeout sitesIlya Dryomov2015-06-252-18/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No need to bifurcate wait now that we've got ceph_timeout_jiffies(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | libceph: store timeouts in jiffies, verify user inputIlya Dryomov2015-06-254-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are currently three libceph-level timeouts that the user can specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive. All of these are in seconds and no checking is done on user input: negative values are accepted, we multiply them all by HZ which may or may not overflow, arbitrarily large jiffies then get added together, etc. There is also a bug in the way mount_timeout=0 is handled. It's supposed to mean "infinite timeout", but that's not how wait.h APIs treat it and so __ceph_open_session() for example will busy loop without much chance of being interrupted if none of ceph-mons are there. Fix all this by verifying user input, storing timeouts capped by msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies() helper for all user-specified waits to handle infinite timeouts correctly. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
| * | | | | | ceph: exclude setfilelock requests when calculating oldest tidYan, Zheng2015-06-252-7/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | setfilelock requests can block for a long time, which can prevent client from advancing its oldest tid. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: don't pre-allocate space for cap release messagesYan, Zheng2015-06-254-202/+129
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously we pre-allocate cap release messages for each caps. This wastes lots of memory when there are large amount of caps. This patch make the code not pre-allocate the cap release messages. Instead, we add the corresponding ceph_cap struct to a list when releasing a cap. Later when flush cap releases is needed, we allocate the cap release messages dynamically. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: make sure syncfs flushes all cap snapsYan, Zheng2015-06-254-31/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: don't trim auth cap when there are cap snapsYan, Zheng2015-06-251-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: take snap_rwsem when accessing snap realm's cached_contextYan, Zheng2015-06-253-7/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When ceph inode's i_head_snapc is NULL, __ceph_mark_dirty_caps() accesses snap realm's cached_context. So we need take read lock of snap_rwsem. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: avoid sending unnessesary FLUSHSNAP messageYan, Zheng2015-06-253-45/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | when a snap notification contains no new snapshot, we can avoid sending FLUSHSNAP message to MDS. But we still need to create cap_snap in some case because it's required by write path and page writeback path Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: set i_head_snapc when getting CEPH_CAP_FILE_WR referenceYan, Zheng2015-06-254-137/+212
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In most cases that snap context is needed, we are holding reference of CEPH_CAP_FILE_WR. So we can set ceph inode's i_head_snapc when getting the CEPH_CAP_FILE_WR reference, and make codes get snap context from i_head_snapc. This makes the code simpler. Another benefit of this change is that we can handle snap notification more elegantly. Especially when snap context is updated while someone else is doing write. The old queue cap_snap code may set cap_snap's context to ether the old context or the new snap context, depending on if i_head_snapc is set. The new queue capp_snap code always set cap_snap's context to the old snap context. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: use empty snap context for uninline_data and get_pool_permYan, Zheng2015-06-253-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cached_context in ceph_snap_realm is directly accessed by uninline_data() and get_pool_perm(). This is racy in theory. both uninline_data() and get_pool_perm() do not modify existing object, they only create new object. So we can pass the empty snap context to them. Unlike cached_context in ceph_snap_realm, we do not need to protect the empty snap context. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | ceph: check OSD caps before read/writeYan, Zheng2015-06-257-6/+249
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * | | | | | libceph: allow setting osd_req_op's flagsYan, Zheng2015-06-252-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>
* | | | | | | Merge tag 'nfs-for-4.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2015-07-0226-192/+1004
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client updates from Trond Myklebust: "Highlights include: Stable patches: - Fix a crash in the NFSv4 file locking code. - Fix an fsync() regression, where we were failing to retry I/O in some circumstances. - Fix an infinite loop in NFSv4.0 OPEN stateid recovery - Fix a memory leak when an attempted pnfs fails. - Fix a memory leak in the backchannel code - Large hostnames were not supported correctly in NFSv4.1 - Fix a pNFS/flexfiles bug that was impeding error reporting on I/O. - Fix a couple of credential issues in pNFS/flexfiles Bugfixes + cleanups: - Open flag sanity checks in the NFSv4 atomic open codepath - More NFSv4 delegation related bugfixes - Various NFSv4.1 backchannel bugfixes and cleanups - Fix the NFS swap socket code - Various cleanups of the NFSv4 SETCLIENTID and EXCHANGE_ID code - Fix a UDP transport deadlock issue Features: - More RDMA client transport improvements - NFSv4.2 LAYOUTSTATS functionality for pnfs flexfiles" * tag 'nfs-for-4.2-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (87 commits) nfs: Remove invalid tk_pid from debug message nfs: Remove invalid NFS_ATTR_FATTR_V4_REFERRAL checking in nfs4_get_rootfh nfs: Drop bad comment in nfs41_walk_client_list() nfs: Remove unneeded micro checking of CONFIG_PROC_FS nfs: Don't setting FILE_CREATED flags always nfs: Use remove_proc_subtree() instead remove_proc_entry() nfs: Remove unused argument in nfs_server_set_fsinfo() nfs: Fix a memory leak when meeting an unsupported state protect nfs: take extra reference to fl->fl_file when running a LOCKU operation NFSv4: When returning a delegation, don't reclaim an incompatible open mode. NFSv4.2: LAYOUTSTATS is optional to implement NFSv4.2: Fix up a decoding error in layoutstats pNFS/flexfiles: Fix the reset of struct pgio_header when resending pNFS/flexfiles: Turn off layoutcommit for servers that don't need it pnfs/flexfiles: protect ktime manipulation with mirror lock nfs: provide pnfs_report_layoutstat when NFS42 is disabled nfs: verify open flags before allowing open nfs: always update creds in mirror, even when we have an already connected ds nfs: fix potential credential leak in ff_layout_update_mirror_cred pnfs/flexfiles: report layoutstat regularly ...
| * | | | | | | nfs: Remove invalid tk_pid from debug messageKinglong Mee2015-07-013-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before rpc_run_task(), tk_pid is uninitiated as 0 always. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | | | nfs: Remove invalid NFS_ATTR_FATTR_V4_REFERRAL checking in nfs4_get_rootfhKinglong Mee2015-07-011-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NFS_ATTR_FATTR_V4_REFERRAL is only set in nfs4_proc_lookup_common. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | | | nfs: Drop bad comment in nfs41_walk_client_list()Kinglong Mee2015-07-011-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 7b1f1fd184 "NFSv4/4.1: Fix bugs in nfs4[01]_walk_client_list" have change the logical of the list_for_each_entry(). Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | | | nfs: Remove unneeded micro checking of CONFIG_PROC_FSKinglong Mee2015-07-011-7/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Have checking CONFIG_PROC_FS in include/linux/sunrpc/stats.h. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | | | nfs: Don't setting FILE_CREATED flags alwaysKinglong Mee2015-07-011-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 5bc2afc2b5 "NFSv4: Honour the 'opened' parameter in the atomic_open() filesystem method" have support the opened arguments now. Also, Commit 03da633aa7 "atomic_open: take care of EEXIST in no-open case with O_CREAT|O_EXCL in fs/namei.c" have change vfs's logical. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | | | nfs: Use remove_proc_subtree() instead remove_proc_entry()Kinglong Mee2015-07-011-18/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Thanks for Al Viro's comments of killing proc_fs_nfs completely. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | | | nfs: Remove unused argument in nfs_server_set_fsinfo()Kinglong Mee2015-07-011-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit e38eb6506f "NFS: set_pnfs_layoutdriver() from nfs4_proc_fsinfo()" have remove the using of mntfh from nfs_server_set_fsinfo(). Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | | | nfs: Fix a memory leak when meeting an unsupported state protectKinglong Mee2015-07-011-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | | | nfs: take extra reference to fl->fl_file when running a LOCKU operationJeff Layton2015-07-011-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jean reported another crash, similar to the one fixed by feaff8e5b2cf: BUG: unable to handle kernel NULL pointer dereference at 0000000000000148 IP: [<ffffffff8124ef7f>] locks_get_lock_context+0xf/0xa0 PGD 0 Oops: 0000 [#1] SMP Modules linked in: nfsv3 nfs_layout_flexfiles rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache vmw_vsock_vmci_transport vsock cfg80211 rfkill coretemp crct10dif_pclmul ppdev vmw_balloon crc32_pclmul crc32c_intel ghash_clmulni_intel pcspkr vmxnet3 parport_pc i2c_piix4 microcode serio_raw parport nfsd floppy vmw_vmci acpi_cpufreq auth_rpcgss shpchp nfs_acl lockd grace sunrpc vmwgfx drm_kms_helper ttm drm mptspi scsi_transport_spi mptscsih ata_generic mptbase i2c_core pata_acpi CPU: 0 PID: 329 Comm: kworker/0:1H Not tainted 4.1.0-rc7+ #2 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/30/2013 Workqueue: rpciod rpc_async_schedule [sunrpc] 30ec000 RIP: 0010:[<ffffffff8124ef7f>] [<ffffffff8124ef7f>] locks_get_lock_context+0xf/0xa0 RSP: 0018:ffff8802330efc08 EFLAGS: 00010296 RAX: ffff8802330efc58 RBX: ffff880097187c80 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000 RBP: ffff8802330efc18 R08: ffff88023fc173d8 R09: 3038b7bf00000000 R10: 00002f1a02000000 R11: 3038b7bf00000000 R12: 0000000000000000 R13: 0000000000000000 R14: ffff8802337a2300 R15: 0000000000000020 FS: 0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000148 CR3: 000000003680f000 CR4: 00000000001407f0 Stack: ffff880097187c80 ffff880097187cd8 ffff8802330efc98 ffffffff81250281 ffff8802330efc68 ffffffffa013e7df ffff8802330efc98 0000000000000246 ffff8801f6901c00 ffff880233d2b8d8 ffff8802330efc58 ffff8802330efc58 Call Trace: [<ffffffff81250281>] __posix_lock_file+0x31/0x5e0 [<ffffffffa013e7df>] ? rpc_wake_up_task_queue_locked.part.35+0xcf/0x240 [sunrpc] [<ffffffff8125088b>] posix_lock_file_wait+0x3b/0xd0 [<ffffffffa03890b2>] ? nfs41_wake_and_assign_slot+0x32/0x40 [nfsv4] [<ffffffffa0365808>] ? nfs41_sequence_done+0xd8/0x300 [nfsv4] [<ffffffffa0367525>] do_vfs_lock+0x35/0x40 [nfsv4] [<ffffffffa03690c1>] nfs4_locku_done+0x81/0x120 [nfsv4] [<ffffffffa013e310>] ? rpc_destroy_wait_queue+0x20/0x20 [sunrpc] [<ffffffffa013e310>] ? rpc_destroy_wait_queue+0x20/0x20 [sunrpc] [<ffffffffa013e33c>] rpc_exit_task+0x2c/0x90 [sunrpc] [<ffffffffa0134400>] ? call_refreshresult+0x170/0x170 [sunrpc] [<ffffffffa013ece4>] __rpc_execute+0x84/0x410 [sunrpc] [<ffffffffa013f085>] rpc_async_schedule+0x15/0x20 [sunrpc] [<ffffffff810add67>] process_one_work+0x147/0x400 [<ffffffff810ae42b>] worker_thread+0x11b/0x460 [<ffffffff810ae310>] ? rescuer_thread+0x2f0/0x2f0 [<ffffffff810b35d9>] kthread+0xc9/0xe0 [<ffffffff81010000>] ? perf_trace_xen_mmu_set_pmd+0xa0/0x160 [<ffffffff810b3510>] ? kthread_create_on_node+0x170/0x170 [<ffffffff8173c222>] ret_from_fork+0x42/0x70 [<ffffffff810b3510>] ? kthread_create_on_node+0x170/0x170 Code: a5 81 e8 85 75 e4 ff c6 05 31 ee aa 00 01 eb 98 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 54 49 89 fc 53 <48> 8b 9f 48 01 00 00 48 85 db 74 08 48 89 d8 5b 41 5c 5d c3 83 RIP [<ffffffff8124ef7f>] locks_get_lock_context+0xf/0xa0 RSP <ffff8802330efc08> CR2: 0000000000000148 ---[ end trace 64484f16250de7ef ]--- The problem is almost exactly the same as the one fixed by feaff8e5b2cf. We must take a reference to the struct file when running the LOCKU compound to prevent the final fput from running until the operation is complete. Reported-by: Jean Spector <jean@primarydata.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | | | NFSv4: When returning a delegation, don't reclaim an incompatible open mode.NeilBrown2015-06-291-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is possible to have an active open with one mode, and a delegation for the same file with a different mode. In particular, a WR_ONLY open and an RD_ONLY delegation. This happens if a WR_ONLY open is followed by a RD_ONLY open which provides a delegation, but is then close. When returning the delegation, we currently try to claim opens for every open type (n_rdwr, n_rdonly, n_wronly). As there is no harm in claiming an open for a mode that we already have, this is often simplest. However if the delegation only provides a subset of the modes that we currently have open, this will produce an error from the server. So when claiming open modes prior to returning a delegation, skip the open request if the mode is not covered by the delegation - the open_stateid must already cover that mode, so there is nothing to do. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | | | NFSv4.2: LAYOUTSTATS is optional to implementTrond Myklebust2015-06-273-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make it so, by checking the return value for NFS4ERR_MOTSUPP and caching the information as a server capability. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
| * | | | | | | NFSv4.2: Fix up a decoding error in layoutstatsTrond Myklebust2015-06-271-17/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to the spec, the server is only returning the status, which we decode in the op header. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
OpenPOWER on IntegriCloud