summaryrefslogtreecommitdiffstats
path: root/fs/proc
Commit message (Collapse)AuthorAgeFilesLines
* mm: use mmget_not_zero() helperVegard Nossum2017-02-273-5/+5
| | | | | | | | | | | | | | | | | | We already have the helper, we can convert the rest of the kernel mechanically using: git grep -l 'atomic_inc_not_zero.*mm_users' | xargs sed -i 's/atomic_inc_not_zero(&\(.*\)->mm_users)/mmget_not_zero\(\1\)/' This is needed for a later patch that hooks into the helper, but might be a worthwhile cleanup on its own. Link: http://lkml.kernel.org/r/20161218123229.22952-3-vegard.nossum@oracle.com Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: add new mmgrab() helperVegard Nossum2017-02-271-2/+2
| | | | | | | | | | | | | | | | | | | | | Apart from adding the helper function itself, the rest of the kernel is converted mechanically using: git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/' git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/' This is needed for a later patch that hooks into the helper, but might be a worthwhile cleanup on its own. (Michal Hocko provided most of the kerneldoc comment.) Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* /proc/kcore: update physical address for kcore ram and textPratyush Anand2017-02-271-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently all the p_paddr of PT_LOAD headers are assigned to 0, which is not true and could be misleading, since 0 is a valid physical address. User space tools like makedumpfile needs to know physical address for PT_LOAD segments of direct mapped regions. Therefore this patch updates paddr for such regions. It also sets an invalid paddr (-1) for other regions, so that user space tool can know whether a physical address provided in PT_LOAD is correct or not. I do not know why it was 0, which is a valid physical address. But certainly, it might break some user space tools, and those need to be fixed. For example, see following code from kexec-tools kexec/kexec-elf.c:build_mem_phdrs() if ((phdr->p_paddr + phdr->p_memsz) < phdr->p_paddr) { /* The memory address wraps */ if (probe_debug) { fprintf(stderr, "ELF address wrap around\n"); } return -1; } We do not need to perform above check for an invalid physical address. I think, kexec-tools and makedumpfile will need fixup. I already have those fixup which will be sent upstream once this patch makes through. Pro with this approach is that, it will help to calculate variable like page_offset, phys_base from PT_LOAD even when they are randomized and therefore will reduce many variable and version specific values in user space tools. Having an ASLR offset information can help to translate an identity mapped virtual address to a physical address. But that would be an additional field in PT_LOAD header structure and an arch dependent value. Moreover, sending a valid physical address like 0 does not seem right. So, IMHO it is better to fix that and send valid physical address when available (identity mapped). Link: http://lkml.kernel.org/r/f951340d2917cdd2a329fae9837a83f2059dc3b2.1485318868.git.panand@redhat.com Signed-off-by: Pratyush Anand <panand@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Dave Anderson <anderson@redhat.com> Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp> Cc: Simon Horman <simon.horman@netronome.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* procfs: use an enum for possible hidepid valuesLafcadio Wluiki2017-02-243-6/+7
| | | | | | | | | | | | | | | | | | | | | | Previously, the hidepid parameter was checked by comparing literal integers 0, 1, 2. Let's add a proper enum for this, to make the checking more expressive: 0 → HIDEPID_OFF 1 → HIDEPID_NO_ACCESS 2 → HIDEPID_INVISIBLE This changes the internal labelling only, the userspace-facing interface remains unmodified, and still works with literal integers 0, 1, 2. No functional changes. Link: http://lkml.kernel.org/r/1484572984-13388-2-git-send-email-djalal@gmail.com Signed-off-by: Lafcadio Wluiki <wluikil@gmail.com> Signed-off-by: Djalal Harouni <tixxdz@gmail.com> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* proc: less code duplication in /proc/*/cmdlineAlexey Dobriyan2017-02-241-88/+56
| | | | | | | | | | | After staring at this code for a while I've figured using small 2-entry array describing ARGV and ENVP is the way to address code duplication critique. Link: http://lkml.kernel.org/r/20170105185724.GA12027@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* proc: use rb_entry()Geliang Tang2017-02-241-5/+6
| | | | | | | | | | | | | | | | To make the code clearer, use rb_entry() instead of container_of() to deal with rbtree. Link: http://lkml.kernel.org/r/4fd1f82818665705ce75c5156a060ae7caa8e0a9.1482160150.git.geliangtang@gmail.com Signed-off-by: Geliang Tang <geliangtang@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Juergen Gross <jgross@suse.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* userfaultfd: non-cooperative: add event for memory unmapsMike Rapoport2017-02-241-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a non-cooperative userfaultfd monitor copies pages in the background, it may encounter regions that were already unmapped. Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely changes in the virtual memory layout. Since there might be different uffd contexts for the affected VMAs, we first should create a temporary representation for the unmap event for each uffd context and then notify them one by one to the appropriate userfault file descriptors. The event notification occurs after the mmap_sem has been released. [arnd@arndb.de: fix nommu build] Link: http://lkml.kernel.org/r/20170203165141.3665284-1-arnd@arndb.de [mhocko@suse.com: fix nommu build] Link: http://lkml.kernel.org/r/20170202091503.GA22823@dhcp22.suse.cz Link: http://lkml.kernel.org/r/1485542673-24387-3-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmfDave Jiang2017-02-241-2/+2
| | | | | | | | | | | | | | | | | | | | | | | ->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to take a vma and vmf parameter when the vma already resides in vmf. Remove the vma parameter to simplify things. [arnd@arndb.de: fix ARM build] Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2017-02-235-89/+117
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "There is a lot here. A lot of these changes result in subtle user visible differences in kernel behavior. I don't expect anything will care but I will revert/fix things immediately if any regressions show up. From Seth Forshee there is a continuation of the work to make the vfs ready for unpriviled mounts. We had thought the previous changes prevented the creation of files outside of s_user_ns of a filesystem, but it turns we missed the O_CREAT path. Ooops. Pavel Tikhomirov and Oleg Nesterov worked together to fix a long standing bug in the implemenation of PR_SET_CHILD_SUBREAPER where only children that are forked after the prctl are considered and not children forked before the prctl. The only known user of this prctl systemd forks all children after the prctl. So no userspace regressions will occur. Holding earlier forked children to the same rules as later forked children creates a semantic that is sane enough to allow checkpoing of processes that use this feature. There is a long delayed change by Nikolay Borisov to limit inotify instances inside a user namespace. Michael Kerrisk extends the API for files used to maniuplate namespaces with two new trivial ioctls to allow discovery of the hierachy and properties of namespaces. Konstantin Khlebnikov with the help of Al Viro adds code that when a network namespace exits purges it's sysctl entries from the dcache. As in some circumstances this could use a lot of memory. Vivek Goyal fixed a bug with stacked filesystems where the permissions on the wrong inode were being checked. I continue previous work on ptracing across exec. Allowing a file to be setuid across exec while being ptraced if the tracer has enough credentials in the user namespace, and if the process has CAP_SETUID in it's own namespace. Proc files for setuid or otherwise undumpable executables are now owned by the root in the user namespace of their mm. Allowing debugging of setuid applications in containers to work better. A bug I introduced with permission checking and automount is now fixed. The big change is to mark the mounts that the kernel initiates as a result of an automount. This allows the permission checks in sget to be safely suppressed for this kind of mount. As the permission check happened when the original filesystem was mounted. Finally a special case in the mount namespace is removed preventing unbounded chains in the mount hash table, and making the semantics simpler which benefits CRIU. The vfs fix along with related work in ima and evm I believe makes us ready to finish developing and merge fully unprivileged mounts of the fuse filesystem. The cleanups of the mount namespace makes discussing how to fix the worst case complexity of umount. The stacked filesystem fixes pave the way for adding multiple mappings for the filesystem uids so that efficient and safer containers can be implemented" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc/sysctl: Don't grab i_lock under sysctl_lock. vfs: Use upper filesystem inode in bprm_fill_uid() proc/sysctl: prune stale dentries during unregistering mnt: Tuck mounts under others instead of creating shadow/side mounts. prctl: propagate has_child_subreaper flag to every descendant introduce the walk_process_tree() helper nsfs: Add an ioctl() to return owner UID of a userns fs: Better permission checking for submounts exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction vfs: open() with O_CREAT should not create inodes with unknown ids nsfs: Add an ioctl() to return the namespace type proc: Better ownership of files for non-dumpable tasks in user namespaces exec: Remove LSM_UNSAFE_PTRACE_CAP exec: Test the ptracer's saved cred to see if the tracee can gain caps exec: Don't reset euid and egid when the tracee has CAP_SETUID inotify: Convert to using per-namespace limits
| * proc/sysctl: Don't grab i_lock under sysctl_lock.Eric W. Biederman2017-02-221-13/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: > This patch has locking problem. I've got lockdep splat under LTP. > > [ 6633.115456] ====================================================== > [ 6633.115502] [ INFO: possible circular locking dependency detected ] > [ 6633.115553] 4.9.10-debug+ #9 Tainted: G L > [ 6633.115584] ------------------------------------------------------- > [ 6633.115627] ksm02/284980 is trying to acquire lock: > [ 6633.115659] (&sb->s_type->i_lock_key#4){+.+...}, at: [<ffffffff816bc1ce>] igrab+0x1e/0x80 > [ 6633.115834] but task is already holding lock: > [ 6633.115882] (sysctl_lock){+.+...}, at: [<ffffffff817e379b>] unregister_sysctl_table+0x6b/0x110 > [ 6633.116026] which lock already depends on the new lock. > [ 6633.116026] > [ 6633.116080] > [ 6633.116080] the existing dependency chain (in reverse order) is: > [ 6633.116117] > -> #2 (sysctl_lock){+.+...}: > -> #1 (&(&dentry->d_lockref.lock)->rlock){+.+...}: > -> #0 (&sb->s_type->i_lock_key#4){+.+...}: > > d_lock nests inside i_lock > sysctl_lock nests inside d_lock in d_compare > > This patch adds i_lock nesting inside sysctl_lock. Al Viro <viro@ZenIV.linux.org.uk> replied: > Once ->unregistering is set, you can drop sysctl_lock just fine. So I'd > try something like this - use rcu_read_lock() in proc_sys_prune_dcache(), > drop sysctl_lock() before it and regain after. Make sure that no inodes > are added to the list ones ->unregistering has been set and use RCU list > primitives for modifying the inode list, with sysctl_lock still used to > serialize its modifications. > > Freeing struct inode is RCU-delayed (see proc_destroy_inode()), so doing > igrab() is safe there. Since we don't drop inode reference until after we'd > passed beyond it in the list, list_for_each_entry_rcu() should be fine. I agree with Al Viro's analsysis of the situtation. Fixes: d6cffbbe9a7e ("proc/sysctl: prune stale dentries during unregistering") Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Tested-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
| * proc/sysctl: prune stale dentries during unregisteringKonstantin Khlebnikov2017-02-133-19/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently unregistering sysctl table does not prune its dentries. Stale dentries could slowdown sysctl operations significantly. For example, command: # for i in {1..100000} ; do unshare -n -- sysctl -a &> /dev/null ; done creates a millions of stale denties around sysctls of loopback interface: # sysctl fs.dentry-state fs.dentry-state = 25812579 24724135 45 0 0 0 All of them have matching names thus lookup have to scan though whole hash chain and call d_compare (proc_sys_compare) which checks them under system-wide spinlock (sysctl_lock). # time sysctl -a > /dev/null real 1m12.806s user 0m0.016s sys 1m12.400s Currently only memory reclaimer could remove this garbage. But without significant memory pressure this never happens. This patch collects sysctl inodes into list on sysctl table header and prunes all their dentries once that table unregisters. Konstantin Khlebnikov <khlebnikov@yandex-team.ru> writes: > On 10.02.2017 10:47, Al Viro wrote: >> how about >> the matching stats *after* that patch? > > dcache size doesn't grow endlessly, so stats are fine > > # sysctl fs.dentry-state > fs.dentry-state = 92712 58376 45 0 0 0 > > # time sysctl -a &>/dev/null > > real 0m0.013s > user 0m0.004s > sys 0m0.008s Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| * proc: Better ownership of files for non-dumpable tasks in user namespacesEric W. Biederman2017-01-243-69/+61
| | | | | | | | | | | | | | | | | | Instead of making the files owned by the GLOBAL_ROOT_USER. Make non-dumpable files whose mm has always lived in a user namespace owned by the user namespace root. This allows the container root to have things work as expected in a container. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
* | Merge branch 'next' of ↵Linus Torvalds2017-02-211-4/+9
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security layer updates from James Morris: "Highlights: - major AppArmor update: policy namespaces & lots of fixes - add /sys/kernel/security/lsm node for easy detection of loaded LSMs - SELinux cgroupfs labeling support - SELinux context mounts on tmpfs, ramfs, devpts within user namespaces - improved TPM 2.0 support" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (117 commits) tpm: declare tpm2_get_pcr_allocation() as static tpm: Fix expected number of response bytes of TPM1.2 PCR Extend tpm xen: drop unneeded chip variable tpm: fix misspelled "facilitate" in module parameter description tpm_tis: fix the error handling of init_tis() KEYS: Use memzero_explicit() for secret data KEYS: Fix an error code in request_master_key() sign-file: fix build error in sign-file.c with libressl selinux: allow changing labels for cgroupfs selinux: fix off-by-one in setprocattr tpm: silence an array overflow warning tpm: fix the type of owned field in cap_t tpm: add securityfs support for TPM 2.0 firmware event log tpm: enhance read_log_of() to support Physical TPM event log tpm: enhance TPM 2.0 PCR extend to support multiple banks tpm: implement TPM 2.0 capability to get active PCR banks tpm: fix RC value check in tpm2_seal_trusted tpm_tis: fix iTPM probe via probe_itpm() function tpm: Begin the process to deprecate user_read_timer tpm: remove tpm_read_index and tpm_write_index from tpm.h ...
| * | proc,security: move restriction on writing /proc/pid/attr nodes to procStephen Smalley2017-01-091-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Processes can only alter their own security attributes via /proc/pid/attr nodes. This is presently enforced by each individual security module and is also imposed by the Linux credentials implementation, which only allows a task to alter its own credentials. Move the check enforcing this restriction from the individual security modules to proc_pid_attr_write() before calling the security hook, and drop the unnecessary task argument to the security hook since it can only ever be the current task. Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: Casey Schaufler <casey@schaufler-ca.com> Acked-by: John Johansen <john.johansen@canonical.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
* | | Merge branch 'sched-core-for-linus' of ↵Linus Torvalds2017-02-203-45/+42
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes in this (fairly busy) cycle were: - There was a class of scheduler bugs related to forgetting to update the rq-clock timestamp which can cause weird and hard to debug problems, so there's a new debug facility for this: which uncovered a whole lot of bugs which convinced us that we want to keep the debug facility. (Peter Zijlstra, Matt Fleming) - Various cputime related updates: eliminate cputime and use u64 nanoseconds directly, simplify and improve the arch interfaces, implement delayed accounting more widely, etc. - (Frederic Weisbecker) - Move code around for better structure plus cleanups (Ingo Molnar) - Move IO schedule accounting deeper into the scheduler plus related changes to improve the situation (Tejun Heo) - ... plus a round of sched/rt and sched/deadline fixes, plus other fixes, updats and cleanups" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits) sched/core: Remove unlikely() annotation from sched_move_task() sched/autogroup: Rename auto_group.[ch] to autogroup.[ch] sched/topology: Split out scheduler topology code from core.c into topology.c sched/core: Remove unnecessary #include headers sched/rq_clock: Consolidate the ordering of the rq_clock methods delayacct: Include <uapi/linux/taskstats.h> sched/core: Clean up comments sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds sched/clock: Add dummy clear_sched_clock_stable() stub function sched/cputime: Remove generic asm headers sched/cputime: Remove unused nsec_to_cputime() s390, sched/cputime: Remove unused cputime definitions powerpc, sched/cputime: Remove unused cputime definitions s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs ia64, sched/cputime: Remove unused cputime definitions ia64: Convert vtime to use nsec units directly ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it sched/cputime: Remove jiffies based cputime sched/cputime, vtime: Return nsecs instead of cputime_t to account sched/cputime: Complete nsec conversion of tick based accounting ...
| * | | s390, sched/cputime: Make arch_cpu_idle_time() to return nsecsFrederic Weisbecker2017-02-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This way we don't need to deal with cputime_t details from the core code. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-32-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/cputime: Convert task/group cputime to nsecsFrederic Weisbecker2017-02-011-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that most cputime readers use the transition API which return the task cputime in old style cputime_t, we can safely store the cputime in nsecs. This will eventually make cputime statistics less opaque and more granular. Back and forth convertions between cputime_t and nsecs in order to deal with cputime_t random granularity won't be needed anymore. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/cputime: Convert guest time accounting to nsecs (u64)Frederic Weisbecker2017-02-011-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cputime_t is being obsolete and replaced by nsecs units in order to make internal timestamps less opaque and more granular. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-6-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | sched/cputime: Convert kcpustat to nsecsFrederic Weisbecker2017-02-012-39/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kernel CPU stats are stored in cputime_t which is an architecture defined type, and hence a bit opaque and requiring accessors and mutators for any operation. Converting them to nsecs simplifies the code and is one step toward the removal of cputime_t in the core code. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Link: http://lkml.kernel.org/r/1485832191-26889-4-git-send-email-fweisbec@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
* | | | Merge branch 'timers-core-for-linus' of ↵Linus Torvalds2017-02-201-2/+2
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "Nothing exciting, just the usual pile of fixes, updates and cleanups: - A bunch of clocksource driver updates - Removal of CONFIG_TIMER_STATS and the related /proc file - More posix timer slim down work - A scalability enhancement in the tick broadcast code - Math cleanups" * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) hrtimer: Catch invalid clockids again math64, tile: Fix build failure clocksource/drivers/arm_arch_timer:: Mark cyclecounter __ro_after_init timerfd: Protect the might cancel mechanism proper timer_list: Remove useless cast when printing time: Remove CONFIG_TIMER_STATS clocksource/drivers/arm_arch_timer: Work around Hisilicon erratum 161010101 clocksource/drivers/arm_arch_timer: Introduce generic errata handling infrastructure clocksource/drivers/arm_arch_timer: Remove fsl-a008585 parameter clocksource/drivers/arm_arch_timer: Add dt binding for hisilicon-161010101 erratum clocksource/drivers/ostm: Add renesas-ostm timer driver clocksource/drivers/ostm: Document renesas-ostm timer DT bindings clocksource/drivers/tcb_clksrc: Use 32 bit tcb as sched_clock clocksource/drivers/gemini: Add driver for the Cortina Gemini clocksource: add DT bindings for Cortina Gemini clockevents: Add a clkevt-of mechanism like clksrc-of tick/broadcast: Reduce lock cacheline contention timers: Omit POSIX timer stuff from task_struct when disabled x86/timer: Make delay() work during early bootup delay: Add explanation of udelay() inaccuracy ...
| * | | | timers: Omit POSIX timer stuff from task_struct when disabledNicolas Pitre2017-01-271-2/+2
| | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CONFIG_POSIX_TIMERS is disabled, it is preferable to remove related structures from struct task_struct and struct signal_struct as they won't contain anything useful and shouldn't be relied upon by mistake. Code still referencing those structures is also disabled here. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: John Stultz <john.stultz@linaro.org>
* | | | mm: fix KPF_SWAPCACHE in /proc/kpageflagsHugh Dickins2017-02-071-1/+2
| |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 6326fec1122c ("mm: Use owner_priv bit for PageSwapCache, valid when PageSwapBacked") aliased PG_swapcache to PG_owner_priv_1 (and depending on PageSwapBacked being true). As a result, the KPF_SWAPCACHE bit in '/proc/kpageflags' should now be synthesized, instead of being shown on unrelated pages which just happen to have PG_owner_priv_1 set. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | proc: add a schedule point in proc_pid_readdir()Eric Dumazet2017-01-241-0/+2
| |/ |/| | | | | | | | | | | | | | | | | | | We have seen proc_pid_readdir() invocations holding cpu for more than 50 ms. Add a cond_resched() to be gentle with other tasks. [akpm@linux-foundation.org: coding style fix] Link: http://lkml.kernel.org/r/1484238380.15816.42.camel@edumazet-glaptop3.roam.corp.google.com Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | sysctl: Drop reference added by grab_header in proc_sys_readdirZhou Chengming2017-01-101-1/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes CVE-2016-9191, proc_sys_readdir doesn't drop reference added by grab_header when return from !dir_emit_dots path. It can cause any path called unregister_sysctl_table will wait forever. The calltrace of CVE-2016-9191: [ 5535.960522] Call Trace: [ 5535.963265] [<ffffffff817cdaaf>] schedule+0x3f/0xa0 [ 5535.968817] [<ffffffff817d33fb>] schedule_timeout+0x3db/0x6f0 [ 5535.975346] [<ffffffff817cf055>] ? wait_for_completion+0x45/0x130 [ 5535.982256] [<ffffffff817cf0d3>] wait_for_completion+0xc3/0x130 [ 5535.988972] [<ffffffff810d1fd0>] ? wake_up_q+0x80/0x80 [ 5535.994804] [<ffffffff8130de64>] drop_sysctl_table+0xc4/0xe0 [ 5536.001227] [<ffffffff8130de17>] drop_sysctl_table+0x77/0xe0 [ 5536.007648] [<ffffffff8130decd>] unregister_sysctl_table+0x4d/0xa0 [ 5536.014654] [<ffffffff8130deff>] unregister_sysctl_table+0x7f/0xa0 [ 5536.021657] [<ffffffff810f57f5>] unregister_sched_domain_sysctl+0x15/0x40 [ 5536.029344] [<ffffffff810d7704>] partition_sched_domains+0x44/0x450 [ 5536.036447] [<ffffffff817d0761>] ? __mutex_unlock_slowpath+0x111/0x1f0 [ 5536.043844] [<ffffffff81167684>] rebuild_sched_domains_locked+0x64/0xb0 [ 5536.051336] [<ffffffff8116789d>] update_flag+0x11d/0x210 [ 5536.057373] [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450 [ 5536.064186] [<ffffffff81167acb>] ? cpuset_css_offline+0x1b/0x60 [ 5536.070899] [<ffffffff810fce3d>] ? trace_hardirqs_on+0xd/0x10 [ 5536.077420] [<ffffffff817cf61f>] ? mutex_lock_nested+0x2df/0x450 [ 5536.084234] [<ffffffff8115a9f5>] ? css_killed_work_fn+0x25/0x220 [ 5536.091049] [<ffffffff81167ae5>] cpuset_css_offline+0x35/0x60 [ 5536.097571] [<ffffffff8115aa2c>] css_killed_work_fn+0x5c/0x220 [ 5536.104207] [<ffffffff810bc83f>] process_one_work+0x1df/0x710 [ 5536.110736] [<ffffffff810bc7c0>] ? process_one_work+0x160/0x710 [ 5536.117461] [<ffffffff810bce9b>] worker_thread+0x12b/0x4a0 [ 5536.123697] [<ffffffff810bcd70>] ? process_one_work+0x710/0x710 [ 5536.130426] [<ffffffff810c3f7e>] kthread+0xfe/0x120 [ 5536.135991] [<ffffffff817d4baf>] ret_from_fork+0x1f/0x40 [ 5536.142041] [<ffffffff810c3e80>] ? kthread_create_on_node+0x230/0x230 One cgroup maintainer mentioned that "cgroup is trying to offline a cpuset css, which takes place under cgroup_mutex. The offlining ends up trying to drain active usages of a sysctl table which apprently is not happening." The real reason is that proc_sys_readdir doesn't drop reference added by grab_header when return from !dir_emit_dots path. So this cpuset offline path will wait here forever. See here for details: http://www.openwall.com/lists/oss-security/2016/11/04/13 Fixes: f0c3b5093add ("[readdir] convert procfs") Cc: stable@vger.kernel.org Reported-by: CAI Qian <caiqian@redhat.com> Tested-by: Yang Shukui <yangshukui@huawei.com> Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
* Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds2016-12-2412-12/+12
| | | | | | | | | | | | | This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge uncontroversial parts of branch 'readlink' of ↵Linus Torvalds2016-12-173-28/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull partial readlink cleanups from Miklos Szeredi. This is the uncontroversial part of the readlink cleanup patch-set that simplifies the default readlink handling. Miklos and Al are still discussing the rest of the series. * git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: vfs: make generic_readlink() static vfs: remove ".readlink = generic_readlink" assignments vfs: default to generic_readlink() vfs: replace calling i_op->readlink with vfs_readlink() proc/self: use generic_readlink ecryptfs: use vfs_get_link() bad_inode: add missing i_op initializers
| * vfs: remove ".readlink = generic_readlink" assignmentsMiklos Szeredi2016-12-093-3/+0
| | | | | | | | | | | | | | | | | | | | | | If .readlink == NULL implies generic_readlink(). Generated by: to_del="\.readlink.*=.*generic_readlink" for i in `git grep -l $to_del`; do sed -i "/$to_del"/d $i; done Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * proc/self: use generic_readlinkMiklos Szeredi2016-12-092-27/+2
| | | | | | | | | | | | | | | | The /proc/self and /proc/self-thread symlinks have separate but identical functionality for reading and following. This cleanup utilizes generic_readlink to remove the duplication. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* | Merge branch 'stable-4.10' of git://git.infradead.org/users/pcmoore/auditLinus Torvalds2016-12-141-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull audit updates from Paul Moore: "After the small number of patches for v4.9, we've got a much bigger pile for v4.10. The bulk of these patches involve a rework of the audit backlog queue to enable us to move the netlink multicasting out of the task/thread that generates the audit record and into the kernel thread that emits the record (just like we do for the audit unicast to auditd). While we were playing with the backlog queue(s) we fixed a number of other little problems with the code, and from all the testing so far things look to be in much better shape now. Doing this also allowed us to re-enable disabling IRQs for some netns operations ("netns: avoid disabling irq for netns id"). The remaining patches fix some small problems that are well documented in the commit descriptions, as well as adding session ID filtering support" * 'stable-4.10' of git://git.infradead.org/users/pcmoore/audit: audit: use proper refcount locking on audit_sock netns: avoid disabling irq for netns id audit: don't ever sleep on a command record/message audit: handle a clean auditd shutdown with grace audit: wake up kauditd_thread after auditd registers audit: rework audit_log_start() audit: rework the audit queue handling audit: rename the queues and kauditd related functions audit: queue netlink multicast sends just like we do for unicast sends audit: fixup audit_init() audit: move kaudit thread start from auditd registration to kaudit init (#2) audit: add support for session ID user filter audit: fix formatting of AUDIT_CONFIG_CHANGE events audit: skip sessionid sentinel value when auto-incrementing audit: tame initialization warning len_abuf in audit_log_execve_info audit: less stack usage for /proc/*/loginuid
| * | audit: less stack usage for /proc/*/loginuidAlexey Dobriyan2016-11-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | %u requires 10 characters at most not 20. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
* | | Merge branch 'next' of ↵Linus Torvalds2016-12-144-21/+13
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates from James Morris: "Generally pretty quiet for this release. Highlights: Yama: - allow ptrace access for original parent after re-parenting TPM: - add documentation - many bugfixes & cleanups - define a generic open() method for ascii & bios measurements Integrity: - Harden against malformed xattrs SELinux: - bugfixes & cleanups Smack: - Remove unnecessary smack_known_invalid label - Do not apply star label in smack_setprocattr hook - parse mnt opts after privileges check (fixes unpriv DoS vuln)" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (56 commits) Yama: allow access for the current ptrace parent tpm: adjust return value of tpm_read_log tpm: vtpm_proxy: conditionally call tpm_chip_unregister tpm: Fix handling of missing event log tpm: Check the bios_dir entry for NULL before accessing it tpm: return -ENODEV if np is not set tpm: cleanup of printk error messages tpm: replace of_find_node_by_name() with dev of_node property tpm: redefine read_log() to handle ACPI/OF at runtime tpm: fix the missing .owner in tpm_bios_measurements_ops tpm: have event log use the tpm_chip tpm: drop tpm1_chip_register(/unregister) tpm: replace dynamically allocated bios_dir with a static array tpm: replace symbolic permission with octal for securityfs files char: tpm: fix kerneldoc tpm2_unseal_trusted name typo tpm_tis: Allow tpm_tis to be bound using DT tpm, tpm_vtpm_proxy: add kdoc comments for VTPM_PROXY_IOC_NEW_DEV tpm: Only call pm_runtime_get_sync if device has a parent tpm: define a generic open() method for ascii & bios measurements Documentation: tpm: add the Physical TPM device tree binding documentation ...
| * \ \ Merge branch 'stable-4.10' of git://git.infradead.org/users/pcmoore/selinux ↵James Morris2016-11-244-21/+13
| |\ \ \ | | |_|/ | |/| | | | | | into next
| | * | proc: Pass file mode to proc_pid_make_inodeAndreas Gruenbacher2016-11-144-21/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pass the file mode of the proc inode to be created to proc_pid_make_inode. In proc_pid_make_inode, initialize inode->i_mode before calling security_task_to_inode. This allows selinux to set isec->sclass right away without introducing "half-initialized" inode security structs. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
* | | | Merge tag 'for-linus-4.10-rc0-tag' of ↵Linus Torvalds2016-12-132-1/+1
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: "Xen features and fixes for 4.10 These are some fixes, a move of some arm related headers to share them between arm and arm64 and a series introducing a helper to make code more readable. The most notable change is David stepping down as maintainer of the Xen hypervisor interface. This results in me sending you the pull requests for Xen related code from now on" * tag 'for-linus-4.10-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (29 commits) xen/balloon: Only mark a page as managed when it is released xenbus: fix deadlock on writes to /proc/xen/xenbus xen/scsifront: don't request a slot on the ring until request is ready xen/x86: Increase xen_e820_map to E820_X_MAX possible entries x86: Make E820_X_MAX unconditionally larger than E820MAX xen/pci: Bubble up error and fix description. xen: xenbus: set error code on failure xen: set error code on failures arm/xen: Use alloc_percpu rather than __alloc_percpu arm/arm64: xen: Move shared architecture headers to include/xen/arm xen/events: use xen_vcpu_id mapping for EVTCHNOP_status xen/gntdev: Use VM_MIXEDMAP instead of VM_IO to avoid NUMA balancing xen-scsifront: Add a missing call to kfree MAINTAINERS: update XEN HYPERVISOR INTERFACE xenfs: Use proc_create_mount_point() to create /proc/xen xen-platform: use builtin_pci_driver xen-netback: fix error handling output xen: make use of xenbus_read_unsigned() in xenbus xen: make use of xenbus_read_unsigned() in xen-pciback xen: make use of xenbus_read_unsigned() in xen-fbfront ...
| * | | | xenfs: Use proc_create_mount_point() to create /proc/xenSeth Forshee2016-11-172-1/+1
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mounting proc in user namespace containers fails if the xenbus filesystem is mounted on /proc/xen because this directory fails the "permanently empty" test. proc_create_mount_point() exists specifically to create such mountpoints in proc but is currently proc-internal. Export this interface to modules, then use it in xenbus when creating /proc/xen. Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com>
* | | | fs/proc: calculate /proc/* and /proc/*/task/* nlink at init timeAlexey Dobriyan2016-12-123-6/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Runtime nlink calculation works but meh. I don't know how to do it at compile time, but I know how to do it at init time. Shift "2+" part into init time as a bonus. Link: http://lkml.kernel.org/r/20161122195549.GB29812@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | fs/proc/base.c: save decrement during lookup/readdir in /proc/$PIDAlexey Dobriyan2016-12-121-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Comparison for "<" works equally well as comparison for "<=" but one SUB/LEA is saved (no, it is not optimised away, at least here). Link: http://lkml.kernel.org/r/20161122195143.GA29812@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | fs/proc/array.c: slightly improve render_sigset_tRasmus Villemoes2016-12-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | format_decode and vsnprintf occasionally show up in perf top, so I went looking for places that might not need the full printf power. With the help of kprobes, I gathered some statistics on which format strings we mostly pass to vsnprintf. On a trivial desktop workload, I hit "%x" 25% of the time, so something apparently reads /proc/pid/status (which does 5*16 printf("%x") calls) a lot. With this patch, reading /proc/pid/status is 30% faster according to this microbenchmark: char buf[4096]; int i, fd; for (i = 0; i < 10000; ++i) { fd = open("/proc/self/status", O_RDONLY); read(fd, buf, sizeof(buf)); close(fd); } Link: http://lkml.kernel.org/r/1474410485-1305-1-git-send-email-linux@rasmusvillemoes.dk Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Acked-by: Andrei Vagin <avagin@openvz.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | proc: tweak comments about 2 stage open and everythingAlexey Dobriyan2016-12-121-8/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some comments were obsoleted since commit 05c0ae21c034 ("try a saner locking for pde_opener..."). Some new comments added. Some confusing comments replaced with equally confusing ones. Link: http://lkml.kernel.org/r/20161029160231.GD1246@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | proc: kmalloc struct pde_openerAlexey Dobriyan2016-12-121-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kzalloc is too much, half of the fields will be reinitialized anyway. If proc file doesn't have ->release hook (some still do not), clearing is unnecessary because it will be freed immediately. Link: http://lkml.kernel.org/r/20161029155747.GC1246@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | proc: fix type of struct pde_opener::closing fieldAlexey Dobriyan2016-12-122-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct pde_opener::closing is boolean. Link: http://lkml.kernel.org/r/20161029155439.GB1246@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | proc: just list_del() struct pde_openerAlexey Dobriyan2016-12-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | list_del_init() is too much, structure will be freed in three lines anyway. Link: http://lkml.kernel.org/r/20161029155313.GA1246@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | proc: make struct struct map_files_info::len unsigned intAlexey Dobriyan2016-12-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Linux doesn't support 4GB+ filenames in /proc, so unsigned long is too much. MOV r64, r/m64 is larger than MOV r32, r/m32. Link: http://lkml.kernel.org/r/20161029161123.GG1246@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | proc: make struct pid_entry::len unsignedAlexey Dobriyan2016-12-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "unsigned int" is better on x86_64 because it most of the time it autoexpands to 64-bit value while "int" requires MOVSX instruction. Link: http://lkml.kernel.org/r/20161029160810.GF1246@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | proc: report no_new_privs stateKees Cook2016-12-121-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to being able to examine if a process has been correctly confined with seccomp, the state of no_new_privs is equally interesting, so this adds it to /proc/$pid/status. Link: http://lkml.kernel.org/r/20161103214041.GA58566@beast Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Jann Horn <jann@thejh.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rodrigo Freire <rfreire@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Robert Ho <robert.hu@intel.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Richard W.M. Jones" <rjones@redhat.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | mm: add cond_resched() in gather_pte_stats()Hugh Dickins2016-12-121-0/+1
|/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The other pagetable walks in task_mmu.c have a cond_resched() after walking their ptes: add a cond_resched() in gather_pte_stats() too, for reading /proc/<id>/numa_maps. Only pagemap_pmd_range() has a cond_resched() in its (unusually expensive) pmd_trans_huge case: more should probably be added, but leave them unchanged for now. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1612052157400.13021@eggly.anvils Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | proc: fix NULL dereference when reading /proc/<pid>/auxvLeon Yu2016-10-271-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reading auxv of any kernel thread results in NULL pointer dereferencing in auxv_read() where mm can be NULL. Fix that by checking for NULL mm and bailing out early. This is also the original behavior changed by recent commit c5317167854e ("proc: switch auxv to use of __mem_open()"). # cat /proc/2/auxv Unable to handle kernel NULL pointer dereference at virtual address 000000a8 Internal error: Oops: 17 [#1] PREEMPT SMP ARM CPU: 3 PID: 113 Comm: cat Not tainted 4.9.0-rc1-ARCH+ #1 Hardware name: BCM2709 task: ea3b0b00 task.stack: e99b2000 PC is at auxv_read+0x24/0x4c LR is at do_readv_writev+0x2fc/0x37c Process cat (pid: 113, stack limit = 0xe99b2210) Call chain: auxv_read do_readv_writev vfs_readv default_file_splice_read splice_direct_to_actor do_splice_direct do_sendfile SyS_sendfile64 ret_fast_syscall Fixes: c5317167854e ("proc: switch auxv to use of __mem_open()") Link: http://lkml.kernel.org/r/1476966200-14457-1-git-send-email-chianglungyu@gmail.com Signed-off-by: Leon Yu <chianglungyu@gmail.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Janis Danisevskis <jdanis@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | proc: don't use FOLL_FORCE for reading cmdline and environmentLinus Torvalds2016-10-241-10/+8
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that Lorenzo cleaned things up and made the FOLL_FORCE users explicit, it becomes obvious how some of them don't really need FOLL_FORCE at all. So remove FOLL_FORCE from the proc code that reads the command line and arguments from user space. The mem_rw() function actually does want FOLL_FORCE, because gdd (and possibly many other debuggers) use it as a much more convenient version of PTRACE_PEEKDATA, but we should consider making the FOLL_FORCE part conditional on actually being a ptracer. This does not actually do that, just moves adds a comment to that effect and moves the gup_flags settings next to each other. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'mm-urgent-for-linus' of ↵Linus Torvalds2016-10-223-41/+25
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull vmap stack fixes from Ingo Molnar: "This is fallout from CONFIG_HAVE_ARCH_VMAP_STACK=y on x86: stack accesses that used to be just somewhat questionable are now totally buggy. These changes try to do it without breaking the ABI: the fields are left there, they are just reporting zero, or reporting narrower information (the maps file change)" * 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: mm: Change vm_is_stack_for_task() to vm_is_stack_for_current() fs/proc: Stop trying to report thread stacks fs/proc: Stop reporting eip and esp in /proc/PID/stat mm/numa: Remove duplicated include from mprotect.c
| * | fs/proc: Stop trying to report thread stacksAndy Lutomirski2016-10-202-37/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts more of: b76437579d13 ("procfs: mark thread stack correctly in proc/<pid>/maps") ... which was partially reverted by: 65376df58217 ("proc: revert /proc/<pid>/maps [stack:TID] annotation") Originally, /proc/PID/task/TID/maps was the same as /proc/TID/maps. In current kernels, /proc/PID/maps (or /proc/TID/maps even for threads) shows "[stack]" for VMAs in the mm's stack address range. In contrast, /proc/PID/task/TID/maps uses KSTK_ESP to guess the target thread's stack's VMA. This is racy, probably returns garbage and, on arches with CONFIG_TASK_INFO_IN_THREAD=y, is also crash-prone: KSTK_ESP is not safe to use on tasks that aren't known to be running ordinary process-context kernel code. This patch removes the difference and just shows "[stack]" for VMAs in the mm's stack range. This is IMO much more sensible -- the actual "stack" address really is treated specially by the VM code, and the current thread stack isn't even well-defined for programs that frequently switch stacks on their own. Reported-by: Jann Horn <jann@thejh.net> Signed-off-by: Andy Lutomirski <luto@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linux API <linux-api@vger.kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tycho Andersen <tycho.andersen@canonical.com> Link: http://lkml.kernel.org/r/3e678474ec14e0a0ec34c611016753eea2e1b8ba.1475257877.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
OpenPOWER on IntegriCloud