summaryrefslogtreecommitdiffstats
path: root/fs/proc
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-linus' of ↵Linus Torvalds2011-12-141-5/+3
|\ | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs/ncpfs: fix error paths and goto statements in ncp_fill_super() configfs: register_filesystem() called too early fuse: register_filesystem() called too early ubifs: too early register_filesystem() ... and the same kind of leak for mqueue procfs: fix a vfsmount longterm reference leak
| * procfs: fix a vfsmount longterm reference leakAl Viro2011-12-091-5/+3
| | | | | | | | | | | | kern_mount() doesn't pair with plain mntput()... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | procfs: do not overflow get_{idle,iowait}_time for nohzMichal Hocko2011-12-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit a25cac5198d4 ("proc: Consider NO_HZ when printing idle and iowait times") we are reporting idle/io_wait time also while a CPU is tickless. We rely on get_{idle,iowait}_time functions to retrieve proper data. These functions, however, use usecs_to_cputime to translate micro seconds time to cputime64_t. This is just an alias to usecs_to_jiffies which reduces the data type from u64 to unsigned int and also checks whether the given parameter overflows jiffies_to_usecs(MAX_JIFFY_OFFSET) and returns MAX_JIFFY_OFFSET in that case. When we overflow depends on CONFIG_HZ but especially for CONFIG_HZ_300 it is quite low (1431649781) so we are getting MAX_JIFFY_OFFSET for >3000s! until we overflow unsigned int. Just for reference CONFIG_HZ_100 has an overflow window around 20s, CONFIG_HZ_250 ~8s and CONFIG_HZ_1000 ~2s. This results in a bug when people saw [h]top going mad reporting 100% CPU usage even though there was basically no CPU load. The reason was simply that /proc/stat stopped reporting idle/io_wait changes (and reported MAX_JIFFY_OFFSET) and so the only change happening was for user system time. Let's use nsecs_to_jiffies64 instead which doesn't reduce the precision to 32b type and it is much more appropriate for cumulative time values (unlike usecs_to_jiffies which intended for timeout calculations). Signed-off-by: Michal Hocko <mhocko@suse.cz> Tested-by: Artem S. Tashkinov <t.artem@mailcity.com> Cc: Dave Jones <davej@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | fs/proc/meminfo.c: fix compilation errorClaudio Scordino2011-12-091-3/+4
|/ | | | | | | | | | | Fix the error message "directives may not be used inside a macro argument" which appears when the kernel is compiled for the cris architecture. Signed-off-by: Claudio Scordino <claudio@evidence.eu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Revert "proc: fix races against execve() of /proc/PID/fd**"Linus Torvalds2011-11-091-103/+43
| | | | | | | | | | | | | This reverts commit aa6afca5bcaba8101f3ea09d5c3e4100b2b9f0e5. It escalates of some of the google-chrome SELinux problems with ptrace ("Check failed: pid_ > 0. Did not find zygote process"), and Andrew says that it is also causing mystery lockdep reports. Reported-by: Alex Villacís Lasso <a_villacis@palosanto.com> Requested-by: James Morris <jmorris@namei.org> Requested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'modsplit-Oct31_2011' of ↵Linus Torvalds2011-11-061-0/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h
| * fs: add export.h to files using EXPORT_SYMBOL/THIS_MODULE macrosPaul Gortmaker2011-10-311-0/+1
| | | | | | | | | | | | | | | | | | | | These files were getting <linux/module.h> via an implicit include path, but we want to crush those out of existence since they cost time during compiles of processing thousands of lines of headers for no reason. Give them the lightweight header that just contains the EXPORT_SYMBOL infrastructure. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* | Merge branch 'akpm' (Andrew's incoming - part two)Linus Torvalds2011-11-022-43/+149
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Says Andrew: "60 patches. That's good enough for -rc1 I guess. I have quite a lot of detritus to be rechecked, work through maintainers, etc. - most of the remains of MM - rtc - various misc - cgroups - memcg - cpusets - procfs - ipc - rapidio - sysctl - pps - w1 - drivers/misc - aio" * akpm: (60 commits) memcg: replace ss->id_lock with a rwlock aio: allocate kiocbs in batches drivers/misc/vmw_balloon.c: fix typo in code comment drivers/misc/vmw_balloon.c: determine page allocation flag can_sleep outside loop w1: disable irqs in critical section drivers/w1/w1_int.c: multiple masters used same init_name drivers/power/ds2780_battery.c: fix deadlock upon insertion and removal drivers/power/ds2780_battery.c: add a nolock function to w1 interface drivers/power/ds2780_battery.c: create central point for calling w1 interface w1: ds2760 and ds2780, use ida for id and ida_simple_get() to get it pps gpio client: add missing dependency pps: new client driver using GPIO pps: default echo function include/linux/dma-mapping.h: add dma_zalloc_coherent() sysctl: make CONFIG_SYSCTL_SYSCALL default to n sysctl: add support for poll() RapidIO: documentation update drivers/net/rionet.c: fix ethernet address macros for LE platforms RapidIO: fix potential null deref in rio_setup_device() RapidIO: add mport driver for Tsi721 bridge ...
| * | sysctl: add support for poll()Lucas De Marchi2011-11-021-0/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adding support for poll() in sysctl fs allows userspace to receive notifications of changes in sysctl entries. This adds a infrastructure to allow files in sysctl fs to be pollable and implements it for hostname and domainname. [akpm@linux-foundation.org: s/declare/define/ for definitions] Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi> Cc: Greg KH <gregkh@suse.de> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | proc: fix races against execve() of /proc/PID/fd**Vasiliy Kulikov2011-11-021-43/+103
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fd* files are restricted to the task's owner, and other users may not get direct access to them. But one may open any of these files and run any setuid program, keeping opened file descriptors. As there are permission checks on open(), but not on readdir() and read(), operations on the kept file descriptors will not be checked. It makes it possible to violate procfs permission model. Reading fdinfo/* may disclosure current fds' position and flags, reading directory contents of fdinfo/ and fd/ may disclosure the number of opened files by the target task. This information is not sensible per se, but it can reveal some private information (like length of a password stored in a file) under certain conditions. Used existing (un)lock_trace functions to check for ptrace_may_access(), but instead of using EPERM return code from it use EACCES to be consistent with existing proc_pid_follow_link()/proc_pid_readlink() return code. If they differ, attacker can guess what fds exist by analyzing stat() return code. Patched handlers: stat() for fd/*, stat() and read() for fdindo/*, readdir() and lookup() for fd/ and fdinfo/. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | procfs: report EISDIR when reading sysctl dirs in procPavel Emelyanov2011-11-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On reading sysctl dirs we should return -EISDIR instead of -EINVAL. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | filesystems: add set_nlink()Miklos Szeredi2011-11-023-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | Replace remaining direct i_nlink updates with a new set_nlink() updater function. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* | | filesystems: add missing nlink wrappersMiklos Szeredi2011-11-021-1/+1
|/ / | | | | | | | | | | | | Replace direct i_nlink updates with the respective updater function (inc_nlink, drop_nlink, clear_nlink, inode_dec_link_count). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* | mm: distinguish between mlocked and pinned pagesChristoph Lameter2011-10-311-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some kernel components pin user space memory (infiniband and perf) (by increasing the page count) and account that memory as "mlocked". The difference between mlocking and pinning is: A. mlocked pages are marked with PG_mlocked and are exempt from swapping. Page migration may move them around though. They are kept on a special LRU list. B. Pinned pages cannot be moved because something needs to directly access physical memory. They may not be on any LRU list. I recently saw an mlockalled process where mm->locked_vm became bigger than the virtual size of the process (!) because some memory was accounted for twice: Once when the page was mlocked and once when the Infiniband layer increased the refcount because it needt to pin the RDMA memory. This patch introduces a separate counter for pinned pages and accounts them seperately. Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Mike Marciniszyn <infinipath@qlogic.com> Cc: Roland Dreier <roland@kernel.org> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | oom: remove oom_disable_countDavid Rientjes2011-10-311-13/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This removes mm->oom_disable_count entirely since it's unnecessary and currently buggy. The counter was intended to be per-process but it's currently decremented in the exit path for each thread that exits, causing it to underflow. The count was originally intended to prevent oom killing threads that share memory with threads that cannot be killed since it doesn't lead to future memory freeing. The counter could be fixed to represent all threads sharing the same mm, but it's better to remove the count since: - it is possible that the OOM_DISABLE thread sharing memory with the victim is waiting on that thread to exit and will actually cause future memory freeing, and - there is no guarantee that a thread is disabled from oom killing just because another thread sharing its mm is oom disabled. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | /proc/self/numa_maps: restore "huge" tag for hugetlb vmasAndrew Morton2011-10-311-0/+3
|/ | | | | | | | | | | | | | | | | | | The display of the "huge" tag was accidentally removed in 29ea2f698 ("mm: use walk_page_range() instead of custom page table walking code"). Reported-by: Stephen Hemminger <shemminger@vyatta.com> Tested-by: Stephen Hemminger <shemminger@vyatta.com> Reviewed-by: Stephen Wilson <wilsons@start.ca> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'timers-core-for-linus' of ↵Linus Torvalds2011-10-261-7/+34
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) time, s390: Get rid of compile warning dw_apb_timer: constify clocksource name time: Cleanup old CONFIG_GENERIC_TIME references that snuck in time: Change jiffies_to_clock_t() argument type to unsigned long alarmtimers: Fix error handling clocksource: Make watchdog reset lockless posix-cpu-timers: Cure SMP accounting oddities s390: Use direct ktime path for s390 clockevent device clockevents: Add direct ktime programming function clockevents: Make minimum delay adjustments configurable nohz: Remove "Switched to NOHz mode" debugging messages proc: Consider NO_HZ when printing idle and iowait times nohz: Make idle/iowait counter update conditional nohz: Fix update_ts_time_stat idle accounting cputime: Clean up cputime_to_usecs and usecs_to_cputime macros alarmtimers: Rework RTC device selection using class interface alarmtimers: Add try_to_cancel functionality alarmtimers: Add more refined alarm state tracking alarmtimers: Remove period from alarm structure alarmtimers: Remove interval cap limit hack ...
| * proc: Consider NO_HZ when printing idle and iowait timesMichal Hocko2011-09-081-7/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | show_stat handler of the /proc/stat file relies on kstat_cpu(cpu) statistics when priting information about idle and iowait times. This is OK if we are not using tickless kernel (CONFIG_NO_HZ) because counters are updated periodically. With NO_HZ things got more tricky because we are not doing idle/iowait accounting while we are tickless so the value might get outdated. Users of /proc/stat will notice that by unchanged idle/iowait values which is then interpreted as 0% idle/iowait time. From the user space POV this is an unexpected behavior and a change of the interface. Let's fix this by using get_cpu_{idle,iowait}_time_us which accounts the total idle/iowait time since boot and it doesn't rely on sampling or any other periodic activity. Fall back to the previous behavior if NO_HZ is disabled or not configured. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Dave Jones <davej@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Link: http://lkml.kernel.org/r/39181366adac1b39cb6aa3cd53ff0f7c78d32676.1314172057.git.mhocko@suse.cz Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | teach /proc/$pid/numa_maps about transparent hugepagesDave Hansen2011-09-211-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This is modeled after the smaps code. It detects transparent hugepages and then does a single gather_stats() for the page as a whole. This has two benifits: 1. It is more efficient since it does many pages in a single shot. 2. It does not have to break down the huge page. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | break out numa_maps gather_pte_stats() checksDave Hansen2011-09-211-15/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | gather_pte_stats() does a number of checks on a target page to see whether it should even be considered for statistics. This breaks that code out in to a separate function so that we can use it in the transparent hugepage case in the next patch. Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Christoph Lameter <cl@gentwo.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | make /proc/$pid/numa_maps gather_stats() take variable page sizeDave Hansen2011-09-211-10/+11
|/ | | | | | | | | | | | | | | | | | | | | | | | | We need to teach the numa_maps code about transparent huge pages. The first step is to teach gather_stats() that the pte it is dealing with might represent more than one page. Note that will we use this in a moment for transparent huge pages since they have use a single pmd_t which _acts_ as a "surrogate" for a bunch of smaller pte_t's. I'm a _bit_ unhappy that this interface counts in hugetlbfs page sizes for hugetlbfs pages and PAGE_SIZE for normal pages. That means that to figure out how many _bytes_ "dirty=1" means, you must first know the hugetlbfs page size. That's easier said than done especially if you don't have visibility in to the mount. But, that's probably a discussion for another day especially since it would change behavior to fix it. But, just in case anyone wonders why this patch only passes a '1' in the hugetlb case... Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vfs: show O_CLOEXE bit properly in /proc/<pid>/fdinfo/<fd> filesLinus Torvalds2011-08-061-1/+9
| | | | | | | | | | | | | | | | | | | | | The CLOEXE bit is magical, and for performance (and semantic) reasons we don't actually maintain it in the file descriptor itself, but in a separate bit array. Which means that when we show f_flags, the CLOEXE status is shown incorrectly: we show the status not as it is now, but as it was when the file was opened. Fix that by looking up the bit properly in the 'fdt->close_on_exec' bit array. Uli needs this in order to re-implement the pfiles program: "For normal file descriptors (not sockets) this was the last piece of information which wasn't available. This is all part of my 'give Solaris users no reason to not switch' effort. I intend to offer the code to the util-linux-ng maintainers." Requested-by: Ulrich Drepper <drepper@akkadia.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom_ajd: don't use WARN_ONCE, just use printk_onceLinus Torvalds2011-08-061-1/+1
| | | | | | | | | | | WARN_ONCE() is very annoying, in that it shows the stack trace that we don't care about at all, and also triggers various user-level "kernel oopsed" logic that we really don't care about. And it's not like the user can do anything about the applications (sshd) in question, it's a distro issue. Requested-by: Andi Kleen <andi@firstfloor.org> (and many others) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* proc: make struct proc_dir_entry::name a terminal array rather than a pointerDavid Howells2011-07-273-5/+4
| | | | | | | | | | | | | | | | | | | | | | Since __proc_create() appends the name it is given to the end of the PDE structure that it allocates, there isn't a need to store a name pointer. Instead we can just replace the name pointer with a terminal char array of _unspecified_ length. The compiler will simply append the string to statically defined variables of PDE type overlapping any hole at the end of the structure and, unlike specifying an explicitly _zero_ length array, won't give a warning if you try to statically initialise it with a string of more than zero length. Also, whilst we're at it: (1) Move namelen to end just prior to name and reduce it to a single byte (name shouldn't be longer than NAME_MAX). (2) Move pde_unload_lock two places further on so that if it's four bytes in size on a 64-bit machine, it won't cause an unused hole in the PDE struct. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* atomic: use <linux/atomic.h>Arun Sharma2011-07-261-1/+1
| | | | | | | | | | | | | | This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* proc: fix a race in do_io_accounting()Vasiliy Kulikov2011-07-261-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | If an inode's mode permits opening /proc/PID/io and the resulting file descriptor is kept across execve() of a setuid or similar binary, the ptrace_may_access() check tries to prevent using this fd against the task with escalated privileges. Unfortunately, there is a race in the check against execve(). If execve() is processed after the ptrace check, but before the actual io information gathering, io statistics will be gathered from the privileged process. At least in theory this might lead to gathering sensible information (like ssh/ftp password length) that wouldn't be available otherwise. Holding task->signal->cred_guard_mutex while gathering the io information should protect against the race. The order of locking is similar to the one inside of ptrace_attach(): first goes cred_guard_mutex, then lock_task_sighand(). Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* procfs: return ENOENT on opening a being-removed proc entryDaisuke Ogino2011-07-261-1/+1
| | | | | | | | | | | | | | Change the return value to ENOENT. This return value is then returned when opening the proc entry that have been removed. For example, open("/proc/bus/pci/XX/YY") when the corresponding device is being hot-removed. Signed-off-by: Daisuke Ogino <ogino.daisuke@jp.fujitsu.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Acked-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* oom: make deprecated use of oom_adj more verboseDavid Rientjes2011-07-251-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | /proc/pid/oom_adj is deprecated and scheduled for removal in August 2012 according to Documentation/feature-removal-schedule.txt. This patch makes the warning more verbose by making it appear as a more serious problem (the presence of a stack trace and being multiline should attract more attention) so that applications still using the old interface can get fixed. Very popular users of the old interface have been converted since the oom killer rewrite has been introduced. udevd switched to the /proc/pid/oom_score_adj interface for v162, kde switched in 4.6.1, and opensshd switched in 5.7p1. At the start of 2012, this should be changed into a WARN() to emit all such incidents and then finally remove the tunable in August 2012 as scheduled. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2011-07-222-5/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits) vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp isofs: Remove global fs lock jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory fix IN_DELETE_SELF on overwriting rename() on ramfs et.al. mm/truncate.c: fix build for CONFIG_BLOCK not enabled fs:update the NOTE of the file_operations structure Remove dead code in dget_parent() AFS: Fix silly characters in a comment switch d_add_ci() to d_splice_alias() in "found negative" case as well simplify gfs2_lookup() jfs_lookup(): don't bother with . or .. get rid of useless dget_parent() in btrfs rename() and link() get rid of useless dget_parent() in fs/btrfs/ioctl.c fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers drivers: fix up various ->llseek() implementations fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek Ext4: handle SEEK_HOLE/SEEK_DATA generically Btrfs: implement our own ->llseek fs: add SEEK_HOLE and SEEK_DATA flags reiserfs: make reiserfs default to barrier=flush ... Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new shrinker callout for the inode cache, that clashed with the xfs code to start the periodic workers later.
| * fs: seq_file - add event counter to simplify poll() supportKay Sievers2011-07-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Moving the event counter into the dynamically allocated 'struc seq_file' allows poll() support without the need to allocate its own tracking structure. All current users are switched over to use the new counter. Requested-by: Andrew Morton akpm@linux-foundation.org Acked-by: NeilBrown <neilb@suse.de> Tested-by: Lucas De Marchi lucas.demarchi@profusion.mobi Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * ->permission() sanitizing: don't pass flags to ->permission()Al Viro2011-07-202-2/+2
| | | | | | | | | | | | not used by the instances anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * ->permission() sanitizing: don't pass flags to generic_permission()Al Viro2011-07-201-1/+1
| | | | | | | | | | | | | | redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of them removes that bit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * ->permission() sanitizing: MAY_NOT_BLOCKAl Viro2011-07-201-1/+1
| | | | | | | | | | | | Duplicate the flags argument into mask bitmap. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * kill check_acl callback of generic_permission()Al Viro2011-07-201-1/+1
| | | | | | | | | | | | | | its value depends only on inode and does not change; we might as well store it in ->i_op->check_acl and be done with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge branch 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/miscLinus Torvalds2011-07-222-2/+2
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (39 commits) ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever ptrace: fix ptrace_signal() && STOP_DEQUEUED interaction connector: add an event for monitoring process tracers ptrace: dont send SIGSTOP on auto-attach if PT_SEIZED ptrace: mv send-SIGSTOP from do_fork() to ptrace_init_task() ptrace_init_task: initialize child->jobctl explicitly has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/ ptrace: make former thread ID available via PTRACE_GETEVENTMSG after PTRACE_EVENT_EXEC stop ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/ ptrace: kill real_parent_is_ptracer() in in favor of ptrace_reparented() ptrace: ptrace_reparented() should check same_thread_group() redefine thread_group_leader() as exit_signal >= 0 do not change dead_task->exit_signal kill task_detached() reparent_leader: check EXIT_DEAD instead of task_detached() make do_notify_parent() __must_check, update the callers __ptrace_detach: avoid task_detached(), check do_notify_parent() kill tracehook_notify_death() make do_notify_parent() return bool ptrace: s/tracehook_tracer_task()/ptrace_parent()/ ...
| * ptrace: s/tracehook_tracer_task()/ptrace_parent()/Tejun Heo2011-06-222-2/+2
| | | | | | | | | | | | | | | | | | | | | | tracehook.h is on the way out. Rename tracehook_tracer_task() to ptrace_parent() and move it from tracehook.h to ptrace.h. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: John Johansen <john.johansen@canonical.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
* | proc: restrict access to /proc/PID/ioVasiliy Kulikov2011-06-281-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | /proc/PID/io may be used for gathering private information. E.g. for openssh and vsftpd daemons wchars/rchars may be used to learn the precise password length. Restrict it to processes being able to ptrace the target process. ptrace_may_access() is needed to prevent keeping open file descriptor of "io" file, executing setuid binary and gathering io information of the setuid'ed process. Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | proc_sys_permission() is OK in RCU modeAl Viro2011-06-201-3/+0
| | | | | | | | | | | | | | | | nothing blocking there, since all instances of sysctl ->permissions() method are non-blocking - both of them, that is. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | proc_fd_permission() is doesn't need to bail out in RCU modeAl Viro2011-06-201-5/+1
| | | | | | | | | | | | nothing blocking except generic_permission() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfdLinus Torvalds2011-06-161-3/+6
|\ \ | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd: proc: Fix Oops on stat of /proc/<zombie pid>/ns/net
| * | proc: Fix Oops on stat of /proc/<zombie pid>/ns/netEric W. Biederman2011-06-151-3/+6
| |/ | | | | | | | | | | | | | | | | | | Don't call iput with the inode half setup to be a namespace filedescriptor. Instead rearrange the code so that we don't initialize ei->ns_ops until after I ns_ops->get succeeds, preventing us from invoking ns_ops->put when ns_ops->get failed. Reported-by: Ingo Saitz <Ingo.Saitz@stud.uni-hannover.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
* | fix leak in proc_set_super()Al Viro2011-06-121-5/+6
|/ | | | | | set_anon_super() can fail... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tileLinus Torvalds2011-05-291-0/+9
|\ | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile: arch/tile: more /proc and /sys file support
| * arch/tile: more /proc and /sys file supportChris Metcalf2011-05-271-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change introduces a few of the less controversial /proc and /proc/sys interfaces for tile, along with sysfs attributes for various things that were originally proposed as /proc/tile files. It also adjusts the "hardwall" proc API. Arnd Bergmann reviewed the initial arch/tile submission, which included a complete set of all the /proc/tile and /proc/sys/tile knobs that we had added in a somewhat ad hoc way during initial development, and provided feedback on where most of them should go. One knob turned out to be similar enough to the existing /proc/sys/debug/exception-trace that it was re-implemented to use that model instead. Another knob was /proc/tile/grid, which reported the "grid" dimensions of a tile chip (e.g. 8x8 processors = 64-core chip). Arnd suggested looking at sysfs for that, so this change moves that information to a pair of sysfs attributes (chip_width and chip_height) in the /sys/devices/system/cpu directory. We also put the "chip_serial" and "chip_revision" information from our old /proc/tile/board file as attributes in /sys/devices/system/cpu. Other information collected via hypervisor APIs is now placed in /sys/hypervisor. We create a /sys/hypervisor/type file (holding the constant string "tilera") to be parallel with the Xen use of /sys/hypervisor/type holding "xen". We create three top-level files, "version" (the hypervisor's own version), "config_version" (the version of the configuration file), and "hvconfig" (the contents of the configuration file). The remaining information from our old /proc/tile/board and /proc/tile/switch files becomes an attribute group appearing under /sys/hypervisor/board/. Finally, after some feedback from Arnd Bergmann for the previous version of this patch, the /proc/tile/hardwall file is split up into two conceptual parts. First, a directory /proc/tile/hardwall/ which contains one file per active hardwall, each file named after the hardwall's ID and holding a cpulist that says which cpus are enclosed by the hardwall. Second, a /proc/PID file "hardwall" that is either empty (for non-hardwall-using processes) or contains the hardwall ID. Finally, this change pushes the /proc/sys/tile/unaligned_fixup/ directory, with knobs controlling the kernel code for handling the fixup of unaligned exceptions. Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
* | fs/proc/vmcore.c: add hook to read_from_oldmem() to check for non-ram pagesOlaf Hering2011-05-261-3/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The balloon driver in a Xen guest frees guest pages and marks them as mmio. When the kernel crashes and the crash kernel attempts to read the oldmem via /proc/vmcore a read from ballooned pages will generate 100% load in dom0 because Xen asks qemu-dm for the page content. Since the reads come in as 8byte requests each ballooned page is tried 512 times. With this change a hook can be registered which checks wether the given pfn is really ram. The hook has to return a value > 0 for ram pages, a value < 0 on error (because the hypercall is not known) and 0 for non-ram pages. This will reduce the time to read /proc/vmcore. Without this change a 512M guest with 128M crashkernel region needs 200 seconds to read it, with this change it takes just 2 seconds. Signed-off-by: Olaf Hering <olaf@aepfle.de> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | proc: fix pagemap_read() error caseKOSAKI Motohiro2011-05-261-10/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, pagemap_read() has three error and/or corner case handling mistake. (1) If ppos parameter is wrong, mm refcount will be leak. (2) If count parameter is 0, mm refcount will be leak too. (3) If the current task is sleeping in kmalloc() and the system is out of memory and oom-killer kill the proc associated task, mm_refcount prevent the task free its memory. then system may hang up. <Quote Hugh's explain why we shold call kmalloc() before get_mm()> check_mem_permission gets a reference to the mm. If we __get_free_page after check_mem_permission, imagine what happens if the system is out of memory, and the mm we're looking at is selected for killing by the OOM killer: while we wait in __get_free_page for more memory, no memory is freed from the selected mm because it cannot reach exit_mmap while we hold that reference. This patch fixes the above three. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jovi Zhang <bookjovi@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Stephen Wilson <wilsons@start.ca> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | proc: put check_mem_permission after __get_free_page in mem_writeKOSAKI Motohiro2011-05-261-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It whould be better if put check_mem_permission after __get_free_page in mem_write, to be same as function mem_read. Hugh Dickins explained the reason. check_mem_permission gets a reference to the mm. If we __get_free_page after check_mem_permission, imagine what happens if the system is out of memory, and the mm we're looking at is selected for killing by the OOM killer: while we wait in __get_free_page for more memory, no memory is freed from the selected mm because it cannot reach exit_mmap while we hold that reference. Reported-by: Jovi Zhang <bookjovi@gmail.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Stephen Wilson <wilsons@start.ca> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | proc/stat: use defined macro KMALLOC_MAX_SIZEYuanhan Liu2011-05-261-3/+3
| | | | | | | | | | | | | | | | | | | | There is a macro for the max size kmalloc can allocate, so use it instead of a hardcoded number. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | proc: constify status arrayMike Frysinger2011-05-261-2/+2
| | | | | | | | | | | | | | | | | | No need for this local array to be writable, so mark it const. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | fs/proc: convert to kstrtoX()Alexey Dobriyan2011-05-262-11/+13
| | | | | | | | | | | | | | | | Convert fs/proc/ from strict_strto*() to kstrto*() functions. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud