op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	maps4: make page monitoring /proc file optional	Matt Mackall	2008-02-05	3	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Make /proc/ page monitoring configurable This puts the following files under an embedded config option: /proc/pid/clear_refs /proc/pid/smaps /proc/pid/pagemap /proc/kpagecount /proc/kpageflags [akpm@linux-foundation.org: Kconfig fix] Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	maps4: add /proc/kpageflags interface	Matt Mackall	2008-02-05	1	-1/+83
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This makes a subset of physical page flags available to userspace. Together with /proc/pid/kpagemap, this allows tracking of a wide variety of VM behaviors. Exported flags are decoupled from the kernel's internal flags. This allows us to reorder flag bits, and synthesize any bits that get redefined in terms of other bits. [akpm@linux-foundation.org: remove unneeded access_ok()] [akpm@linux-foundation.org: s/0/NULL/] Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	maps4: add /proc/kpagecount interface	Matt Mackall	2008-02-05	1	-0/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This makes physical page map counts available to userspace. Together with /proc/pid/pagemap and /proc/pid/clear_refs, this can be used to monitor memory usage on a per-page basis. [akpm@linux-foundation.org: remove unneeded access_ok()] [bunk@stusta.de: make struct proc_kpagemap static] Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	maps4: add /proc/pid/pagemap interface	Matt Mackall	2008-02-05	3	-2/+204
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This interface provides a mapping for each page in an address space to its physical page frame number, allowing precise determination of what pages are mapped and what pages are shared between processes. New in this version: - headers gone again (as recommended by Dave Hansen and Alan Cox) - 64-bit entries (as per discussion with Andi Kleen) - swap pte information exported (from Dave Hansen) - page walker callback for holes (from Dave Hansen) - direct put_user I/O (as suggested by Rusty Russell) This patch folds in cleanups and swap PTE support from Dave Hansen <haveblue@us.ibm.com>. Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	maps4: regroup task_mmu by interface	Matt Mackall	2008-02-05	1	-207/+210
\| \| \| \| \| \| \| \| \| \| \|	Reorder source so that all the code and data for each interface is together. Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	maps4: move clear_refs code to task_mmu.c	Matt Mackall	2008-02-05	3	-53/+37
\| \| \| \| \| \| \| \| \| \| \| \|	This puts all the clear_refs code where it belongs and probably lets things compile on MMU-less systems as well. Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	maps4: simplify interdependence of maps and smaps	Matt Mackall	2008-02-05	1	-26/+26
\| \| \| \| \| \| \| \| \| \| \| \|	This pulls the shared map display code out of show_map and puts it in show_smap where it belongs. Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	maps4: use pagewalker in clear_refs and smaps	Matt Mackall	2008-02-05	1	-78/+17
\| \| \| \| \| \| \| \| \|	Use the generic pagewalker for smaps and clear_refs Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	maps4: add proportional set size accounting in smaps	Fengguang Wu	2008-02-05	1	-1/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The "proportional set size" (PSS) of a process is the count of pages it has in memory, where each page is divided by the number of processes sharing it. So if a process has 1000 pages all to itself, and 1000 shared with one other process, its PSS will be 1500. - lwn.net: "ELC: How much memory are applications really using?" The PSS proposed by Matt Mackall is a very nice metic for measuring an process's memory footprint. So collect and export it via /proc/<pid>/smaps. Matt Mackall's pagemap/kpagemap and John Berthels's exmap can also do the job. They are comprehensive tools. But for PSS, let's do it in the simple way. Cc: John Berthels <jjberthels@gmail.com> Cc: Bernardo Innocenti <bernie@codewiz.org> Cc: Padraig Brady <P@draigBrady.com> Cc: Denys Vlasenko <vda.linux@googlemail.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Hugh Dickins <hugh@veritas.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	hugetlb: allow sticky directory mount option	Ken Chen	2008-02-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Allow sticky directory mount option for hugetlbfs. This allows admin to create a shared hugetlbfs mount point for multiple users, while prevent accidental file deletion that users may step on each other. It is similiar to default tmpfs mount option, or typical option used on /tmp. Signed-off-by: Ken Chen <kenchen@google.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <hermes@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	bufferhead: revert constructor removal	Christoph Lameter	2008-02-05	1	-3/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The constructor for buffer_head slabs was removed recently. We need the constructor back in slab defrag in order to insure that slab objects always have a definite state even before we allocated them. I think we mistakenly merged the removal of the constuctor into a cleanup patch. You (ie: akpm) had a test that showed that the removal of the constructor led to a small regression. The prior state makes things easier for slab defrag. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	is_vmalloc_addr(): Check if an address is within the vmalloc boundaries	Christoph Lameter	2008-02-05	4	-7/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Checking if an address is a vmalloc address is done in a couple of places. Define a common version in mm.h and replace the other checks. Again the include structures suck. The definition of VMALLOC_START and VMALLOC_END is not available in vmalloc.h since highmem.c cannot be included there. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user	Christoph Lameter	2008-02-05	19	-96/+71
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Simplify page cache zeroing of segments of pages through 3 functions zero_user_segments(page, start1, end1, start2, end2) Zeros two segments of the page. It takes the position where to start and end the zeroing which avoids length calculations and makes code clearer. zero_user_segment(page, start, end) Same for a single segment. zero_user(page, start, length) Length variant for the case where we know the length. We remove the zero_user_page macro. Issues: 1. Its a macro. Inline functions are preferable. 2. The KM_USER0 macro is only defined for HIGHMEM. Having to treat this special case everywhere makes the code needlessly complex. The parameter for zeroing is always KM_USER0 except in one single case that we open code. Avoiding KM_USER0 makes a lot of code not having to be dealing with the special casing for HIGHMEM anymore. Dealing with kmap is only necessary for HIGHMEM configurations. In those configurations we use KM_USER0 like we do for a series of other functions defined in highmem.h. Since KM_USER0 is depends on HIGHMEM the existing zero_user_page function could not be a macro. zero_user_* functions introduced here can be be inline because that constant is not used when these functions are called. Also extract the flushing of the caches to be outside of the kmap. [akpm@linux-foundation.org: fix nfs and ntfs build] [akpm@linux-foundation.org: fix ntfs build some more] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Steven French <sfrench@us.ibm.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: <linux-ext4@vger.kernel.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: David Chinner <dgc@sgi.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Steven French <sfrench@us.ibm.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	timerfd: new timerfd API	Davide Libenzi	2008-02-05	2	-74/+165
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is the new timerfd API as it is implemented by the following patch: int timerfd_create(int clockid, int flags); int timerfd_settime(int ufd, int flags, const struct itimerspec utmr, struct itimerspec otmr); int timerfd_gettime(int ufd, struct itimerspec *otmr); The timerfd_create() API creates an un-programmed timerfd fd. The "clockid" parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME. The timerfd_settime() API give new settings by the timerfd fd, by optionally retrieving the previous expiration time (in case the "otmr" parameter is not NULL). The time value specified in "utmr" is absolute, if the TFD_TIMER_ABSTIME bit is set in the "flags" parameter. Otherwise it's a relative time. The timerfd_gettime() API returns the next expiration time of the timer, or {0, 0} if the timerfd has not been set yet. Like the previous timerfd API implementation, read(2) and poll(2) are supported (with the same interface). Here's a simple test program I used to exercise the new timerfd APIs: http://www.xmailserver.org/timerfd-test2.c [akpm@linux-foundation.org: coding-style cleanups] [akpm@linux-foundation.org: fix ia64 build] [akpm@linux-foundation.org: fix m68k build] [akpm@linux-foundation.org: fix mips build] [akpm@linux-foundation.org: fix alpha, arm, blackfin, cris, m68k, s390, sparc and sparc64 builds] [heiko.carstens@de.ibm.com: fix s390] [akpm@linux-foundation.org: fix powerpc build] [akpm@linux-foundation.org: fix sparc64 more] Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	exec: rework the group exit and fix the race with kill	Oleg Nesterov	2008-02-05	1	-9/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As Roland pointed out, we have the very old problem with exec. de_thread() sets SIGNAL_GROUP_EXIT, kills other threads, changes ->group_leader and then clears signal->flags. All signals (even fatal ones) sent in this window (which is not too small) will be lost. With this patch exec doesn't abuse SIGNAL_GROUP_EXIT. signal_group_exit(), the new helper, should be used to detect exit_group() or exec() in progress. It can have more users, but this patch does only strictly necessary changes. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Robin Holt <holt@sgi.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	get_task_comm(): return the result	Andrew Morton	2008-02-05	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \|	It was dumb to make get_task_comm() return void. Change it to return a pointer to the resulting output for caller convenience. Cc: Ulrich Drepper <drepper@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	lockdep: annotate epoll	Peter Zijlstra	2008-02-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On Sat, 2008-01-05 at 13:35 -0800, Davide Libenzi wrote: > I remember I talked with Arjan about this time ago. Basically, since 1) > you can drop an epoll fd inside another epoll fd 2) callback-based wakeups > are used, you can see a wake_up() from inside another wake_up(), but they > will never refer to the same lock instance. > Think about: > > dfd = socket(...); > efd1 = epoll_create(); > efd2 = epoll_create(); > epoll_ctl(efd1, EPOLL_CTL_ADD, dfd, ...); > epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...); > > When a packet arrives to the device underneath "dfd", the net code will > issue a wake_up() on its poll wake list. Epoll (efd1) has installed a > callback wakeup entry on that queue, and the wake_up() performed by the > "dfd" net code will end up in ep_poll_callback(). At this point epoll > (efd1) notices that it may have some event ready, so it needs to wake up > the waiters on its poll wait list (efd2). So it calls ep_poll_safewake() > that ends up in another wake_up(), after having checked about the > recursion constraints. That are, no more than EP_MAX_POLLWAKE_NESTS, to > avoid stack blasting. Never hit the same queue, to avoid loops like: > > epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...); > epoll_ctl(efd3, EPOLL_CTL_ADD, efd2, ...); > epoll_ctl(efd4, EPOLL_CTL_ADD, efd3, ...); > epoll_ctl(efd1, EPOLL_CTL_ADD, efd4, ...); > > The code "if (tncur->wq == wq \|\| ..." prevents re-entering the same > queue/lock. Since the epoll code is very careful to not nest same instance locks allow the recursion. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial	Linus Torvalds	2008-02-04	14	-16/+16
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (79 commits) Jesper Juhl is the new trivial patches maintainer Documentation: mention email-clients.txt in SubmittingPatches fs/binfmt_elf.c: spello fix do_invalidatepage() comment typo fix Documentation/filesystems/porting fixes typo fixes in net/core/net_namespace.c typo fix in net/rfkill/rfkill.c typo fixes in net/sctp/sm_statefuns.c lib/: Spelling fixes kernel/: Spelling fixes include/scsi/: Spelling fixes include/linux/: Spelling fixes include/asm-m68knommu/: Spelling fixes include/asm-frv/: Spelling fixes fs/: Spelling fixes drivers/watchdog/: Spelling fixes drivers/video/: Spelling fixes drivers/ssb/: Spelling fixes drivers/serial/: Spelling fixes drivers/scsi/: Spelling fixes ...
\| *	fs/binfmt_elf.c: spello fix	Ohad Ben-Cohen	2008-02-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	s/litle/little Signed-off-by: Ohad Ben-Cohen <ohad@bencohen.org> Signed-off-by: Adrian Bunk <bunk@kernel.org>
\| *	fs/: Spelling fixes	Joe Perches	2008-02-03	9	-11/+11
\| \| \| \| \| \| \| \| \| \|	Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>
\| *	Spelling fixes: lenght->length	Paulius Zaleckas	2008-02-03	2	-2/+2
\| \| \| \| \| \| \| \| \| \|	Signed-off-by: Paulius Zaleckas <pauliusz@yahoo.com> Signed-off-by: Adrian Bunk <bunk@kernel.org>
\| *	Typoes: "whith" -> "with"	Robert P. J. Day	2008-02-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Adrian Bunk <bunk@kernel.org>
\| *	Fix a small number of "memeber" typoes.	Robert P. J. Day	2008-02-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Adrian Bunk <bunk@kernel.org>
* \|	Merge branch 'locks' of git://linux-nfs.org/~bfields/linux	Linus Torvalds	2008-02-04	1	-61/+64
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* 'locks' of git://linux-nfs.org/~bfields/linux: pid-namespaces-vs-locks-interaction file locks: Use wait_event_interruptible_timeout() locks: clarify posix_locks_deadlock
\| * \|	pid-namespaces-vs-locks-interaction	Vitaliy Gusev	2008-02-03	1	-6/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	fcntl(F_GETLK,..) can return pid of process for not current pid namespace (if process is belonged to the several namespaces). It is true also for pids in /proc/locks. So correct behavior is saving pointer to the struct pid of the process lock owner. Signed-off-by: Vitaliy Gusev <vgusev@openvz.org> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
\| * \|	file locks: Use wait_event_interruptible_timeout()	Matthew Wilcox	2008-02-03	1	-28/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	interruptible_sleep_on_locked() is just an open-coded wait_event_interruptible_timeout(), with the one difference that interruptible_sleep_on_locked() doesn't bother to check the condition on which it is waiting, depending instead on the BKL to avoid the case where it blocks after the wakeup has already been called. locks_block_on_timeout() is only used in one place, so it's actually simpler to inline it into its caller. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
\| * \|	locks: clarify posix_locks_deadlock	J. Bruce Fields	2008-02-03	1	-30/+40
\| \|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For such a short function (with such a long comment), posix_locks_deadlock() seems to cause a lot of confusion. Attempt to make it a bit clearer: - Remove the initial posix_same_owner() check, which can never pass (since this is only called in the case that block_fl and caller_fl conflict) - Use an explicit loop (and a helper function) instead of a goto. - Rewrite the comment, attempting a clearer explanation, and removing some uninteresting historical detail. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* \|	vm audit: add VM_DONTEXPAND to mmap for drivers that need it	Nick Piggin	2008-02-04	1	-4/+0
\|/ \| \| \| \| \| \| \| \| \| \| \|	Drivers that register a ->fault handler, but do not range-check the offset argument, must set VM_DONTEXPAND in the vm_flags in order to prevent an expanding mremap from overflowing the resource. I've audited the tree and attempted to fix these problems (usually by adding VM_DONTEXPAND where it is not obvious). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Merge branch 'for-linus' of git://linux-nfs.org/~bfields/linux	Linus Torvalds	2008-02-02	23	-375/+466
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* 'for-linus' of git://linux-nfs.org/~bfields/linux: (100 commits) SUNRPC: RPC program information is stored in unsigned integers SUNRPC: Move exported symbol definitions after function declaration part 2 NLM: tear down RPC clients in nlm_shutdown_hosts SUNRPC: spin svc_rqst initialization to its own function nfsd: more careful input validation in nfsctl write methods lockd: minor log message fix knfsd: don't bother mapping putrootfh enoent to eperm rdma: makefile rdma: ONCRPC RDMA protocol marshalling rdma: SVCRDMA sendto rdma: SVCRDMA recvfrom rdma: SVCRDMA Core Transport Services rdma: SVCRDMA Transport Module rdma: SVCRMDA Header File svc: Add svc_xprt_names service to replace svc_sock_names knfsd: Support adding transports by writing portlist file svc: Add svc API that queries for a transport instance svc: Add /proc/sys/sunrpc/transport files svc: Add transport hdr size for defer/revisit svc: Move the xprt independent code to the svc_xprt.c file ...
\| *	NLM: tear down RPC clients in nlm_shutdown_hosts	Jeff Layton	2008-02-01	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It's possible for a RPC to outlive the lockd daemon that created it, so we need to make sure that all RPC's are killed when lockd is coming down. When nlm_shutdown_hosts is called, kill off all RPC tasks associated with the host. Since we need to wait until they have all gone away, we might as well just shut down the RPC client altogether. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
\| *	nfsd: more careful input validation in nfsctl write methods	J. Bruce Fields	2008-02-01	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Neil Brown points out that we're checking buf[size-1] in a couple places without first checking whether size is zero. Actually, given the implementation of simple_transaction_get(), buf[-1] is zero, so in both of these cases the subsequent check of the value of buf[size-1] will catch this case. But it seems fragile to depend on that, so add explicit checks for this case. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Acked-by: NeilBrown <neilb@suse.de>
\| *	lockd: minor log message fix	J. Bruce Fields	2008-02-01	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Wendy Cheng noticed that function name doesn't agree here. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Cc: Wendy Cheng <wcheng@redhat.com>
\| *	knfsd: don't bother mapping putrootfh enoent to eperm	J. Bruce Fields	2008-02-01	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Neither EPERM and ENOENT map to valid errors for PUTROOTFH according to rfc 3530, and, if anything, ENOENT is likely to be slightly more informative; so don't bother mapping ENOENT to EPERM. (Probably this was originally done because one likely cause was that there is an fsid=0 export but that it isn't permitted to this particular client. Now that we allow WRONGSEC returns, this is somewhat less likely.) In the long term we should work to make this situation less likely, perhaps by turning off nfsv4 service entirely in the absence of the pseudofs root, or constructing a pseudofilesystem root ourselves in the kernel as necessary. Thanks to Benny Halevy <bhalevy@panasas.com> for pointing out this problem. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Cc: Benny Halevy <bhalevy@panasas.com>
\| *	svc: Add svc_xprt_names service to replace svc_sock_names	Tom Tucker	2008-02-01	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Create a transport independent version of the svc_sock_names function. The toclose capability of the svc_sock_names service can be implemented using the svc_xprt_find and svc_xprt_close services. Signed-off-by: Tom Tucker <tom@opengridcomputing.com> Acked-by: Neil Brown <neilb@suse.de> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Greg Banks <gnb@sgi.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
\| *	knfsd: Support adding transports by writing portlist file	Tom Tucker	2008-02-01	2	-6/+61
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Update the write handler for the portlist file to allow creating new listening endpoints on a transport. The general form of the string is: <transport_name><space><port number> For example: echo "tcp 2049" > /proc/fs/nfsd/portlist This is intended to support the creation of a listening endpoint for RDMA transports without adding #ifdef code to the nfssvc.c file. Transports can also be removed as follows: '-'<transport_name><space><port number> For example: echo "-tcp 2049" > /proc/fs/nfsd/portlist Attempting to add a listener with an invalid transport string results in EPROTONOSUPPORT and a perror string of "Protocol not supported". Attempting to remove an non-existent listener (.e.g. bad proto or port) results in ENOTCONN and a perror string of "Transport endpoint is not connected" Signed-off-by: Tom Tucker <tom@opengridcomputing.com> Acked-by: Neil Brown <neilb@suse.de> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Greg Banks <gnb@sgi.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
\| *	svc: Add svc API that queries for a transport instance	Tom Tucker	2008-02-01	1	-14/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a new svc function that allows a service to query whether a transport instance has already been created. This is used in lockd to determine whether or not a transport needs to be created when a lockd instance is brought up. Specifying 0 for the address family or port is effectively a wild-card, and will result in matching the first transport in the service's list that has a matching class name. Signed-off-by: Tom Tucker <tom@opengridcomputing.com> Acked-by: Neil Brown <neilb@suse.de> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Greg Banks <gnb@sgi.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
\| *	svc: Make close transport independent	Tom Tucker	2008-02-01	2	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move sk_list and sk_ready to svc_xprt. This involves close because these lists are walked by svcs when closing all their transports. So I combined the moving of these lists to svc_xprt with making close transport independent. The svc_force_sock_close has been changed to svc_close_all and takes a list as an argument. This removes some svc internals knowledge from the svcs. This code races with module removal and transport addition. Thanks to Simon Holm Thøgersen for a compile fix. Signed-off-by: Tom Tucker <tom@opengridcomputing.com> Acked-by: Neil Brown <neilb@suse.de> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Greg Banks <gnb@sgi.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Cc: Simon Holm Thøgersen <odie@cs.aau.dk>
\| *	svc: Change services to use new svc_create_xprt service	Tom Tucker	2008-02-01	3	-13/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Modify the various kernel RPC svcs to use the svc_create_xprt service. Signed-off-by: Tom Tucker <tom@opengridcomputing.com> Acked-by: Neil Brown <neilb@suse.de> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Greg Banks <gnb@sgi.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
\| *	Leak in nlmsvc_testlock for async GETFL case	Oleg Drokin	2008-02-01	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix nlm_block leak for the case of supplied blocking lock info. Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
\| *	nfsd4: clean up access_valid, deny_valid checks.	J. Bruce Fields	2008-02-01	1	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Document these checks a little better and inline, as suggested by Neil Brown (note both functions have two callers). Remove an obviously bogus check while we're there (checking whether unsigned value is negative). Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Cc: Neil Brown <neilb@suse.de>
\| *	nfsd: allow root to set uid and gid on create	J. Bruce Fields	2008-02-01	1	-19/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The server silently ignores attempts to set the uid and gid on create. Based on the comment, this appears to have been done to prevent some overly-clever IRIX client from causing itself problems. Perhaps we should remove that hack completely. For now, at least, it makes sense to allow root (when no_root_squash is set) to set uid and gid. While we're there, since nfsd_create and nfsd_create_v3 share the same logic, pull that out into a separate function. And spell out the individual modifications of ia_valid instead of doing them both at once inside a conditional. Thanks to Roger Willcocks <roger@filmlight.ltd.uk> for the bug report and original patch on which this is based. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
\| *	lockd: fix a leak in nlmsvc_testlock asynchronous request handling	Oleg Drokin	2008-02-01	1	-7/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Without the patch, there is a leakage of nlmblock structure refcount that holds a reference nlmfile structure, that holds a reference to struct file, when async GETFL is used (-EINPROGRESS return from file_ops->lock()), and also in some error cases. Fix up a style nit while we're here. Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
\| *	nfsd: Allow AIX client to read dir containing mountpoints	Frank Filz	2008-02-01	2	-6/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch addresses a compatibility issue with a Linux NFS server and AIX NFS client. I have exported /export as fsid=0 with sec=krb5:krb5i I have mount --bind /home onto /export/home I have exported /export/home with sec=krb5i The AIX client mounts / -o sec=krb5:krb5i onto /mnt If I do an ls /mnt, the AIX client gets a permission error. Looking at the network traceIwe see a READDIR looking for attributes FATTR4_RDATTR_ERROR and FATTR4_MOUNTED_ON_FILEID. The response gives a NFS4ERR_WRONGSEC which the AIX client is not expecting. Since the AIX client is only asking for an attribute that is an attribute of the parent file system (pseudo root in my example), it seems reasonable that there should not be an error. In discussing this issue with Bruce Fields, I initially proposed ignoring the error in nfsd4_encode_dirent_fattr() if all that was being asked for was FATTR4_RDATTR_ERROR and FATTR4_MOUNTED_ON_FILEID, however, Bruce suggested that we avoid calling cross_mnt() if only these attributes are requested. The following patch implements bypassing cross_mnt() if only FATTR4_RDATTR_ERROR and FATTR4_MOUNTED_ON_FILEID are called. Since there is some complexity in the code in nfsd4_encode_fattr(), I didn't want to duplicate code (and introduce a maintenance nightmare), so I added a parameter to nfsd4_encode_fattr() that indicates whether it should ignore cross mounts and simply fill in the attribute using the passed in dentry as opposed to it's parent. Signed-off-by: Frank Filz <ffilzlnx@us.ibm.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
\| *	nfsd4: fix bad seqid on lock request incompatible with open mode	J. Bruce Fields	2008-02-01	1	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The failure to return a stateowner from nfs4_preprocess_seqid_op() means in the case where a lock request is of a type incompatible with an open (due to, e.g., an application attempting a write lock on a file open for read), means that fs/nfsd/nfs4xdr.c:ENCODE_SEQID_OP_TAIL() never bumps the seqid as it should. The client, attempting to close the file afterwards, then gets an (incorrect) bad sequence id error. Worse, this prevents the open file from ever being closed, so we leak state. Thanks to Benny Halevy and Trond Myklebust for analysis, and to Steven Wilton for the report and extensive data-gathering. Cc: Benny Halevy <bhalevy@panasas.com> Cc: Steven Wilton <steven.wilton@team.eftel.com.au> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
\| *	lockd: fix reference count leaks in async locking case	Oleg Drokin	2008-02-01	2	-17/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In a number of places where we wish only to translate nlm_drop_reply to rpc_drop_reply errors we instead return early with rpc_drop_reply, skipping some important end-of-function cleanup. This results in reference count leaks when lockd is doing posix locking on GFS2. Signed-off-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
\| *	nfsd4: recognize callback channel failure earlier	J. Bruce Fields	2008-02-01	2	-9/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When the callback channel fails, we inform the client of that by returning a cb_path_down error the next time it tries to renew its lease. If we wait most of a lease period before deciding that a callback has failed and that the callback channel is down, then we decrease the chances that the client will find out in time to do anything about it. So, mark the channel down as soon as we recognize that an rpc has failed. However, continue trying to recall delegations anyway, in hopes it will come back up. This will prevent more delegations from being given out, and ensure cb_path_down is returned to renew calls earlier, while still making the best effort to deliver recalls of existing delegations. Also fix a couple comments and remove a dprink that doesn't seem likely to be useful. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
\| *	nfsd4: miscellaneous nfs4state.c style fixes	J. Bruce Fields	2008-02-01	1	-29/+26
\| \| \| \| \| \| \| \| \| \| \| \|	Fix various minor style violations. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
\| *	nfsd4: make current_clientid local	J. Bruce Fields	2008-02-01	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Declare this variable in the one function where it's used, and clean up some minor style problems. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
\| *	nfsd: fix encode_entryplus_baggage() indentation	J. Bruce Fields	2008-02-01	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \|	Fix bizarre indentation. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
\| *	nfsd4: kill unneeded cl_confirm check	J. Bruce Fields	2008-02-01	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We generate a unique cl_confirm for every new client; so if we've already checked that this cl_confirm agrees with the cl_confirm of unconf, then we already know that it does not agree with the cl_confirm of conf. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>