op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	[PATCH] fix de_thread() vs do_coredump() deadlock	Oleg Nesterov	2005-10-30	1	-3/+13
\| \| \| \| \| \| \| \| \| \| \| \|	de_thread() sends SIGKILL to all sub-threads and waits them to die in 'D' state. It is possible that one of the threads already dequeued coredump signal. When de_thread() unlocks ->sighand->lock that thread can enter do_coredump()->coredump_wait() and cause a deadlock. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] fuse: spelling fixes	Miklos Szeredi	2005-10-30	2	-9/+9
\| \| \| \| \| \| \| \| \|	Correct some typos and inconsistent use of "initialise" vs "initialize" in comments. Reported by Ioannis Barkas. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] Locking problems while EXT3FS_DEBUG on	Glauber de Oliveira Costa	2005-10-30	3	-6/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I noticed some problems while running ext3 with the debug flag set on. More precisely, I was unable to umount the filesystem. Some investigation took me to the patch that follows. At a first glance , the lock/unlock I've taken out seems really not necessary, as the main code (outside debug) does not lock the super. The only additional danger operations that debug code introduces seems to be related to bitmap, but bitmap operations tends to be all atomic anyway. I also took the opportunity to fix 2 spelling errors. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] coredump_wait() cleanup	Oleg Nesterov	2005-10-30	1	-8/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch deletes pointless code from coredump_wait(). 1. It does useless mm->core_waiters inc/dec under mm->mmap_sem, but any changes to ->core_waiters have no effect until we drop ->mmap_sem. 2. It calls yield() for absolutely unknown reason. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] proc: fix of error path in proc_get_inode()	Kirill Korotaev	2005-10-30	1	-7/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes incorrect error path in proc_get_inode(), when module can't be get due to being unloaded. When try_module_get() fails, this function puts de(!) and still returns inode with non-getted de. There are still unresolved known bugs in proc yet to be fixed: - proc_dir_entry tree is managed without any serialization - create_proc_entry() doesn't setup de->owner anyhow, so setting it later manually is inatomic. - looks like almost all modules do not care whether it's de->owner is set... Signed-Off-By: Denis Lunev <den@sw.ru> Signed-Off-By: Kirill Korotaev <dev@sw.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] fuse: clean up dead code related to nfs exporting	Miklos Szeredi	2005-10-30	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove last remains of NFS exportability support. The code is actually buggy (as reported by Akshat Aranya), since 'alias' will be leaked if it's non-null and alias->d_flags has DCACHE_DISCONNECTED. This is not an active bug, since there will never be any disconnected dentries. But it's better to get rid of the unnecessary complexity anyway. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] fix de_thread vs it_real_fn() deadlock	Oleg Nesterov	2005-10-30	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \|	de_thread() calls del_timer_sync(->real_timer) under ->sighand->siglock. This is deadlockable, it_real_fn sends a signal and needs this lock too. Also, delete unneeded ->real_timer.data assignment. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] reduce sizeof(struct file)	Eric Dumazet	2005-10-30	4	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that RCU applied on 'struct file' seems stable, we can place f_rcuhead in a memory location that is not anymore used at call_rcu(&f->f_rcuhead, file_free_rcu) time, to reduce the size of this critical kernel object. The trick I used is to move f_rcuhead and f_list in an union called f_u The callers are changed so that f_rcuhead becomes f_u.fu_rcuhead and f_list becomes f_u.f_list Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] open: cleanup in lookup_flags()	Miklos Szeredi	2005-10-30	1	-6/+0
\| \| \| \| \| \| \| \| \| \| \|	lookup_flags() is only called from the non-create case, so it needn't check for O_CREAT\|O_EXCL. Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] Don't uselessly export task_struct to userspace in core dumps	Eric W. Biederman	2005-10-30	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	task_struct is an internal structure to the kernel with a lot of good information, that is probably interesting in core dumps. However there is no way for user space to know what format that information is in making it useless. I grepped the GDB 6.3 source code and NT_TASKSTRUCT while defined is not used anywhere else. So I would be surprised if anyone notices it is missing. In addition exporting kernel pointers to all the interesting kernel data structures sounds like the very definition of an information leak. I haven't a clue what someone with evil intentions could do with that information, but in any attack against the kernel it looks like this is the perfect tool for aiming that attack. So since NT_TASKSTRUCT is useless as currently defined and is potentially dangerous, let's just not export it. (akpm: Daniel Jacobowitz <dan@debian.org> "would be amazed" if anything was using NT_TASKSTRUCT). Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] TIOC* compat ioctl handling	Christoph Hellwig	2005-10-30	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TIOCSTART and TIOCSTOP are defined in asm/ioctls.h and asm/termios.h by various architectures but not actually implemented anywhere but in the IRIX compatibility layer, so remove their COMPATIBLE_IOCTL from parisc, ppc64 and sparc64. Move the TIOCSLTC COMPATIBLE_IOCTL to common code, guided by an ifdef to only show up on architectures that support it (same as the code handling it in tty_ioctl.c), aswell as it's brother TIOCGLTC that wasn't handled so far. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] little de_thread() cleanup	Oleg Nesterov	2005-10-30	1	-4/+3
\| \| \| \| \| \| \| \|	Trivial, saves one 'if' branch in de_thread(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] reiserfs: [kv]free() checking cleanup	James Lamanna	2005-10-30	2	-18/+10
\| \| \| \| \| \| \| \| \|	Signed-off-by: James Lamanna <jlamanna@gmail.com> Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] ext3: Fix unmapped buffers in transaction's lists	Jan Kara	2005-10-30	1	-1/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix the problem (BUG 4964) with unmapped buffers in transaction's t_sync_data list. The problem is we need to call filesystem's own invalidatepage() from block_write_full_page(). block_write_full_page() must call filesystem's invalidatepage(). Otherwise following nasty race can happen: proc 1 proc 2 ------ ------ - write some new data to 'offset' => bh gets to the transactions data list - starts truncate => i_size set to new size - mpage_writepages() - ext3_ordered_writepage() to 'offset' - block_write_full_page() - page->index > end_index+1 - block_invalidatepage() - discard_buffer() - clear_buffer_mapped() - commit triggers and finds unmapped buffer - BOOM! Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] SELinux: canonicalize getxattr()	James Morris	2005-10-30	1	-5/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch allows SELinux to canonicalize the value returned from getxattr() via the security_inode_getsecurity() hook, which is called after the fs level getxattr() function. The purpose of this is to allow the in-core security context for an inode to override the on-disk value. This could happen in cases such as upgrading a system to a different labeling form (e.g. standard SELinux to MLS) without needing to do a full relabel of the filesystem. In such cases, we want getxattr() to return the canonical security context that the kernel is using rather than what is stored on disk. The implementation hooks into the inode_getsecurity(), adding another parameter to indicate the result of the preceding fs-level getxattr() call, so that SELinux knows whether to compare a value obtained from disk with the kernel value. We also now allow getxattr() to work for mountpoint labeled filesystems (i.e. mount with option context=foo_t), as we are able to return the kernel value to the user. Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] CONFIG_IA32	Brian Gerst	2005-10-30	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	Add CONFIG_X86_32 for i386. This allows selecting options that only apply to 32-bit systems. (X86 && !X86_64) becomes X86_32 (X86 \|\| X86_64) becomes X86 Signed-off-by: Brian Gerst <bgerst@didntduck.org> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] NFS: Remove unbalanced spin_unlock() calls from nfs_refresh_inode()	Trond Myklebust	2005-10-30	1	-2/+0
\| \| \| \| \| \| \|	Doh! Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] hugetlb: overcommit accounting check	Adam Litke	2005-10-29	1	-10/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Basic overcommit checking for hugetlb_file_map() based on an implementation used with demand faulting in SLES9. Since demand faulting can't guarantee the availability of pages at mmap time, this patch implements a basic sanity check to ensure that the number of huge pages required to satisfy the mmap are currently available. Despite the obvious race, I think it is a good start on doing proper accounting. I'd like to work towards an accounting system that mimics the semantics of normal pages (especially for the MAP_PRIVATE/COW case). That work is underway and builds on what this patch starts. Huge page shared memory segments are simpler and still maintain their commit on shmget semantics. Signed-off-by: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] hugetlb: demand fault handler	Adam Litke	2005-10-29	1	-5/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Below is a patch to implement demand faulting for huge pages. The main motivation for changing from prefaulting to demand faulting is so that huge page memory areas can be allocated according to NUMA policy. Thanks to consolidated hugetlb code, switching the behavior requires changing only one fault handler. The bulk of the patch just moves the logic from hugelb_prefault() to hugetlb_pte_fault() and find_get_huge_page(). Signed-off-by: Adam Litke <agl@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] cleanup hugelbfs_forget_inode	Christoph Hellwig	2005-10-29	1	-18/+20
\| \| \| \| \| \| \| \| \| \|	Reformat hugelbfs_forget_inode and add the missing but harmless write_inode_now call. It looks the same as generic_forget_inode now except for the call to truncate_hugepages instead of truncate_inode_pages. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] kill hugelbfs_do_delete_inode	Christoph Hellwig	2005-10-29	1	-37/+1
\| \| \| \| \| \| \| \| \|	hugetlbfs_do_delete_inode is the same as generic_delete_inode now, so remove it in favour of the latter. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] hugetlbfs: clean up hugetlbfs_delete_inode	Christoph Hellwig	2005-10-29	1	-6/+32
\| \| \| \| \| \| \| \| \| \| \|	Make hugetlbfs looks the same as generic_detelte_inode, fixing a bunch of missing updates to it at the same time. Rename it to hugetlbfs_do_delete_inode and add a real hugetlbfs_delete_inode that implements ->delete_inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] hugetlbfs: move free_inodes accounting	Christoph Hellwig	2005-10-29	1	-36/+42
\| \| \| \| \| \| \| \| \| \| \|	Move hugetlbfs accounting into ->alloc_inode / ->destroy_inode. This keeps the code simpler, fixes a loeak where a failing inode allocation wouldn't decrement the counter and moves hugetlbfs_delete_inode and hugetlbfs_forget_inode closer to their generic counterparts. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] mm: split page table lock	Hugh Dickins	2005-10-29	4	-12/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with a many-threaded application which concurrently initializes different parts of a large anonymous area. This patch corrects that, by using a separate spinlock per page table page, to guard the page table entries in that page, instead of using the mm's single page_table_lock. (But even then, page_table_lock is still used to guard page table allocation, and anon_vma allocation.) In this implementation, the spinlock is tucked inside the struct page of the page table page: with a BUILD_BUG_ON in case it overflows - which it would in the case of 32-bit PA-RISC with spinlock debugging enabled. Splitting the lock is not quite for free: another cacheline access. Ideally, I suppose we would use split ptlock only for multi-threaded processes on multi-cpu machines; but deciding that dynamically would have its own costs. So for now enable it by config, at some number of cpus - since the Kconfig language doesn't support inequalities, let preprocessor compare that with NR_CPUS. But I don't think it's worth being user-configurable: for good testing of both split and unsplit configs, split now at 4 cpus, and perhaps change that to 8 later. There is a benefit even for singly threaded processes: kswapd can be attacking one part of the mm while another part is busy faulting. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] mm: follow_page with inner ptlock	Hugh Dickins	2005-10-29	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Final step in pushing down common core's page_table_lock. follow_page no longer wants caller to hold page_table_lock, uses pte_offset_map_lock itself; and so no page_table_lock is taken in get_user_pages itself. But get_user_pages (and get_futex_key) do then need follow_page to pin the page for them: take Daniel's suggestion of bitflags to follow_page. Need one for WRITE, another for TOUCH (it was the accessed flag before: vanished along with check_user_page_readable, but surely get_numa_maps is wrong to mark every page it finds as accessed), another for GET. And another, ANON to dispose of untouched_anonymous_page: it seems silly for that to descend a second time, let follow_page observe if there was no page table and return ZERO_PAGE if so. Fix minor bug in that: check VM_LOCKED - make_pages_present ought to make readonly anonymous present. Give get_numa_maps a cond_resched while we're there. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] mm: unmap_vmas with inner ptlock	Hugh Dickins	2005-10-29	1	-7/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove the page_table_lock from around the calls to unmap_vmas, and replace the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are now safe to descend without page_table_lock. Don't attempt fancy locking for hugepages, just take page_table_lock in unmap_hugepage_range. Which makes zap_hugepage_range, and the hugetlb test in zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway. Nor does unmap_vmas have much use for its mm arg now. The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without page_table_lock: if they're implemented at all, they typically come down to flush_cache_range (usually done outside page_table_lock) and flush_tlb_range (which we already audited for the mprotect case). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] mm: pte_offset_map_lock loops	Hugh Dickins	2005-10-29	1	-11/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Convert those common loops using page_table_lock on the outside and pte_offset_map within to use just pte_offset_map_lock within instead. These all hold mmap_sem (some exclusively, some not), so at no level can a page table be whipped away from beneath them. But whereas pte_alloc loops tested with the "atomic" pmd_present, these loops are testing with pmd_none, which on i386 PAE tests both lower and upper halves. That's now unsafe, so add a cast into pmd_none to test only the vital lower half: we lose a little sensitivity to a corrupt middle directory, but not enough to worry about. It appears that i386 and UML were the only architectures vulnerable in this way, and pgd and pud no problem. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] mm: ptd_alloc take ptlock	Hugh Dickins	2005-10-29	1	-9/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Second step in pushing down the page_table_lock. Remove the temporary bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not to hold page_table_lock, whether it's on init_mm or a user mm; take page_table_lock internally to check if a racing task already allocated. Convert their callers from common code. But avoid coming back to change them again later: instead of moving the spin_lock(&mm->page_table_lock) down, switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which encapsulate the mapping+locking and unlocking+unmapping together, and in the end may use alternatives to the mm page_table_lock itself. These callers all hold mmap_sem (some exclusively, some not), so at no level can a page table be whipped away from beneath them; and pte_alloc uses the "atomic" pmd_present to test whether it needs to allocate. It appears that on all arches we can safely descend without page_table_lock. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] mm: update_hiwaters just in time	Hugh Dickins	2005-10-29	3	-4/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	update_mem_hiwater has attracted various criticisms, in particular from those concerned with mm scalability. Originally it was called whenever rss or total_vm got raised. Then many of those callsites were replaced by a timer tick call from account_system_time. Now Frank van Maarseveen reports that to be found inadequate. How about this? Works for Frank. Replace update_mem_hiwater, a poor combination of two unrelated ops, by macros update_hiwater_rss and update_hiwater_vm. Don't attempt to keep mm->hiwater_rss up to date at timer tick, nor every time we raise rss (usually by 1): those are hot paths. Do the opposite, update only when about to lower rss (usually by many), or just before final accounting in do_exit. Handle mm->hiwater_vm in the same way, though it's much less of an issue. Demand that whoever collects these hiwater statistics do the work of taking the maximum with rss or total_vm. And there has been no collector of these hiwater statistics in the tree. The new convention needs an example, so match Frank's usage by adding a VmPeak line above VmSize to /proc/<pid>/status, and also a VmHWM line above VmRSS (High-Water-Mark or High-Water-Memory). There was a particular anomaly during mremap move, that hiwater_vm might be captured too high. A fleeting such anomaly remains, but it's quickly corrected now, whereas before it would stick. What locking? None: if the app is racy then these statistics will be racy, it's not worth any overhead to make them exact. But whenever it suits, hiwater_vm is updated under exclusive mmap_sem, and hiwater_rss under page_table_lock (for now) or with preemption disabled (later on): without going to any trouble, minimize the time between reading current values and updating, to minimize those occasions when a racing thread bumps a count up and back down in between. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] core remove PageReserved	Nick Piggin	2005-10-29	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove PageReserved() calls from core code by tightening VM_RESERVED handling in mm/ to cover PageReserved functionality. PageReserved special casing is removed from get_page and put_page. All setting and clearing of PageReserved is retained, and it is now flagged in the page_alloc checks to help ensure we don't introduce any refcount based freeing of Reserved pages. MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being deprecated. We never completely handled it correctly anyway, and is be reintroduced in future if required (Hugh has a proof of concept). Once PageReserved() calls are removed from kernel/power/swsusp.c, and all arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can be trivially removed. Last real user of PageReserved is swsusp, which uses PageReserved to determine whether a struct page points to valid memory or not. This still needs to be addressed (a generic page_is_ram() should work). A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and thus mapcounted and count towards shared rss). These writes to the struct page could cause excessive cacheline bouncing on big systems. There are a number of ways this could be addressed if it is an issue. Signed-off-by: Nick Piggin <npiggin@suse.de> Refcount bug fix for filemap_xip.c Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] mm: rss = file_rss + anon_rss	Hugh Dickins	2005-10-29	3	-7/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I was lazy when we added anon_rss, and chose to change as few places as possible. So currently each anonymous page has to be counted twice, in rss and in anon_rss. Which won't be so good if those are atomic counts in some configurations. Change that around: keep file_rss and anon_rss separately, and add them together (with get_mm_rss macro) when the total is needed - reading two atomics is much cheaper than updating two atomics. And update anon_rss upfront, typically in memory.c, not tucked away in page_add_anon_rmap. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] mm: mm_init set_mm_counters	Hugh Dickins	2005-10-29	5	-11/+0
\| \| \| \| \| \| \| \| \| \| \| \|	How is anon_rss initialized? In dup_mmap, and by mm_alloc's memset; but that's not so good if an mm_counter_t is a special type. And how is rss initialized? By set_mm_counter, all over the place. Come on, we just need to initialize them both at once by set_mm_counter in mm_init (which follows the memcpy when forking). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] Convert mempolicies to nodemask_t	Andi Kleen	2005-10-29	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \|	The NUMA policy code predated nodemask_t so it used open coded bitmaps. Convert everything to nodemask_t. Big patch, but shouldn't have any actual behaviour changes (except I removed one unnecessary check against node_online_map and one unnecessary BUG_ON) Signed-off-by: "Andi Kleen" <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	[PATCH] usb: Patch for USBDEVFS_IOCTL from 32-bit programs	Pete Zaitcev	2005-10-28	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Dell supplied me with the following test: #include<stdio.h> #include<errno.h> #include<sys/ioctl.h> #include<fcntl.h> #include<linux/usbdevice_fs.h> main(int argc,charargv[]) { struct usbdevfs_hub_portinfo hubPortInfo = {0}; struct usbdevfs_ioctl command = {0}; command.ifno = 0; command.ioctl_code = USBDEVFS_HUB_PORTINFO; command.data = (void)&hubPortInfo; int fd, ret; if(argc != 2) { fprintf(stderr,"Usage: %s /proc/bus/usb/<BusNo>/<HubID>\n",argv[0]); fprintf(stderr,"Example: %s /proc/bus/usb/001/001\n",argv[0]); exit(1); } errno = 0; fd = open(argv[1],O_RDWR); if(fd < 0) { perror("open failed:"); exit(errno); } errno = 0; ret = ioctl(fd,USBDEVFS_IOCTL,&command); printf("IOCTL return status:%d\n",ret); if(ret<0) { perror("IOCTL failed:"); close(fd); exit(3); } else { printf("IOCTL passed:Num of ports %d\n",hubPortInfo.nports); close(fd); exit(0); } return 0; } I have verified that it breaks if built in 32 bit mode on x86_64 and that the patch below fixes it. Signed-off-by: Pete Zaitcev <zaitcev@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
*	[PATCH] Fix ext3 warning for unused var	Peter Osterlund	2005-10-28	1	-10/+16
\| \| \| \| \| \| \|	Fix compile warning in ext3 quota code. Signed-off-by: Peter Osterlund <petero2@telia.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
*	Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6	Linus Torvalds	2005-10-28	2	-3/+28
\|\
\| *	Merge ../bleed-2.6	Greg KH	2005-10-28	41	-686/+1379
\| \|\
\| * \|	[PATCH] Driver Core: fix up all callers of class_device_create()	Greg Kroah-Hartman	2005-10-28	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The previous patch adding the ability to nest struct class_device changed the paramaters to the call class_device_create(). This patch fixes up all in-kernel users of the function. Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
\| * \|	[PATCH] add sysfs attr to re-emit device hotplug event	Kay Sievers	2005-10-28	1	-1/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A "coldplug + udevstart" can be simple like this: for i in /sys/block///uevent; do echo 1 > $i; done for i in /sys/class///uevent; do echo 1 > $i; done for i in /sys/bus//devices//uevent; do echo 1 > $i; done Signed-off-by: Kay Sievers <kay.sievers@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
* \| \|	Merge branch 'for-linus' of ↵	Linus Torvalds	2005-10-28	6	-19/+38
\|\ \ \ \| \|_\|/ \|/\| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/shaggy/jfs-2.6
\| * \|	JFS: make sure right-most xtree pages have header.next set to zero	Dave Kleikamp	2005-10-28	1	-9/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The xtTruncate code was only doing this for leaf pages. When a file is horribly fragmented, we may truncate a file leaving an internal page with an invalid head.next field, which may cause a stale page to be referenced. Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
\| * \|	JFS: Corrupted block map should not cause trap	Dave Kleikamp	2005-10-11	1	-5/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Replace assert statements with better error handling. Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
\| * \|	JFS: make special inodes play nicely with page balancing	Dave Kleikamp	2005-10-03	5	-5/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch fixes up a few problems with jfs's reserved inodes. 1. There is no need for the jfs code setting the I_DIRTY bits in i_state. I am ashamed that the code ever did this, and surprised it hasn't been noticed until now. 2. Make sure special inodes are on an inode hash list. If the inodes are unhashed, __mark_inode_dirty will fail to put the inode on the superblock's dirty list, and the data will not be flushed under memory pressure. 3. Force writing journal data to disk when metapage_writepage is unable to write a metadata page due to pending journal I/O. Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
* \| \|	Merge branch 'for-linus' of git://brick.kernel.dk/data/git/linux-2.6-block	Linus Torvalds	2005-10-28	1	-1/+1
\|\ \ \
\| * \| \|	[patch] remove gendisk->stamp_idle field	Chen, Kenneth W	2005-10-28	1	-1/+1
\| \| \|/ \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	struct gendisk has these two fields: stamp, stamp_idle. Update to stamp_idle is always in sync with stamp and they are always the same. Therefore, it does not add any value in having two fields tracking same timestamp. Suggest to remove it. Also, we should only update gendisk stats with non-zero value. Advantage is that we don't have to needlessly calculate memory address, and then add zero to the content. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Signed-off-by: Jens Axboe <axboe@suse.de>
* \| \|	[PATCH] gfp_t: reiserfs mapping_set_gfp_mask() use	Al Viro	2005-10-28	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* \| \|	[PATCH] gfp_t: fs/*	Al Viro	2005-10-28	19	-45/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- ->releasepage() annotated (s/int/gfp_t), instances updated - missing gfp_t in fs/* added - fixed misannotation from the original sweep caught by bitwise checks: XFS used __nocast both for gfp_t and for flags used by XFS allocator. The latter left with unsigned int __nocast; we might want to add a different type for those but for now let's leave them alone. That, BTW, is a case when __nocast use had been actively confusing - it had been used in the same code for two different and similar types, with no way to catch misuses. Switch of gfp_t to bitwise had caught that immediately... One tricky bit is left alone to be dealt with later - mapping->flags is a mix of gfp_t and error indications. Left alone for now. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* \| \|	[PATCH] gfp_t: infrastructure	Al Viro	2005-10-28	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Beginning of gfp_t annotations: - -Wbitwise added to CHECKFLAGS - old __bitwise renamed to __bitwise__ - __bitwise defined to either __bitwise__ or nothing, depending on __CHECK_ENDIAN__ being defined - gfp_t switched from __nocast to __bitwise__ - force cast to gfp_t added to __GFP_... constants - new helper - gfp_zone(); extracts zone bits out of gfp_t value and casts the result to int Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* \| \|	NFS: Allow files that are open for write to invalidate caches	Trond Myklebust	2005-10-27	1	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \|	Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* \| \|	NFSv4: Convert unnecessary XDR warning messages into dprintk()	Trond Myklebust	2005-10-27	1	-11/+6
\| \| \| \| \| \| \| \| \| \| \| \|	Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>