summaryrefslogtreecommitdiffstats
path: root/include
Commit message (Collapse)AuthorAgeFilesLines
* [PATCH] cpusets: automatic numa mempolicy rebindingPaul Jackson2005-10-301-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch automatically updates a tasks NUMA mempolicy when its cpuset memory placement changes. It does so within the context of the task, without any need to support low level external mempolicy manipulation. If a system is not using cpusets, or if running on a system with just the root (all-encompassing) cpuset, then this remap is a no-op. Only when a task is moved between cpusets, or a cpusets memory placement is changed does the following apply. Otherwise, the main routine below, rebind_policy() is not even called. When mixing cpusets, scheduler affinity, and NUMA mempolicies, the essential role of cpusets is to place jobs (several related tasks) on a set of CPUs and Memory Nodes, the essential role of sched_setaffinity is to manage a jobs processor placement within its allowed cpuset, and the essential role of NUMA mempolicy (mbind, set_mempolicy) is to manage a jobs memory placement within its allowed cpuset. However, CPU affinity and NUMA memory placement are managed within the kernel using absolute system wide numbering, not cpuset relative numbering. This is ok until a job is migrated to a different cpuset, or what's the same, a jobs cpuset is moved to different CPUs and Memory Nodes. Then the CPU affinity and NUMA memory placement of the tasks in the job need to be updated, to preserve their cpuset-relative position. This can be done for CPU affinity using sched_setaffinity() from user code, as one task can modify anothers CPU affinity. This cannot be done from an external task for NUMA memory placement, as that can only be modified in the context of the task using it. However, it easy enough to remap a tasks NUMA mempolicy automatically when a task is migrated, using the existing cpuset mechanism to trigger a refresh of a tasks memory placement after its cpuset has changed. All that is needed is the old and new nodemask, and notice to the task that it needs to rebind its mempolicy. The tasks mems_allowed has the old mask, the tasks cpuset has the new mask, and the existing cpuset_update_current_mems_allowed() mechanism provides the notice. The bitmap/cpumask/nodemask remap operators provide the cpuset relative calculations. This patch leaves open a couple of issues: 1) Updating vma and shmfs/tmpfs/hugetlbfs memory policies: These mempolicies may reference nodes outside of those allowed to the current task by its cpuset. Tasks are migrated as part of jobs, which reside on what might be several cpusets in a subtree. When such a job is migrated, all NUMA memory policy references to nodes within that cpuset subtree should be translated, and references to any nodes outside that subtree should be left untouched. A future patch will provide the cpuset mechanism needed to mark such subtrees. With that patch, we will be able to correctly migrate these other memory policies across a job migration. 2) Updating cpuset, affinity and memory policies in user space: This is harder. Any placement state stored in user space using system-wide numbering will be invalidated across a migration. More work will be required to provide user code with a migration-safe means to manage its cpuset relative placement, while preserving the current API's that pass system wide numbers, not cpuset relative numbers across the kernel-user boundary. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] cpusets: bitmap and mask remap operatorsPaul Jackson2005-10-303-0/+46
| | | | | | | | | | | | | | | | | | | | | In the forthcoming task migration support, a key calculation will be mapping cpu and node numbers from the old set to the new set while preserving cpuset-relative offset. For example, if a task and its pages on nodes 8-11 are being migrated to nodes 24-27, then pages on node 9 (the 2nd node in the old set) should be moved to node 25 (the 2nd node in the new set.) As with other bitmap operations, the proper way to code this is to provide the underlying calculation in lib/bitmap.c, and then to provide the usual cpumask and nodemask wrappers. This patch provides that. These operations are termed 'remap' operations. Both remapping a single bit and a set of bits is supported. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] cpusets: dual semaphore locking overhaulPaul Jackson2005-10-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Overhaul cpuset locking. Replace single semaphore with two semaphores. The suggestion to use two locks was made by Roman Zippel. Both locks are global. Code that wants to modify cpusets must first acquire the exclusive manage_sem, which allows them read-only access to cpusets, and holds off other would-be modifiers. Before making actual changes, the second semaphore, callback_sem must be acquired as well. Code that needs only to query cpusets must acquire callback_sem, which is also a global exclusive lock. The earlier problems with double tripping are avoided, because it is allowed for holders of manage_sem to nest the second callback_sem lock, and only callback_sem is needed by code called from within __alloc_pages(), where the double tripping had been possible. This is not quite the same as a normal read/write semaphore, because obtaining read-only access with intent to change must hold off other such attempts, while allowing read-only access w/o such intention. Changing cpusets involves several related checks and changes, which must be done while allowing read-only queries (to avoid the double trip), but while ensuring nothing changes (holding off other would be modifiers.) This overhaul of cpuset locking also makes careful use of task_lock() to guard access to the task->cpuset pointer, closing a couple of race conditions noticed while reading this code (thanks, Roman). I've never seen these races fail in any use or test. See further the comments in the code. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] add_timer() of a pending timer is illegalAndrew Morton2005-10-301-1/+2
| | | | | | | | | In the recent timer rework we lost the check for an add_timer() of an already-pending timer. That check was useful for networking, so put it back. Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] unify sys_ptrace prototypeChristoph Hellwig2005-10-3017-18/+1
| | | | | | | | | | Make sure we always return, as all syscalls should. Also move the common prototype to <linux/syscalls.h> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] adjust parisc sys_ptrace prototypeChristoph Hellwig2005-10-301-1/+0
| | | | | | | | | | | | Make the pid argument a long as on every other arcihtecture. Despite pid_t beeing a 32bit type even on 64bit parisc this is not an ABI change due to the parisc calling conventions. And even if it did it wouldn't matter too much because 64bit userspace on parisc is in an embrionic stage. Acked-by: Matthew Wilcox <matthew@wil.cx> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] kill sigqueue->lockOleg Nesterov2005-10-301-1/+0
| | | | | | | | | | This lock is used in sigqueue_free(), but it is always equal to current->sighand->siglock, so we don't need to keep it in the struct sigqueue. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] reduce sizeof(struct file)Eric Dumazet2005-10-301-2/+8
| | | | | | | | | | | | | | | Now that RCU applied on 'struct file' seems stable, we can place f_rcuhead in a memory location that is not anymore used at call_rcu(&f->f_rcuhead, file_free_rcu) time, to reduce the size of this critical kernel object. The trick I used is to move f_rcuhead and f_list in an union called f_u The callers are changed so that f_rcuhead becomes f_u.fu_rcuhead and f_list becomes f_u.f_list Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] remove timer debug fieldAndrew Morton2005-10-301-5/+0
| | | | | | | | | | | | | | | | Remove timer_list.magic and associated debugging code. I originally added this when a spinlock was added to timer_list - this meant that an all-zeroes timer became illegal and init_timer() was required. That spinlock isn't even there any more, although timer.base must now be initialised. I'll keep this debugging code in -mm. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] remove some more check_region stuffJeff Garzik2005-10-304-12/+0
| | | | | | | | | | | | | | | | | | | | | | | | Removed some more references to check_region(). I checked these changes into the 'checkreg' branch of rsync://rsync.kernel.org/pub/scm/linux/kernel/git/jgarzik/misc-2.6.git The only valid references remaining are in: drivers/scsi/advansys.c drivers/scsi/BusLogic.c drivers/cdrom/sbpcd.c sound/oss/pss.c Remove last vestiges of ide_check_region() drivers/char/specialix: trim trailing whitespace drivers/char/specialix: eliminate use of check_region() Remove outdated and unused references to check_region() [sound oss] remove check_region() usage from cs4232, wavfront [netdrvr eepro] trim trailing whitespace [netdrvr eepro] remove check_region() usage Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] NTP shift_right cleanupjohn stultz2005-10-301-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | Create a macro shift_right() that avoids the numerous ugly conditionals in the NTP code that look like: if(a < 0) b = -(-a >> shift); else b = a >> shift; Replacing it with: b = shift_right(a, shift); This should have zero effect on the logic, however it should probably have a bit of testing just to be sure. Also replace open-coded min/max with the macros. Signed-off-by : John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Add kthread_stop_sem()Alan Stern2005-10-301-0/+12
| | | | | | | | | Enhance the kthread API by adding kthread_stop_sem, for use in stopping threads that spend their idle time waiting on a semaphore. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] introduce setup_timer() helperOleg Nesterov2005-10-301-0/+9
| | | | | | | | | | | Every user of init_timer() also needs to initialize ->function and ->data fields. This patch adds a simple setup_timer() helper for that. The schedule_timeout() is patched as an example of usage. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] ext3: Fix unmapped buffers in transaction's listsJan Kara2005-10-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the problem (BUG 4964) with unmapped buffers in transaction's t_sync_data list. The problem is we need to call filesystem's own invalidatepage() from block_write_full_page(). block_write_full_page() must call filesystem's invalidatepage(). Otherwise following nasty race can happen: proc 1 proc 2 ------ ------ - write some new data to 'offset' => bh gets to the transactions data list - starts truncate => i_size set to new size - mpage_writepages() - ext3_ordered_writepage() to 'offset' - block_write_full_page() - page->index > end_index+1 - block_invalidatepage() - discard_buffer() - clear_buffer_mapped() - commit triggers and finds unmapped buffer - BOOM! Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] s390: export ipl device parametersHeiko Carstens2005-10-301-0/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sysfs interface to export ipl device parameters. Dependent on the ipl type the interface will look like this: - ccw ipl: /sys/firmware/ipl/device /ipl_type - fcp ipl: /sys/firmware/ipl/binary_parameter /bootprog /br_lba /device /ipl_type /lun /scp_data /wwpn - otherwise (unknown that is): /sys/firmware/ipl/ipl_type Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] uml: remove old UM_FASTCALL, and make the thing work againPaolo 'Blaisorblade' Giarrusso2005-10-301-0/+8
| | | | | | | | | | | | This was used in the old dark age of 2.4, ARCH_CFLAGS doesn't work any more since some time, and UM_FASTCALL was never used in 2.6. Instead, reintroduce the thing more properly now, directly in include/asm-um/linkage.h. Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] uml: reuse i386 cpu-specific tuningPaolo 'Blaisorblade' Giarrusso2005-10-301-4/+15
| | | | | | | | | | | | | | | | | | Make UML share the underlying cpu-specific tuning done on i386. Actually, for now many config options aren't used a lot - but that can be done later. Also, UML relies on GCC optimization for things like memcpy and such more than i386, so specifying the correct -march and -mtune should be enough. Later, we may want to correct some other stuff. For instance, since FPU context switching, for us, is done (at least partially, i.e. between our kernelspace and userspace) by the host, we may allow usage of FPU operations by GCC. This doesn't hold for kernelspace vs. kernelspace, but we don't support preemption. Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] m32r: fix #if warningsHirokazu Takata2005-10-301-1/+1
| | | | | | | | Fix warnings for #if directives. Signed-off-by: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] introduce .valid callback for pm_opsShaohua Li2005-10-301-0/+1
| | | | | | | | | | | Add pm_ops.valid callback, so only the available pm states show in /sys/power/state. And this also makes an earlier states error report at enter_state before we do actual suspend/resume. Signed-off-by: Shaohua Li<shaohua.li@intel.com> Acked-by: Pavel Machek<pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] swsusp: rework memory freeing on resumeRafael J. Wysocki2005-10-301-2/+1
| | | | | | | | | | | | The following patch makes swsusp use the PG_nosave and PG_nosave_free flags to mark pages that should be freed in case of an error during resume. This allows us to simplify the code and to use swsusp_free() in all of the swsusp's resume error paths, which makes them actually work. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] swsusp: move snapshot functionality to separate fileRafael J. Wysocki2005-10-301-0/+6
| | | | | | | | | | | | | | | The following patch moves the functionality of swsusp related to creating and handling the snapshot of memory to a separate file, snapshot.c This should enable us to untangle the code in the future and eventually to implement some parts of swsusp.c in the user space. The patch does not change the code. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] introduce get_cpu_sysdev() to retrieve a sysfs entry for a cpu.Ashok Raj2005-10-301-0/+1
| | | | | | | | | | | | | | | | | | | | | Some modules creating sysfs entries under /sys/devices/system/cpu/cpuX/ need to know the parent sysfs entry to make devices under them. This will just return the sysfs entry for a given cpu. sysfs entries showing under each cpu sysfs can be easily created if such entries can be created by registering a sysfs driver for cpuclass. The issue is when the entry is created the CPU may not be online, hence we would need to defer the creation until the online notification comes. Current users: cache entries for Intel CPU's and cpufreq subsystem. Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Zwane Mwaikambo <zwane@holomorphy.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] x86: inline spin_unlock if !CONFIG_DEBUG_SPINLOCK and !CONFIG_PREEMPTIngo Molnar2005-10-301-6/+25
| | | | | | | Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] Clean up mtrr compat ioctl codeBrian Gerst2005-10-301-0/+33
| | | | | | | | | | Handle 32-bit mtrr ioctls in the mtrr driver instead of the ia32 compatability layer. Signed-off-by: Brian Gerst <bgerst@didntduck.org> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] i386: move apic init in init_IRQsEric W. Biederman2005-10-304-23/+3
| | | | | | | | | | | | | | | | | | | | | | | | All kinds of ugliness exists because we don't initialize the apics during init_IRQs. - We calibrate jiffies in non apic mode even when we are using apics. - We have to have special code to initialize the apics when non-smp. - The legacy i8259 must exist and be setup correctly, even when we won't use it past initialization. - The kexec on panic code must restore the state of the io_apics. - init/main.c needs a special case for !smp smp_init on x86 In addition to pure code movement I needed a couple of non-obvious changes: - Move setup_boot_APIC_clock into APIC_late_time_init for simplicity. - Use cpu_khz to generate a better approximation of loops_per_jiffies so I can verify the timer interrupt is working. - Call setup_apic_nmi_watchdog again after cpu_khz is initialized on the boot cpu. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] ES7000 platform updateNatalie.Protasevich@unisys.com2005-10-301-1/+1
| | | | | | | | | | | This is platform code update for ES7000: disables IRQ overrides for the recent ES7000 (Rascal/Zorro), cleans up the compile warning. The patch only affects the ES7000 subarch. Signed-off-by: <Natalie.Protasevich@unisys.com> Acked-by: Zwane Mwaikambo <zwane@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] x86: add an accessor function for getting the per-CPU gdtZachary Amsden2005-10-301-3/+5
| | | | | | | | | | Add an accessor function for getting the per-CPU gdt. Callee must already have the CPU. Signed-off-by: Zachary Amsden <zach@vmware.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] i386: little pgtable.h consolidation vs 2/3levelPaolo 'Blaisorblade' Giarrusso2005-10-303-10/+5
| | | | | | | | | | Join together some common functions (pmd_page{,_kernel}) over 2level and 3level pages. Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] x86: cmpxchg improvementsJan Beulich2005-10-301-3/+30
| | | | | | | | | | | | | | | This adjusts i386's cmpxchg patterns so that - for word and long cmpxchg-es the compiler can utilize all possible registers - cmpxchg8b gets disabled when the minimum specified hardware architectur doesn't support it (like was already happening for the byte, word, and long ones). Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] SELinux: canonicalize getxattr()James Morris2005-10-301-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch allows SELinux to canonicalize the value returned from getxattr() via the security_inode_getsecurity() hook, which is called after the fs level getxattr() function. The purpose of this is to allow the in-core security context for an inode to override the on-disk value. This could happen in cases such as upgrading a system to a different labeling form (e.g. standard SELinux to MLS) without needing to do a full relabel of the filesystem. In such cases, we want getxattr() to return the canonical security context that the kernel is using rather than what is stored on disk. The implementation hooks into the inode_getsecurity(), adding another parameter to indicate the result of the preceding fs-level getxattr() call, so that SELinux knows whether to compare a value obtained from disk with the kernel value. We also now allow getxattr() to work for mountpoint labeled filesystems (i.e. mount with option context=foo_t), as we are able to return the kernel value to the user. Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] CONFIG_IA32Brian Gerst2005-10-301-1/+1
| | | | | | | | | | | | | Add CONFIG_X86_32 for i386. This allows selecting options that only apply to 32-bit systems. (X86 && !X86_64) becomes X86_32 (X86 || X86_64) becomes X86 Signed-off-by: Brian Gerst <bgerst@didntduck.org> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [libata] change ata_qc_complete() to take error mask as second argJeff Garzik2005-10-301-2/+26
| | | | | | | | | | | | | | | The second argument to ata_qc_complete() was being used for two purposes: communicate the ATA Status register to the completion function, and indicate an error. On legacy PCI IDE hardware, the latter is often implicit in the former. On more modern hardware, the driver often completely emulated a Status register value, passing ATA_ERR as an indication that something went wrong. Now that previous code changes have eliminated the need to use drv_stat arg to communicate the ATA Status register value, we can convert it to a mask of possible error classes. This will lead to more flexible error handling in the future.
* Merge branch 'master'Jeff Garzik2005-10-3038-238/+521
|\
| * Merge master.kernel.org:/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds2005-10-291-4/+13
| |\
| | * [PATCH] Introduce sg_set_bufHerbert Xu2005-10-301-4/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sg_init_one is a nice tool for the block layer. However, users of struct scatterlist in other subsystems don't usually need the DMA attributes. For them it's a waste of time and space to initialise the whole struct scatterlist structure. Therefore this patch adds a new function sg_set_buf to initialise a scatterlist without zeroing the DMA attributes. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
| * | [PATCH] memory hotplug: sysfs and add/remove functionsDave Hansen2005-10-293-0/+130
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds generic memory add/remove and supporting functions for memory hotplug into a new file as well as a memory hotplug kernel config option. Individual architecture patches will follow. For now, disable memory hotplug when swsusp is enabled. There's a lot of churn there right now. We'll fix it up properly once it calms down. Signed-off-by: Matt Tolentino <matthew.e.tolentino@intel.com> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * | [PATCH] memory hotplug locking: zone span seqlockDave Hansen2005-10-292-2/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | See the "fixup bad_range()" patch for more information, but this actually creates a the lock to protect things making assumptions about a zone's size staying constant at runtime. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * | [PATCH] memory hotplug locking: node_size_lockDave Hansen2005-10-292-0/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pgdat->node_size_lock is basically only neeeded in one place in the normal code: show_mem(), which is the arch-specific sysrq-m printing function. Strictly speaking, the architectures not doing memory hotplug do no need this locking in show_mem(). However, they are all included for completeness. This should also make any future consolidation of all of the implementations a little more straightforward. This lock is also held in the sparsemem code during a memory removal, as sections are invalidated. This is the place there pfn_valid() is made false for a memory area that's being removed. The lock is only required when doing pfn_valid() operations on memory which the user does not already have a reference on the page, such as in show_mem(). Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * | [PATCH] memory hotplug prep: __section_nr helperDave Hansen2005-10-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | A little helper that we use in the hotplug code. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * | [PATCH] memory hotplug prep: kill local_mapnrDave Hansen2005-10-294-21/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following series implements memory hot-add for ppc64 and i386. There are x86_64 and ia64 implementations that will be submitted shortly as well, through the normal maintainers. This patch: local_mapnr is unused, except for in an alpha header. Keep the alpha one, kill the rest. Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * | [PATCH] mm: update comments to pte lockHugh Dickins2005-10-292-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | Updated several references to page_table_lock in common code comments. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * | [PATCH] mm: fix rss and mmlist lockingHugh Dickins2005-10-291-4/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A couple of oddities were guarded by page_table_lock, no longer properly guarded when that is split. The mm_counters of file_rss and anon_rss: make those an atomic_t, or an atomic64_t if the architecture supports it, in such a case. Definitions by courtesy of Christoph Lameter: who spent considerable effort on more scalable ways of counting, but found insufficient benefit in practice. And adding an mm with swap to the mmlist for swapoff: the list is well- guarded by its own lock, but the list_empty check now has to be repeated inside it. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * | [PATCH] mm: split page table lockHugh Dickins2005-10-292-11/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with a many-threaded application which concurrently initializes different parts of a large anonymous area. This patch corrects that, by using a separate spinlock per page table page, to guard the page table entries in that page, instead of using the mm's single page_table_lock. (But even then, page_table_lock is still used to guard page table allocation, and anon_vma allocation.) In this implementation, the spinlock is tucked inside the struct page of the page table page: with a BUILD_BUG_ON in case it overflows - which it would in the case of 32-bit PA-RISC with spinlock debugging enabled. Splitting the lock is not quite for free: another cacheline access. Ideally, I suppose we would use split ptlock only for multi-threaded processes on multi-cpu machines; but deciding that dynamically would have its own costs. So for now enable it by config, at some number of cpus - since the Kconfig language doesn't support inequalities, let preprocessor compare that with NR_CPUS. But I don't think it's worth being user-configurable: for good testing of both split and unsplit configs, split now at 4 cpus, and perhaps change that to 8 later. There is a benefit even for singly threaded processes: kswapd can be attacking one part of the mm while another part is busy faulting. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * | [PATCH] mm: parisc pte atomicityHugh Dickins2005-10-291-16/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a worrying function translation_exists in parisc cacheflush.h, unaffected by split ptlock since flush_dcache_page is using it on some other mm, without any relevant lock. Oh well, make it a slightly more robust by factoring the pfn check within it. And it looked liable to confuse a camouflaged swap or file entry with a good pte: fix that too. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * | [PATCH] mm: follow_page with inner ptlockHugh Dickins2005-10-291-8/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Final step in pushing down common core's page_table_lock. follow_page no longer wants caller to hold page_table_lock, uses pte_offset_map_lock itself; and so no page_table_lock is taken in get_user_pages itself. But get_user_pages (and get_futex_key) do then need follow_page to pin the page for them: take Daniel's suggestion of bitflags to follow_page. Need one for WRITE, another for TOUCH (it was the accessed flag before: vanished along with check_user_page_readable, but surely get_numa_maps is wrong to mark every page it finds as accessed), another for GET. And another, ANON to dispose of untouched_anonymous_page: it seems silly for that to descend a second time, let follow_page observe if there was no page table and return ZERO_PAGE if so. Fix minor bug in that: check VM_LOCKED - make_pages_present ought to make readonly anonymous present. Give get_numa_maps a cond_resched while we're there. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * | [PATCH] mm: kill check_user_page_readableHugh Dickins2005-10-291-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | check_user_page_readable is a problematic variant of follow_page. It's used only by oprofile's i386 and arm backtrace code, at interrupt time, to establish whether a userspace stackframe is currently readable. This is problematic, because we want to push the page_table_lock down inside follow_page, and later split it; whereas oprofile is doing a spin_trylock on it (in the i386 case, forgotten in the arm case), and needs that to pin perhaps two pages spanned by the stackframe (which might be covered by different locks when we split). I think oprofile is going about this in the wrong way: it doesn't need to know the area is readable (neither i386 nor arm uses read protection of user pages), it doesn't need to pin the memory, it should simply __copy_from_user_inatomic, and see if that succeeds or not. Sorry, but I've not got around to devising the sparse __user annotations for this. Then we can eliminate check_user_page_readable, and return to a single follow_page without the __follow_page variants. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * | [PATCH] mm: rmap with inner ptlockHugh Dickins2005-10-291-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rmap's page_check_address descend without page_table_lock. First just pte_offset_map in case there's no pte present worth locking for, then take page_table_lock for the full check, and pass ptl back to caller in the same style as pte_offset_map_lock. __xip_unmap, page_referenced_one and try_to_unmap_one use pte_unmap_unlock. try_to_unmap_cluster also. page_check_address reformatted to avoid progressive indentation. No use is made of its one error code, return NULL when it fails. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * | [PATCH] mm: unmap_vmas with inner ptlockHugh Dickins2005-10-292-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the page_table_lock from around the calls to unmap_vmas, and replace the pte_offset_map in zap_pte_range by pte_offset_map_lock: all callers are now safe to descend without page_table_lock. Don't attempt fancy locking for hugepages, just take page_table_lock in unmap_hugepage_range. Which makes zap_hugepage_range, and the hugetlb test in zap_page_range, redundant: unmap_vmas calls unmap_hugepage_range anyway. Nor does unmap_vmas have much use for its mm arg now. The tlb_start_vma and tlb_end_vma in unmap_page_range are now called without page_table_lock: if they're implemented at all, they typically come down to flush_cache_range (usually done outside page_table_lock) and flush_tlb_range (which we already audited for the mprotect case). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * | [PATCH] mm: flush_tlb_range outside ptlockHugh Dickins2005-10-291-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There was one small but very significant change in the previous patch: mprotect's flush_tlb_range fell outside the page_table_lock: as it is in 2.4, but that doesn't prove it safe in 2.6. On some architectures flush_tlb_range comes to the same as flush_tlb_mm, which has always been called from outside page_table_lock in dup_mmap, and is so proved safe. Others required a deeper audit: I could find no reliance on page_table_lock in any; but in ia64 and parisc found some code which looks a bit as if it might want preemption disabled. That won't do any actual harm, so pending a decision from the maintainers, disable preemption there. Remove comments on page_table_lock from flush_tlb_mm, flush_tlb_range and flush_tlb_page entries in cachetlb.txt: they were rather misleading (what generic code does is different from what usually happens), the rules are now changing, and it's not yet clear where we'll end up (will the generic tlb_flush_mmu happen always under lock? never under lock? or sometimes under and sometimes not?). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
| * | [PATCH] mm: pte_offset_map_lock loopsHugh Dickins2005-10-292-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert those common loops using page_table_lock on the outside and pte_offset_map within to use just pte_offset_map_lock within instead. These all hold mmap_sem (some exclusively, some not), so at no level can a page table be whipped away from beneath them. But whereas pte_alloc loops tested with the "atomic" pmd_present, these loops are testing with pmd_none, which on i386 PAE tests both lower and upper halves. That's now unsafe, so add a cast into pmd_none to test only the vital lower half: we lose a little sensitivity to a corrupt middle directory, but not enough to worry about. It appears that i386 and UML were the only architectures vulnerable in this way, and pgd and pud no problem. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
OpenPOWER on IntegriCloud