| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This extends wait_task_inactive() with a new argument so it can be used in
a "soft" mode where it will check for the task changing state unexpectedly
and back off. There is no change to existing callers. This lays the
groundwork to allow robust, noninvasive tracing that can try to sample a
blocked thread but back off safely if it wakes up.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds the generic HAVE_ARCH_TRACEHOOK kconfig item. Each arch should
add to some Kconfig file:
select HAVE_ARCH_TRACEHOOK
if the arch code uses all the latest hooks to enable newfangled tracing
and debugging code. The comment in arch/Kconfig lists all the
prerequisite arch support. When all these are available, setting
HAVE_ARCH_TRACEHOOK will allow enabling any new features that depend on
the modern arch interfaces.
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This moves all the ptrace hooks related to exec into tracehook.h inlines.
This also lifts the calls for tracing out of the binfmt load_binary hooks
into search_binary_handler() after it calls into the binfmt module. This
change has no effect, since all the binfmt modules' load_binary functions
did the call at the end on success, and now search_binary_handler() does
it immediately after return if successful. We consolidate the repeated
code, and binfmt modules no longer need to import ptrace_notify().
Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Kmem cache passed to constructor is only needed for constructors that are
themselves multiplexeres. Nobody uses this "feature", nor does anybody uses
passed kmem cache in non-trivial way, so pass only pointer to object.
Non-trivial places are:
arch/powerpc/mm/init_64.c
arch/powerpc/mm/hugetlbpage.c
This is flag day, yes.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Matt Mackall <mpm@selenic.com>
[akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c]
[akpm@linux-foundation.org: fix mm/slab.c]
[akpm@linux-foundation.org: fix ubifs]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
| |
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implement get_user_pages_fast without locking in the fastpath on x86.
Do an optimistic lockless pagetable walk, without taking mmap_sem or any
page table locks or even mmap_sem. Page table existence is guaranteed by
turning interrupts off (combined with the fact that we're always looking
up the current mm, means we can do the lockless page table walk within the
constraints of the TLB shootdown design). Basically we can do this
lockless pagetable walk in a similar manner to the way the CPU's pagetable
walker does not have to take any locks to find present ptes.
This patch (combined with the subsequent ones to convert direct IO to use
it) was found to give about 10% performance improvement on a 2 socket 8
core Intel Xeon system running an OLTP workload on DB2 v9.5
"To test the effects of the patch, an OLTP workload was run on an IBM
x3850 M2 server with 2 processors (quad-core Intel Xeon processors at
2.93 GHz) using IBM DB2 v9.5 running Linux 2.6.24rc7 kernel. Comparing
runs with and without the patch resulted in an overall performance
benefit of ~9.8%. Correspondingly, oprofiles showed that samples from
__up_read and __down_read routines that is seen during thread contention
for system resources was reduced from 2.8% down to .05%. Monitoring the
/proc/vmstat output from the patched run showed that the counter for
fast_gup contained a very high number while the fast_gup_slow value was
zero."
(fast_gup is the old name for get_user_pages_fast, fast_gup_slow is a
counter we had for the number of times the slowpath was invoked).
The main reason for the improvement is that DB2 has multiple threads each
issuing direct-IO. Direct-IO uses get_user_pages, and thus the threads
contend the mmap_sem cacheline, and can also contend on page table locks.
I would anticipate larger performance gains on larger systems, however I
think DB2 uses an adaptive mix of threads and processes, so it could be
that thread contention remains pretty constant as machine size increases.
In which case, we stuck with "only" a 10% gain.
The downside of using get_user_pages_fast is that if there is not a pte
with the correct permissions for the access, we end up falling back to
get_user_pages and so the get_user_pages_fast is a bit of extra work.
However this should not be the common case in most performance critical
code.
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: Kconfig fix]
[akpm@linux-foundation.org: Makefile fix/cleanup]
[akpm@linux-foundation.org: warning fix]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
| |
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
| |
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
| |
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Mikael Starvik <starvik@axis.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch implements devices state save/restore before after kexec.
This patch together with features in kexec_jump patch can be used for
following:
- A simple hibernation implementation without ACPI support. You can kexec a
hibernating kernel, save the memory image of original system and shutdown
the system. When resuming, you restore the memory image of original system
via ordinary kexec load then jump back.
- Kernel/system debug through making system snapshot. You can make system
snapshot, jump back, do some thing and make another system snapshot.
- Cooperative multi-kernel/system. With kexec jump, you can switch between
several kernels/systems quickly without boot process except the first time.
This appears like swap a whole kernel/system out/in.
- A general method to call program in physical mode (paging turning
off). This can be used to invoke BIOS code under Linux.
The following user-space tools can be used with kexec jump:
- kexec-tools needs to be patched to support kexec jump. The patches
and the precompiled kexec can be download from the following URL:
source: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-src_git_kh10.tar.bz2
patches: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-patches_git_kh10.tar.bz2
binary: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec_git_kh10
- makedumpfile with patches are used as memory image saving tool, it
can exclude free pages from original kernel memory image file. The
patches and the precompiled makedumpfile can be download from the
following URL:
source: http://khibernation.sourceforge.net/download/release_v10/makedumpfile/makedumpfile-src_cvs_kh10.tar.bz2
patches: http://khibernation.sourceforge.net/download/release_v10/makedumpfile/makedumpfile-patches_cvs_kh10.tar.bz2
binary: http://khibernation.sourceforge.net/download/release_v10/makedumpfile/makedumpfile_cvs_kh10
- An initramfs image can be used as the root file system of kexeced
kernel. An initramfs image built with "BuildRoot" can be downloaded
from the following URL:
initramfs image: http://khibernation.sourceforge.net/download/release_v10/initramfs/rootfs_cvs_kh10.gz
All user space tools above are included in the initramfs image.
Usage example of simple hibernation:
1. Compile and install patched kernel with following options selected:
CONFIG_X86_32=y
CONFIG_RELOCATABLE=y
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_PM=y
CONFIG_HIBERNATION=y
CONFIG_KEXEC_JUMP=y
2. Build an initramfs image contains kexec-tool and makedumpfile, or
download the pre-built initramfs image, called rootfs.gz in
following text.
3. Prepare a partition to save memory image of original kernel, called
hibernating partition in following text.
4. Boot kernel compiled in step 1 (kernel A).
5. In the kernel A, load kernel compiled in step 1 (kernel B) with
/sbin/kexec. The shell command line can be as follow:
/sbin/kexec --load-preserve-context /boot/bzImage --mem-min=0x100000
--mem-max=0xffffff --initrd=rootfs.gz
6. Boot the kernel B with following shell command line:
/sbin/kexec -e
7. The kernel B will boot as normal kexec. In kernel B the memory
image of kernel A can be saved into hibernating partition as
follow:
jump_back_entry=`cat /proc/cmdline | tr ' ' '\n' | grep kexec_jump_back_entry | cut -d '='`
echo $jump_back_entry > kexec_jump_back_entry
cp /proc/vmcore dump.elf
Then you can shutdown the machine as normal.
8. Boot kernel compiled in step 1 (kernel C). Use the rootfs.gz as
root file system.
9. In kernel C, load the memory image of kernel A as follow:
/sbin/kexec -l --args-none --entry=`cat kexec_jump_back_entry` dump.elf
10. Jump back to the kernel A as follow:
/sbin/kexec -e
Then, kernel A is resumed.
Implementation point:
To support jumping between two kernels, before jumping to (executing)
the new kernel and jumping back to the original kernel, the devices
are put into quiescent state, and the state of devices and CPU is
saved. After jumping back from kexeced kernel and jumping to the new
kernel, the state of devices and CPU are restored accordingly. The
devices/CPU state save/restore code of software suspend is called to
implement corresponding function.
Known issues:
- Because the segment number supported by sys_kexec_load is limited,
hibernation image with many segments may not be load. This is
planned to be eliminated by adding a new flag to sys_kexec_load to
make a image can be loaded with multiple sys_kexec_load invoking.
Now, only the i386 architecture is supported.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch provides an enhancement to kexec/kdump. It implements the
following features:
- Backup/restore memory used by the original kernel before/after
kexec.
- Save/restore CPU state before/after kexec.
The features of this patch can be used as a general method to call program in
physical mode (paging turning off). This can be used to call BIOS code under
Linux.
kexec-tools needs to be patched to support kexec jump. The patches and
the precompiled kexec can be download from the following URL:
source: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-src_git_kh10.tar.bz2
patches: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec-tools-patches_git_kh10.tar.bz2
binary: http://khibernation.sourceforge.net/download/release_v10/kexec-tools/kexec_git_kh10
Usage example of calling some physical mode code and return:
1. Compile and install patched kernel with following options selected:
CONFIG_X86_32=y
CONFIG_KEXEC=y
CONFIG_PM=y
CONFIG_KEXEC_JUMP=y
2. Build patched kexec-tool or download the pre-built one.
3. Build some physical mode executable named such as "phy_mode"
4. Boot kernel compiled in step 1.
5. Load physical mode executable with /sbin/kexec. The shell command
line can be as follow:
/sbin/kexec --load-preserve-context --args-none phy_mode
6. Call physical mode executable with following shell command line:
/sbin/kexec -e
Implementation point:
To support jumping without reserving memory. One shadow backup page (source
page) is allocated for each page used by kexeced code image (destination
page). When do kexec_load, the image of kexeced code is loaded into source
pages, and before executing, the destination pages and the source pages are
swapped, so the contents of destination pages are backupped. Before jumping
to the kexeced code image and after jumping back to the original kernel, the
destination pages and the source pages are swapped too.
C ABI (calling convention) is used as communication protocol between
kernel and called code.
A flag named KEXEC_PRESERVE_CONTEXT for sys_kexec_load is added to
indicate that the loaded kernel image is used for jumping back.
Now, only the i386 architecture is supported.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Nigel Cunningham <nigel@nigel.suspend2.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The calgary code can give drivers addresses above 4GB which is very bad
for hardware that is only 32bit DMA addressable.
With this patch, the calgary code sets the global dma_ops to swiotlb or
nommu properly, and the dma_ops of devices behind the Calgary/CalIOC2
to calgary_dma_ops. So the calgary code can handle devices safely that
aren't behind the Calgary/CalIOC2.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Alexis Bruemmer <alexisb@us.ibm.com>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add per-device dma_mapping_ops support for CONFIG_X86_64 as POWER
architecture does:
This enables us to cleanly fix the Calgary IOMMU issue that some devices
are not behind the IOMMU (http://lkml.org/lkml/2008/5/8/423).
I think that per-device dma_mapping_ops support would be also helpful for
KVM people to support PCI passthrough but Andi thinks that this makes it
difficult to support the PCI passthrough (see the above thread). So I
CC'ed this to KVM camp. Comments are appreciated.
A pointer to dma_mapping_ops to struct dev_archdata is added. If the
pointer is non NULL, DMA operations in asm/dma-mapping.h use it. If it's
NULL, the system-wide dma_ops pointer is used as before.
If it's useful for KVM people, I plan to implement a mechanism to register
a hook called when a new pci (or dma capable) device is created (it works
with hot plugging). It enables IOMMUs to set up an appropriate
dma_mapping_ops per device.
The major obstacle is that dma_mapping_error doesn't take a pointer to the
device unlike other DMA operations. So x86 can't have dma_mapping_ops per
device. Note all the POWER IOMMUs use the same dma_mapping_error function
so this is not a problem for POWER but x86 IOMMUs use different
dma_mapping_error functions.
The first patch adds the device argument to dma_mapping_error. The patch
is trivial but large since it touches lots of drivers and dma-mapping.h in
all the architecture.
This patch:
dma_mapping_error() doesn't take a pointer to the device unlike other DMA
operations. So we can't have dma_mapping_ops per device.
Note that POWER already has dma_mapping_ops per device but all the POWER
IOMMUs use the same dma_mapping_error function. x86 IOMMUs use device
argument.
[akpm@linux-foundation.org: fix sge]
[akpm@linux-foundation.org: fix svc_rdma]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix bnx2x]
[akpm@linux-foundation.org: fix s2io]
[akpm@linux-foundation.org: fix pasemi_mac]
[akpm@linux-foundation.org: fix sdhci]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix sparc]
[akpm@linux-foundation.org: fix ibmvscsi]
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
| |
The new code in commit 5cbf1565f29eb57a86a305b08836613508e294d7
has a bug in the version supporting the AMD 'syscall' instruction.
It clobbers the user's %ecx register value (with the %ebp value).
This change fixes it.
Signed-off-by: Roland McGrath <roland@redhat.com>
|
|\
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: Fix boot problem due to AT_BASE_PLATFORM change
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Commit 9115d13453dee22473a1e8cacc90a8d64a9c4bc9 ("powerpc: Enable
AT_BASE_PLATFORM aux vector") broke boot on 32-bit powerpc systems; we
have to use PTRRELOC to initialize powerpc_base_platform this early in
boot.
Bug reported by Jon Smirl.
Signed-off-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|\ \
| | |
| | |
| | |
| | | |
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6:
sparc: Wire up new system calls.
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This wires up the recently added Wire up signalfd4, eventfd2,
epoll_create1, dup3, pipe2, and inotify_init1 system calls.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\ \ \
| |_|/
|/| |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
[IA64] Wire up new system calls
|
| |/
| |
| |
| |
| |
| |
| | |
Six new system calls: signalfd4, eventfd2, epoll_create1,
dup3, pipe2 and inotify_init1.
Signed-off-by: Tony Luck <tony.luck@intel.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (34 commits)
powerpc: Wireup new syscalls
Move update_mmu_cache() declaration from tlbflush.h to pgtable.h
powerpc/pseries: Remove kmalloc call in handling writes to lparcfg
powerpc/pseries: Update arch vector to indicate support for CMO
ibmvfc: Add support for collaborative memory overcommit
ibmvscsi: driver enablement for CMO
ibmveth: enable driver for CMO
ibmveth: Automatically enable larger rx buffer pools for larger mtu
powerpc/pseries: Verify CMO memory entitlement updates with virtual I/O
powerpc/pseries: vio bus support for CMO
powerpc/pseries: iommu enablement for CMO
powerpc/pseries: Add CMO paging statistics
powerpc/pseries: Add collaborative memory manager
powerpc/pseries: Utilities to set firmware page state
powerpc/pseries: Enable CMO feature during platform setup
powerpc/pseries: Split retrieval of processor entitlement data into a helper routine
powerpc/pseries: Add memory entitlement capabilities to /proc/ppc64/lparcfg
powerpc/pseries: Split processor entitlement retrieval and gathering to helper routines
powerpc/pseries: Remove extraneous error reporting for hcall failures in lparcfg
powerpc: Fix compile error with binutils 2.15
...
Fixed up conflict in arch/powerpc/platforms/52xx/Kconfig manually.
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
There are only 4 valid name=value pairs for writes to
/proc/ppc64/lparcfg. Current code allocates a buffer to copy
this information in from the user. Since the longest name=value
pair will easily fit into a buffer of 64 characters, simply
put the buffer on the stack instead of allocating the buffer.
Signed-off-by: Nathan Fotenot <nfont@austin.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Update the architecture vector to indicate that Cooperative Memory
Overcommitment is supported if CONFIG_PPC_SMLPAR is set.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Verify memory entitlement updates can be handled by vio.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This is a large patch but the normal code path is not affected. For
non-pSeries platforms the code is ifdef'ed out and for non-CMO enabled
pSeries systems this does not affect the normal code path. Devices that
do not perform DMA operations do not need modification with this patch.
The function get_desired_dma was renamed from get_io_entitlement for
clarity.
Overview
Cooperative Memory Overcommitment (CMO) allows for a set of OS partitions
to be run with less RAM than the aggregate needs of the group of
partitions. The firmware will balance memory between the partitions
and page in/out memory as needed. Based on the number and type of IO
adpaters preset each partition is allocated an amount of memory for
DMA operations and this allocation will be guaranteed to the partition;
this is referred to as the partition's 'entitlement'.
Partitions running in a CMO environment can only have virtual IO devices
present. The VIO bus layer will manage the IO entitlement for the system.
Accounting, at a system and per-device level, is tracked in the VIO bus
code and exposed via sysfs. A set of dma_ops functions are added to
the bus to allow for this accounting.
Bus initialization
At initialization, the bus will calculate the minimum needs of the system
based on providing each device present with a standard minimum entitlement
along with a spare allocation for the bus to handle hotplug events.
If the minimum needs can not be met the system boot will be halted.
Device changes
The significant changes for devices while running under CMO are that the
devices must specify how much dedicated IO entitlement they desire and
must also handle DMA mapping errors that can occur due to constrained
IO memory. The virtual IO drivers are modified to silence errors when
DMA mappings fail for CMO and handle these failures gracefully.
Each devices will be guaranteed a minimum entitlement that can always
be mapped. Devices will specify how much entitlement they desire and
the VIO bus will attempt to provide for this. Devices can change their
desired entitlement level at any point in time to address particular needs
(via vio_cmo_set_dev_desired()), not just at device probe time.
VIO bus changes
The system will have a particular entitlement level available from which
it can provide memory to the devices. The bus defines two pools of memory
within this entitlement, the reserved and excess pools. Each device is
provided with it's own entitlement no less than a system defined minimum
entitlement and no greater than what the device has specified as it's
desired entitlement. The entitlement provided to devices comes from the
reserve pool. The reserve pool can also contain a spare allocation as
large as the system defined minimum entitlement which is used for device
hotplug events. Any entitlement not needed to fulfill the needs of a
reserve pool is placed in the excess pool. Each device is guaranteed
that it can map up to it's entitled level; additional mapping are possible
as long as there is unmapped memory in the excess pool.
Bus probe
As the system starts, each device is given an entitlement equal only
to the system defined minimum entitlement. The reserve pool is equal
to the sum of these entitlements, plus a spare allocation. The VIO bus
also tracks the aggregate desired entitlement of all the devices. If the
system desired entitlement is greater than the size of the reserve pool,
when devices unmap IO memory it will be reserved and a balance operation
will be scheduled for some time in the future.
Entitlement balancing
The balance function tries to fairly distribute entitlement between the
devices in the system with the goal of providing each device with it's
desired amount of entitlement. Devices using more than what would be
ideal will have their entitled set-point adjusted; this will effectively
set a goal for lower IO memory usage as future mappings can fail and
deallocations will trigger a balance operation to distribute the newly
unmapped memory. A fair distribution of entitlement can take several
balance operations to achieve. Entitlement changes and device DLPAR
events will alter the state of CMO and will trigger balance operations.
Hotplug events
The VIO bus allows for changes in system entitlement at run-time via
'vio_cmo_entitlement_update()'. When devices are added the hotplug
device event will be preceded by a system entitlement increase and this
is reversed when devices are removed.
The following changes are made that the VIO bus layer for CMO:
* add IO memory accounting per device structure.
* add IO memory entitlement query function to driver structure.
* during vio bus probe, if CMO is enabled, check that driver has
memory entitlement query function defined. Fail if function not defined.
* fail to register driver if io entitlement function not defined.
* create set of dma_ops at vio level for CMO that will track allocations
and return DMA failures once entitlement is reached. Entitlement will
limited by overall system entitlement. Devices will have a reserved
quantity of memory that is guaranteed, the rest can be used as available.
* expose entitlement, current allocation, desired allocation, and the
allocation error counter for devices to the user through sysfs
* provide mechanism for changing a device's desired entitlement at run time
for devices as an exported function and sysfs tunable
* track any DMA failures for entitled IO memory for each vio device.
* check entitlement against available system entitlement on device add
* track entitlement metrics (high water mark, current usage)
* provide function to reset high water mark
* provide minimum and desired entitlement numbers at a bus level
* provide drivers with a minimum guaranteed entitlement
* balance available entitlement between devices to satisfy their needs
* handle system entitlement changes and device hotplug
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
To support Cooperative Memory Overcommitment (CMO), we need to check
for failure from some of the tce hcalls.
These changes for the pseries platform affect the powerpc architecture;
patches for the other affected platforms are included in this patch.
pSeries platform IOMMU code changes:
* platform TCE functions must handle H_NOT_ENOUGH_RESOURCES errors and
return an error.
Architecture IOMMU code changes:
* Calls to ppc_md.tce_build need to check return values and return
DMA_MAPPING_ERROR for transient errors.
Architecture changes:
* struct machdep_calls for tce_build*_pSeriesLP functions need to change
to indicate failure.
* all other platforms will need updates to iommu functions to match the new
calling semantics; they will return 0 on success. The other platforms
default configs have been built, but no further testing was performed.
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Olof Johansson <olof@lixom.net>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
With the addition of Cooperative Memory Overcommitment (CMO) support
for IBM Power Systems, two fields have been added to the VPA to report
paging statistics. Add support in lparcfg to report them to userspace.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Adds a collaborative memory manager, which acts as a simple balloon driver
for System p machines that support cooperative memory overcommitment
(CMO).
Adds a platform configuration option for CMO called PPC_SMLPAR.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Newer versions of firmware support page states, which are used by the
collaborative memory manager (future patch) to "loan" pages to the
hypervisor for use by other partitions.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
For Cooperative Memory Overcommitment (CMO), set the FW_FEATURE_CMO
flag in powerpc_firmware_features from the rtas ibm,get-system-parameters
table prior to calling iommu_init_early_pSeries.
With this, any CMO specific functionality can be controlled by checking:
firmware_has_feature(FW_FEATURE_CMO)
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
routine
Split the retrieval of processor entitlement data returned in the H_GET_PPP
hcall into its own helper routine.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Update /proc/ppc64/lparcfg to display Cooperative Memory
Overcommitment statistics as reported by the H_GET_MPP hcall. This
also updates the lparcfg interface to allow setting memory entitlement
and weight.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
helper routines
Split the retrieval and setting of processor entitlement and weight into
helper routines. This also removes the printing of the raw values
returned from h_get_ppp, the values are already parsed and printed.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Remove the extraneous error reporting used when a hcall made from lparcfg fails.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
My previous patch to fix compilation with binutils-2.17 causes
a "file truncated" build error from ld with binutils 2.15 (and
possibly older), and a warning with 2.16 and 2.17.
This fixes it.
Signed-off-by: Segher Boessenkool <segher@kernel.crashing.org>
Acked-by: Chuck Meade <chuckmeade@mindspring.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
At the moment the fixed mapping is by default strongly ordered (the
iommu_fixed=weak boot option must be used to make the fixed mapping weakly
ordered). If we're on a setup where the southbridge is being used in
endpoint mode (triblade and CAB boards) the default should be a weakly
ordered fixed mapping.
This adds a check so that if a node of type pcie-endpoint can be found in
the device tree the fixed mapping is set to be weak by default (but can be
overridden using iommu_fixed=strong).
Signed-off-by: Mark Nelson <markn@au1.ibm.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch implements support for HW based watchpoint via the
DBSR_DAC (Data Address Compare) facility of the BookE processors.
It does so by interfacing with the existing DABR breakpoint code
and adding the necessary bits and pieces for the new bits to
be properly set or cleared
Signed-off-by: Luis Machado <luisgpm@br.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
A struct sysdev_attribute * parameter was added to the show routine by
commit 4a0b2b4dbe1335b8b9886ba3dc85a145d5d938ed "sysdev: Pass the
attribute to the low level sysdev show/store function".
This eliminates a warning:
arch/powerpc/kernel/sysfs.c:538: warning: initialization from incompatible pointer type
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Stash the first platform string matched by identify_cpu() in
powerpc_base_platform, and supply that to the ELF loader for the value
of AT_BASE_PLATFORM.
Signed-off-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
| |\ \ |
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This patch adjusts the placement of a reference context from
a spu affinity chain. The reference context can now be placed
only on nodes that have enough spus not intended to be used by
another gang (already running on the node).
Signed-off-by: Andre Detsch <adetsch@br.ibm.com>
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Currenlt,, it is possible to lock aff_mutex and
cbe_spu_info[n].list_mutex in different orders, allowing a deadlock to
occur. With this change, aff_mutex is not taken within a list_mutex
critical section anymore.
Signed-off-by: Andre Detsch <adetsch@br.ibm.com>
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
kcalloc is supposed to be called with the count as its first argument and
the element size as the second.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
|
| |\ \ \
| | |_|/
| |/| | |
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
CONFIG_GENERIC_GPIO is needed for the gpio driver to work.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
|
| | | |
| | | |
| | | |
| | | |
| | | | |
Signed-off-by: Jon Smirl <jonsmirl@gmail.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The h8300 and sparc options somehow survived when the code stopped using
CONFIG_UNIX98_PTY_COUNT.
Reviewed-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
kdump kernel fails to boot with calgary iommu and aacraid driver on a x366
box. The ongoing dma's of aacraid from the first kernel continue to exist
until the driver is loaded in the kdump kernel. Calgary is initialized
prior to aacraid and creation of new tce tables causes wrong dma's to
occur. Here we try to get the tce tables of the first kernel in kdump
kernel and use them. While in the kdump kernel we do not allocate new tce
tables but instead read the base address register contents of calgary
iommu and use the tables that the registers point to. With these changes
the kdump kernel and hence aacraid now boots normally.
Signed-off-by: Chandru Siddalingappa <chandru@in.ibm.com>
Acked-by: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Now that it is safe to use get_online_cpus() we can revert
[S390] cpu topology: Fix possible deadlock.
commit: fd781fa25c9e9c6fd1599df060b05e7c4ad724e5
and call arch_reinit_sched_domains() directly from topology_work_fn().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Remove some unused #include <linux/dirent.h>'s.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|