summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* init: make rootdelay=N consistent with rootwait behaviourPaul Gortmaker2014-08-081-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently rootdelay=N and rootwait behave differently (aside from the obvious unbounded wait duration) because they are at different places in the init sequence. The difference manifests itself for md devices because the call to md_run_setup() lives between rootdelay and rootwait, so if you try to use rootdelay=20 to try and allow a slow RAID0 array to assemble, you get this: [ 4.526011] sd 6:0:0:0: [sdc] Attached SCSI removable disk [ 22.972079] md: Waiting for all devices to be available before autodetect i.e. you've achieved nothing other than delaying the probing 20s, when what you wanted was a 20s delay _after_ the probing for md devices was initiated. Here we move the rootdelay code to be right beside the rootwait code, so that their behaviour is consistent. It should be noted that in doing so, the actions based on the saved_root_name[0] and initrd_load() were previously put on hold by rootdelay=N and now currently will not be delayed. However, I think consistent behaviour is more important than matching historical behaviour of delaying the above two operations. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs/ramfs/file-nommu.c: replace count*size kzalloc by kcallocFabian Frederick2014-08-081-1/+1
| | | | | | | Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Axel Lin <axel.lin@ingics.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kernel/kallsyms.c: fix %pB when there's no symbol at the addressNamhyung Kim2014-08-081-1/+1
| | | | | | | | | | | | | | | | | | | | | __sprint_symbol() should restore original address when kallsyms_lookup() failed to find a symbol. It's reported when dumpstack shows an address in a dynamically allocated trampoline for ftrace. [ 1314.612287] [<ffffffff81700312>] dump_stack+0x45/0x56 [ 1314.612290] [<ffffffff8125f5b0>] ? meminfo_proc_open+0x30/0x30 [ 1314.612293] [<ffffffffa080a494>] kpatch_ftrace_handler+0x14/0xf0 [kpatch] [ 1314.612306] [<ffffffffa00160c4>] 0xffffffffa00160c3 You can see a difference in the hex address - c4 and c3. Fix it. Signed-off-by: Namhyung Kim <namhyung@kernel.org> Reported-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs/efs/namei.c: return is not a functionFabian Frederick2014-08-081-5/+6
| | | | | | | | | | Fix checkpatch errors: "ERROR: return is not a function, parentheses are not required" Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/zswap.c: add __init to zswap_entry_cache_destroy()Fabian Frederick2014-08-081-2/+2
| | | | | | | | | | | zswap_entry_cache_destroy() is only called by __init init_zswap(). This patch also fixes function name zswap_entry_cache_ s/destory/destroy Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: memcontrol: avoid charge statistics churn during page migrationJohannes Weiner2014-08-081-25/+10
| | | | | | | | | | | | | | | Charge migration currently disables IRQs twice to update the charge statistics for the old page and then again for the new page. But migration is a seamless transition of a charge from one physical page to another one of the same size, so this should be a non-event from an accounting point of view. Leave the statistics alone. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* memcg: remove lookup_cgroup_page() prototypeGreg Thelen2014-08-081-1/+0
| | | | | | | | | | | | | Commit 6b208e3f6e35 ("mm: memcg: remove unused node/section info from pc->flags") deleted lookup_cgroup_page() but left a prototype for it. Kill the vestigial prototype. Signed-off-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page-cgroup: get rid of NR_PCG_FLAGSVladimir Davydov2014-08-082-8/+0
| | | | | | | | | | It's not used anywhere today, so let's remove it. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* page-cgroup: trivial cleanupVladimir Davydov2014-08-081-11/+11
| | | | | | | | | | | | | | Add forward declarations for struct pglist_data, mem_cgroup. Remove __init, __meminit from function prototypes and inline functions. Remove redundant inclusion of bit_spinlock.h. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: memcontrol: use page lists for uncharge batchingJohannes Weiner2014-08-086-129/+117
| | | | | | | | | | | | | | | | | | | Pages are now uncharged at release time, and all sources of batched uncharges operate on lists of pages. Directly use those lists, and get rid of the per-task batching state. This also batches statistics accounting, in addition to the res counter charges, to reduce IRQ-disabling and re-enabling. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: memcontrol: rewrite uncharge APIJohannes Weiner2014-08-0816-768/+389
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The memcg uncharging code that is involved towards the end of a page's lifetime - truncation, reclaim, swapout, migration - is impressively complicated and fragile. Because anonymous and file pages were always charged before they had their page->mapping established, uncharges had to happen when the page type could still be known from the context; as in unmap for anonymous, page cache removal for file and shmem pages, and swap cache truncation for swap pages. However, these operations happen well before the page is actually freed, and so a lot of synchronization is necessary: - Charging, uncharging, page migration, and charge migration all need to take a per-page bit spinlock as they could race with uncharging. - Swap cache truncation happens during both swap-in and swap-out, and possibly repeatedly before the page is actually freed. This means that the memcg swapout code is called from many contexts that make no sense and it has to figure out the direction from page state to make sure memory and memory+swap are always correctly charged. - On page migration, the old page might be unmapped but then reused, so memcg code has to prevent untimely uncharging in that case. Because this code - which should be a simple charge transfer - is so special-cased, it is not reusable for replace_page_cache(). But now that charged pages always have a page->mapping, introduce mem_cgroup_uncharge(), which is called after the final put_page(), when we know for sure that nobody is looking at the page anymore. For page migration, introduce mem_cgroup_migrate(), which is called after the migration is successful and the new page is fully rmapped. Because the old page is no longer uncharged after migration, prevent double charges by decoupling the page's memcg association (PCG_USED and pc->mem_cgroup) from the page holding an actual charge. The new bits PCG_MEM and PCG_MEMSW represent the respective charges and are transferred to the new page during migration. mem_cgroup_migrate() is suitable for replace_page_cache() as well, which gets rid of mem_cgroup_replace_page_cache(). However, care needs to be taken because both the source and the target page can already be charged and on the LRU when fuse is splicing: grab the page lock on the charge moving side to prevent changing pc->mem_cgroup of a page under migration. Also, the lruvecs of both pages change as we uncharge the old and charge the new during migration, and putback may race with us, so grab the lru lock and isolate the pages iff on LRU to prevent races and ensure the pages are on the right lruvec afterward. Swap accounting is massively simplified: because the page is no longer uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry before the final put_page() in page reclaim. Finally, page_cgroup changes are now protected by whatever protection the page itself offers: anonymous pages are charged under the page table lock, whereas page cache insertions, swapin, and migration hold the page lock. Uncharging happens under full exclusion with no outstanding references. Charging and uncharging also ensure that the page is off-LRU, which serializes against charge migration. Remove the very costly page_cgroup lock and set pc->flags non-atomically. [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable] [vdavydov@parallels.com: fix flags definition] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Tested-by: Jet Chen <jet.chen@intel.com> Acked-by: Michal Hocko <mhocko@suse.cz> Tested-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: memcontrol: rewrite charge APIJohannes Weiner2014-08-0812-395/+338
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These patches rework memcg charge lifetime to integrate more naturally with the lifetime of user pages. This drastically simplifies the code and reduces charging and uncharging overhead. The most expensive part of charging and uncharging is the page_cgroup bit spinlock, which is removed entirely after this series. Here are the top-10 profile entries of a stress test that reads a 128G sparse file on a freshly booted box, without even a dedicated cgroup (i.e. executing in the root memcg). Before: 15.36% cat [kernel.kallsyms] [k] copy_user_generic_string 13.31% cat [kernel.kallsyms] [k] memset 11.48% cat [kernel.kallsyms] [k] do_mpage_readpage 4.23% cat [kernel.kallsyms] [k] get_page_from_freelist 2.38% cat [kernel.kallsyms] [k] put_page 2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge 2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common 1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list 1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup 1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn After: 15.67% cat [kernel.kallsyms] [k] copy_user_generic_string 13.48% cat [kernel.kallsyms] [k] memset 11.42% cat [kernel.kallsyms] [k] do_mpage_readpage 3.98% cat [kernel.kallsyms] [k] get_page_from_freelist 2.46% cat [kernel.kallsyms] [k] put_page 2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list 1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup 1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn 1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk 1.30% cat [kernel.kallsyms] [k] kfree As you can see, the memcg footprint has shrunk quite a bit. text data bss dec hex filename 37970 9892 400 48262 bc86 mm/memcontrol.o.old 35239 9892 400 45531 b1db mm/memcontrol.o This patch (of 4): The memcg charge API charges pages before they are rmapped - i.e. have an actual "type" - and so every callsite needs its own set of charge and uncharge functions to know what type is being operated on. Worse, uncharge has to happen from a context that is still type-specific, rather than at the end of the page's lifetime with exclusive access, and so requires a lot of synchronization. Rewrite the charge API to provide a generic set of try_charge(), commit_charge() and cancel_charge() transaction operations, much like what's currently done for swap-in: mem_cgroup_try_charge() attempts to reserve a charge, reclaiming pages from the memcg if necessary. mem_cgroup_commit_charge() commits the page to the charge once it has a valid page->mapping and PageAnon() reliably tells the type. mem_cgroup_cancel_charge() aborts the transaction. This reduces the charge API and enables subsequent patches to drastically simplify uncharging. As pages need to be committed after rmap is established but before they are added to the LRU, page_add_new_anon_rmap() must stop doing LRU additions again. Revive lru_cache_add_active_or_unevictable(). [hughd@google.com: fix shmem_unuse] [hughd@google.com: Add comments on the private use of -EAGAIN] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vm_is_stack: use for_each_thread() rather then buggy while_each_thread()Oleg Nesterov2014-08-081-6/+3
| | | | | | | | | | | | | | | | Aleksei hit the soft lockup during reading /proc/PID/smaps. David investigated the problem and suggested the right fix. while_each_thread() is racy and should die, this patch updates vm_is_stack(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Aleksei Besogonov <alex.besogonov@gmail.com> Tested-by: Aleksei Besogonov <alex.besogonov@gmail.com> Suggested-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Revert "slab: remove BAD_ALIEN_MAGIC"Joonsoo Kim2014-08-081-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC"). commit a640616822b2 ("slab: remove BAD_ALIEN_MAGIC") assumes that the system with !CONFIG_NUMA has only one memory node. But, it turns out to be false by the report from Geert. His system, m68k, has many memory nodes and is configured in !CONFIG_NUMA. So it couldn't boot with above change. Here goes his failure report. With latest mainline, I'm getting a crash during bootup on m68k/ARAnyM: enable_cpucache failed for radix_tree_node, error 12. kernel BUG at /scratch/geert/linux/linux-m68k/mm/slab.c:1522! *** TRAP #7 *** FORMAT=0 Current process id is 0 BAD KERNEL TRAP: 00000000 Modules linked in: PC: [<0039c92c>] kmem_cache_init_late+0x70/0x8c SR: 2200 SP: 00345f90 a2: 0034c2e8 d0: 0000003d d1: 00000000 d2: 00000000 d3: 003ac942 d4: 00000000 d5: 00000000 a0: 0034f686 a1: 0034f682 Process swapper (pid: 0, task=0034c2e8) Frame format=0 Stack from 00345fc4: 002f69ef 002ff7e5 000005f2 000360fa 0017d806 003921d4 00000000 00000000 00000000 00000000 00000000 00000000 003ac942 00000000 003912d6 Call Trace: [<000360fa>] parse_args+0x0/0x2ca [<0017d806>] strlen+0x0/0x1a [<003921d4>] start_kernel+0x23c/0x428 [<003912d6>] _sinittext+0x2d6/0x95e Code: f7e5 4879 002f 69ef 61ff ffca 462a 4e47 <4879> 0035 4b1c 61ff fff0 0cc4 7005 23c0 0037 fd20 588f 265f 285f 4e75 48e7 301c Disabling lock debugging due to kernel taint Kernel panic - not syncing: Attempted to kill the idle task! Although there is a alternative way to fix this issue such as disabling use of alien cache on !CONFIG_NUMA, but, reverting issued commit is better to me in this time. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'microblaze-3.17-rc1' of git://git.monstr.eu/linux-2.6-microblazeLinus Torvalds2014-08-074-18/+28
|\ | | | | | | | | | | | | | | | | | | | | | | | | Pull microblaze updates from Michal Simek: - add new syscall and fix comment - fix udelay implementation - fix libgcc for modules * tag 'microblaze-3.17-rc1' of git://git.monstr.eu/linux-2.6-microblaze: microblaze: Change libgcc-style functions from lib-y to obj-y microblaze: Wire-up renameat2 syscall microblaze: Add syscall number comment microblaze: delay.h fix udelay and ndelay for 8 bit args
| * microblaze: Change libgcc-style functions from lib-y to obj-yDavid Holsgrove2014-07-181-11/+3
| | | | | | | | | | | | | | | | | | | | | | | | Following precedence set by MIPs: "[MIPS] Change libgcc-style functions from lib-y to obj-y" (sha1: f7c2778151f32581ea9ec567d01d5d85209fcfe6), switch the goal definition for kbuild to obj-y to ensure object files linked in vmlinux if only modules were users of these functions. Signed-off-by: David Holsgrove <david.holsgrove@xilinx.com> Signed-off-by: Michal Simek <michal.simek@xilinx.com>
| * microblaze: Wire-up renameat2 syscallMichal Simek2014-07-092-0/+2
| | | | | | | | | | | | Add new renameat2 syscall. Signed-off-by: Michal Simek <michal.simek@xilinx.com>
| * microblaze: Add syscall number commentMichal Simek2014-07-091-1/+1
| | | | | | | | | | | | Trivial fix. Signed-off-by: Michal Simek <michal.simek@xilinx.com>
| * microblaze: delay.h fix udelay and ndelay for 8 bit argsMichal Simek2014-07-091-6/+22
| | | | | | | | | | | | | | | | Based on: "asm-generic: delay.h fix udelay and ndelay for 8 bit args" (sha1: a87e553fabe8ceadc6f90889066559234cf194c7) Signed-off-by: Michal Simek <michal.simek@xilinx.com>
* | Merge branch 'for-linus' of ↵Linus Torvalds2014-08-071-3/+8
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32 Pull avr32 fix from Hans-Christian Egtvedt. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/egtvedt/linux-avr32: avr32: fix error return code
| * | avr32: fix error return codeJulia Lawall2014-08-071-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert a zero return value on error to a negative one, as returned elsewhere in the function. A simplified version of the semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ identifier ret; expression e1,e2; @@ ( if (\(ret < 0\|ret != 0\)) { ... return ret; } | ret = 0 ) ... when != ret = e1 when != &ret *if(...) { ... when != ret = e2 when forall return ret; } // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Acked-by: Hans-Christian Egtvedt <egtvedt@samfundet.no>
* | | Merge branch 'next' of ↵Linus Torvalds2014-08-07161-3284/+5477
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc updates from Ben Herrenschmidt: "This is the powerpc new goodies for 3.17. The short story: The biggest bit is Michael removing all of pre-POWER4 processor support from the 64-bit kernel. POWER3 and rs64. This gets rid of a ton of old cruft that has been bitrotting in a long while. It was broken for quite a few versions already and nobody noticed. Nobody uses those machines anymore. While at it, he cleaned up a bunch of old dusty cabinets, getting rid of a skeletton or two. Then, we have some base VFIO support for KVM, which allows assigning of PCI devices to KVM guests, support for large 64-bit BARs on "powernv" platforms, support for HMI (Hardware Management Interrupts) on those same platforms, some sparse-vmemmap improvements (for memory hotplug), There is the usual batch of Freescale embedded updates (summary in the merge commit) and fixes here or there, I think that's it for the highlights" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (102 commits) powerpc/eeh: Export eeh_iommu_group_to_pe() powerpc/eeh: Add missing #ifdef CONFIG_IOMMU_API powerpc: Reduce scariness of interrupt frames in stack traces powerpc: start loop at section start of start in vmemmap_populated() powerpc: implement vmemmap_free() powerpc: implement vmemmap_remove_mapping() for BOOK3S powerpc: implement vmemmap_list_free() powerpc: Fail remap_4k_pfn() if PFN doesn't fit inside PTE powerpc/book3s: Fix endianess issue for HMI handling on napping cpus. powerpc/book3s: handle HMIs for cpus in nap mode. powerpc/powernv: Invoke opal call to handle hmi. powerpc/book3s: Add basic infrastructure to handle HMI in Linux. powerpc/iommu: Fix comments with it_page_shift powerpc/powernv: Handle compound PE in config accessors powerpc/powernv: Handle compound PE for EEH powerpc/powernv: Handle compound PE powerpc/powernv: Split ioda_eeh_get_state() powerpc/powernv: Allow to freeze PE powerpc/powernv: Enable M64 aperatus for PHB3 powerpc/eeh: Aux PE data for error log ...
| * | | powerpc/eeh: Export eeh_iommu_group_to_pe()Gavin Shan2014-08-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function is used by VFIO driver, which might be built as a dynamic module. So it should be exported. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/eeh: Add missing #ifdef CONFIG_IOMMU_APIBenjamin Herrenschmidt2014-08-051-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some new functions are exposed for use by the IOMMU code but won't build when CONFIG_IOMMU_API isn't set, so shield them appropriately. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc: Reduce scariness of interrupt frames in stack tracesPaul Mackerras2014-08-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some people see things like "Exception: 501" in stack traces in dmesg and assume that means that something has gone badly wrong, when in fact "Exception: 501" just means a device interrupt was taken. This changes "Exception" to "interrupt" to make it clearer that we are just recording the fact of a change in control flow rather than some error condition. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc: start loop at section start of start in vmemmap_populated()Li Zhong2014-08-051-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vmemmap_populated() checks whether the [start, start + page_size) has valid pfn numbers, to know whether a vmemmap mapping has been created that includes this range. Some range before end might not be checked by this loop: sec11start......start11..sec11end/sec12start..end....start12..sec12end as the above, for start11(section 11), it checks [sec11start, sec11end), and loop ends as the next start(start12) is bigger than end. However, [sec11end/sec12start, end) is not checked here. So before the loop, adjust the start to be the start of the section, so we don't miss ranges like the above. After we adjust start to be the start of the section, it also means it's aligned with vmemmap as of the sizeof struct page, so we could use page_to_pfn directly in the loop. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Acked-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc: implement vmemmap_free()Li Zhong2014-08-051-21/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vmemmap_free() does the opposite of vmemap_populate(). This patch also puts vmemmap_free() and vmemmap_list_free() into CONFIG_MEMMORY_HOTPLUG. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Acked-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc: implement vmemmap_remove_mapping() for BOOK3SLi Zhong2014-08-052-1/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is to be called in vmemmap_free(), leave the implementation on BOOK3E empty as before. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Acked-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc: implement vmemmap_list_free()Li Zhong2014-08-051-10/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements vmemmap_list_free() for vmemmap_free(). The freed entries will be removed from vmemmap_list, and form a freed list, with next as the header. The next position in the last allocated page is kept at the list tail. When allocation, if there are freed entries left, get it from the freed list; if no freed entries left, get it like before from the last allocated pages. With this change, realmode_pfn_to_page() also needs to be changed to walk all the entries in the vmemmap_list, as the virt_addr of the entries might not be stored in order anymore. It helps to reuse the memory when continuous doing memory hot-plug/remove operations, but didn't reclaim the pages already allocated, so the memory usage will only increase, but won't exceed the value for the largest memory configuration. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Acked-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc: Fail remap_4k_pfn() if PFN doesn't fit inside PTEMadhusudanan Kandasamy2014-08-051-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | remap_4k_pfn() silently truncates upper bits of input 4K PFN if it cannot be contained in PTE. This leads invalid memory mapping and could result in a system crash when the memory is accessed. This patch fails remap_4k_pfn() and returns -EINVAL if the input 4K PFN cannot be contained in PTE. V3 : Added parentheses to protect 'pfn' and entire macro as suggested by Brian. V2 : Rewritten to avoid helper function as suggested by Stephen Rothwell. Signed-off-by: Madhusudanan Kandasamy <kmadhu@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/book3s: Fix endianess issue for HMI handling on napping cpus.Mahesh Salgaonkar2014-08-051-12/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (NOTE: This patch depends on upstream HMI handling patchset at https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-July/119731.html) The current HMI handling on napping cpus does not take care of endianess issue. On LE host kernel when we wake up from nap due to HMI interrupt we would checkstop while jumping into opal call. There is a similar issue in case of fast sleep wakeup where the code invokes opal_resync_tb opal call without handling LE issue. This patch fixes that as well. With this patch applied, HMIs handling on LE host kernel works fine. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/book3s: handle HMIs for cpus in nap mode.Mahesh Salgaonkar2014-08-051-0/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | HMIs are thread specific and can come while thread is in sleep/nap mode. Hence with SMT=off mode we can receive HMIs on sleeping threads. For interrupt received in nap mode, cpu wakes up at system reset vector, clears the interrupt and go back to nap mode again. But HMIs are sticky and they keep happening until we clear reason bits from HMER. Hence add a special check for HMI in reset vector (through power7_wakeup_* functions) and invoke opal call to handle HMI. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/powernv: Invoke opal call to handle hmi.Mahesh Salgaonkar2014-08-056-7/+267
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we hit the HMI in Linux, invoke opal call to handle/recover from HMI errors in real mode and then in virtual mode during check_irq_replay() invoke opal_poll_events()/opal_do_notifier() to retrieve HMI event from OPAL and act accordingly. Now that we are ready to handle HMI interrupt directly in linux, remove the HMI interrupt registration with firmware. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/book3s: Add basic infrastructure to handle HMI in Linux.Mahesh Salgaonkar2014-08-0513-3/+139
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Handle Hypervisor Maintenance Interrupt (HMI) in Linux. This patch implements basic infrastructure to handle HMI in Linux host. The design is to invoke opal handle hmi in real mode for recovery and set irq_pending when we hit HMI. During check_irq_replay pull opal hmi event and print hmi info on console. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/iommu: Fix comments with it_page_shiftAlexey Kardashevskiy2014-08-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a couple of commented debug prints which still use IOMMU_PAGE_SHIFT() which is not defined for POWERPC anymore, replace them with it_page_shift. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/powernv: Handle compound PE in config accessorsGavin Shan2014-08-051-28/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PCI config accessors check for PE frozen state and clear it if EEH isn't functional. The patch handles compound PE in config accessors if PHB supports it. For consistency, all PEs will be put into frozen state if any one in compound group gets frozen by hardware. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/powernv: Handle compound PE for EEHGavin Shan2014-08-051-47/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch handles compound PE for EEH backend. If one specific PE in compound group has been frozen, we enforces to freeze all PEs in the group. If we're enable DMA or MMIO for one PE in compound group, DMA or MMIO of all PEs in the group will be enabled. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/powernv: Handle compound PEGavin Shan2014-08-052-0/+146
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch introduces 3 PHB callbacks: compound PE state retrieval, force freezing and unfreezing compound PE. The PCI config accessors and PowerNV EEH backend can use them in subsequent patches. We don't export the capability of compound PE to EEH core, which helps avoiding more complexity to EEH core. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/powernv: Split ioda_eeh_get_state()Gavin Shan2014-08-051-80/+104
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Function ioda_eeh_get_state() is used to fetch EEH state for PHB or PE. We're going to support compound PE and the function becomes more complicated with that. The patch splits the function into two functions for PHB and PE cases separately to improve readability. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/powernv: Allow to freeze PEGavin Shan2014-08-052-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch synchronizes header file with firmware to have new OPAL API opal_pci_eeh_freeze_set(), which is used to freeze the specified PE in order to support "compound" PE. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/powernv: Enable M64 aperatus for PHB3Guo Chao2014-08-053-22/+307
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch enables M64 aperatus for PHB3. We already had platform hook (ppc_md.pcibios_window_alignment) to affect the PCI resource assignment done in PCI core so that each PE's M32 resource was built on basis of M32 segment size. Similarly, we're using that for M64 assignment on basis of M64 segment size. * We're using last M64 BAR to cover M64 aperatus, and it's shared by all 256 PEs. * We don't support P7IOC yet. However, some function callbacks are added to (struct pnv_phb) so that we can reuse them on P7IOC in future. * PE, corresponding to PCI bus with large M64 BAR device attached, might span multiple M64 segments. We introduce "compound" PE to cover the case. The compound PE is a list of PEs and the master PE is used as before. The slave PEs are just for MMIO isolation. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/eeh: Aux PE data for error logGavin Shan2014-08-054-15/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch allows PE (struct eeh_pe) instance to have auxillary data, whose size is configurable on basis of platform. For PowerNV, the auxillary data will be used to cache PHB diag-data for that PE (frozen PE or fenced PHB). In turn, we can retrieve the diag-data at any later points. It's useful for the case of VFIO PCI devices where the error log should be cached, and then be retrieved by the guest at later point. Also, it can avoid PHB diag-data overwritting if another frozen PE reported and the previous diag-data isn't fetched by guest. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/eeh: Make diag-data not endian dependentGavin Shan2014-08-053-108/+139
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's followup of commit ddf0322a ("powerpc/powernv: Fix endianness problems in EEH"). The patch helps to get non-endian-dependent diag-data. Cc: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/eeh: Replace pr_warning() with pr_warn()Gavin Shan2014-08-058-44/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pr_warn() is equal to pr_warning(), but the former is a bit more formal according to commit fc62f2f ("kernel.h: add pr_warn for symmetry to dev_warn, netdev_warn"). The patch replaces pr_warning() with pr_warn(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/eeh: Reduce lines of log dumpGavin Shan2014-08-051-5/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch prints 4 PCIE or AER config registers each line, which is part of the EEH log so that it looks a bit more compact. Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/eeh: Selectively enable IO for error logGavin Shan2014-08-054-6/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to the experiment I did, PCI config access is blocked on P7IOC frozen PE by hardware, but PHB3 doesn't do that. That means we always get 0xFF's while dumping PCI config space of the frozen PE on P7IOC. We don't have the problem on PHB3. So we have to enable I/O prioir to collecting error log. Otherwise, meaningless 0xFF's are always returned. The patch fixes it by EEH flag (EEH_ENABLE_IO_FOR_LOG), which is selectively set to indicate the case for: P7IOC on PowerNV platform, pSeries platform. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/eeh: Refactor EEH flag accessorsGavin Shan2014-08-056-38/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are multiple global EEH flags. Almost each flag has its own accessor, which doesn't make sense. The patch refactors EEH flag accessors so that they look unified: eeh_add_flag(): Add EEH flag eeh_clear_flag(): Clear EEH flag eeh_has_flag(): Check if one specific flag has been set eeh_enabled(): Check if EEH functionality has been enabled Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/eeh: Fetch IOMMU table in reliable wayGavin Shan2014-08-051-11/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Function eeh_iommu_group_to_pe() iterates each PCI device to check the binding IOMMU group with get_iommu_table_base(), which possibly fetches pdev->dev.archdata.dma_data.dma_offset. It's (0x1 << 59) for "bypass" cases. The patch fixes the issue by iterating devices hooked to the IOMMU group and fetch IOMMU table there. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/powernv: Fix IOMMU table for VFIO devGavin Shan2014-08-051-9/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On PHB3, PCI devices can bypass IOMMU for DMA access. If we pass through one PCI device, whose hose driver ever enable the bypass mode, pdev->dev.archdata.dma_data.iommu_table_base isn't IOMMU table. However, EEH needs access the IOMMU table when the device is owned by guest. The patch fixes pdev->dev.archdata.dma_data.iommu_table when passing through the device to guest in pnv_pci_ioda2_set_bypass(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
| * | | powerpc/eeh: Wrong place to call pci_get_slot()Mike Qiu2014-08-051-33/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pci_get_slot() is called with hold of PCI bus semaphore and it's not safe to be called in interrupt context. However, we possibly checks EEH error and calls the function in interrupt context. To avoid using pci_get_slot(), we turn into device tree for fetching location code. Otherwise, we might run into WARN_ON() as following messages indicate: WARNING: at drivers/pci/search.c:223 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.16.0-rc3+ #72 task: c000000001367af0 ti: c000000001444000 task.ti: c000000001444000 NIP: c000000000497b70 LR: c000000000037530 CTR: 000000003003d114 REGS: c000000001446fa0 TRAP: 0700 Not tainted (3.16.0-rc3+) MSR: 9000000000029032 <SF,HV,EE,ME,IR,DR,RI> CR: 48002422 XER: 20000000 CFAR: c00000000003752c SOFTE: 0 : NIP [c000000000497b70] .pci_get_slot+0x40/0x110 LR [c000000000037530] .eeh_pe_loc_get+0x150/0x190 Call Trace: .of_get_property+0x30/0x60 (unreliable) .eeh_pe_loc_get+0x150/0x190 .eeh_dev_check_failure+0x1b4/0x550 .eeh_check_failure+0x90/0xf0 .lpfc_sli_check_eratt+0x504/0x7c0 [lpfc] .lpfc_poll_eratt+0x64/0x100 [lpfc] .call_timer_fn+0x64/0x190 .run_timer_softirq+0x2cc/0x3e0 Cc: stable@vger.kernel.org Signed-off-by: Mike Qiu <qiudayu@linux.vnet.ibm.com> Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
OpenPOWER on IntegriCloud