summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* crypto: testmgr - add larger cast5 testvectorsJohannes Goetzfried2012-08-014-2/+871
| | | | | | | | | New ECB, CBC and CTR testvectors for cast5. We need larger testvectors to check parallel code paths in the optimized implementation. Tests have also been added to the tcrypt module. Signed-off-by: Johannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: cast5 - prepare generic module for optimized implementationsJohannes Goetzfried2012-08-013-34/+69
| | | | | | | | | Rename cast5 module to cast5_generic to allow autoloading of optimized implementations. Generic functions and s-boxes are exported to be able to use them within optimized implementations. Signed-off-by: Johannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: arch/s390 - cleanup - remove unneeded cra_list initializationJussi Kivilinna2012-08-013-16/+0
| | | | | | | | | | | | | | Initialization of cra_list is currently mixed, most ciphers initialize this field and most shashes do not. Initialization however is not needed at all since cra_list is initialized/overwritten in __crypto_register_alg() with list_add(). Therefore perform cleanup to remove all unneeded initializations of this field in 'arch/s390/crypto/' Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: linux-s390@vger.kernel.org Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: drivers - remove cra_list initializationJussi Kivilinna2012-08-0112-21/+0
| | | | | | | | | | | | | | | | | | | Initialization of cra_list is currently mixed, most ciphers initialize this field and most shashes do not. Initialization however is not needed at all since cra_list is initialized/overwritten in __crypto_register_alg() with list_add(). Therefore perform cleanup to remove all unneeded initializations of this field in 'crypto/drivers/'. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: linux-geode@lists.infradead.org Cc: Michal Ludvig <michal@logix.cz> Cc: Dmitry Kasatkin <dmitry.kasatkin@nokia.com> Cc: Varun Wadekar <vwadekar@nvidia.com> Cc: Eric Bénard <eric@eukrea.com> Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Acked-by: Kent Yoder <key@linux.vnet.ibm.com> Acked-by: Vladimir Zapolskiy <vladimir_zapolskiy@mentor.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: arch/x86 - cleanup - remove unneeded crypto_alg.cra_list initializationsJussi Kivilinna2012-08-0111-54/+1
| | | | | | | | | | | Initialization of cra_list is currently mixed, most ciphers initialize this field and most shashes do not. Initialization however is not needed at all since cra_list is initialized/overwritten in __crypto_register_alg() with list_add(). Therefore perform cleanup to remove all unneeded initializations of this field in 'arch/x86/crypto/'. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: cleanup - remove unneeded crypto_alg.cra_list initializationsJussi Kivilinna2012-08-0115-15/+0
| | | | | | | | | | | Initialization of cra_list is currently mixed, most ciphers initialize this field and most shashes do not. Initialization however is not needed at all since cra_list is initialized/overwritten in __crypto_register_alg() with list_add(). Therefore perform cleanup to remove all unneeded initializations of this field in 'crypto/'. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: whirlpool - use crypto_[un]register_shashesJussi Kivilinna2012-08-011-33/+6
| | | | | | | | Combine all shash algs to be registered and use new crypto_[un]register_shashes functions. This simplifies init/exit code. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: sha512 - use crypto_[un]register_shashesJussi Kivilinna2012-08-011-15/+5
| | | | | | | | Combine all shash algs to be registered and use new crypto_[un]register_shashes functions. This simplifies init/exit code. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: sha256 - use crypto_[un]register_shashesJussi Kivilinna2012-08-011-20/+5
| | | | | | | | Combine all shash algs to be registered and use new crypto_[un]register_shashes functions. This simplifies init/exit code. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: tiger - use crypto_[un]register_shashesJussi Kivilinna2012-08-011-32/+6
| | | | | | | | Combine all shash algs to be registered and use new crypto_[un]register_shashes functions. This simplifies init/exit code. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: add crypto_[un]register_shashes for [un]registering multiple shash ↵Jussi Kivilinna2012-08-012-0/+38
| | | | | | | | | | entries at once Add crypto_[un]register_shashes() to allow simplifying init/exit code of shash crypto modules that register multiple algorithms. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: ansi_cprng - use crypto_[un]register_algsJussi Kivilinna2012-08-011-40/+23
| | | | | | | | | Combine all crypto_alg to be registered and use new crypto_[un]register_algs functions. This simplifies init/exit code. Cc: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: serpent - use crypto_[un]register_algsJussi Kivilinna2012-08-011-34/+19
| | | | | | | | Combine all crypto_alg to be registered and use new crypto_[un]register_algs functions. This simplifies init/exit code. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: des - use crypto_[un]register_algsJussi Kivilinna2012-08-011-20/+5
| | | | | | | | Combine all crypto_alg to be registered and use new crypto_[un]register_algs functions. This simplifies init/exit code. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: crypto_null - use crypto_[un]register_algsJussi Kivilinna2012-08-011-39/+18
| | | | | | | | Combine all crypto_alg to be registered and use new crypto_[un]register_algs functions. This simplifies init/exit code. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: tea - use crypto_[un]register_algsJussi Kivilinna2012-08-011-35/+6
| | | | | | | | Combine all crypto_alg to be registered and use new crypto_[un]register_algs functions. This simplifies init/exit code. Signed-off-by: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* Merge tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6Linus Torvalds2012-07-317-169/+248
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull irqdomain changes from Grant Likely: "Round of refactoring and enhancements to irq_domain infrastructure. This series starts the process of simplifying irqdomain. The ultimate goal is to merge LEGACY, LINEAR and TREE mappings into a single system, but had to back off from that after some last minute bugs. Instead it mainly reorganizes the code and ensures that the reverse map gets populated when the irq is mapped instead of the first time it is looked up. Merging of the irq_domain types is deferred to v3.7 In other news, this series adds helpers for creating static mappings on a linear or tree mapping." * tag 'irqdomain-for-linus' of git://git.secretlab.ca/git/linux-2.6: irqdomain: Improve diagnostics when a domain mapping fails irqdomain: eliminate slow-path revmap lookups irqdomain: Fix irq_create_direct_mapping() to test irq_domain type. irqdomain: Eliminate dedicated radix lookup functions irqdomain: Support for static IRQ mapping and association. irqdomain: Always update revmap when setting up a virq irqdomain: Split disassociating code into separate function irq_domain: correct a minor wrong comment for linear revmap irq_domain: Standardise legacy/linear domain selection irqdomain: Make ops->map hook optional irqdomain: Remove unnecessary test for IRQ_DOMAIN_MAP_LEGACY irqdomain: Simple NUMA awareness. devicetree: add helper inline for retrieving a node's full name
| * irqdomain: Improve diagnostics when a domain mapping failsMark Brown2012-07-241-6/+11
| | | | | | | | | | | | | | | | | | When the map operation fails log the error code we get and add a WARN_ON() so we get a backtrace (which should help work out which interrupt is the source of the issue). Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
| * irqdomain: eliminate slow-path revmap lookupsGrant Likely2012-07-241-40/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the current state of irq_domain, the reverse map is always updated when new IRQs get mapped. This means that the irq_find_mapping() function can be simplified to execute the revmap lookup functions unconditionally This patch adds lookup functions for the revmaps that don't yet have one and removes the slow path lookup code path. v8: Broke out unrelated changes into separate patches. Rebased on Paul's irq association patches. v7: Rebased to irqdomain/next for v3.4 and applied before the removal of 'hint' v6: Remove the slow path entirely. The only place where the slow path could get called is for a linear mapping if the hwirq number is larger than the linear revmap size. There shouldn't be any interrupt controllers that do that. v5: rewrite to not use a ->revmap() callback. It is simpler, smaller, safer and faster to open code each of the revmap lookups directly into irq_find_mapping() via a switch statement. v4: Fix build failure on incorrect variable reference. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Milton Miller <miltonm@bga.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Rob Herring <rob.herring@calxeda.com>
| * Merge remote-tracking branch 'origin' into irqdomain/nextGrant Likely2012-07-244548-104854/+181547
| |\
| * | irqdomain: Fix irq_create_direct_mapping() to test irq_domain type.Grant Likely2012-07-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | irq_create_direct_mapping can only be used with the NOMAP type. Make the function test to ensure it is passed the correct type of irq_domain. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rob Herring <rob.herring@calxeda.com>
| * | irqdomain: Eliminate dedicated radix lookup functionsGrant Likely2012-07-114-65/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation to remove the slow revmap path, eliminate the public radix revmap lookup functions. This simplifies the code and makes the slowpath removal patch a lot simpler. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rob Herring <rob.herring@calxeda.com>
| * | irqdomain: Support for static IRQ mapping and association.Grant Likely2012-07-112-25/+107
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a new strict mapping API for supporting creation of linux IRQs at existing positions within the domain. The new routines are as follows: For dynamic allocation and insertion to specified ranges: - irq_create_identity_mapping() - irq_create_strict_mappings() These will allocate and associate a range of linux IRQs at the specified location. This can be used by controllers that have their own static linux IRQ definitions to map a hwirq range to, as well as for platforms that wish to establish 1:1 identity mapping between linux and hwirq space. For insertion to specified ranges by platforms that do their own irq_desc management: - irq_domain_associate() - irq_domain_associate_many() These in turn call back in to the domain's ->map() routine, for further processing by the platform. Disassociation of IRQs get handled through irq_dispose_mapping() as normal. With these in place it should be possible to begin migration of legacy IRQ domains to linear ones, without requiring special handling for static vs dynamic IRQ definitions in DT vs non-DT paths. This also makes it possible for domains with static mappings to adopt whichever tree model best fits their needs, rather than simply restricting them to linear revmaps. Signed-off-by: Paul Mundt <lethal@linux-sh.org> [grant.likely: Reorganized irq_domain_associate{,_many} to have all logic in one place] [grant.likely: Add error checking for unallocated irq_descs at associate time] Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rob Herring <rob.herring@calxeda.com>
| * | irqdomain: Always update revmap when setting up a virqGrant Likely2012-07-112-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At irq_setup_virq() time all of the data needed to update the reverse map is available, but the current code ignores it and relies upon the slow path to insert revmap records. This patch adds revmap updating to the setup path so the slow path will no longer be necessary. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rob Herring <rob.herring@calxeda.com>
| * | irqdomain: Split disassociating code into separate functionGrant Likely2012-07-111-28/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch moves the irq disassociation code out into a separate function in preparation to extend irq_setup_virq to handle multiple irqs and rename it for use by interrupt controller drivers. The new function will be used by irq_setup_virq() in its error path. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rob Herring <rob.herring@calxeda.com>
| * | Merge tag 'v3.5-rc6' into irqdomain/nextGrant Likely2012-07-11995-5037/+9700
| |\ \ | | | | | | | | | | | | Linux 3.5-rc6
| * | | irq_domain: correct a minor wrong comment for linear revmapDong Aisheng2012-07-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The revmap type should be linear for irq_domain_add_linear function. Signed-off-by: Dong Aisheng <dong.aisheng@linaro.org> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
| * | | irq_domain: Standardise legacy/linear domain selectionMark Brown2012-07-113-0/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A large proportion of interrupt controllers that support legacy mappings do so because non-DT systems need to use fixed IRQ numbers when registering devices via buses but can otherwise use a linear mapping. The interrupt controller itself typically is not affected by the mapping used and best practice is to use a linear mapping where possible so drivers frequently select at runtime depending on if a legacy range has been allocated to them. Standardise this behaviour by providing irq_domain_register_simple() which will allocate a linear mapping unless a positive first_irq is provided in which case it will fall back to a legacy mapping. This helps make best practice for irq_domain adoption clearer. Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
| * | | irqdomain: Make ops->map hook optionalGrant Likely2012-06-171-10/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There isn't a really compelling reason to force ->map to be populated, so allow it to be left unset. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rob Herring <rob.herring@calxeda.com>
| * | | irqdomain: Remove unnecessary test for IRQ_DOMAIN_MAP_LEGACYGrant Likely2012-06-151-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Where irq_domain_associate() is called in irq_create_mapping, there is no need to test for IRQ_DOMAIN_MAP_LEGACY because it is already tested for earlier in the routine. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rob Herring <rob.herring@calxeda.com>
| * | | irqdomain: Simple NUMA awareness.Paul Mundt2012-06-152-10/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While common irqdesc allocation is node aware, the irqdomain code is not. Presently we observe a number of regressions/inconsistencies on NUMA-capable platforms: - Platforms using irqdomains with legacy mappings, where the irq_descs are allocated node-local and the irqdomain data structure is not. - Drivers implementing irqdomains will lose node locality regardless of the underlying struct device's node id. This plugs in NUMA node id proliferation across the various allocation callsites by way of_node_to_nid() node lookup. While of_node_to_nid() does the right thing for OF-capable platforms it doesn't presently handle the non-DT case. This is trivially dealt with by simply wraping in to numa_node_id() unconditionally. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rob Herring <rob.herring@calxeda.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
| * | | devicetree: add helper inline for retrieving a node's full nameGrant Likely2012-06-1510-21/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pattern (np ? np->full_name : "<none>") is rather common in the kernel, but can also make for quite long lines. This patch adds a new inline function, of_node_full_name() so that the test for a valid node pointer doesn't need to be open coded at all call sites. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de>
* | | | Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds2012-07-31131-1060/+3174
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge Andrew's second set of patches: - MM - a few random fixes - a couple of RTC leftovers * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits) rtc/rtc-88pm80x: remove unneed devm_kfree rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables tmpfs: distribute interleave better across nodes mm: remove redundant initialization mm: warn if pg_data_t isn't initialized with zero mips: zero out pg_data_t when it's allocated memcg: gix memory accounting scalability in shrink_page_list mm/sparse: remove index_init_lock mm/sparse: more checks on mem_section number mm/sparse: optimize sparse_index_alloc memcg: add mem_cgroup_from_css() helper memcg: further prevent OOM with too many dirty pages memcg: prevent OOM with too many dirty pages mm: mmu_notifier: fix freed page still mapped in secondary MMU mm: memcg: only check anon swapin page charges for swap cache mm: memcg: only check swap cache pages for repeated charging mm: memcg: split swapin charge function into private and public part mm: memcg: remove needless !mm fixup to init_mm when charging mm: memcg: remove unneeded shmem charge type ...
| * | | | rtc/rtc-88pm80x: remove unneed devm_kfreeDevendra Naga2012-07-311-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | devm_kzalloc() doesn't need a matching devm_kfree(), the freeing mechanism will trigger when driver unloads. Signed-off-by: Devendra Naga <devendra.aaru@gmail.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Ashish Jangam <ashish.jangam@kpitcummins.com> Cc: David Dajun Chen <dchen@diasemi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | rtc/rtc-88pm80x: assign ret only when rtc_register_driver failsDevendra Naga2012-07-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the probe we are assigning ret to return value of PTR_ERR right after the rtc_register_drive()r, as we would have done it in the if (IS_ERR(ptr)) check, since the function fails and goes inside that case Signed-off-by: Devendra Naga <devendra.aaru@gmail.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Ashish Jangam <ashish.jangam@kpitcummins.com> Cc: David Dajun Chen <dchen@diasemi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm: hugetlbfs: close race during teardown of hugetlbfs shared page tablesMel Gorman2012-07-313-3/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a process creates a large hugetlbfs mapping that is eligible for page table sharing and forks heavily with children some of whom fault and others which destroy the mapping then it is possible for page tables to get corrupted. Some teardowns of the mapping encounter a "bad pmd" and output a message to the kernel log. The final teardown will trigger a BUG_ON in mm/filemap.c. This was reproduced in 3.4 but is known to have existed for a long time and goes back at least as far as 2.6.37. It was probably was introduced in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages look like this; [ ..........] Lots of bad pmd messages followed by this [ 127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7). [ 127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7). [ 127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7). [ 127.186778] ------------[ cut here ]------------ [ 127.186781] kernel BUG at mm/filemap.c:134! [ 127.186782] invalid opcode: 0000 [#1] SMP [ 127.186783] CPU 7 [ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod [ 127.186801] [ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR [ 127.186804] RIP: 0010:[<ffffffff810ed6ce>] [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160 [ 127.186809] RSP: 0000:ffff8804144b5c08 EFLAGS: 00010002 [ 127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0 [ 127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00 [ 127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003 [ 127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8 [ 127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8 [ 127.186815] FS: 00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000 [ 127.186816] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0 [ 127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0) [ 127.186821] Stack: [ 127.186822] ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b [ 127.186824] ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98 [ 127.186825] ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000 [ 127.186827] Call Trace: [ 127.186829] [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80 [ 127.186832] [<ffffffff811bc925>] truncate_hugepages+0x115/0x220 [ 127.186834] [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30 [ 127.186837] [<ffffffff811655c7>] evict+0xa7/0x1b0 [ 127.186839] [<ffffffff811657a3>] iput_final+0xd3/0x1f0 [ 127.186840] [<ffffffff811658f9>] iput+0x39/0x50 [ 127.186842] [<ffffffff81162708>] d_kill+0xf8/0x130 [ 127.186843] [<ffffffff81162812>] dput+0xd2/0x1a0 [ 127.186845] [<ffffffff8114e2d0>] __fput+0x170/0x230 [ 127.186848] [<ffffffff81236e0e>] ? rb_erase+0xce/0x150 [ 127.186849] [<ffffffff8114e3ad>] fput+0x1d/0x30 [ 127.186851] [<ffffffff81117db7>] remove_vma+0x37/0x80 [ 127.186853] [<ffffffff81119182>] do_munmap+0x2d2/0x360 [ 127.186855] [<ffffffff811cc639>] sys_shmdt+0xc9/0x170 [ 127.186857] [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b [ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0 [ 127.186868] RIP [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160 [ 127.186870] RSP <ffff8804144b5c08> [ 127.186871] ---[ end trace 7cbac5d1db69f426 ]--- The bug is a race and not always easy to reproduce. To reproduce it I was doing the following on a single socket I7-based machine with 16G of RAM. $ hugeadm --pool-pages-max DEFAULT:13G $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall $ for i in `seq 1 9000`; do ./hugetlbfs-test; done On my particular machine, it usually triggers within 10 minutes but enabling debug options can change the timing such that it never hits. Once the bug is triggered, the machine is in trouble and needs to be rebooted. The machine will respond but processes accessing proc like "ps aux" will hang due to the BUG_ON. shutdown will also hang and needs a hard reset or a sysrq-b. The basic problem is a race between page table sharing and teardown. For the most part page table sharing depends on i_mmap_mutex. In some cases, it is also taking the mm->page_table_lock for the PTE updates but with shared page tables, it is the i_mmap_mutex that is more important. Unfortunately it appears to be also insufficient. Consider the following situation Process A Process B --------- --------- hugetlb_fault shmdt LockWrite(mmap_sem) do_munmap unmap_region unmap_vmas unmap_single_vma unmap_hugepage_range Lock(i_mmap_mutex) Lock(mm->page_table_lock) huge_pmd_unshare/unmap tables <--- (1) Unlock(mm->page_table_lock) Unlock(i_mmap_mutex) huge_pte_alloc ... Lock(i_mmap_mutex) ... vma_prio_walk, find svma, spte ... Lock(mm->page_table_lock) ... share spte ... Unlock(mm->page_table_lock) ... Unlock(i_mmap_mutex) ... hugetlb_no_page <--- (2) free_pgtables unlink_file_vma hugetlb_free_pgd_range remove_vma_list In this scenario, it is possible for Process A to share page tables with Process B that is trying to tear them down. The i_mmap_mutex on its own does not prevent Process A walking Process B's page tables. At (1) above, the page tables are not shared yet so it unmaps the PMDs. Process A sets up page table sharing and at (2) faults a new entry. Process B then trips up on it in free_pgtables. This patch fixes the problem by adding a new function __unmap_hugepage_range_final that is only called when the VMA is about to be destroyed. This function clears VM_MAYSHARE during unmap_hugepage_range() under the i_mmap_mutex. This makes the VMA ineligible for sharing and avoids the race. Superficially this looks like it would then be vunerable to truncate and madvise issues but hugetlbfs has its own truncate handlers so does not use unmap_mapping_range() and does not support madvise(DONTNEED). This should be treated as a -stable candidate if it is merged. Test program is as follows. The test case was mostly written by Michal Hocko with a few minor changes to reproduce this bug. ==== CUT HERE ==== static size_t huge_page_size = (2UL << 20); static size_t nr_huge_page_A = 512; static size_t nr_huge_page_B = 5632; unsigned int get_random(unsigned int max) { struct timeval tv; gettimeofday(&tv, NULL); srandom(tv.tv_usec); return random() % max; } static void play(void *addr, size_t size) { unsigned char *start = addr, *end = start + size, *a; start += get_random(size/2); /* we could itterate on huge pages but let's give it more time. */ for (a = start; a < end; a += 4096) *a = 0; } int main(int argc, char **argv) { key_t key = IPC_PRIVATE; size_t sizeA = nr_huge_page_A * huge_page_size; size_t sizeB = nr_huge_page_B * huge_page_size; int shmidA, shmidB; void *addrA = NULL, *addrB = NULL; int nr_children = 300, n = 0; if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) { perror("shmget:"); return 1; } if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) { perror("shmat"); return 1; } if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) { perror("shmget:"); return 1; } if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) { perror("shmat"); return 1; } fork_child: switch(fork()) { case 0: switch (n%3) { case 0: play(addrA, sizeA); break; case 1: play(addrB, sizeB); break; case 2: break; } break; case -1: perror("fork:"); break; default: if (++n < nr_children) goto fork_child; play(addrA, sizeA); break; } shmdt(addrA); shmdt(addrB); do { wait(NULL); } while (--n > 0); shmctl(shmidA, IPC_RMID, NULL); shmctl(shmidB, IPC_RMID, NULL); return 0; } [akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build] Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | tmpfs: distribute interleave better across nodesNathan Zimmer2012-07-311-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When tmpfs has the interleave memory policy, it always starts allocating for each file from node 0 at offset 0. When there are many small files, the lower nodes fill up disproportionately. This patch spreads out node usage by starting files at nodes other than 0, by using the inode number to bias the starting node for interleave. Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Nick Piggin <npiggin@gmail.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm: remove redundant initializationMinchan Kim2012-07-312-12/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | pg_data_t is zeroed before reaching free_area_init_core(), so remove the now unnecessary initializations. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm: warn if pg_data_t isn't initialized with zeroMinchan Kim2012-07-311-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Warn if memory-hotplug/boot code doesn't initialize pg_data_t with zero when it is allocated. Arch code and memory hotplug already initiailize pg_data_t. So this warning should never happen. I select fields randomly near the beginning, middle and end of pg_data_t for checking. This patch isn't for performance but for removing initialization code which is necessary to add whenever we adds new field to pg_data_t or zone. Firstly, Andrew suggested clearing out of pg_data_t in MM core part but Tejun doesn't like it because in the future, some archs can initialize some fields in arch code and pass them into general MM part so blindly clearing it out in mm core part would be very annoying. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mips: zero out pg_data_t when it's allocatedMinchan Kim2012-07-311-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is preparation for the next patch which removes the zeroing of the pg_data_t in core MM. All archs except MIPS already do this. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | memcg: gix memory accounting scalability in shrink_page_listTim Chen2012-07-311-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I noticed in a multi-process parallel files reading benchmark I ran on a 8 socket machine, throughput slowed down by a factor of 8 when I ran the benchmark within a cgroup container. I traced the problem to the following code path (see below) when we are trying to reclaim memory from file cache. The res_counter_uncharge function is called on every page that's reclaimed and created heavy lock contention. The patch below allows the reclaimed pages to be uncharged from the resource counter in batch and recovered the regression. Tim 40.67% usemem [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--92.61%-- res_counter_uncharge | | | |--100.00%-- __mem_cgroup_uncharge_common | | | | | |--100.00%-- mem_cgroup_uncharge_cache_page | | | __remove_mapping | | | shrink_page_list | | | shrink_inactive_list | | | shrink_mem_cgroup_zone | | | shrink_zone | | | do_try_to_free_pages | | | try_to_free_pages | | | __alloc_pages_nodemask | | | alloc_pages_current Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm/sparse: remove index_init_lockGavin Shan2012-07-311-13/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sparse_index_init() uses the index_init_lock spinlock to protect root mem_section assignment. The lock is not necessary anymore because the function is called only during boot (during paging init which is executed only from a single CPU) and from the hotplug code (by add_memory() via arch_add_memory()) which uses mem_hotplug_mutex. The lock was introduced by 28ae55c9 ("sparsemem extreme: hotplug preparation") and sparse_index_init() was used only during boot at that time. Later when the hotplug code (and add_memory()) was introduced there was no synchronization so it was possible to online more sections from the same root probably (though I am not 100% sure about that). The first synchronization has been added by 6ad696d2 ("mm: allow memory hotplug and hibernation in the same kernel") which was later replaced by the mem_hotplug_mutex - 20d6c96b ("mem-hotplug: introduce {un}lock_memory_hotplug()"). Let's remove the lock as it is not needed and it makes the code more confusing. [mhocko@suse.cz: changelog] Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm/sparse: more checks on mem_section numberGavin Shan2012-07-311-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __section_nr() was implemented to retrieve the corresponding memory section number according to its descriptor. It's possible that the specified memory section descriptor doesn't exist in the global array. So add more checking on that and report an error for a wrong case. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm/sparse: optimize sparse_index_allocGavin Shan2012-07-311-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With CONFIG_SPARSEMEM_EXTREME, the two levels of memory section descriptors are allocated from slab or bootmem. When allocating from slab, let slab/bootmem allocator clear the memory chunk. We needn't clear it explicitly. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | memcg: add mem_cgroup_from_css() helperWanpeng Li2012-07-311-8/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a mem_cgroup_from_css() helper to replace open-coded invokations of container_of(). To clarify the code and to add a little more type safety. [akpm@linux-foundation.org: fix extensive breakage] Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | memcg: further prevent OOM with too many dirty pagesHugh Dickins2012-07-311-9/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The may_enter_fs test turns out to be too restrictive: though I saw no problem with it when testing on 3.5-rc6, it very soon OOMed when I tested on 3.5-rc6-mm1. I don't know what the difference there is, perhaps I just slightly changed the way I started off the testing: dd if=/dev/zero of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync repeatedly, in 20M memory.limit_in_bytes cgroup to ext4 on USB stick. ext4 (and gfs2 and xfs) turn out to allocate new pages for writing with AOP_FLAG_NOFS: that seems a little worrying, and it's unclear to me why the transaction needs to be started even before allocating pagecache memory. But it may not be worth worrying about these days: if direct reclaim avoids FS writeback, does __GFP_FS now mean anything? Anyway, we insisted on the may_enter_fs test to avoid hangs with the loop device; but since that also masks off __GFP_IO, we can test for __GFP_IO directly, ignoring may_enter_fs and __GFP_FS. But even so, the test still OOMs sometimes: when originally testing on 3.5-rc6, it OOMed about one time in five or ten; when testing just now on 3.5-rc6-mm1, it OOMed on the first iteration. This residual problem comes from an accumulation of pages under ordinary writeback, not marked PageReclaim, so rightly not causing the memcg check to wait on their writeback: these too can prevent shrink_page_list() from freeing any pages, so many times that memcg reclaim fails and OOMs. Deal with these in the same way as direct reclaim now deals with dirty FS pages: mark them PageReclaim. It is appropriate to rotate these to tail of list when writepage completes, but more importantly, the PageReclaim flag makes memcg reclaim wait on them if encountered again. Increment NR_VMSCAN_IMMEDIATE? That's arguable: I chose not. Setting PageReclaim here may occasionally race with end_page_writeback() clearing it: lru_deactivate_fn() already faced the same race, and correctly concluded that the window is small and the issue non-critical. With these changes, the test runs indefinitely without OOMing on ext4, ext3 and ext2: I'll move on to test with other filesystems later. Trivia: invert conditions for a clearer block without an else, and goto keep_locked to do the unlock_page. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | memcg: prevent OOM with too many dirty pagesMichal Hocko2012-07-311-3/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current implementation of dirty pages throttling is not memcg aware which makes it easy to have memcg LRUs full of dirty pages. Without throttling, these LRUs can be scanned faster than the rate of writeback, leading to memcg OOM conditions when the hard limit is small. This patch fixes the problem by throttling the allocating process (possibly a writer) during the hard limit reclaim by waiting on PageReclaim pages. We are waiting only for PageReclaim pages because those are the pages that made one full round over LRU and that means that the writeback is much slower than scanning. The solution is far from being ideal - long term solution is memcg aware dirty throttling - but it is meant to be a band aid until we have a real fix. We are seeing this happening during nightly backups which are placed into containers to prevent from eviction of the real working set. The change affects only memcg reclaim and only when we encounter PageReclaim pages which is a signal that the reclaim doesn't catch up on with the writers so somebody should be throttled. This could be potentially unfair because it could be somebody else from the group who gets throttled on behalf of the writer but as writers need to allocate as well and they allocate in higher rate the probability that only innocent processes would be penalized is not that high. I have tested this change by a simple dd copying /dev/zero to tmpfs or ext3 running under small memcg (1G copy under 5M, 60M, 300M and 2G containers) and dd got killed by OOM killer every time. With the patch I could run the dd with the same size under 5M controller without any OOM. The issue is more visible with slower devices for output. * With the patch ================ * tmpfs size=2G --------------- $ vim cgroup_cache_oom_test.sh $ ./cgroup_cache_oom_test.sh 5M using Limit 5M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s $ ./cgroup_cache_oom_test.sh 60M using Limit 60M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s $ ./cgroup_cache_oom_test.sh 300M using Limit 300M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s $ ./cgroup_cache_oom_test.sh 2G using Limit 2G for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s * ext3 ------ $ ./cgroup_cache_oom_test.sh 5M using Limit 5M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s $ ./cgroup_cache_oom_test.sh 60M using Limit 60M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s $ ./cgroup_cache_oom_test.sh 300M using Limit 300M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s $ ./cgroup_cache_oom_test.sh 2G using Limit 2G for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s * Without the patch =================== * tmpfs size=2G --------------- $ ./cgroup_cache_oom_test.sh 5M using Limit 5M for group ./cgroup_cache_oom_test.sh: line 46: 4668 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count $ ./cgroup_cache_oom_test.sh 60M using Limit 60M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s $ ./cgroup_cache_oom_test.sh 300M using Limit 300M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s $ ./cgroup_cache_oom_test.sh 2G using Limit 2G for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s * ext3 ------ $ ./cgroup_cache_oom_test.sh 5M using Limit 5M for group ./cgroup_cache_oom_test.sh: line 46: 4689 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count $ ./cgroup_cache_oom_test.sh 60M using Limit 60M for group ./cgroup_cache_oom_test.sh: line 46: 4692 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count $ ./cgroup_cache_oom_test.sh 300M using Limit 300M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s $ ./cgroup_cache_oom_test.sh 2G using Limit 2G for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s [akpm@linux-foundation.org: tweak changelog, reordered the test to optimize for CONFIG_CGROUP_MEM_RES_CTLR=n] [hughd@google.com: fix deadlock with loop driver] Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm: mmu_notifier: fix freed page still mapped in secondary MMUXiao Guangrong2012-07-311-22/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mmu_notifier_release() is called when the process is exiting. It will delete all the mmu notifiers. But at this time the page belonging to the process is still present in page tables and is present on the LRU list, so this race will happen: CPU 0 CPU 1 mmu_notifier_release: try_to_unmap: hlist_del_init_rcu(&mn->hlist); ptep_clear_flush_notify: mmu nofifler not found free page !!!!!! /* * At the point, the page has been * freed, but it is still mapped in * the secondary MMU. */ mn->ops->release(mn, mm); Then the box is not stable and sometimes we can get this bug: [ 738.075923] BUG: Bad page state in process migrate-perf pfn:03bec [ 738.075931] page:ffffea00000efb00 count:0 mapcount:0 mapping: (null) index:0x8076 [ 738.075936] page flags: 0x20000000000014(referenced|dirty) The same issue is present in mmu_notifier_unregister(). We can call ->release before deleting the notifier to ensure the page has been unmapped from the secondary MMU before it is freed. Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Avi Kivity <avi@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm: memcg: only check anon swapin page charges for swap cacheJohannes Weiner2012-07-311-8/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | shmem knows for sure that the page is in swap cache when attempting to charge a page, because the cache charge entry function has a check for it. Only anon pages may be removed from swap cache already when trying to charge their swapin. Adjust the comment, though: '4969c11 mm: fix swapin race condition' added a stable PageSwapCache check under the page lock in the do_swap_page() before calling the memory controller, so it's unuse_pte()'s pte_same() that may fail. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Wanpeng Li <liwp.linux@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | | mm: memcg: only check swap cache pages for repeated chargingJohannes Weiner2012-07-311-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Only anon and shmem pages in the swap cache are attempted to be charged multiple times, from every swap pte fault or from shmem_unuse(). No other pages require checking PageCgroupUsed(). Charging pages in the swap cache is also serialized by the page lock, and since both the try_charge and commit_charge are called under the same page lock section, the PageCgroupUsed() check might as well happen before the counter charging, let alone reclaim. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Wanpeng Li <liwp.linux@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud