summaryrefslogtreecommitdiffstats
path: root/lib/radix-tree.c
Commit message (Collapse)AuthorAgeFilesLines
* lib/radix-tree.c: update the kmemleak stack trace for radix tree allocationsCatalin Marinas2014-06-061-0/+6
| | | | | | | | | | | Since radix_tree_preload() stack trace is not always useful for debugging an actual radix tree memory leak, this patch updates the kmemleak allocation stack trace in the radix_tree_node_alloc() function. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* lib/radix-tree.c: kernel-doc warning fixFabian Frederick2014-06-041-1/+0
| | | | | | | | | | index has been removed from __radix_tree_delete_node in 449dd6984d0e47 ("mm: keep page cache radix tree nodes in check") Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: replace __get_cpu_var uses with this_cpu_ptrChristoph Lameter2014-06-041-3/+3
| | | | | | | | | | | Replace places where __get_cpu_var() is used for an address calculation with this_cpu_ptr(). Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: keep page cache radix tree nodes in checkJohannes Weiner2014-04-031-14/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, page cache radix tree nodes were freed after reclaim emptied out their page pointers. But now reclaim stores shadow entries in their place, which are only reclaimed when the inodes themselves are reclaimed. This is problematic for bigger files that are still in use after they have a significant amount of their cache reclaimed, without any of those pages actually refaulting. The shadow entries will just sit there and waste memory. In the worst case, the shadow entries will accumulate until the machine runs out of memory. To get this under control, the VM will track radix tree nodes exclusively containing shadow entries on a per-NUMA node list. Per-NUMA rather than global because we expect the radix tree nodes themselves to be allocated node-locally and we want to reduce cross-node references of otherwise independent cache workloads. A simple shrinker will then reclaim these nodes on memory pressure. A few things need to be stored in the radix tree node to implement the shadow node LRU and allow tree deletions coming from the list: 1. There is no index available that would describe the reverse path from the node up to the tree root, which is needed to perform a deletion. To solve this, encode in each node its offset inside the parent. This can be stored in the unused upper bits of the same member that stores the node's height at no extra space cost. 2. The number of shadow entries needs to be counted in addition to the regular entries, to quickly detect when the node is ready to go to the shadow node LRU list. The current entry count is an unsigned int but the maximum number of entries is 64, so a shadow counter can easily be stored in the unused upper bits. 3. Tree modification needs tree lock and tree root, which are located in the address space, so store an address_space backpointer in the node. The parent pointer of the node is in a union with the 2-word rcu_head, so the backpointer comes at no extra cost as well. 4. The node needs to be linked to an LRU list, which requires a list head inside the node. This does increase the size of the node, but it does not change the number of objects that fit into a slab page. [akpm@linux-foundation.org: export the right function] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* lib: radix_tree: tree node interfaceJohannes Weiner2014-04-031-115/+148
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make struct radix_tree_node part of the public interface and provide API functions to create, look up, and delete whole nodes. Refactor the existing insert, look up, delete functions on top of these new node primitives. This will allow the VM to track and garbage collect page cache radix tree nodes. [sasha.levin@oracle.com: return correct error code on insertion failure] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: filemap: move radix tree hole searching hereJohannes Weiner2014-04-031-75/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The radix tree hole searching code is only used for page cache, for example the readahead code trying to get a a picture of the area surrounding a fault. It sufficed to rely on the radix tree definition of holes, which is "empty tree slot". But this is about to change, though, as shadow page descriptors will be stored in the page cache after the actual pages get evicted from memory. Move the functions over to mm/filemap.c and make them native page cache operations, where they can later be adapted to handle the new definition of "page cache hole". Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* lib: radix-tree: add radix_tree_delete_item()Johannes Weiner2014-04-031-4/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide a function that does not just delete an entry at a given index, but also allows passing in an expected item. Delete only if that item is still located at the specified index. This is handy when lockless tree traversals want to delete entries as well because they don't have to do an second, locked lookup to verify the slot has not changed under them before deleting the entry. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* lib/radix-tree.c: swapoff tmpfs radix_tree: remember to rcu_read_unlockHugh Dickins2014-03-041-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Running fsx on tmpfs with concurrent memhog-swapoff-swapon, lots of BUG: sleeping function called from invalid context at kernel/fork.c:606 in_atomic(): 0, irqs_disabled(): 0, pid: 1394, name: swapoff 1 lock held by swapoff/1394: #0: (rcu_read_lock){.+.+.+}, at: [<ffffffff812520a1>] radix_tree_locate_item+0x1f/0x2b6 followed by ================================================ [ BUG: lock held when returning to user space! ] 3.14.0-rc1 #3 Not tainted ------------------------------------------------ swapoff/1394 is leaving the kernel with locks still held! 1 lock held by swapoff/1394: #0: (rcu_read_lock){.+.+.+}, at: [<ffffffff812520a1>] radix_tree_locate_item+0x1f/0x2b6 after which the system recovered nicely. Whoops, I long ago forgot the rcu_read_unlock() on one unlikely branch. Fixes e504f3fdd63d ("tmpfs radix_tree: locate_item to speed up swapoff") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* lib/radix-tree.c: make radix_tree_node_alloc() work correctly within interruptJan Kara2013-09-111-2/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With users of radix_tree_preload() run from interrupt (block/blk-ioc.c is one such possible user), the following race can happen: radix_tree_preload() ... radix_tree_insert() radix_tree_node_alloc() if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; <interrupt> ... radix_tree_preload() ... radix_tree_insert() radix_tree_node_alloc() if (rtp->nr) { ret = rtp->nodes[rtp->nr - 1]; And we give out one radix tree node twice. That clearly results in radix tree corruption with different results (usually OOPS) depending on which two users of radix tree race. We fix the problem by making radix_tree_node_alloc() always allocate fresh radix tree nodes when in interrupt. Using preloading when in interrupt doesn't make sense since all the allocations have to be atomic anyway and we cannot steal nodes from process-context users because some users rely on radix_tree_insert() succeeding after radix_tree_preload(). in_interrupt() check is somewhat ugly but we cannot simply key off passed gfp_mask as that is acquired from root_gfp_mask() and thus the same for all preload users. Another part of the fix is to avoid node preallocation in radix_tree_preload() when passed gfp_mask doesn't allow waiting. Again, preallocation in such case doesn't make sense and when preallocation would happen in interrupt we could possibly leak some allocated nodes. However, some users of radix_tree_preload() require following radix_tree_insert() to succeed. To avoid unexpected effects for these users, radix_tree_preload() only warns if passed gfp mask doesn't allow waiting and we provide a new function radix_tree_maybe_preload() for those users which get different gfp mask from different call sites and which are prepared to handle radix_tree_insert() failure. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* radix-tree: fix contiguous iteratorKonstantin Khlebnikov2012-06-051-0/+3
| | | | | | | | | | | | | | | | This patch fixes bug in macro radix_tree_for_each_contig(). If radix_tree_next_slot() sees NULL in next slot it returns NULL, but following radix_tree_next_chunk() switches iterating into next chunk. As result iterating becomes non-contiguous and breaks vfs "splice" and all its users. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Reported-and-bisected-by: Hans de Bruin <jmdebruin@xmsnet.nl> Reported-and-bisected-by: Ondrej Zary <linux@rainbow-software.org> Reported-bisected-and-tested-by: Toralf Förster <toralf.foerster@gmx.de> Link: https://lkml.org/lkml/2012/6/5/64 Cc: stable <stable@vger.kernel.org> # 3.4.x Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* radix-tree: fix preload vector sizeNick Piggin2012-05-291-1/+14
| | | | | | | | We are not preallocating a sufficient number of nodes. Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* radix-tree: rewrite gang lookup using iteratorKonstantin Khlebnikov2012-03-281-258/+33
| | | | | | | | | | | Rewrite radix_tree_gang_lookup_* functions using the new radix-tree iterator. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Tested-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* radix-tree: introduce bit-optimized iteratorKonstantin Khlebnikov2012-03-281-0/+151
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A series of radix tree cleanups, and usage of them in the core pagecache code. Micro-benchmark: lookup 14 slots (typical page-vector size) in radix-tree there earch <step> slot filled and tagged before/after - nsec per full scan through tree * Intel Sandy Bridge i7-2620M 4Mb L3 New code always faster * AMD Athlon 6000+ 2x1Mb L2, without L3 New code generally faster, Minor degradation (marked with "*") for huge sparse trees * i386 on Sandy Bridge New code faster for common cases: tagged and dense trees. Some degradations for non-tagged lookup on sparse trees. Ideally, there might help __ffs() analog for searching first non-zero long element in array, gcc sometimes cannot optimize this loop corretly. Numbers: CPU: Intel Sandy Bridge i7-2620M 4Mb L3 radix-tree with 1024 slots: tagged lookup step 1 before 7156 after 3613 step 2 before 5399 after 2696 step 3 before 4779 after 1928 step 4 before 4456 after 1429 step 5 before 4292 after 1213 step 6 before 4183 after 1052 step 7 before 4157 after 951 step 8 before 4016 after 812 step 9 before 3952 after 851 step 10 before 3937 after 732 step 11 before 4023 after 709 step 12 before 3872 after 657 step 13 before 3892 after 633 step 14 before 3720 after 591 step 15 before 3879 after 578 step 16 before 3561 after 513 normal lookup step 1 before 4266 after 3301 step 2 before 2695 after 2129 step 3 before 2083 after 1712 step 4 before 1801 after 1534 step 5 before 1628 after 1313 step 6 before 1551 after 1263 step 7 before 1475 after 1185 step 8 before 1432 after 1167 step 9 before 1373 after 1092 step 10 before 1339 after 1134 step 11 before 1292 after 1056 step 12 before 1319 after 1030 step 13 before 1276 after 1004 step 14 before 1256 after 987 step 15 before 1228 after 992 step 16 before 1247 after 999 radix-tree with 1024*1024*128 slots: tagged lookup step 1 before 1086102841 after 674196409 step 2 before 816839155 after 498138306 step 7 before 599728907 after 240676762 step 15 before 555729253 after 185219677 step 63 before 606637748 after 128585664 step 64 before 608384432 after 102945089 step 65 before 596987114 after 123996019 step 128 before 304459225 after 56783056 step 256 before 158846855 after 31232481 step 512 before 86085652 after 18950595 step 12345 before 6517189 after 1674057 normal lookup step 1 before 626064869 after 544418266 step 2 before 418809975 after 336321473 step 7 before 242303598 after 207755560 step 15 before 208380563 after 176496355 step 63 before 186854206 after 167283638 step 64 before 176188060 after 170143976 step 65 before 185139608 after 167487116 step 128 before 88181865 after 86913490 step 256 before 45733628 after 45143534 step 512 before 24506038 after 23859036 step 12345 before 2177425 after 2018662 * AMD Athlon 6000+ 2x1Mb L2, without L3 radix-tree with 1024 slots: tag-lookup step 1 before 8164 after 5379 step 2 before 5818 after 5581 step 3 before 4959 after 4213 step 4 before 4371 after 3386 step 5 before 4204 after 2997 step 6 before 4950 after 2744 step 7 before 4598 after 2480 step 8 before 4251 after 2288 step 9 before 4262 after 2243 step 10 before 4175 after 2131 step 11 before 3999 after 2024 step 12 before 3979 after 1994 step 13 before 3842 after 1929 step 14 before 3750 after 1810 step 15 before 3735 after 1810 step 16 before 3532 after 1660 normal-lookup step 1 before 7875 after 5847 step 2 before 4808 after 4071 step 3 before 4073 after 3462 step 4 before 3677 after 3074 step 5 before 4308 after 2978 step 6 before 3911 after 3807 step 7 before 3635 after 3522 step 8 before 3313 after 3202 step 9 before 3280 after 3257 step 10 before 3166 after 3083 step 11 before 3066 after 3026 step 12 before 2985 after 2982 step 13 before 2925 after 2924 step 14 before 2834 after 2808 step 15 before 2805 after 2803 step 16 before 2647 after 2622 radix-tree with 1024*1024*128 slots: tag-lookup step 1 before 1288059720 after 951736580 step 2 before 961292300 after 884212140 step 7 before 768905140 after 547267580 step 15 before 771319480 after 456550640 step 63 before 504847640 after 242704304 step 64 before 392484800 after 177920786 step 65 before 491162160 after 246895264 step 128 before 208084064 after 97348392 step 256 before 112401035 after 51408126 step 512 before 75825834 after 29145070 step 12345 before 5603166 after 2847330 normal-lookup step 1 before 1025677120 after 861375100 step 2 before 647220080 after 572258540 step 7 before 505518960 after 484041813 step 15 before 430483053 after 444815320 * step 63 before 388113453 after 404250546 * step 64 before 374154666 after 396027440 * step 65 before 381423973 after 396704853 * step 128 before 190078700 after 202619384 * step 256 before 100886756 after 102829108 * step 512 before 64074505 after 56158720 step 12345 before 4237289 after 4422299 * * i686 on Sandy bridge radix-tree with 1024 slots: tagged lookup step 1 before 7990 after 4019 step 2 before 5698 after 2897 step 3 before 5013 after 2475 step 4 before 4630 after 1721 step 5 before 4346 after 1759 step 6 before 4299 after 1556 step 7 before 4098 after 1513 step 8 before 4115 after 1222 step 9 before 3983 after 1390 step 10 before 4077 after 1207 step 11 before 3921 after 1231 step 12 before 3894 after 1116 step 13 before 3840 after 1147 step 14 before 3799 after 1090 step 15 before 3797 after 1059 step 16 before 3783 after 745 normal lookup step 1 before 5103 after 3499 step 2 before 3299 after 2550 step 3 before 2489 after 2370 step 4 before 2034 after 2302 * step 5 before 1846 after 2268 * step 6 before 1752 after 2249 * step 7 before 1679 after 2164 * step 8 before 1627 after 2153 * step 9 before 1542 after 2095 * step 10 before 1479 after 2109 * step 11 before 1469 after 2009 * step 12 before 1445 after 2039 * step 13 before 1411 after 2013 * step 14 before 1374 after 2046 * step 15 before 1340 after 1975 * step 16 before 1331 after 2000 * radix-tree with 1024*1024*128 slots: tagged lookup step 1 before 1225865377 after 667153553 step 2 before 842427423 after 471533007 step 7 before 609296153 after 276260116 step 15 before 544232060 after 226859105 step 63 before 519209199 after 141343043 step 64 before 588980279 after 141951339 step 65 before 521099710 after 138282060 step 128 before 298476778 after 83390628 step 256 before 149358342 after 43602609 step 512 before 76994713 after 22911077 step 12345 before 5328666 after 1472111 normal lookup step 1 before 819284564 after 533635310 step 2 before 512421605 after 364956155 step 7 before 271443305 after 305721345 * step 15 before 223591630 after 273960216 * step 63 before 190320247 after 217770207 * step 64 before 178538168 after 267411372 * step 65 before 186400423 after 215347937 * step 128 before 88106045 after 140540612 * step 256 before 44812420 after 70660377 * step 512 before 24435438 after 36328275 * step 12345 before 2123924 after 2148062 * bloat-o-meter delta for this patchset + patchset with related shmem cleanups bloat-o-meter: x86_64 add/remove: 4/3 grow/shrink: 5/6 up/down: 928/-939 (-11) function old new delta radix_tree_next_chunk - 499 +499 shmem_unuse 428 554 +126 shmem_radix_tree_replace 131 227 +96 find_get_pages_tag 354 419 +65 find_get_pages_contig 345 407 +62 find_get_pages 362 396 +34 __kstrtab_radix_tree_next_chunk - 22 +22 __ksymtab_radix_tree_next_chunk - 16 +16 __kcrctab_radix_tree_next_chunk - 8 +8 radix_tree_gang_lookup_slot 204 203 -1 static.shmem_xattr_set 384 381 -3 radix_tree_gang_lookup_tag_slot 208 191 -17 radix_tree_gang_lookup 231 187 -44 radix_tree_gang_lookup_tag 247 199 -48 shmem_unlock_mapping 278 190 -88 __lookup 217 - -217 __lookup_tag 242 - -242 radix_tree_locate_item 279 - -279 bloat-o-meter: i386 add/remove: 3/3 grow/shrink: 8/9 up/down: 1075/-1275 (-200) function old new delta radix_tree_next_chunk - 757 +757 shmem_unuse 352 449 +97 find_get_pages_contig 269 322 +53 shmem_radix_tree_replace 113 154 +41 find_get_pages_tag 277 318 +41 dcache_dir_lseek 426 458 +32 __kstrtab_radix_tree_next_chunk - 22 +22 vc_do_resize 968 977 +9 snd_pcm_lib_read1 725 733 +8 __ksymtab_radix_tree_next_chunk - 8 +8 netlbl_cipsov4_list 1120 1127 +7 find_get_pages 293 291 -2 new_slab 467 459 -8 bitfill_unaligned_rev 425 417 -8 radix_tree_gang_lookup_tag_slot 177 146 -31 blk_dump_cmd 267 229 -38 radix_tree_gang_lookup_slot 212 134 -78 shmem_unlock_mapping 221 128 -93 radix_tree_gang_lookup_tag 275 162 -113 radix_tree_gang_lookup 255 126 -129 __lookup 227 - -227 __lookup_tag 271 - -271 radix_tree_locate_item 277 - -277 This patch: Implement a clean, simple and effective radix-tree iteration routine. Iterating divided into two phases: * lookup next chunk in radix-tree leaf node * iterating through slots in this chunk Main iterator function radix_tree_next_chunk() returns pointer to first slot, and stores in the struct radix_tree_iter index of next-to-last slot. For tagged-iterating it also constuct bitmask of tags for retunted chunk. All additional logic implemented as static-inline functions and macroses. Also adds radix_tree_find_next_bit() static-inline variant of find_next_bit() optimized for small constant size arrays, because find_next_bit() too heavy for searching in an array with one/two long elements. [akpm@linux-foundation.org: rework comments a bit] Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Tested-by: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* lib: reduce the use of module.h wherever possiblePaul Gortmaker2012-03-071-1/+1
| | | | | | | | | | For files only using THIS_MODULE and/or EXPORT_SYMBOL, map them onto including export.h -- or if the file isn't even using those, then just delete the include. Fix up any implicit include dependencies that were being masked by module.h along the way. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
* radix_tree: take radix_tree_path off stackHugh Dickins2012-01-121-78/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Down, down in the deepest depths of GFP_NOIO page reclaim, we have shrink_page_list() calling __remove_mapping() calling __delete_from_ swap_cache() or __delete_from_page_cache(). You would not expect those to need much stack, but in fact they call radix_tree_delete(): which declares a 192-byte radix_tree_path array on its stack (to record the node,offsets it visits when descending, in case it needs to ascend to update them). And if any tag is still set [1], that calls radix_tree_tag_clear(), which declares a further such 192-byte radix_tree_path array on the stack. (At least we have interrupts disabled here, so won't then be pushing registers too.) That was probably a good choice when most users were 32-bit (array of half the size), and adding fields to radix_tree_node would have bloated it unnecessarily. But nowadays many are 64-bit, and each radix_tree_node contains a struct rcu_head, which is only used when freeing; whereas the radix_tree_path info is only used for updating the tree (deleting, clearing tags or setting tags if tagged) when a lock must be held, of no interest when accessing the tree locklessly. So add a parent pointer to the radix_tree_node, in union with the rcu_head, and remove all uses of the radix_tree_path. There would be space in that union to save the offset when descending as before (we can argue that a lock must already be held to exclude other users), but recalculating it when ascending is both easy (a constant shift and a constant mask) and uncommon, so it seems better just to do that. Two little optimizations: no need to decrement height when descending, adjusting shift is enough; and once radix_tree_tag_if_tagged() has set tag on a node and its ancestors, it need not ascend from that node again. perf on the radix tree test harness reports radix_tree_insert() as 2% slower (now having to set parent), but radix_tree_delete() 24% faster. Surely that's an exaggeration from rtth's artificially low map shift 3, but forcing it back to 6 still rates radix_tree_delete() 8% faster. [1] Can a pagecache tag (dirty, writeback or towrite) actually still be set at the time of radix_tree_delete()? Perhaps not if the filesystem is well-behaved. But although I've not tracked any stack overflow down to this cause, I have observed a curious case in which a dirty tag is set and left set on tmpfs: page migration's migrate_page_copy() happens to use __set_page_dirty_nobuffers() to set PageDirty on the newpage, and that sets PAGECACHE_TAG_DIRTY as a side-effect - harmless to a filesystem which doesn't use tags, except for this stack depth issue. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nai Xia <nai.xia@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* radix_tree: clean away saw_unset_tag leftoversHugh Dickins2011-10-311-8/+2
| | | | | | | | | | | | radix_tree_tag_get()'s BUG (when it sees a tag after saw_unset_tag) was unsafe and removed in 2.6.34, but the pointless saw_unset_tag left behind. Remove it now, and return 0 as soon as we see unset tag - we already rely upon the root tag to be correct, returning 0 immediately if it's not set. Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* tmpfs radix_tree: locate_item to speed up swapoffHugh Dickins2011-08-031-0/+92
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have already acknowledged that swapoff of a tmpfs file is slower than it was before conversion to the generic radix_tree: a little slower there will be acceptable, if the hotter paths are faster. But it was a shock to find swapoff of a 500MB file 20 times slower on my laptop, taking 10 minutes; and at that rate it significantly slows down my testing. Now, most of that turned out to be overhead from PROVE_LOCKING and PROVE_RCU: without those it was only 4 times slower than before; and more realistic tests on other machines don't fare as badly. I've tried a number of things to improve it, including tagging the swap entries, then doing lookup by tag: I'd expected that to halve the time, but in practice it's erratic, and often counter-productive. The only change I've so far found to make a consistent improvement, is to short-circuit the way we go back and forth, gang lookup packing entries into the array supplied, then shmem scanning that array for the target entry. Scanning in place doubles the speed, so it's now only twice as slow as before (or three times slower when the PROVEs are on). So, add radix_tree_locate_item() as an expedient, once-off, single-caller hack to do the lookup directly in place. #ifdef it on CONFIG_SHMEM and CONFIG_SWAP, as much to document its limited applicability as save space in other configurations. And, sadly, #include sched.h for cond_resched(). Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* radix_tree: exceptional entries and indicesHugh Dickins2011-08-031-10/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A patchset to extend tmpfs to MAX_LFS_FILESIZE by abandoning its peculiar swap vector, instead keeping a file's swap entries in the same radix tree as its struct page pointers: thus saving memory, and simplifying its code and locking. This patch: The radix_tree is used by several subsystems for different purposes. A major use is to store the struct page pointers of a file's pagecache for memory management. But what if mm wanted to store something other than page pointers there too? The low bit of a radix_tree entry is already used to denote an indirect pointer, for internal use, and the unlikely radix_tree_deref_retry() case. Define the next bit as denoting an exceptional entry, and supply inline functions radix_tree_exception() to return non-0 in either unlikely case, and radix_tree_exceptional_entry() to return non-0 in the second case. If a subsystem already uses radix_tree with that bit set, no problem: it does not affect internal workings at all, but is defined for the convenience of those storing well-aligned pointers in the radix_tree. The radix_tree_gang_lookups have an implicit assumption that the caller can deduce the offset of each entry returned e.g. by the page->index of a struct page. But that may not be feasible for some kinds of item to be stored there. radix_tree_gang_lookup_slot() allow for an optional indices argument, output array in which to return those offsets. The same could be added to other radix_tree_gang_lookups, but for now keep it to the only one for which we need it. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* radix_tree: radix_tree_gang_lookup_tag_slot() may never returnToshiyuki Okajima2011-01-261-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Executed command: fsstress -d /mnt -n 600 -p 850 crash> bt PID: 7947 TASK: ffff880160546a70 CPU: 0 COMMAND: "fsstress" #0 [ffff8800dfc07d00] machine_kexec at ffffffff81030db9 #1 [ffff8800dfc07d70] crash_kexec at ffffffff810a7952 #2 [ffff8800dfc07e40] oops_end at ffffffff814aa7c8 #3 [ffff8800dfc07e70] die_nmi at ffffffff814aa969 #4 [ffff8800dfc07ea0] do_nmi_callback at ffffffff8102b07b #5 [ffff8800dfc07f10] do_nmi at ffffffff814aa514 #6 [ffff8800dfc07f50] nmi at ffffffff814a9d60 [exception RIP: __lookup_tag+100] RIP: ffffffff812274b4 RSP: ffff88016056b998 RFLAGS: 00000287 RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000006 RDX: 000000000000001d RSI: ffff88016056bb18 RDI: ffff8800c85366e0 RBP: ffff88016056b9c8 R8: ffff88016056b9e8 R9: 0000000000000000 R10: 000000000000000e R11: ffff8800c8536908 R12: 0000000000000010 R13: 0000000000000040 R14: ffffffffffffffc0 R15: ffff8800c85366e0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 <NMI exception stack> #7 [ffff88016056b998] __lookup_tag at ffffffff812274b4 #8 [ffff88016056b9d0] radix_tree_gang_lookup_tag_slot at ffffffff81227605 #9 [ffff88016056ba20] find_get_pages_tag at ffffffff810fc110 #10 [ffff88016056ba80] pagevec_lookup_tag at ffffffff81105e85 #11 [ffff88016056baa0] write_cache_pages at ffffffff81104c47 #12 [ffff88016056bbd0] generic_writepages at ffffffff81105014 #13 [ffff88016056bbe0] do_writepages at ffffffff81105055 #14 [ffff88016056bbf0] __filemap_fdatawrite_range at ffffffff810fb2cb #15 [ffff88016056bc40] filemap_write_and_wait_range at ffffffff810fb32a #16 [ffff88016056bc70] generic_file_direct_write at ffffffff810fb3dc #17 [ffff88016056bce0] __generic_file_aio_write at ffffffff810fcee5 #18 [ffff88016056bda0] generic_file_aio_write at ffffffff810fd085 #19 [ffff88016056bdf0] do_sync_write at ffffffff8114f9ea #20 [ffff88016056bf00] vfs_write at ffffffff8114fcf8 #21 [ffff88016056bf30] sys_write at ffffffff81150691 #22 [ffff88016056bf80] system_call_fastpath at ffffffff8100c0b2 I think this root cause is the following: radix_tree_range_tag_if_tagged() always tags the root tag with settag if the root tag is set with iftag even if there are no iftag tags in the specified range (Of course, there are some iftag tags outside the specified range). =============================================================================== [[[Detailed description]]] (1) Why cannot radix_tree_gang_lookup_tag_slot() return forever? __lookup_tag(): - Return with 0. - Return with the index which is not bigger than the old one as the input parameter. Therefore the following "while" repeats forever because the above conditions cause "ret" not to be updated and the cur_index cannot be changed into the bigger one. (So, radix_tree_gang_lookup_tag_slot() cannot return forever.) radix_tree_gang_lookup_tag_slot(): 1178 while (ret < max_items) { 1179 unsigned int slots_found; 1180 unsigned long next_index; /* Index of next search */ 1181 1182 if (cur_index > max_index) 1183 break; 1184 slots_found = __lookup_tag(node, results + ret, 1185 cur_index, max_items - ret, &next_index, tag); 1186 ret += slots_found; // cannot update ret because slots_found == 0. // so, this while loops forever. 1187 if (next_index == 0) 1188 break; 1189 cur_index = next_index; 1190 } (2) Why does __lookup_tag() return with 0 and doesn't update the index? Assuming the following: - the one of the slot in radix_tree_node is NULL. - the one of the tag which corresponds to the slot sets with PAGECACHE_TAG_TOWRITE or other. - In a certain height(!=0), the corresponding index is 0. a) __lookup_tag() notices that the tag is set. 1005 static unsigned int 1006 __lookup_tag(struct radix_tree_node *slot, void ***results, unsigned long index, 1007 unsigned int max_items, unsigned long *next_index, unsigned int tag) 1008 { 1009 unsigned int nr_found = 0; 1010 unsigned int shift, height; 1011 1012 height = slot->height; 1013 if (height == 0) 1014 goto out; 1015 shift = (height-1) * RADIX_TREE_MAP_SHIFT; 1016 1017 while (height > 0) { 1018 unsigned long i = (index >> shift) & RADIX_TREE_MAP_MASK ; 1019 1020 for (;;) { 1021 if (tag_get(slot, tag, i)) 1022 break; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * the index is not updated yet. b) __lookup_tag() notices that the slot is NULL. 1023 index &= ~((1UL << shift) - 1); 1024 index += 1UL << shift; 1025 if (index == 0) 1026 goto out; /* 32-bit wraparound */ 1027 i++; 1028 if (i == RADIX_TREE_MAP_SIZE) 1029 goto out; 1030 } 1031 height--; 1032 if (height == 0) { /* Bottom level: grab some items */ ... 1055 } 1056 shift -= RADIX_TREE_MAP_SHIFT; 1057 slot = rcu_dereference_raw(slot->slots[i]); 1058 if (slot == NULL) 1059 break; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ c) __lookup_tag() doesn't update the index and return with 0. 1060 } 1061 out: 1062 *next_index = index; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1063 return nr_found; 1064 } (3) Why is the slot NULL even if the tag is set? Because radix_tree_range_tag_if_tagged() always sets the root tag with PAGECACHE_TAG_TOWRITE if the root tag is set with PAGECACHE_TAG_DIRTY, even if there is no tag which can be set with PAGECACHE_TAG_TOWRITE in the specified range (from *first_indexp to last_index). Of course, some PAGECACHE_TAG_DIRTY nodes must exist outside the specified range. (radix_tree_range_tag_if_tagged() is called only from tag_pages_for_writeback()) 640 unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root, 641 unsigned long *first_indexp, unsigned long last_index, 642 unsigned long nr_to_tag, 643 unsigned int iftag, unsigned int settag) 644 { 645 unsigned int height = root->height; 646 struct radix_tree_path path[height]; 647 struct radix_tree_path *pathp = path; 648 struct radix_tree_node *slot; 649 unsigned int shift; 650 unsigned long tagged = 0; 651 unsigned long index = *first_indexp; 652 653 last_index = min(last_index, radix_tree_maxindex(height)); 654 if (index > last_index) 655 return 0; 656 if (!nr_to_tag) 657 return 0; 658 if (!root_tag_get(root, iftag)) { 659 *first_indexp = last_index + 1; 660 return 0; 661 } 662 if (height == 0) { 663 *first_indexp = last_index + 1; 664 root_tag_set(root, settag); 665 return 1; 666 } ... 733 root_tag_set(root, settag); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 734 *first_indexp = index; 735 736 return tagged; 737 } As the result, there is no radix_tree_node which is set with PAGECACHE_TAG_TOWRITE but the root tag(radix_tree_root) is set with PAGECACHE_TAG_TOWRITE. [figure: inside radix_tree] (Please see the figure with typewriter font) =========================================== [roottag = DIRTY] | tag=0:NOTHING tag[0 0 0 1] 1:DIRTY [x x x +] 2:WRITEBACK | 3:DIRTY,WRITEBACK p 4:TOWRITE <---> 5:DIRTY,TOWRITE ... specified range (index: 0 to 2) * There is no DIRTY tag within the specified range. (But there is a DIRTY tag outside that range.) | | | | | | | | | after calling tag_pages_for_writeback() | | | | | | | | | v v v v v v v v v [roottag = DIRTY,TOWRITE] | p is "page". tag[0 0 0 1] x is NULL. [x x x +] +- is a pointer to "page". | p * But TOWRITE tag is set on the root tag. ============================================ After that, radix_tree_extend() via radix_tree_insert() is called when the page is added. This function sets the new radix_tree_node with PAGECACHE_TAG_TOWRITE to succeed the status of the root tag. 246 static int radix_tree_extend(struct radix_tree_root *root, unsigned long index) 247 { 248 struct radix_tree_node *node; 249 unsigned int height; 250 int tag; 251 252 /* Figure out what the height should be. */ 253 height = root->height + 1; 254 while (index > radix_tree_maxindex(height)) 255 height++; 256 257 if (root->rnode == NULL) { 258 root->height = height; 259 goto out; 260 } 261 262 do { 263 unsigned int newheight; 264 if (!(node = radix_tree_node_alloc(root))) 265 return -ENOMEM; 266 267 /* Increase the height. */ 268 node->slots[0] = radix_tree_indirect_to_ptr(root->rnode); 269 270 /* Propagate the aggregated tag info into the new root */ 271 for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) { 272 if (root_tag_get(root, tag)) 273 tag_set(node, tag, 0); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 274 } =========================================== [roottag = DIRTY,TOWRITE] | : tag[0 0 0 1] [0 0 0 0] [x x x +] [+ x x x] | | p p (new page) | | | | | | | | | after calling radix_tree_insert | | | | | | | | | v v v v v v v v v [roottag = DIRTY,TOWRITE] | tag [5 0 0 0] * DIRTY and TOWRITE tags are [+ + x x] succeeded to the new node. | | tag [0 0 0 1] [0 0 0 0] [x x x +] [+ x x x] | | p p ============================================ After that, the index 3 page is released by remove_from_page_cache(). Then we can make the situation that the tag is set with PAGECACHE_TAG_TOWRITE and that the slot which corresponds to the tag is NULL. =========================================== [roottag = DIRTY,TOWRITE] | tag [5 0 0 0] [+ + x x] | | tag [0 0 0 1] [0 0 0 0] [x x x +] [+ x x x] | | p p (remove) | | | | | | | | | after calling remove_page_cache | | | | | | | | | v v v v v v v v v [roottag = DIRTY,TOWRITE] | tag [4 0 0 0] * Only DIRTY tag is cleared [x + x x] because no TOWRITE tag is existed | in the bottom node. [0 0 0 0] [+ x x x] | p ============================================ To solve this problem Change to that radix_tree_tag_if_tagged() doesn't tag the root tag if it doesn't set any tags within the specified range. Like this. ============================================ 640 unsigned long radix_tree_range_tag_if_tagged(struct radix_tree_root *root, 641 unsigned long *first_indexp, unsigned long last_index, 642 unsigned long nr_to_tag, 643 unsigned int iftag, unsigned int settag) 644 { 650 unsigned long tagged = 0; ... 733 if (tagged) ^^^^^^^^^^^^^^^^^^^^^^^^ 734 root_tag_set(root, settag); 735 *first_indexp = index; 736 737 return tagged; 738 } ============================================ Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* radix-tree: fix RCU bugNick Piggin2010-11-121-25/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Salman Qazi describes the following radix-tree bug: In the following case, we get can get a deadlock: 0. The radix tree contains two items, one has the index 0. 1. The reader (in this case find_get_pages) takes the rcu_read_lock. 2. The reader acquires slot(s) for item(s) including the index 0 item. 3. The non-zero index item is deleted, and as a consequence the other item is moved to the root of the tree. The place where it used to be is queued for deletion after the readers finish. 3b. The zero item is deleted, removing it from the direct slot, it remains in the rcu-delayed indirect node. 4. The reader looks at the index 0 slot, and finds that the page has 0 ref count 5. The reader looks at it again, hoping that the item will either be freed or the ref count will increase. This never happens, as the slot it is looking at will never be updated. Also, this slot can never be reclaimed because the reader is holding rcu_read_lock and is in an infinite loop. The fix is to re-use the same "indirect" pointer case that requires a slot lookup retry into a general "retry the lookup" bit. Signed-off-by: Nick Piggin <npiggin@kernel.dk> Reported-by: Salman Qazi <sqazi@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'rcu/urgent' of ↵Ingo Molnar2010-10-071-15/+48
|\ | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/rcu
| * Merge branch 'radix-tree' of ↵Linus Torvalds2010-08-221-15/+48
| |\ | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev * 'radix-tree' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/xfsdev: radix-tree: radix_tree_range_tag_if_tagged() can set incorrect tags radix-tree: clear all tags in radix_tree_node_rcu_free
| | * radix-tree: radix_tree_range_tag_if_tagged() can set incorrect tagsDave Chinner2010-08-231-13/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit ebf8aa44beed48cd17893a83d92a4403e5f9d9e2 ("radix-tree: omplement function radix_tree_range_tag_if_tagged") does not safely set tags on on intermediate tree nodes. The code walks down the tree setting tags before it has fully resolved the path to the leaf under the assumption there will be a leaf slot with the tag set in the range it is searching. Unfortunately, this is not a valid assumption - we can abort after setting a tag on an intermediate node if we overrun the number of tags we are allowed to set in a batch, or stop scanning because we we have passed the last scan index before we reach a leaf slot with the tag we are searching for set. As a result, we can leave the function with tags set on intemediate nodes which can be tripped over later by tag-based lookups. The result of these stale tags is that lookup may end prematurely or livelock because the lookup cannot make progress. The fix for the problem involves reocrding the traversal path we take to the leaf nodes, and only propagating the tags back up the tree once the tag is set in the leaf node slot. We are already recording the path for efficient traversal, so there is no additional overhead to do the intermediately node tag setting in this manner. This fixes a radix tree lookup livelock triggered by the new writeback sync livelock avoidance code introduced in commit f446daaea9d4a420d16c606f755f3689dcb2d0ce ("mm: implement writeback livelock avoidance using page tagging"). Signed-off-by: Dave Chinner <dchinner@redhat.com> Acked-by: Jan Kara <jack@suse.cz>
| | * radix-tree: clear all tags in radix_tree_node_rcu_freeDave Chinner2010-08-231-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit f446daaea9d4a420d16c606f755f3689dcb2d0ce ("mm: implement writeback livelock avoidance using page tagging") introduced a new radix tree tag, increasing the number of tags in each node from 2 to 3. It did not, however, fix up the code in radix_tree_node_rcu_free() that cleans up after radix_tree_shrink() and hence could leave stray tags set in the new tag array. The result is that the livelock avoidance code added in the the above commit would hit stale tags when doing tag based lookups, resulting in livelocks when trying to traverse the tree. Fix this problem in radix_tree_node_rcu_free() so it doesn't happen again in the future by using a loop to walk all the tags up to RADIX_TREE_MAX_TAGS to clear the stray tags radix_tree_shrink() leaves behind. Signed-off-by: Dave Chinner <dchinner@redhat.com> Acked-by: Nick Piggin <npiggin@kernel.dk> Acked-by: Jan Kara <jack@suse.cz>
* | | Merge branch 'rcu/next' of ↵Ingo Molnar2010-08-231-1/+1
|\ \ \ | |/ / |/| | | | | git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/rcu
| * | radix-tree: __rcu annotationsArnd Bergmann2010-08-191-1/+1
| |/ | | | | | | | | | | | | Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Nick Piggin <npiggin@suse.de> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* | lib/radix-tree.c: fix overflow in radix_tree_range_tag_if_tagged()Jan Kara2010-08-201-1/+4
|/ | | | | | | | | | | | When radix_tree_maxindex() is ~0UL, it can happen that scanning overflows index and tree traversal code goes astray reading memory until it hits unreadable memory. Check for overflow and exit in that case. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* radix-tree: omplement function radix_tree_range_tag_if_taggedJan Kara2010-08-091-0/+94
| | | | | | | | | | | | | | Implement function for setting one tag if another tag is set for each item in given range. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Chris Mason <chris.mason@oracle.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* radix-tree: fix radix_tree_prev_hole() underflow caseCesar Eduardo Barros2010-05-271-2/+2
| | | | | | | | | | | radix_tree_prev_hole() used LONG_MAX to detect underflow; however, ULONG_MAX is clearly what was intended, both here and by its only user (count_history_pages at mm/readahead.c). Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* radix_tree_tag_get() is not as safe as the docs make out [ver #2]David Howells2010-04-091-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | radix_tree_tag_get() is not safe to use concurrently with radix_tree_tag_set() or radix_tree_tag_clear(). The problem is that the double tag_get() in radix_tree_tag_get(): if (!tag_get(node, tag, offset)) saw_unset_tag = 1; if (height == 1) { int ret = tag_get(node, tag, offset); may see the value change due to the action of set/clear. RCU is no protection against this as no pointers are being changed, no nodes are being replaced according to a COW protocol - set/clear alter the node directly. The documentation in linux/radix-tree.h, however, says that radix_tree_tag_get() is an exception to the rule that "any function modifying the tree or tags (...) must exclude other modifications, and exclude any functions reading the tree". The problem is that the next statement in radix_tree_tag_get() checks that the tag doesn't vary over time: BUG_ON(ret && saw_unset_tag); This has been seen happening in FS-Cache: https://www.redhat.com/archives/linux-cachefs/2010-April/msg00013.html To this end, remove the BUG_ON() from radix_tree_tag_get() and note in various comments that the value of the tag may change whilst the RCU read lock is held, and thus that the return value of radix_tree_tag_get() may not be relied upon unless radix_tree_tag_set/clear() and radix_tree_delete() are excluded from running concurrently with it. Reported-by: Romain DEGEZ <romain.degez@smartjog.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo2010-03-301-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
* radix-tree: Disable RCU lockdep checking in radix treePaul E. McKenney2010-02-251-12/+12
| | | | | | | | | | | | | | | | | | | | | | Because the radix tree is used with many different locking designs, we cannot do any effective checking without changing the radix-tree APIs. It might make sense to do this later, but only if the RCU lockdep checking proves itself sufficiently valuable. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-10-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
* FS-Cache: Don't delete pending pages from the page-store tracking treeDavid Howells2009-11-191-2/+0
| | | | | | | Don't delete pending pages from the page-store tracking tree, but rather send them for another write as they've presumably been updated. Signed-off-by: David Howells <dhowells@redhat.com>
* FS-Cache: Use radix tree preload correctly in tracking of pages to be storedDavid Howells2009-11-191-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __fscache_write_page() attempts to load the radix tree preallocation pool for the CPU it is on before calling radix_tree_insert(), as the insertion must be done inside a pair of spinlocks. Use of the preallocation pool, however, is contingent on the radix tree being initialised without __GFP_WAIT specified. __fscache_acquire_cookie() was passing GFP_NOFS to INIT_RADIX_TREE() - but that includes __GFP_WAIT. The solution is to AND out __GFP_WAIT. Additionally, the banner comment to radix_tree_preload() is altered to make note of this prerequisite. Possibly there should be a WARN_ON() too. Without this fix, I have seen the following recursive deadlock caused by radix_tree_insert() attempting to allocate memory inside the spinlocked region, which resulted in FS-Cache being called back into to release memory - which required the spinlock already held. ============================================= [ INFO: possible recursive locking detected ] 2.6.32-rc6-cachefs #24 --------------------------------------------- nfsiod/7916 is trying to acquire lock: (&cookie->lock){+.+.-.}, at: [<ffffffffa0076872>] __fscache_uncache_page+0xdb/0x160 [fscache] but task is already holding lock: (&cookie->lock){+.+.-.}, at: [<ffffffffa0076acc>] __fscache_write_page+0x15c/0x3f3 [fscache] other info that might help us debug this: 5 locks held by nfsiod/7916: #0: (nfsiod){+.+.+.}, at: [<ffffffff81048290>] worker_thread+0x19a/0x2e2 #1: (&task->u.tk_work#2){+.+.+.}, at: [<ffffffff81048290>] worker_thread+0x19a/0x2e2 #2: (&cookie->lock){+.+.-.}, at: [<ffffffffa0076acc>] __fscache_write_page+0x15c/0x3f3 [fscache] #3: (&object->lock#2){+.+.-.}, at: [<ffffffffa0076b07>] __fscache_write_page+0x197/0x3f3 [fscache] #4: (&cookie->stores_lock){+.+...}, at: [<ffffffffa0076b0f>] __fscache_write_page+0x19f/0x3f3 [fscache] stack backtrace: Pid: 7916, comm: nfsiod Not tainted 2.6.32-rc6-cachefs #24 Call Trace: [<ffffffff8105ac7f>] __lock_acquire+0x1649/0x16e3 [<ffffffff81059ded>] ? __lock_acquire+0x7b7/0x16e3 [<ffffffff8100e27d>] ? dump_trace+0x248/0x257 [<ffffffff8105ad70>] lock_acquire+0x57/0x6d [<ffffffffa0076872>] ? __fscache_uncache_page+0xdb/0x160 [fscache] [<ffffffff8135467c>] _spin_lock+0x2c/0x3b [<ffffffffa0076872>] ? __fscache_uncache_page+0xdb/0x160 [fscache] [<ffffffffa0076872>] __fscache_uncache_page+0xdb/0x160 [fscache] [<ffffffffa0077eb7>] ? __fscache_check_page_write+0x0/0x71 [fscache] [<ffffffffa00b4755>] nfs_fscache_release_page+0x86/0xc4 [nfs] [<ffffffffa00907f0>] nfs_release_page+0x3c/0x41 [nfs] [<ffffffff81087ffb>] try_to_release_page+0x32/0x3b [<ffffffff81092c2b>] shrink_page_list+0x316/0x4ac [<ffffffff81058a9b>] ? mark_held_locks+0x52/0x70 [<ffffffff8135451b>] ? _spin_unlock_irq+0x2b/0x31 [<ffffffff81093153>] shrink_inactive_list+0x392/0x67c [<ffffffff81058a9b>] ? mark_held_locks+0x52/0x70 [<ffffffff810934ca>] shrink_list+0x8d/0x8f [<ffffffff81093744>] shrink_zone+0x278/0x33c [<ffffffff81052c70>] ? ktime_get_ts+0xad/0xba [<ffffffff8109453b>] try_to_free_pages+0x22e/0x392 [<ffffffff8109184c>] ? isolate_pages_global+0x0/0x212 [<ffffffff8108e16b>] __alloc_pages_nodemask+0x3dc/0x5cf [<ffffffff810ae24a>] cache_alloc_refill+0x34d/0x6c1 [<ffffffff811bcf74>] ? radix_tree_node_alloc+0x52/0x5c [<ffffffff810ae929>] kmem_cache_alloc+0xb2/0x118 [<ffffffff811bcf74>] radix_tree_node_alloc+0x52/0x5c [<ffffffff811bcfd5>] radix_tree_insert+0x57/0x19c [<ffffffffa0076b53>] __fscache_write_page+0x1e3/0x3f3 [fscache] [<ffffffffa00b4248>] __nfs_readpage_to_fscache+0x58/0x11e [nfs] [<ffffffffa009bb77>] nfs_readpage_release+0x34/0x9b [nfs] [<ffffffffa009c0d9>] nfs_readpage_release_full+0x32/0x4b [nfs] [<ffffffffa0006cff>] rpc_release_calldata+0x12/0x14 [sunrpc] [<ffffffffa0006e2d>] rpc_free_task+0x59/0x61 [sunrpc] [<ffffffffa0006f03>] rpc_async_release+0x10/0x12 [sunrpc] [<ffffffff810482e5>] worker_thread+0x1ef/0x2e2 [<ffffffff81048290>] ? worker_thread+0x19a/0x2e2 [<ffffffff81352433>] ? thread_return+0x3e/0x101 [<ffffffffa0006ef3>] ? rpc_async_release+0x0/0x12 [sunrpc] [<ffffffff8104bff5>] ? autoremove_wake_function+0x0/0x34 [<ffffffff81058d25>] ? trace_hardirqs_on+0xd/0xf [<ffffffff810480f6>] ? worker_thread+0x0/0x2e2 [<ffffffff8104bd21>] kthread+0x7a/0x82 [<ffffffff8100beda>] child_rip+0xa/0x20 [<ffffffff8100b87c>] ? restore_args+0x0/0x30 [<ffffffff8104c2b9>] ? add_wait_queue+0x15/0x44 [<ffffffff8104bca7>] ? kthread+0x0/0x82 [<ffffffff8100bed0>] ? child_rip+0x0/0x20 Signed-off-by: David Howells <dhowells@redhat.com>
* lib: do code optimization for radix_tree_lookup() and radix_tree_lookup_slot()Huang Shijie2009-06-161-47/+26
| | | | | | | | | | | | | | | | | | | | radix_tree_lookup() and radix_tree_lookup_slot() have much the same code except for the return value. Introduce radix_tree_lookup_element() to do the real work. /* * is_slot == 1 : search for the slot. * is_slot == 0 : search for the node. */ static void * radix_tree_lookup_element(struct radix_tree_root *root, unsigned long index, int is_slot); Signed-off-by: Huang Shijie <shijie8@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* radix-tree: add radix_tree_prev_hole()Wu Fengguang2009-06-161-0/+37
| | | | | | | | | | | | | The counterpart of radix_tree_next_hole(). To be used by context readahead. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Vladislav Bolkhovitin <vst@vlnb.net> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2009-01-071-5/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (24 commits) trivial: chack -> check typo fix in main Makefile trivial: Add a space (and a comma) to a printk in 8250 driver trivial: Fix misspelling of "firmware" in docs for ncr53c8xx/sym53c8xx trivial: Fix misspelling of "firmware" in powerpc Makefile trivial: Fix misspelling of "firmware" in usb.c trivial: Fix misspelling of "firmware" in qla1280.c trivial: Fix misspelling of "firmware" in a100u2w.c trivial: Fix misspelling of "firmware" in megaraid.c trivial: Fix misspelling of "firmware" in ql4_mbx.c trivial: Fix misspelling of "firmware" in acpi_memhotplug.c trivial: Fix misspelling of "firmware" in ipw2100.c trivial: Fix misspelling of "firmware" in atmel.c trivial: Fix misspelled firmware in Kconfig trivial: fix an -> a typos in documentation and comments trivial: fix then -> than typos in comments and documentation trivial: update Jesper Juhl CREDITS entry with new email trivial: fix singal -> signal typo trivial: Fix incorrect use of "loose" in event.c trivial: printk: fix indentation of new_text_line declaration trivial: rtc-stk17ta8: fix sparse warning ...
| * trivial: radix-tree: document wrap-around issue of radix_tree_next_hole()Wu Fengguang2009-01-061-5/+6
| | | | | | | | | | | | | | And some 80-line cleanups. Signed-off-by: Wu Fengguang <wfg@linux.intel.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | lib: radix_tree.c make percpu variable staticHarvey Harrison2009-01-061-1/+1
|/ | | | | | | | | | | radix_tree_preloads is unused outside of this file, make it static. Noticed by sparse: lib/radix-tree.c:84:1: warning: symbol 'per_cpu__radix_tree_preloads' was not declared. Should it be static? Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* SL*B: drop kmem cache argument from constructorAlexey Dobriyan2008-07-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Kmem cache passed to constructor is only needed for constructors that are themselves multiplexeres. Nobody uses this "feature", nor does anybody uses passed kmem cache in non-trivial way, so pass only pointer to object. Non-trivial places are: arch/powerpc/mm/init_64.c arch/powerpc/mm/hugetlbpage.c This is flag day, yes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Matt Mackall <mpm@selenic.com> [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c] [akpm@linux-foundation.org: fix mm/slab.c] [akpm@linux-foundation.org: fix ubifs] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* radix-tree: add gang_lookup_slot, gang_lookup_slot_tagNick Piggin2008-07-261-23/+155
| | | | | | | | | | | | | | Introduce gang_lookup_slot() and gang_lookup_slot_tag() functions, which are used by lockless pagecache. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Christoph has movedChristoph Lameter2008-07-041-1/+1
| | | | | | | | | | | | | | | Remove all clameter@sgi.com addresses from the kernel tree since they will become invalid on June 27th. Change my maintainer email address for the slab allocators to cl@linux-foundation.org (which will be the new email address for the future). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* radix-tree: fix small lockless radix-tree bugNick Piggin2008-06-121-58/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We shrink a radix tree when its root node has only one child, in the left most slot. The child becomes the new root node. To perform this operation in a manner compatible with concurrent lockless lookups, we atomically switch the root pointer from the parent to its child. However a concurrent lockless lookup may now have loaded a pointer to the parent (and is presently deciding what to do next). For this reason, we also have to keep the parent node in a valid state after shrinking the tree, until the next RCU grace period -- otherwise this lookup with the parent pointer may not do the right thing. Notably, we need to keep the child in the left most slot there in case that is requested by the lookup. This is all pretty standard RCU stuff. It is worth repeating because in my eagerness to obey the radix tree node constructor scheme, I had broken it by zeroing the radix tree node before the grace period. What could happen is that a lookup can load the parent pointer, then decide it wants to follow the left most child slot, only to find the slot contained NULL due to the concurrent shrinker having zeroed the parent node before waiting for a grace period. The lookup would return a false negative as a result. Fix it by doing that clearing in the RCU callback. I would normally want to rip out the constructor entirely, but radix tree nodes are one of those places where they make sense (only few cachelines will be touched soon after allocation). This was never actually found in any lockless pagecache testing or by the test harness, but by seeing the odd problem with my scalable vmap rewrite. I have not tickled the test harness into reproducing it yet, but I'll keep working at it. Fortunately, it is not a problem anywhere lockless pagecache is used in mainline kernels (pagecache probe is not a guarantee, and brd does not have concurrent lookups and deletes). Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Remove set_migrateflags()Christoph Lameter2008-04-281-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | Migrate flags must be set on slab creation as agreed upon when the antifrag logic was reviewed. Otherwise some slabs of a slabcache will end up in the unmovable and others in the reclaimable section depending on which flag was active when a new slab page was allocated. This likely slid in somehow when antifrag was merged. Remove it. The buffer_heads are always allocated with __GFP_RECLAIMABLE because the SLAB_RECLAIM_ACCOUNT option is set. The set_migrateflags() never had any effect there. Radix tree allocations are not directly reclaimable but they are allocated with __GFP_RECLAIMABLE set on each allocation. We now set SLAB_RECLAIM_ACCOUNT on radix tree slab creation making sure that radix tree slabs are consistently placed in the reclaimable section. Radix tree slabs will also be accounted as such. There is then no user left of set_migratepages. So remove it. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* radix-tree: avoid atomic allocations for preloaded insertionsNick Piggin2008-02-051-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most pagecache (and some other) radix tree insertions have the great opportunity to preallocate a few nodes with relaxed gfp flags. But the preallocation is squandered when it comes time to allocate a node, we default to first attempting a GFP_ATOMIC allocation -- that doesn't normally fail, but it can eat into atomic memory reserves that we don't need to be using. Another upshot of this is that it removes the sometimes highly contended zone->lock from underneath tree_lock. Pagecache insertions are always performed with a radix tree preload, and after this change, such a situation will never fall back to kmem_cache_alloc within radix_tree_node_alloc. David Miller reports seeing this allocation fail on a highly threaded sparc64 system: [527319.459981] dd: page allocation failure. order:0, mode:0x20 [527319.460403] Call Trace: [527319.460568] [00000000004b71e0] __slab_alloc+0x1b0/0x6a8 [527319.460636] [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8 [527319.460698] [000000000055309c] radix_tree_node_alloc+0x20/0x90 [527319.460763] [0000000000553238] radix_tree_insert+0x12c/0x260 [527319.460830] [0000000000495cd0] add_to_page_cache+0x38/0xb0 [527319.460893] [00000000004e4794] mpage_readpages+0x6c/0x134 [527319.460955] [000000000049c7fc] __do_page_cache_readahead+0x170/0x280 [527319.461028] [000000000049cc88] ondemand_readahead+0x208/0x214 [527319.461094] [0000000000496018] do_generic_mapping_read+0xe8/0x428 [527319.461152] [0000000000497948] generic_file_aio_read+0x108/0x170 [527319.461217] [00000000004badac] do_sync_read+0x88/0xd0 [527319.461292] [00000000004bb5cc] vfs_read+0x78/0x10c [527319.461361] [00000000004bb920] sys_read+0x34/0x60 [527319.461424] [0000000000406294] linux_sparc_syscall32+0x3c/0x40 The calltrace is significant: __do_page_cache_readahead allocates a number of pages with GFP_KERNEL, and hence it should have reclaimed sufficient memory to satisfy GFP_ATOMIC allocations. However after the list of pages goes to mpage_readpages, there can be significant intervals (including disk IO) before all the pages are inserted into the radix-tree. So the reserves can easily be depleted at that point. The patch is confirmed to fix the problem. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* avoid negative (and full-width) shifts in radix-tree.cPeter Lund2007-10-171-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Negative shifts are not allowed in C (the result is undefined). Same thing with full-width shifts. It works on most platforms but not on the VAX with gcc 4.0.1 (it results in an "operand reserved" fault). Shifting by more than the width of the value on the left is also not allowed. I think the extra '>> 1' tacked on at the end in the original code was an attempt to work around that. Getting rid of that is an extra feature of this patch. Here's the chapter and verse, taken from the final draft of the C99 standard ("6.5.7 Bitwise shift operators", paragraph 3): "The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined." Thank you to Jan-Benedict Glaw, Christoph Hellwig, Maciej Rozycki, Pekka Enberg, Andreas Schwab, and Christoph Lameter for review. Special thanks to Andreas for spotting that my fix only removed half the undefined behaviour. Signed-off-by: Peter Lund <firefly@vax64.dk> Christoph Lameter <clameter@sgi.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: "Maciej W. Rozycki" <macro@linux-mips.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Andreas Schwab <schwab@suse.de> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: WU Fengguang <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Slab API: remove useless ctor parameter and reorder parametersChristoph Lameter2007-10-171-1/+1
| | | | | | | | | | | | | | | | | | | | | Slab constructors currently have a flags parameter that is never used. And the order of the arguments is opposite to other slab functions. The object pointer is placed before the kmem_cache pointer. Convert ctor(void *object, struct kmem_cache *s, unsigned long flags) to ctor(struct kmem_cache *s, void *object) throughout the kernel [akpm@linux-foundation.org: coupla fixes] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Group short-lived and reclaimable kernel allocationsMel Gorman2007-10-161-2/+4
| | | | | | | | | | | | | | | | | | This patch marks a number of allocations that are either short-lived such as network buffers or are reclaimable such as inode allocations. When something like updatedb is called, long-lived and unmovable kernel allocations tend to be spread throughout the address space which increases fragmentation. This patch groups these allocations together as much as possible by adding a new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be reclaimed on demand, but not moved. i.e. they can be migrated by deleting them and re-reading the information from elsewhere. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fix the max path calculation in radix-tree.cJeff Moyer2007-10-161-4/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A while back, Nick Piggin introduced a patch to reduce the node memory usage for small files (commit cfd9b7df4abd3257c9e381b0e445817b26a51c0c): -#define RADIX_TREE_MAP_SHIFT 6 +#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6) Unfortunately, he didn't take into account the fact that the calculation of the maximum path was based on an assumption of having to round up: #define RADIX_TREE_MAX_PATH (RADIX_TREE_INDEX_BITS/RADIX_TREE_MAP_SHIFT + 2) So, if CONFIG_BASE_SMALL is set, you will end up with a RADIX_TREE_MAX_PATH that is one greater than necessary. The practical upshot of this is just a bit of wasted memory (one long in the height_to_maxindex array, an extra pre-allocated radix tree node per cpu, and extra stack usage in a couple of functions), but it seems worth getting right. It's also worth noting that I never build with CONFIG_BASE_SMALL. What I did to test this was duplicate the code in a small user-space program and check the results of the calculations for max path and the contents of the height_to_maxindex array. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* radix-tree: use indirect bitNick Piggin2007-10-161-26/+43
| | | | | | | | | | | | | | | | Rather than sign direct radix-tree pointers with a special bit, sign the indirect one that hangs off the root. This means that, given a lookup_slot operation, the invalid result will be differentiated from the valid (previously, valid results could have the bit either set or clear). This does not affect slot lookups which occur under lock -- they can never return an invalid result. Is needed in future for lockless pagecache. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud