op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	UBIFS: check buffer length when scanning for LPT nodes	Adrian Hunter	2008-09-30	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	'is_a_node()' function was reading from a buffer before checking the buffer length, resulting in an OOPS as follows: BUG: unable to handle kernel paging request at f8f74002 IP: [<f8f9783f>] :ubifs:ubifs_unpack_bits+0xca/0x233 pde = 19e95067 pte = 00000000 Oops: 0000 [#1] PREEMPT SMP Modules linked in: ubifs ubi mtdchar bio2mtd mtd brd video output [last unloaded: mtd] Pid: 6414, comm: integck Not tainted (2.6.27-rc6ubifs34 #23) EIP: 0060:[<f8f9783f>] EFLAGS: 00010246 CPU: 0 EIP is at ubifs_unpack_bits+0xca/0x233 [ubifs] EAX: 00000000 EBX: f6090630 ECX: d9badcfc EDX: 00000000 ESI: 00000004 EDI: f8f74002 EBP: d9badcec ESP: d9badcc0 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process integck (pid: 6414, ti=d9bac000 task=f727dae0 task.ti=d9bac000) Stack: 00000006 f7306240 00000002 00000000 d9badcfc d9badd00 0000001c 00000000 f6090630 f6090630 f8f74000 d9badd10 f8fa1cc9 00000000 f8f74002 00000000 f8f74002 f60fe128 f6090630 f8f74000 d9badd68 f8fa1e46 00000000 0001e000 Call Trace: [<f8fa1cc9>] ? is_a_node+0x30/0x90 [ubifs] [<f8fa1e46>] ? dbg_check_ltab+0x11d/0x5bd [ubifs] [<f8fa388f>] ? ubifs_lpt_start_commit+0x42/0xed3 [ubifs] [<c038e76a>] ? mutex_unlock+0x8/0xa [<f8f9625d>] ? ubifs_tnc_start_commit+0x1c8/0xedb [ubifs] [<f8f8d90b>] ? do_commit+0x187/0x523 [ubifs] [<c038e76a>] ? mutex_unlock+0x8/0xa [<f8f7ca17>] ? bud_wbuf_callback+0x22/0x28 [ubifs] [<f8f8dd1d>] ? ubifs_run_commit+0x76/0xc0 [ubifs] [<f8f8032c>] ? ubifs_sync_fs+0xd2/0xe6 [ubifs] [<c01a2e97>] ? vfs_quota_sync+0x0/0x17e [<c01a5ba6>] ? quota_sync_sb+0x26/0xbb [<c01a2e97>] ? vfs_quota_sync+0x0/0x17e [<c01a5c5d>] ? sync_dquots+0x22/0x12c [<c0173d1b>] ? __fsync_super+0x19/0x68 [<c0173d75>] ? fsync_super+0xb/0x19 [<c0174065>] ? generic_shutdown_super+0x22/0xe7 [<c01a31fc>] ? vfs_quota_off+0x0/0x5fd [<f8f7cf4d>] ? ubifs_kill_sb+0x31/0x35 [ubifs] [<c01741f9>] ? deactivate_super+0x5e/0x71 [<c0187610>] ? mntput_no_expire+0x82/0xe4 [<c0187905>] ? sys_umount+0x4c/0x2f6 [<c0187bc8>] ? sys_oldumount+0x19/0x1b [<c0103b71>] ? sysenter_do_call+0x12/0x25 ======================= Code: c1 f8 03 8d 04 07 8b 4d e8 89 01 8b 45 e4 89 10 89 d8 89 f1 d3 e8 85 c0 74 07 29 d6 83 fe 20 75 2a 89 d8 83 c4 20 5b 5e 5f 5d EIP: [<f8f9783f>] ubifs_unpack_bits+0xca/0x233 [ubifs] SS:ESP 0068:d9badcc0 ---[ end trace 1f02572436518c13 ]--- Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
*	UBIFS: correct condition to eliminate unecessary assignment	Adrian Hunter	2008-09-30	1	-1/+1
\| \| \| \|	Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
*	UBIFS: add more debugging messages for LPT	Adrian Hunter	2008-09-30	5	-13/+228
\| \| \| \| \| \| \|	Also add debugging checks for LPT size and separate out c->check_lpt_free from unrelated bitfields. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
*	UBIFS: fix bulk-read handling uptodate pages	Adrian Hunter	2008-09-30	1	-5/+11
\| \| \| \| \| \| \| \|	Bulk-read skips uptodate pages but this was putting its array index out and causing it to treat subsequent pages as holes. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
*	UBIFS: improve garbage collection	Adrian Hunter	2008-09-30	1	-10/+72
\| \| \| \| \| \| \| \| \|	Make garbage collection try to keep data nodes from the same inode together and in ascending order. This improves performance when reading those nodes especially when bulk-read is used. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
*	UBIFS: allow for sync_fs when read-only	Adrian Hunter	2008-09-30	1	-9/+10
\| \| \| \| \| \| \| \|	sync_fs can be called even if the file system is mounted read-only. Ensure the commit is not run in that case. Reported-by: Zoltan Sogor <weth@inf.u-szeged.hu> Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
*	UBIFS: commit on sync_fs	Artem Bityutskiy	2008-09-30	1	-0/+12
\| \| \| \| \| \| \| \| \|	Commit the journal when the FS is sync'ed. This will make statfs provide better free space report. And we anyway advice our users to sync the FS if they want better statfs report. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
*	UBIFS: correct comment for commit_on_unmount	Artem Bityutskiy	2008-09-30	1	-6/+3
\| \| \| \|	Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
*	UBIFS: update dbg_dump_inode	Artem Bityutskiy	2008-09-30	1	-17/+25
\| \| \| \| \| \| \|	'dbg_dump_inode()' is quite outdated and does not print all the fileds. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
*	UBIFS: fix commentary	Artem Bityutskiy	2008-09-30	1	-2/+2
\| \| \| \| \| \|	Znode may refer both data nodes and indexing nodes Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
*	UBIFS: fix races in bit-fields	Artem Bityutskiy	2008-09-30	1	-8/+9
\| \| \| \| \| \| \| \| \| \|	We cannot store bit-fields together if the processes which change them may race, unless we serialize them. Thus, move the nospc and nospc_rp bit-fields eway from the mount option/constant bit-fields, to avoid races. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
*	UBIFS: ensure data read beyond i_size is zeroed out correctly	Adrian Hunter	2008-09-30	2	-3/+8
\| \| \| \|	Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
*	UBIFS: correct key comparison	Adrian Hunter	2008-09-30	1	-2/+2
\| \| \| \| \| \|	The comparison was working, but more by accident than design. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
*	UBIFS: use bit-fields when possible	Artem Bityutskiy	2008-09-30	1	-13/+11
\| \| \| \| \| \| \| \| \|	The "bulk_read" and "no_chk_data_crc" have only 2 values - 0 and 1. We already have bit-fields in corresponding data structers, so make "bulk_read" and "no_chk_data_crc" bit-fields as well. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
*	UBIFS: check data CRC when in error state	Artem Bityutskiy	2008-09-30	1	-0/+1
\| \| \| \| \| \| \|	When UBIFS switches to R/O mode because of an error, it is reasonable to enable data CRC checking. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
*	UBIFS: improve znode splitting rules	Adrian Hunter	2008-09-30	1	-21/+33
\| \| \| \| \| \| \| \| \|	When inserting into a full znode it is split into two znodes. Because data node keys are usually consecutive, it is better to try to keep them together. This patch does a better job of that. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
*	UBIFS: add no_chk_data_crc mount option	Adrian Hunter	2008-09-30	5	-9/+55
\| \| \| \| \| \| \| \| \| \| \| \|	UBIFS read performance can be improved by skipping the CRC check when data nodes are read. This option can be used if the underlying media is considered to be highly reliable. Note that CRCs are always checked for metadata. Read speed on Arm platform with OneNAND goes from 19 MiB/s to 27 MiB/s with data CRC checking disabled. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
*	UBIFS: add bulk-read facility	Adrian Hunter	2008-09-30	5	-3/+626
\| \| \| \| \| \| \| \| \| \| \| \|	Some flash media are capable of reading sequentially at faster rates. UBIFS bulk-read facility is designed to take advantage of that, by reading in one go consecutive data nodes that are also located consecutively in the same LEB. Read speed on Arm platform with OneNAND goes from 17 MiB/s to 19 MiB/s. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
*	UBIFS: use an IS_ERR test rather than a NULL test	Julien Brunel	2008-09-30	1	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In case of error, the function kthread_create returns an ERR pointer, but never returns a NULL pointer. So a NULL test that comes before an IS_ERR test should be deleted. The semantic match that finds this problem is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @match_bad_null_test@ expression x, E; statement S1,S2; @@ x = kthread_create(...) ... when != x = E * if (x == NULL) S1 else S2 // </smpl> Signed-off-by: Julien Brunel <brunel@diku.dk> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
*	UBIFS: inline one-line functions	Artem Bityutskiy	2008-09-30	3	-30/+27
\| \| \| \| \| \| \| \| \| \| \|	'ubifs_get_lprops()' and 'ubifs_release_lprops()' basically wrap mutex lock and unlock. We have them because we want lprops subsystem be separate and as independent as possible. And we planned better locking rules for lprops. Anyway, because they are short, it is better to inline them. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
*	UBIFS: remove unneeded unlikely()	Hirofumi Nakagawa	2008-09-30	4	-9/+9
\| \| \| \| \| \| \| \|	IS_ERR() macro already has unlikely(), so do not use constructions like 'if (unlikely(IS_ERR())'. Signed-off-by: Hirofumi Nakagawa <hnakagawa@miraclelinux.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
*	UBIFS: add a print, fix comments and more minor stuff	Artem Bityutskiy	2008-09-30	3	-24/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit adds a reserved pool size print and tweaks the prints to make them look nicer. It also fixes and cleans-up some comments. Additionally, it deletes some blank lines to make the code look a little nicer. In other words, nothing essential. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
*	mm owner: fix race between swapoff and exit	Balbir Singh	2008-09-29	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There's a race between mm->owner assignment and swapoff, more easily seen when task slab poisoning is turned on. The condition occurs when try_to_unuse() runs in parallel with an exiting task. A similar race can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats> or ptrace or page migration. CPU0 CPU1 try_to_unuse looks at mm = task0->mm increments mm->mm_users task 0 exits mm->owner needs to be updated, but no new owner is found (mm_users > 1, but no other task has task->mm = task0->mm) mm_update_next_owner() leaves mmput(mm) decrements mm->mm_users task0 freed dereferencing mm->owner fails The fix is to notify the subsystem via mm_owner_changed callback(), if no new owner is found, by specifying the new task as NULL. Jiri Slaby: mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but must be set after that, so as not to pass NULL as old owner causing oops. Daisuke Nishimura: mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task() and its callers need to take account of this situation to avoid oops. Hugh Dickins: Lockdep warning and hang below exec_mmap() when testing these patches. exit_mm() up_reads mmap_sem before calling mm_update_next_owner(), so exec_mmap() now needs to do the same. And with that repositioning, there's now no point in mm_need_new_owner() allowing for NULL mm. Reported-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Fix NULL pointer dereference in proc_sys_compare	Linus Torvalds	2008-09-29	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The VFS interface for the 'd_compare()' is a bit special (read: 'odd'), because it really just essentially replaces a memcmp(). The filesystem is supposed to just compare the two names with whatever case-independent or other function. And when I say 'is supposed to', I obviously mean that 'procfs does odd things, and actually looks at the dentry that we don't even pass down, rather than just the name'. Which results in problems, because we actually call d_compare before we have even verified that the dentry is still hashed at all. And that causes a problm since the inode that procfs looks at may have been free'd and the d_inode pointer is NULL. procfs just assumes that all dentries are positive, since procfs itself never generates a negative one. But memory pressure will still result in the dentry getting torn down, and as it is removed by RCU, it still remains visible on some lists - and to d_compare. If the filesystem just did a name comparison, we wouldn't care. And we could just fix procfs to know about negative dentries too. But rather than have the low-level filesystems know about internal VFS details, just move the check for a unhashed dentry up a bit, so that we will only call d_compare on dentries that are still active. The actual oops this caused didn't look like a NULL pointer dereference because procfs did a 'container_of(inode, struct proc_inode, vfs_inode)' to get at its internal proc_inode information from the inode pointer, and accessed a field below the inode. So the oops would look something like BUG: unable to handle kernel paging request at fffffffffffffff0 IP: [<ffffffff802bc6c6>] proc_sys_compare+0x36/0x50 and was seen on both x86-64 (Alexey Dobriyan and Hugh Dickins) and ppc64 (Hugh Dickins). Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Hugh Dickins <hugh@veritas.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-of-by: Linus Torvalds <torvalds@linux-foundation.org>
*	Merge git://oss.sgi.com:8090/xfs/linux-2.6	Linus Torvalds	2008-09-26	1	-91/+3
\|\ \| \| \| \| \| \| \| \| \| \|	* git://oss.sgi.com:8090/xfs/linux-2.6: [XFS] Remove xfs_iext_irec_compact_full() [XFS] Fix extent list corruption in xfs_iext_irec_compact_full().
\| *	[XFS] Remove xfs_iext_irec_compact_full()	Lachlan McIlroy	2008-09-26	1	-92/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Yet another bug was found in xfs_iext_irec_compact_full() and while the source of the bug was found it wasn't an easy task to track it down because the conditions are very difficult to reproduce. A HUGE thank-you goes to Russell Cattelan and Eric Sandeen for their significant effort in tracking down the source of this corruption. xfs_iext_irec_compact_full() and xfs_iext_irec_compact_pages() are almost identical - they both compact indirect extent lists by moving extents from subsequent buffers into earlier ones. xfs_iext_irec_compact_pages() only moves extents if all of the extents in the next buffer will fit into the empty space in the buffer before it. xfs_iext_irec_compact_full() will go a step further and move part of the next buffer if all the extents wont fit. It will then shift the remaining extents in the next buffer up to the start of the buffer. The bug here was that we did not update er_extoff and this caused extent list corruption. It does not appear that this extra functionality gains us much. Calling xfs_iext_irec_compact_pages() instead will do a good enough job at compacting the indirect list and will be quicker too. For the case in xfs_iext_indirect_to_direct() the total number of extents in the indirect list will fit into one buffer so we will never need the extra functionality of xfs_iext_irec_compact_full() there. Also xfs_iext_irec_compact_pages() doesn't need to do a memmove() (the buffers will never overlap) so we don't want the performance hit that can incur. SGI-PV: 987159 SGI-Modid: xfs-linux-melb:xfs-kern:32166a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
\| *	[XFS] Fix extent list corruption in xfs_iext_irec_compact_full().	Lachlan McIlroy	2008-09-26	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we don't move all the records from the next buffer into the current buffer then we need to update the er_extoff field of the next buffer as we shift the remaining records to the start of the buffer. SGI-PV: 987159 SGI-Modid: xfs-linux-melb:xfs-kern:32165a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Russell Cattelan <cattelan@thebarn.com>
* \|	Merge branch 'linux-next' of git://git.infradead.org/~dedekind/ubifs-2.6	Linus Torvalds	2008-09-26	6	-9/+15
\|\ \ \| \|/ \|/\| \| \| \| \| \| \| \| \| \| \|	* 'linux-next' of git://git.infradead.org/~dedekind/ubifs-2.6: UBIFS: fix printk format warnings UBIFS: remove incorrect assert UBIFS: TNC / GC race fixes UBIFS: create the name of the background thread in every case
\| *	UBIFS: fix printk format warnings	Alexander Beregalov	2008-09-18	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	fs/ubifs/dir.c:428: warning: format '%llu' expects type 'long long unsigned int', but argument 5 has type 'long unsigned int' fs/ubifs/debug.c:541: warning: format '%llu' expects type 'long long unsigned int', but argument 2 has type 'long unsigned int' Signed-off-by: Alexander Beregalov <a.beregalov@gmail.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
\| *	UBIFS: remove incorrect assert	Adrian Hunter	2008-09-17	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The assert was not valid because one of the variables 'taken_empty_lebs' has transient values out of sync with the other variables. Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
\| *	UBIFS: TNC / GC race fixes	Adrian Hunter	2008-09-17	2	-4/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- update GC sequence number if any nodes may have been moved even if GC did not finish the LEB - don't ignore error return when reading Signed-off-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
\| *	UBIFS: create the name of the background thread in every case	Sebastian Siewior	2008-09-17	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the ubifs partition is mounted RO and then remounted RW we end up with no thread name in ubifs_remount_rw() and the thread appears nameless. Signed-off-by: Sebastian Siewior <bigeasy@linutronix.de> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
* \|	9p: use an IS_ERR test rather than a NULL test	Julien Brunel	2008-09-24	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In case of error, the function p9_client_walk returns an ERR pointer, but never returns a NULL pointer. So a NULL test that comes after an IS_ERR test should be deleted. The semantic match that finds this problem is as follows: (http://www.emn.fr/x-info/coccinelle/) // <smpl> @match_bad_null_test@ expression x, E; statement S1,S2; @@ x = p9_client_walk(...) ... when != x = E * if (x != NULL) S1 else S2 // </smpl> Signed-off-by: Julien Brunel <brunel@diku.dk> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* \|	[XFS] Don't do I/O beyond eof when unreserving space	Lachlan McIlroy	2008-09-17	1	-0/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When unreserving space with boundaries that are not block aligned we round up the start and round down the end boundaries and then use this function, xfs_zero_remaining_bytes(), to zero the parts of the blocks that got dropped during the rounding. The problem is we don't consider if these blocks are beyond eof. Worse still is if we encounter delayed allocations beyond eof we will try to use the magic delayed allocation block number as a real block number. If the file size is ever extended to expose these blocks then we'll go through xfs_zero_eof() to zero them anyway. SGI-PV: 983683 SGI-Modid: xfs-linux-melb:xfs-kern:32055a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
* \|	[XFS] Fix use-after-free with buffers	Lachlan McIlroy	2008-09-17	1	-24/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have a use-after-free issue where log completions access buffers via the buffer log item and the buffer has already been freed. Fix this by taking a reference on the buffer when attaching the buffer log item and release the hold when the buffer log item is detached and we no longer need the buffer. Also create a new function xfs_buf_item_free() to combine some common code. SGI-PV: 985757 SGI-Modid: xfs-linux-melb:xfs-kern:32025a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
* \|	[XFS] Prevent lockdep false positives when locking two inodes.	David Chinner	2008-09-17	2	-1/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If we call xfs_lock_two_inodes() to grab both the iolock and the ilock, then drop the ilocks on both inodes, then grab them again (as xfs_swap_extents() does) then lockdep will report a locking order problem. This is a false positive. To avoid this, disallow xfs_lock_two_inodes() fom locking both inode locks at once - force calers to make two separate calls. This means that nested dropping and regaining of the ilocks will retain the same lockdep subclass and so lockdep will not see anything wrong with this code. SGI-PV: 986238 SGI-Modid: xfs-linux-melb:xfs-kern:31999a Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Peter Leckie <pleckie@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* \|	[XFS] Fix barrier status change detection.	David Chinner	2008-09-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current code in xlog_iodone() uses the wrong macro to check if the barrier has been cleared due to an EOPNOTSUPP error form the lower layer. SGI-PV: 986143 SGI-Modid: xfs-linux-melb:xfs-kern:31984a Signed-off-by: David Chinner <david@fromorbit.com> Signed-off-by: Nathaniel W. Turner <nate@houseofnate.net> Signed-off-by: Peter Leckie <pleckie@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* \|	[XFS] Prevent direct I/O from mapping extents beyond eof	Lachlan McIlroy	2008-09-17	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With the help from some tracing I found that we try to map extents beyond eof when doing a direct I/O read. It appears that the way to inform the generic direct I/O path (ie do_direct_IO()) that we have breached eof is to return an unmapped buffer from xfs_get_blocks_direct(). This will cause do_direct_IO() to jump to the hole handling code where is will check for eof and then abort. This problem was found because a direct I/O read was trying to map beyond eof and was encountering delayed allocations. The delayed allocations beyond eof are speculative allocations and they didn't get converted when the direct I/O flushed the file because there was only enough space in the current AG to convert and write out the dirty pages within eof. Note that xfs_iomap_write_allocate() wont necessarily convert all the delayed allocation passed to it - it will return after allocating the first extent - so if the delayed allocation extends beyond eof then it will stay that way. SGI-PV: 983683 SGI-Modid: xfs-linux-melb:xfs-kern:31929a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
* \|	[XFS] Fix regression introduced by remount fixup	Christoph Hellwig	2008-09-17	1	-0/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Logically we would return an error in xfs_fs_remount code to prevent users from believing they might have changed mount options using remount which can't be changed. But unfortunately mount(8) adds all options from mtab and fstab to the mount arguments in some cases so we can't blindly reject options, but have to check for each specified option if it actually differs from the currently set option and only reject it if that's the case. Until that is implemented we return success for every remount request, and silently ignore all options that we can't actually change. SGI-PV: 985710 SGI-Modid: xfs-linux-melb:xfs-kern:31908a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com> Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
* \|	[XFS] Move memory allocations for log tracing out of the critical path	Lachlan McIlroy	2008-09-17	2	-21/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Memory allocations for log->l_grant_trace and iclog->ic_trace are done on demand when the first event is logged. In xlog_state_get_iclog_space() we call xlog_trace_iclog() under a spinlock and allocating memory here can cause us to sleep with a spinlock held and deadlock the system. For the log grant tracing we use KM_NOSLEEP but that means we can lose trace entries. Since there is no locking to serialize the log grant tracing we could race and have multiple allocations and leak memory. So move the allocations to where we initialize the log/iclog structures. Use KM_NOFS to avoid recursing into the filesystem and drop log->l_trace since it's not even used. SGI-PV: 983738 SGI-Modid: xfs-linux-melb:xfs-kern:31896a Signed-off-by: Lachlan McIlroy <lachlan@sgi.com> Signed-off-by: Christoph Hellwig <hch@infradead.org>
* \|	rescan_partitions(): make device capacity errors non-fatal	Andrew Morton	2008-09-13	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Herton Krzesinski reports that the error-checking changes in 04ebd4aee52b06a2c38127d9208546e5b96f3a19 ("block/ioctl.c and fs/partition/check.c: check value returned by add_partition") cause his buggy USB camera to no longer mount. "The camera is an Olympus X-840. The original issue comes from the camera itself: its format program creates a partition with an off by one error". Buggy devices happen. It is better for the kernel to warn and to proceed with the mount. Reported-by: Herton Ronaldo Krzesinski <herton@mandriva.com.br> Cc: Abdel Benamrouche <draconux@gmail.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: David Brownell <david-b@pacbell.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* \|	mm: ifdef Quicklists in /proc/meminfo	Hugh Dickins	2008-09-13	1	-4/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A "Quicklists: 0 kB" line has just started appearing in /proc/meminfo, but most architectures (including x86) don't have them configured, so #ifdef it, like the highmem lines. And those architectures which do have quicklists configured are using them for page tables: so let's place it next to PageTables. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* \|	bfs: fix Lockdep warning	Eric Sesterhenn	2008-09-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This fixes: ============================================= [ INFO: possible recursive locking detected ] 2.6.27-rc5-00283-g70bb089 #68 --------------------------------------------- touch/6855 is trying to acquire lock: (&info->bfs_lock){--..}, at: [<c02262f5>] bfs_delete_inode+0x9e/0x18c but task is already holding lock: (&info->bfs_lock){--..}, at: [<c0226c00>] bfs_create+0x45/0x187 other info that might help us debug this: 2 locks held by touch/6855: #0: (&type->i_mutex_dir_key#5){--..}, at: [<c018ad13>] do_filp_open+0x10b/0x62f #1: (&info->bfs_lock){--..}, at: [<c0226c00>] bfs_create+0x45/0x187 stack backtrace: Pid: 6855, comm: touch Not tainted 2.6.27-rc5-00283-g70bb089 #68 [<c013e769>] validate_chain+0x458/0x9f4 [<c013bece>] ? trace_hardirqs_off+0xb/0xd [<c013f36b>] __lock_acquire+0x666/0x6e0 [<c013f440>] lock_acquire+0x5b/0x77 [<c02262f5>] ? bfs_delete_inode+0x9e/0x18c [<c06aab74>] mutex_lock_nested+0xbc/0x234 [<c02262f5>] ? bfs_delete_inode+0x9e/0x18c [<c02262f5>] ? bfs_delete_inode+0x9e/0x18c [<c02262f5>] bfs_delete_inode+0x9e/0x18c [<c0226257>] ? bfs_delete_inode+0x0/0x18c [<c01925e1>] generic_delete_inode+0x94/0xfe [<c019265d>] generic_drop_inode+0x12/0x12f [<c0191b7e>] iput+0x4b/0x4e [<c0226d1e>] bfs_create+0x163/0x187 [<c0188b42>] vfs_create+0xa6/0x114 [<c018adb5>] do_filp_open+0x1ad/0x62f [<c0107cdc>] ? native_sched_clock+0x82/0x96 [<c06ac309>] ? _spin_unlock+0x27/0x3c [<c019379e>] ? alloc_fd+0xbf/0xc9 [<c06ae2f4>] ? sub_preempt_count+0x9d/0xab [<c019379e>] ? alloc_fd+0xbf/0xc9 [<c0180391>] do_sys_open+0x42/0xb8 [<c041d564>] ? trace_hardirqs_on_thunk+0xc/0x10 [<c0180449>] sys_open+0x1e/0x26 [<c01038bd>] sysenter_do_call+0x12/0x31 ======================= The problem is that we don't unlock the bfs->lock mutex before calling iput (we do in the other cases). Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* \|	proc: more debugging for "already registered" case	Alexey Dobriyan	2008-09-13	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Print parent directory name as well. The aim is to catch non-creation of parent directory when proc_mkdir will return NULL and all subsequent registrations go directly in /proc instead of intended directory. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Fixed insane printk string while at it. - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* \|	Merge branch 'for_linus' of ↵	Linus Torvalds	2008-09-11	2	-25/+20
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6: udf: add llseek method udf: Fix error paths in udf_new_inode() udf: Fix lock inversion between iprune_mutex and alloc_mutex (v2)
\| * \|	udf: add llseek method	Christoph Hellwig	2008-09-08	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	UDF currently doesn't set a llseek method for regular files, which means it will fall back to default_llseek. This means no one can seek beyond 2 Gigabytes on udf, and that there's not protection vs the i_size updates from writers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
\| * \|	udf: Fix error paths in udf_new_inode()	Jan Kara	2008-08-19	1	-23/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I case we failed to allocate memory for inode when creating it, we did not properly free block already allocated for this inode. Move memory allocation before the block allocation which fixes this issue (thanks for the idea go to Ingo Oeser <ioe-lkml@rameria.de>). Also remove a few superfluous initializations already done in udf_alloc_inode(). Reviewed-by: Ingo Oeser <ioe-lkml@rameria.de> Signed-off-by: Jan Kara <jack@suse.cz>
\| * \|	udf: Fix lock inversion between iprune_mutex and alloc_mutex (v2)	Jan Kara	2008-08-19	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A memory allocation inside alloc_mutex must not recurse back into the filesystem itself because that leads to lock inversion between iprune_mutex and alloc_mutex (and thus to deadlocks - see traces below). alloc_mutex is actually needed only to update allocation statistics in the superblock so we can drop it before we start allocating memory for the inode. tar D ffff81015b9c8c90 0 6614 6612 ffff8100d5a21a20 0000000000000086 0000000000000000 00000000ffff0000 ffff81015b9c8c90 ffff81015b8f0cd0 ffff81015b9c8ee0 0000000000000000 0000000000000003 0000000000000000 0000000000000000 0000000000000000 Call Trace: [<ffffffff803c1d8a>] __mutex_lock_slowpath+0x64/0x9b [<ffffffff803c1bef>] mutex_lock+0xa/0xb [<ffffffff8027f8c2>] shrink_icache_memory+0x38/0x200 [<ffffffff80257742>] shrink_slab+0xe3/0x15b [<ffffffff802579db>] try_to_free_pages+0x221/0x30d [<ffffffff8025657e>] isolate_pages_global+0x0/0x31 [<ffffffff8025324b>] __alloc_pages_internal+0x252/0x3ab [<ffffffff8026b08b>] cache_alloc_refill+0x22e/0x47b [<ffffffff8026ae37>] kmem_cache_alloc+0x3b/0x61 [<ffffffff8026b15b>] cache_alloc_refill+0x2fe/0x47b [<ffffffff8026b34e>] __kmalloc+0x76/0x9c [<ffffffffa00751f2>] :udf:udf_new_inode+0x202/0x2e2 [<ffffffffa007ae5e>] :udf:udf_create+0x2f/0x16d [<ffffffffa0078f27>] :udf:udf_lookup+0xa6/0xad ... kswapd0 D ffff81015b9d9270 0 125 2 ffff81015b903c28 0000000000000046 ffffffff8028cbb0 00000000fffffffb ffff81015b9d9270 ffff81015b8f0cd0 ffff81015b9d94c0 000000000271b490 ffffe2000271b458 ffffe2000271b420 ffffe20002728dc8 ffffe20002728d90 Call Trace: [<ffffffff8028cbb0>] __set_page_dirty+0xeb/0xf5 [<ffffffff8025403a>] get_dirty_limits+0x1d/0x22f [<ffffffff803c1d8a>] __mutex_lock_slowpath+0x64/0x9b [<ffffffff803c1bef>] mutex_lock+0xa/0xb [<ffffffffa0073f58>] :udf:udf_bitmap_free_blocks+0x47/0x1eb [<ffffffffa007df31>] :udf:udf_discard_prealloc+0xc6/0x172 [<ffffffffa007875a>] :udf:udf_clear_inode+0x1e/0x48 [<ffffffff8027f121>] clear_inode+0x6d/0xc4 [<ffffffff8027f7f2>] dispose_list+0x56/0xee [<ffffffff8027fa5a>] shrink_icache_memory+0x1d0/0x200 [<ffffffff80257742>] shrink_slab+0xe3/0x15b [<ffffffff80257e93>] kswapd+0x346/0x447 ... Reported-by: Tibor Tajti <tibor.tajti@gmail.com> Reviewed-by: Ingo Oeser <ioe-lkml@rameria.de> Signed-off-by: Jan Kara <jack@suse.cz>
* \| \|	ocfs2: Fix a bug in direct IO read.	Tao Ma	2008-09-10	1	-1/+1
\| \|/ \|/\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ocfs2 will become read-only if we try to read the bytes which pass the end of i_size. This can be easily reproduced by following steps: 1. mkfs a ocfs2 volume with bs=4k cs=4k and nosparse. 2. create a small file(say less than 100 bytes) and we will create the file which is allocated 1 cluster. 3. read 8196 bytes from the kernel using O_DIRECT which exceeds the limit. 4. The ocfs2 volume becomes read-only and dmesg shows: OCFS2: ERROR (device sda13): ocfs2_direct_IO_get_blocks: Inode 66010 has a hole at block 1 File system is now read-only due to the potential of on-disk corruption. Please run fsck.ocfs2 once the file system is unmounted. So suppress the ERROR message. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	Merge branch 'linux-next' of git://git.infradead.org/~dedekind/ubifs-2.6	Linus Torvalds	2008-09-09	10	-141/+221
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* 'linux-next' of git://git.infradead.org/~dedekind/ubifs-2.6: UBIFS: make minimum fanout 3 UBIFS: fix division by zero UBIFS: amend f_fsid UBIFS: fill f_fsid UBIFS: improve statfs reporting even more UBIFS: introduce LEB overhead UBIFS: add forgotten gc_idx_lebs component UBIFS: fix assertion UBIFS: improve statfs reporting UBIFS: remove incorrect index space check UBIFS: push empty flash hack down UBIFS: do not update min_idx_lebs in stafs UBIFS: allow for racing between GC and TNC UBIFS: always read hashed-key nodes under TNC mutex UBIFS: fix zero-length truncations