op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	cifs: convert oplock breaks to use slow_work facility (try #4)	Jeff Layton	2009-09-24	10	-175/+119
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is the fourth respin of the patch to convert oplock breaks to use the slow_work facility. A customer of ours was testing a backport of one of the earlier patchsets, and hit a "Busy inodes after umount..." problem. An oplock break job had raced with a umount, and the superblock got torn down and its memory reused. When the oplock break job tried to dereference the inode->i_sb, the kernel oopsed. This patchset has the oplock break job hold an inode and vfsmount reference until the oplock break completes. With this, there should be no need to take a tcon reference (the vfsmount implicitly holds one already). Currently, when an oplock break comes in there's a chance that the oplock break job won't occur if the allocation of the oplock_q_entry fails. There are also some rather nasty races in the allocation and handling these structs. Rather than allocating oplock queue entries when an oplock break comes in, add a few extra fields to the cifsFileInfo struct. Get rid of the dedicated cifs_oplock_thread as well and queue the oplock break job to the slow_work thread pool. This approach also has the advantage that the oplock break jobs can potentially run in parallel rather than be serialized like they are today. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
*	cifs: have cifsFileInfo hold an extra inode reference	Jeff Layton	2009-09-15	3	-3/+5
\| \| \| \| \| \| \| \| \|	It's possible that this struct will outlive the filp to which it is attached. If it does and it needs to do some work on the inode, then it'll need a reference. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
*	cifs: take read lock on GlobalSMBSes_lock in is_valid_oplock_break	Jeff Layton	2009-09-15	1	-3/+3
\| \| \| \| \| \| \| \|	...rather than a write lock. It doesn't change the list so a read lock should be sufficient. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
*	cifs: remove cifsInodeInfo.oplockPending flag	Jeff Layton	2009-09-15	2	-2/+0
\| \| \| \| \| \| \|	It's set on oplock break but nothing ever looks at it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
*	cifs: fix oplock request handling in posix codepath	Jeff Layton	2009-09-15	3	-11/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cifs_posix_open takes a "poplock" argument that's intended to be used in the actual posix open call to set the "Flags" field. It ignores this value however and declares an "oplock" parameter on the stack that it passes uninitialized to the CIFSPOSIXOpen function. Not only does this mean that the oplock request flags are bogus, but the result that's expected to be in that variable is unchanged. Fix this, and also clean up the type of the oplock parameter used. Since it's expected to be __u32, we should use that everywhere and not implicitly cast it from a signed type. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
*	[CIFS] Re-enable Lanman security	Chuck Ebbert	2009-09-15	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	commit ac68392460ffefed13020967bae04edc4d3add06 ("[CIFS] Allow raw ntlmssp code to be enabled with sec=ntlmssp") added a new bit to the allowed security flags mask but seems to have inadvertently removed Lanman security from the allowed flags. Add it back. CC: Stable <stable@kernel.org> Signed-off-by: Chuck Ebbert <cebbert@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
*	Merge branch 'osync_cleanup' of ↵	Linus Torvalds	2009-09-14	13	-247/+137
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'osync_cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: fsync: wait for data writeout completion before calling ->fsync vfs: Remove generic_osync_inode() and sync_page_range{_nolock}() fat: Opencode sync_page_range_nolock() pohmelfs: Use new syncing helper xfs: Convert sync_page_range() to simple filemap_write_and_wait_range() ocfs2: Update syncing after splicing to match generic version ntfs: Use new syncing helpers and update comments ext4: Remove syncing logic from ext4_file_write ext3: Remove syncing logic from ext3_file_write ext2: Update comment about generic_osync_inode vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode vfs: Rename generic_file_aio_write_nolock ocfs2: Use __generic_file_aio_write instead of generic_file_aio_write_nolock pohmelfs: Use __generic_file_aio_write instead of generic_file_aio_write_nolock vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write() vfs: Export __generic_file_aio_write() and add some comments vfs: Introduce filemap_fdatawait_range
\| *	fsync: wait for data writeout completion before calling ->fsync	Christoph Hellwig	2009-09-14	1	-4/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currenly vfs_fsync(_range) first calls filemap_fdatawrite to write out the data, the calls into ->fsync to write out the metadata and then finally calls filemap_fdatawait to wait for the data I/O to complete. What sounds like a clever micro-optimization actually is nast trap for many filesystems. For many modern filesystems i_size or other inode information is only updated on I/O completion and we need to wait for I/O to finish before we can write out the metadata. For old fashionen filesystems that instanciate blocks during the actual write and also update the metadata at that point it opens up a large window were we could expose uninitialized blocks after a crash. While a few filesystems that need it already wait for the I/O to finish inside their ->fsync methods it is rather suboptimal as it is done under the i_mutex and also always for the whole file instead of just a part as we could do for O_SYNC handling. Here is a small audit of all fsync instances in the tree: - spufs_mfc_fsync: - ps3flash_fsync: - vol_cdev_fsync: - printer_fsync: - fb_deferred_io_fsync: - bad_file_fsync: - simple_sync_file: don't care - filesystems/drivers do't use the page cache or are purely in-memory. - simple_fsync: - file_fsync: - affs_file_fsync: - fat_file_fsync: - jfs_fsync: - ubifs_fsync: - reiserfs_dir_fsync: - reiserfs_sync_file: never touch pagecache themselves. We need to wait before if we do not want to expose stale data after an allocation. - afs_fsync: - fuse_fsync_common: do the waiting writeback itself in awkward ways, would benefit from proper semantics - block_fsync: Does a filemap_write_and_wait on the block device inode. Because we now have f_mapping that is the same inode we call it on in vfs_fsync. So just removing it and letting the VFS do the work in one go would be an improvement. - btrfs_sync_file: - cifs_fsync: - xfs_file_fsync: need the wait first and currently do it themselves. would benefit from doing it outside i_mutex. - coda_fsync: - ecryptfs_fsync: - exofs_file_fsync: - shm_fsync: only passes the fsync through to the lower layer - ext3_sync_file: doesn't seem to care, comments are confusing. - ext4_sync_file: would need the wait to work correctly for delalloc mode with late i_size updates. Otherwise the ext3 comment applies. currently implemens it's own writeback and wait in an odd way, could benefit from doing it properly. - gfs2_fsync: not needed for journaled data mode, but probably harmless there. Currently writes back data asynchronously itself. Needs some major audit. - hostfs_fsync: just calls fsync/datasync on the host FD. Without the wait before data might not even be inflight yet if we're unlucky. - hpfs_file_fsync: - ncp_fsync: no-ops. Dangerous before and after. - jffs2_fsync: just calls jffs2_flush_wbuf_gc, not sure how this relates to data. - nfs_fsync_dir: just increments stats, claims all directory operations are synchronous - nfs_file_fsync: only writes out data??? Looks very odd. - nilfs_sync_file: looks like it expects all data done, but not sure from the code - ntfs_dir_fsync: - ntfs_file_fsync: appear to do their own data writeback. Very convoluted code. - ocfs2_sync_file: does it's own data writeback, but no wait. probably needs the wait. - smb_fsync: according to a comment expects all pages written already, probably needs the wait before. This patch only changes vfs_fsync_range, removal of the wait in the methods that have it is left to the filesystem maintainers. Note that most filesystems really do need an audit for their fsync methods given the gems found in this very brief audit. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
\| *	vfs: Remove generic_osync_inode() and sync_page_range{_nolock}()	Jan Kara	2009-09-14	1	-54/+0
\| \| \| \| \| \| \| \| \| \| \| \|	Remove these three functions since nobody uses them anymore. Signed-off-by: Jan Kara <jack@suse.cz>
\| *	fat: Opencode sync_page_range_nolock()	Jan Kara	2009-09-14	2	-4/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	fat_cont_expand() is the only user of sync_page_range_nolock(). It's also the only user of generic_osync_inode() which does not have a file open. So opencode needed actions for FAT so that we can convert generic_osync_inode() to a standard syncing path. Update a comment about generic_osync_inode(). CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Jan Kara <jack@suse.cz>
\| *	xfs: Convert sync_page_range() to simple filemap_write_and_wait_range()	Jan Kara	2009-09-14	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Christoph Hellwig says that it is enough for XFS to call filemap_write_and_wait_range() instead of sync_page_range() because we do all the metadata syncing when forcing the log. CC: Felix Blyakher <felixb@sgi.com> CC: xfs@oss.sgi.com CC: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
\| *	ocfs2: Update syncing after splicing to match generic version	Jan Kara	2009-09-14	1	-21/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Update ocfs2 specific splicing code to use generic syncing helper. The sync now does not happen under rw_lock because generic_write_sync() acquires i_mutex which ranks above rw_lock. That should not matter because standard fsync path does not hold it either. Acked-by: Joel Becker <Joel.Becker@oracle.com> Acked-by: Mark Fasheh <mfasheh@suse.com> CC: ocfs2-devel@oss.oracle.com Signed-off-by: Jan Kara <jack@suse.cz>
\| *	ntfs: Use new syncing helpers and update comments	Jan Kara	2009-09-14	2	-19/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use new syncing helpers in .write and .aio_write functions. Also remove superfluous syncing in ntfs_file_buffered_write() and update comments about generic_osync_inode(). CC: Anton Altaparmakov <aia21@cantab.net> CC: linux-ntfs-dev@lists.sourceforge.net Signed-off-by: Jan Kara <jack@suse.cz>
\| *	ext4: Remove syncing logic from ext4_file_write	Jan Kara	2009-09-14	1	-51/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The syncing is now properly handled by generic_file_aio_write() so no special ext4 code is needed. CC: linux-ext4@vger.kernel.org CC: tytso@mit.edu Signed-off-by: Jan Kara <jack@suse.cz>
\| *	ext3: Remove syncing logic from ext3_file_write	Jan Kara	2009-09-14	1	-60/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Syncing is now properly done by generic_file_aio_write() so no special logic is needed in ext3. CC: linux-ext4@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>
\| *	ext2: Update comment about generic_osync_inode	Jan Kara	2009-09-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	We rely on generic_write_sync() now. CC: linux-ext4@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>
\| *	vfs: Introduce new helpers for syncing after writing to O_SYNC file or ↵	Jan Kara	2009-09-14	2	-23/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	IS_SYNC inode Introduce new function for generic inode syncing (vfs_fsync_range) and use it from fsync() path. Introduce also new helper for syncing after a sync write (generic_write_sync) using the generic function. Use these new helpers for syncing from generic VFS functions. This makes O_SYNC writes to block devices acquire i_mutex for syncing. If we really care about this, we can make block_fsync() drop the i_mutex and reacquire it before it returns. CC: Evgeniy Polyakov <zbr@ioremap.net> CC: ocfs2-devel@oss.oracle.com CC: Joel Becker <joel.becker@oracle.com> CC: Felix Blyakher <felixb@sgi.com> CC: xfs@oss.sgi.com CC: Anton Altaparmakov <aia21@cantab.net> CC: linux-ntfs-dev@lists.sourceforge.net CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> CC: linux-ext4@vger.kernel.org CC: tytso@mit.edu Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
\| *	vfs: Rename generic_file_aio_write_nolock	Christoph Hellwig	2009-09-14	1	-1/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	generic_file_aio_write_nolock() is now used only by block devices and raw character device. Filesystems should use __generic_file_aio_write() in case generic_file_aio_write() doesn't suit them. So rename the function to blkdev_aio_write() and move it to fs/blockdev.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
\| *	ocfs2: Use __generic_file_aio_write instead of generic_file_aio_write_nolock	Jan Kara	2009-09-14	1	-10/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use the new helper. We have to submit data pages ourselves in case of O_SYNC write because __generic_file_aio_write does not do it for us. OCFS2 developpers might think about moving the sync out of i_mutex which seems to be easily possible but that's out of scope of this patch. CC: ocfs2-devel@oss.oracle.com Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz>
* \|	Merge branch 'master' of ↵	Linus Torvalds	2009-09-14	19	-806/+556
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: GFS2: Whitespace fixes GFS2: Remove unused sysfs file GFS2: Be extra careful about deallocating inodes GFS2: Remove no_formal_ino generating code GFS2: Rename eattr.[ch] as xattr.[ch] GFS2: Clean up of extended attribute support GFS2: Add explanation of extended attr on-disk format GFS2: Add "-o errors=panic\|withdraw" mount options GFS2: jumping to wrong label? GFS2: free disk inode which is deleted by remote node -V2 GFS2: Add a document explaining GFS2's uevents GFS2: Add sysfs link to device GFS2: Replace assertion with proper error handling GFS2: Improve error handling in inode allocation GFS2: Add some more info to uevents GFS2: Add online uevent to GFS2
\| * \|	GFS2: Whitespace fixes	Steven Whitehouse	2009-09-14	3	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reported-by: Daniel Walker <dwalker@fifo99.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
\| * \|	GFS2: Remove unused sysfs file	Steven Whitehouse	2009-09-09	3	-14/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The /sys/fs/gfs2/<fsname>/lock_module/id file has been unused for some time now, so we can remove it. We still accept the mount option though, as userspace still sends that. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
\| * \|	GFS2: Be extra careful about deallocating inodes	Steven Whitehouse	2009-09-08	4	-35/+51
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is a potential race in the inode deallocation code if two nodes try to deallocate the same inode at the same time. Most of the issue is solved by the iopen locking. There is still a small window which is not covered by the iopen lock. This patches fixes that and also makes the deallocation code more robust in the face of any errors in the rgrp bitmaps, or erroneous iopen callbacks from other nodes. This does introduce one extra disk read, but that is generally not an issue since its the same block that must be written to later in the deallocation process. The total disk accesses therefore stay the same, Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
\| * \|	GFS2: Remove no_formal_ino generating code	Steven Whitehouse	2009-08-27	5	-190/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The inum structure used throughout GFS2 has two fields. One no_addr is the disk block number of the inode in question and is used everywhere as the inode number. The other, no_formal_ino, is used only as the generation number for NFS. Historically the no_formal_ino field was set using a complicated system of one global and one per-node file containing inode numbers in order to ensure that each no_formal_ino was unique. Also this code made no provision for what would happen when eventually the (64 bit) numbers ran out. Now I know that is pretty unlikely to happen given the large space of numbers, but it is possible nevertheless. The only guarantee required for no_formal_ino is that, for any single inode, the same number doesn't get reused too quickly. We already have a generation number which is kept in the inode and initialised from a counter in the resource group (almost no overhead, since we have to touch the resource group anyway in order to allocate an inode in the first place). Aside from ensuring that we never use the value 0 in the no_formal_ino field, we can use that counter directly. As a result of that change, we lose about 200 lines of code and also gain about 10 creates/sec on the postmark benchmark (on my test machine). Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
\| * \|	GFS2: Rename eattr.[ch] as xattr.[ch]	Steven Whitehouse	2009-08-26	7	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use the more conventional name for the extended attribute support code. Update all the places which care. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
\| * \|	GFS2: Clean up of extended attribute support	Steven Whitehouse	2009-08-26	11	-526/+333
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This has been on my list for some time. We need to change the way in which we handle extended attributes to allow faster file creation times (by reducing the number of transactions required) and the extended attribute code is the main obstacle to this. In addition to that, the VFS provides a way to demultiplex the xattr calls which we ought to be using, rather than rolling our own. This patch changes the GFS2 code to use that VFS feature and as a result the code shrinks by a couple of hundred lines or so, and becomes easier to read. I'm planning on doing further clean up work in this area, but this patch is a good start. The cleaned up code also uses the more usual "xattr" shorthand, I plan to eliminate the use of "eattr" eventually and in the mean time it serves as a flag as to which bits of the code have been updated. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
\| * \|	GFS2: Add "-o errors=panic\|withdraw" mount options	Bob Peterson	2009-08-24	4	-14/+71
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds "-o errors=panic" and "-o errors=withdraw" to the gfs2 mount options. The "errors=withdraw" option is today's current behaviour, meaning to withdraw from the file system if a non-serious gfs2 error occurs. The new "errors=panic" option tells gfs2 to force a kernel panic if a non-serious gfs2 file system error occurs. This may be useful, for example, where fabric-level fencing is used that has no way to reboot (such as fence_scsi). Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
\| * \|	GFS2: jumping to wrong label?	Roel Kluin	2009-08-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Also a gfs2_glock_dq() is required here. Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
\| * \|	GFS2: free disk inode which is deleted by remote node -V2	Wengang Wang	2009-08-18	1	-0/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	this patch is for the same problem that Benjamin Marzinski fixes at commit b94a170e96dc416828af9d350ae2e34b70ae7347 quotation of the original problem: ---cut here--- When a file is deleted from a gfs2 filesystem on one node, a dcache entry for it may still exist on other nodes in the cluster. If this happens, gfs2 will be unable to free this file on disk. Because of this, it's possible to have a gfs2 filesystem with no files on it and no free space. With this patch, when a node receives a callback notifying it that the file is being deleted on another node, it schedules a new workqueue thread to remove the file's dcache entry. ---end cut--- after applying Benjamin's patch, I think there is still a case in which the disk inode remains even when "no space" is hit. the case is that when running d_prune_aliases() against the inode, there are one or more dentries(aliases) which have reference count number > 0. in this case the dentries won't be pruned. and even later, the reference count becomes to 0, the dentries can still be cached in memory. unfortunately, no callback come again, things come back to the state before the callback runs. thus the on disk inode remains there until in memoryinode is removed for some other reason(shrinking inode cache or unmount the volume..). this patch is to remove those dentries when their reference count becomes to 0 and the inode is deleted by remote node. for implementation, gfs2_dentry_delete() is added as dentry_operations.d_delete. the function returns true when the inode is deleted by remote node. in dput(), gfs2_dentry_delete() is called and since it returns true, the dentry is unhashed from dcache and then removed. when all dentries are removed, the in memory inode get removed so that the on disk inode is freed. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
\| * \|	GFS2: Add sysfs link to device	Steven Whitehouse	2009-08-17	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds a link from the per-gfs2 sb sysfs directory to the block device upon which the filesystem is mounted. The link is called "device", strangely enough :-) Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
\| * \|	GFS2: Replace assertion with proper error handling	Steven Whitehouse	2009-08-17	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	One fewer assert, one more place we can recover gracefully if there is an error. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
\| * \|	GFS2: Improve error handling in inode allocation	Steven Whitehouse	2009-08-17	3	-13/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A little while back, block allocation was given some improved error handling which meant that -EIO was returned in the case of there being a problem in the resource group data. In addition a message is printed explaning what went wrong and how to fix it. This extends that error handling so that it also covers inode allocation too. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
\| * \|	GFS2: Add some more info to uevents	Steven Whitehouse	2009-08-17	1	-3/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With each uevent, we now always include the journal ID. We can't call it JID since that is already in use by some of the individual events relating to recovery, so we use JOURNALID instead. We don't send the JOURNALID for spectator mounts, since there isn't one. Also the ADD event now has both RDONLY and SPECTATOR information to match that of the ONLINE event. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
\| * \|	GFS2: Add online uevent to GFS2	Steven Whitehouse	2009-08-17	3	-3/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We already have an offline uevent (used when a withdraw occurs) but no online uevent. This adds an online uevent so that userspace will be able to detect a successful mount by means other than not receiving a remove event after the add & recovery (change) uevents. It has also been added to the remount path as well - we can't use a change uevent there as older GFS2 userspace acts on change uevents according to the state that it thinks the fs is in, so we can't easily add any new ones. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
* \| \|	Merge branch 'for_linus' of ↵	Linus Torvalds	2009-09-14	5	-100/+12
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6: udf: Fix possible corruption when close races with write udf: Perform preallocation only for regular files udf: Remove wrong assignment in udf_symlink udf: Remove dead code
\| * \| \|	udf: Fix possible corruption when close races with write	Jan Kara	2009-09-14	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we close a file, we remove preallocated blocks from it. But this truncation was not protected by i_mutex and thus it could have raced with a write through a different fd and cause crashes or even filesystem corruption. Signed-off-by: Jan Kara <jack@suse.cz>
\| * \| \|	udf: Perform preallocation only for regular files	Jan Kara	2009-09-14	1	-9/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	So far we preallocated blocks also for directories but that brings a problem, when to get rid of preallocated blocks we don't need. So far we removed them in udf_clear_inode() which has a disadvantage that 1) blocks are unavailable long after writing to a directory finished and thus one can get out of space unnecessarily early 2) releasing blocks from udf_clear_inode is problematic because VFS does not expect us to redirty inode there and it also slows down memory reclaim. So preallocate blocks only for regular files where we can drop preallocation in udf_release_file. Signed-off-by: Jan Kara <jack@suse.cz>
\| * \| \|	udf: Remove wrong assignment in udf_symlink	Jan Kara	2009-09-14	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Recomputation of the pointer was wrong (it should have been just increment). Luckily, we never use the computed value. Remove it. Signed-off-by: Jan Kara <jack@suse.cz>
\| * \| \|	udf: Remove dead code	Jan Kara	2009-09-14	2	-90/+0
\| \| \|/ \| \|/\| \| \| \| \| \| \| \| \| \| \| \| \|	Remove code that gets never used. Signed-off-by: Jan Kara <jack@suse.cz>
* \| \|	Merge branch 'for-linus' of ↵	Linus Torvalds	2009-09-14	22	-759/+567
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (21 commits) fs/Kconfig: move nilfs2 outside misc filesystems nilfs2: convert nilfs_bmap_lookup to an inline function nilfs2: allow btree code to directly call dat operations nilfs2: add update functions of virtual block address to dat nilfs2: remove individual gfp constants for each metadata file nilfs2: stop zero-fill of btree path just before free it nilfs2: remove unused btree argument from btree functions nilfs2: remove nilfs_dat_abort_start and nilfs_dat_abort_free nilfs2: shorten freeze period due to GC in write operation v3 nilfs2: add more check routines in mount process nilfs2: An unassigned variable is assigned to a never used structure member nilfs2: use GFP_NOIO for bio_alloc instead of GFP_NOWAIT nilfs2: stop using periodic write_super callback nilfs2: clean up nilfs_write_super nilfs2: fix disorder of nilfs_write_super in nilfs_sync_fs nilfs2: remove redundant super block commit nilfs2: implement nilfs_show_options to display mount options in /proc/mounts nilfs2: always lookup disk block address before reading metadata block nilfs2: use semaphore to protect pointer to a writable FS-instance nilfs2: fix format string compile warning (ino_t) ...
\| * \| \|	fs/Kconfig: move nilfs2 outside misc filesystems	Ryusuke Konishi	2009-09-14	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some people asked me questions like the following: On Wed, 15 Jul 2009 13:11:21 +0200, Leon Woestenberg wrote: > just wondering, any reasons why NILFS2 is one of the miscellaneous > filesystems and, for example, btrfs, is not in Kconfig? Actually, nilfs is NOT a filesystem came from other operating systems, but a filesystem created purely for Linux. Nor is it a flash filesystem but that for generic block devices. So, this moves nilfs outside the misc category as I responded in LKML "Re: Why does NILFS2 hide under Miscellaneous filesystems?" (Message-Id: <20090716.002526.93465395.ryusuke@osrg.net>). Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
\| * \| \|	nilfs2: convert nilfs_bmap_lookup to an inline function	Ryusuke Konishi	2009-09-14	3	-36/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The nilfs_bmap_lookup() is now a wrapper function of nilfs_bmap_lookup_at_level(). This moves the nilfs_bmap_lookup() to a header file converting it to an inline function and gives an opportunity for optimization. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
\| * \| \|	nilfs2: allow btree code to directly call dat operations	Ryusuke Konishi	2009-09-14	4	-299/+167
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current btree code is written so that btree functions call dat operations via wrapper functions in bmap.c when they allocate, free, or modify virtual block addresses. This abstraction requires additional function calls and causes frequent call of nilfs_bmap_get_dat() function since it is used in the every wrapper function. This removes the wrapper functions and makes them available from btree.c and direct.c, which will increase the opportunity of compiler optimization. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
\| * \| \|	nilfs2: add update functions of virtual block address to dat	Ryusuke Konishi	2009-09-14	3	-20/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a preparation for the successive cleanup ("nilfs2: allow btree to directly call dat operations"). This adds functions bundling a few operations to change an entry of virtual block address on the dat file. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
\| * \| \|	nilfs2: remove individual gfp constants for each metadata file	Ryusuke Konishi	2009-09-14	8	-19/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This gets rid of NILFS_CPFILE_GFP, NILFS_SUFILE_GFP, NILFS_DAT_GFP, and NILFS_IFILE_GFP. All of these constants refer to NILFS_MDT_GFP, and can be removed. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
\| * \| \|	nilfs2: stop zero-fill of btree path just before free it	Ryusuke Konishi	2009-09-14	1	-24/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The btree path object is cleared just before it is freed. This will remove the code doing the unnecessary clear operation. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
\| * \| \|	nilfs2: remove unused btree argument from btree functions	Ryusuke Konishi	2009-09-14	1	-248/+208
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Even though many btree functions take a btree object as their first argument, most of them are not used in their functions. This sticky use of the btree argument is hurting code readability and giving the possibility of inefficient code generation. So, this removes the unnecessary btree arguments. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
\| * \| \|	nilfs2: remove nilfs_dat_abort_start and nilfs_dat_abort_free	Ryusuke Konishi	2009-09-14	2	-12/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	These functions are not called from any functions. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
\| * \| \|	nilfs2: shorten freeze period due to GC in write operation v3	Jiro SEKIBA	2009-09-14	2	-7/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a re-revised patch to shorten freeze period. This version include a fix of the bug Konishi-san mentioned last time. When GC is runnning, GC moves live block to difference segments. Copying live blocks into memory is done in a transaction, however it is not necessarily to be in the transaction. This patch will get the nilfs_ioctl_move_blocks() out from transaction lock and put it before the transaction. I ran sysbench fileio test against nilfs partition. I copied some DVD/CD images and created snapshot to create live blocks before starting the benchmark. Followings are summary of rc8 and rc8 w/ the patch of per-request statistics, which is min/max and avg. I ran each test three times and bellow is average of those numers. According to this benchmark result, average time is slightly degrated. However, worstcase (max) result is significantly improved. This can address a few seconds write freeze. - random write per-request performance of rc8 min 0.843ms max 680.406ms avg 3.050ms - random write per-request performance of rc8 w/ this patch min 0.843ms -> 100.00% max 380.490ms -> 55.90% avg 3.233ms -> 106.00% - sequential write per-request performance of rc8 min 0.736ms max 774.343ms avg 2.883ms - sequential write per-request performance of rc8 w/ this patch min 0.720ms -> 97.80% max 644.280ms-> 83.20% avg 3.130ms -> 108.50% -----8<-----8<-----nilfs_cleanerd.conf-----8<-----8<----- protection_period 150 selection_policy timestamp # timestamp in ascend order nsegments_per_clean 2 cleaning_interval 2 retry_interval 60 use_mmap log_priority info -----8<-----8<-----nilfs_cleanerd.conf-----8<-----8<----- Signed-off-by: Jiro SEKIBA <jir@unicus.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
\| * \| \|	nilfs2: add more check routines in mount process	Zhu Yanhai	2009-09-14	2	-4/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	nilfs2: Add more safeguard routines and protections in mount process, which also makes nilfs2 report consistency error messages when checkpoint number is invalid. Signed-off-by: Zhu Yanhai <zhu.yanhai@gmail.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>