op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable	Linus Torvalds	2009-04-03	17	-544/+982
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: BUG to BUG_ON changes Btrfs: remove dead code Btrfs: remove dead code Btrfs: fix typos in comments Btrfs: remove unused ftrace include Btrfs: fix __ucmpdi2 compile bug on 32 bit builds Btrfs: free inode struct when btrfs_new_inode fails Btrfs: fix race in worker_loop Btrfs: add flushoncommit mount option Btrfs: notreelog mount option Btrfs: introduce btrfs_show_options Btrfs: rework allocation clustering Btrfs: Optimize locking in btrfs_next_leaf() Btrfs: break up btrfs_search_slot into smaller pieces Btrfs: kill the pinned_mutex Btrfs: kill the block group alloc mutex Btrfs: clean up find_free_extent Btrfs: free space cache cleanups Btrfs: unplug in the async bio submission threads Btrfs: keep processing bios for a given bdev if our proc is batching
\| *	Btrfs: BUG to BUG_ON changes	Stoyan Gaydarov	2009-04-02	3	-6/+3
\| \| \| \| \| \| \| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
\| *	Btrfs: remove dead code	Dan Carpenter	2009-04-02	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove an unneeded return statement and conditional Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
\| *	Btrfs: remove dead code	Dan Carpenter	2009-04-02	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	merge is always NULL at this point. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
\| *	Btrfs: fix typos in comments	Wu Fengguang	2009-04-02	3	-12/+14
\| \| \| \| \| \| \| \| \| \|	Signed-off-by: Chris Mason <chris.mason@oracle.com>
\| *	Btrfs: remove unused ftrace include	Jim Owens	2009-04-02	2	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \|	Signed-off-by: jim owens <jowens@hp.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
\| *	Btrfs: fix __ucmpdi2 compile bug on 32 bit builds	Heiko Carstens	2009-04-03	1	-11/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We get this on 32 builds: fs/built-in.o: In function `extent_fiemap': (.text+0x1019f2): undefined reference to `__ucmpdi2' Happens because of a switch statement with a 64 bit argument. Convert this to an if statement to fix this. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
\| *	Btrfs: free inode struct when btrfs_new_inode fails	Shen Feng	2009-04-02	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	btrfs_new_inode doesn't call iput to free the inode when it fails. Signed-off-by: Shen Feng <shen@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
\| *	Btrfs: fix race in worker_loop	Amit Gud	2009-04-02	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Need to check kthread_should_stop after schedule_timeout() before calling schedule(). This causes threads to sleep with potentially no one to wake them up causing mount(2) to hang in btrfs_stop_workers waiting for threads to stop. Signed-off-by: Amit Gud <gud@ksu.edu> Signed-off-by: Chris Mason <chris.mason@oracle.com>
\| *	Btrfs: add flushoncommit mount option	Sage Weil	2009-04-02	3	-5/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The 'flushoncommit' mount option forces any data dirtied by a write in a prior transaction to commit as part of the current commit. This makes the committed state a fully consistent view of the file system from the application's perspective (i.e., it includes all completed file system operations). This was previously the behavior only when a snapshot is created. This is used by Ceph to ensure that completed writes make it to the platter along with the metadata operations they are bound to (by BTRFS_IOC_TRANS_{START,END}). Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
\| *	Btrfs: notreelog mount option	Sage Weil	2009-04-02	3	-1/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a 'notreelog' mount option to disable the tree log (used by fsync, O_SYNC writes). This is much slower, but the tree logging produces inconsistent views into the FS for ceph. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
\| *	Btrfs: introduce btrfs_show_options	Eric Paris	2009-04-02	1	-1/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	btrfs options can change at times other than mount, yet /proc/mounts shows the options string used when the fs was mounted (an example would be when btrfs determines that barriers aren't useful and turns them off.) This patch instead outputs the actual options in use by btrfs. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
\| *	Btrfs: rework allocation clustering	Chris Mason	2009-04-03	6	-74/+509
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Because btrfs is copy-on-write, we end up picking new locations for blocks very often. This makes it fairly difficult to maintain perfect read patterns over time, but we can at least do some optimizations for writes. This is done today by remembering the last place we allocated and trying to find a free space hole big enough to hold more than just one allocation. The end result is that we tend to write sequentially to the drive. This happens all the time for metadata and it happens for data when mounted -o ssd. But, the way we record it is fairly racey and it tends to fragment the free space over time because we are trying to allocate fairly large areas at once. This commit gets rid of the races by adding a free space cluster object with dedicated locking to make sure that only one process at a time is out replacing the cluster. The free space fragmentation is somewhat solved by allowing a cluster to be comprised of smaller free space extents. This part definitely adds some CPU time to the cluster allocations, but it allows the allocator to consume the small holes left behind by cow. Signed-off-by: Chris Mason <chris.mason@oracle.com>
\| *	Btrfs: Optimize locking in btrfs_next_leaf()	Chris Mason	2009-04-03	1	-23/+65
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	btrfs_next_leaf was using blocking locks when it could have been using faster spinning ones instead. This adds a few extra checks around the pieces that block and switches over to spinning locks. Signed-off-by: Chris Mason <chris.mason@oracle.com>
\| *	Btrfs: break up btrfs_search_slot into smaller pieces	Chris Mason	2009-04-03	1	-90/+131
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	btrfs_search_slot was doing too many things at once. This breaks it up into more reasonable units. Signed-off-by: Chris Mason <chris.mason@oracle.com>
\| *	Btrfs: kill the pinned_mutex	Josef Bacik	2009-04-03	4	-16/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch removes the pinned_mutex. The extent io map has an internal tree lock that protects the tree itself, and since we only copy the extent io map when we are committing the transaction we don't need it there. We also don't need it when caching the block group since searching through the tree is also protected by the internal map spin lock. Signed-off-by: Josef Bacik <jbacik@redhat.com>
\| *	Btrfs: kill the block group alloc mutex	Josef Bacik	2009-04-03	3	-157/+83
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch removes the block group alloc mutex used to protect the free space tree for allocations and replaces it with a spin lock which is used only to protect the free space rb tree. This means we only take the lock when we are directly manipulating the tree, which makes us a touch faster with multi-threaded workloads. This patch also gets rid of btrfs_find_free_space and replaces it with btrfs_find_space_for_alloc, which takes the number of bytes you want to allocate, and empty_size, which is used to indicate how much free space should be at the end of the allocation. It will return an offset for the allocator to use. If we don't end up using it we _must_ call btrfs_add_free_space to put it back. This is the tradeoff to kill the alloc_mutex, since we need to make sure nobody else comes along and takes our space. Signed-off-by: Josef Bacik <jbacik@redhat.com>
\| *	Btrfs: clean up find_free_extent	Josef Bacik	2009-04-03	1	-165/+91
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I've replaced the strange looping constructs with a list_for_each_entry on space_info->block_groups. If we have a hint we just jump into the loop with the block group and start looking for space. If we don't find anything we start at the beginning and start looking. We never come out of the loop with a ref on the block_group _unless_ we found space to use, then we drop it after we set the trans block_group. Signed-off-by: Josef Bacik <jbacik@redhat.com>
\| *	Btrfs: free space cache cleanups	Josef Bacik	2009-04-03	2	-51/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch cleans up the free space cache code a bit. It better documents the idiosyncrasies of tree_search_offset and makes the code make a bit more sense. I took out the info allocation at the start of __btrfs_add_free_space and put it where it makes more sense. This was left over cruft from when alloc_mutex existed. Also all of the re-searches we do to make sure we inserted properly. Signed-off-by: Josef Bacik <jbacik@redhat.com>
\| *	Btrfs: unplug in the async bio submission threads	Chris Mason	2009-04-03	1	-1/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Btrfs pages being written get set to writeback, and then may go through a number of steps before they hit the block layer. This includes compression, checksumming and async bio submission. The end result is that someone who writes a page and then does wait_on_page_writeback is likely to unplug the queue before the bio they cared about got there. We could fix this by marking bios sync, or by doing more frequent unplugs, but this commit just changes the async bio submission code to unplug after it has processed all the bios for a device. The async bio submission does a fair job of collection bios, so this shouldn't be a huge problem for reducing merging at the elevator. For streaming O_DIRECT writes on a 5 drive array, it boosts performance from 386MB/s to 460MB/s. Thanks to Hisashi Hifumi for helping with this work. Signed-off-by: Chris Mason <chris.mason@oracle.com>
\| *	Btrfs: keep processing bios for a given bdev if our proc is batching	Chris Mason	2009-04-03	1	-0/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Btrfs uses async helper threads to submit write bios so the checksumming helper threads don't block on the disk. The submit bio threads may process bios for more than one block device, so when they find one device congested they try to move on to other devices instead of blocking in get_request_wait for one device. This does a pretty good job of keeping multiple devices busy, but the congested flag has a number of problems. A congested device may still give you a request, and other procs that aren't backing off the congested device may starve you out. This commit uses the io_context stored in current to decide if our process has been made a batching process by the block layer. If so, it keeps sending IO down for at least one batch. This helps make sure we do a good amount of work each time we visit a bdev, and avoids large IO stalls in multi-device workloads. It's also very ugly. A better solution is in the works with Jens Axboe. Signed-off-by: Chris Mason <chris.mason@oracle.com>
* \|	ocfs2: recover orphans in offline slots during recovery and mount	Srinivas Eeda	2009-04-03	4	-18/+132
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	During recovery, a node recovers orphans in it's slot and the dead node(s). But if the dead nodes were holding orphans in offline slots, they will be left unrecovered. If the dead node is the last one to die and is holding orphans in other slots and is the first one to mount, then it only recovers it's own slot, which leaves orphans in offline slots. This patch queues complete_recovery to clean orphans for all offline slots during mount and node recovery. Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2: Pagecache usage optimization on ocfs2	Hisashi Hifumi	2009-04-03	1	-11/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A page can have multiple buffers and even if a page is not uptodate, some buffers can be uptodate on pagesize != blocksize environment. This aops checks that all buffers which correspond to a part of a file that we want to read are uptodate. If so, we do not have to issue actual read IO to HDD even if a page is not uptodate because the portion we want to read are uptodate. "block_is_partially_uptodate" function is already used by ext2/3/4. With the following patch random read/write mixed workloads or random read after random write workloads can be optimized and we can get performance improvement. Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2: fix rare stale inode errors when exporting via nfs	wengang wang	2009-04-03	9	-8/+319
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For nfs exporting, ocfs2_get_dentry() returns the dentry for fh. ocfs2_get_dentry() may read from disk when the inode is not in memory, without any cross cluster lock. this leads to the file system loading a stale inode. This patch fixes above problem. Solution is that in case of inode is not in memory, we get the cluster lock(PR) of alloc inode where the inode in question is allocated from (this causes node on which deletion is done sync the alloc inode) before reading out the inode itsself. then we check the bitmap in the group (the inode in question allcated from) to see if the bit is clear. if it's clear then it's stale. if the bit is set, we then check generation as the existing code does. We have to read out the inode in question from disk first to know its alloc slot and allot bit. And if its not stale we read it out using ocfs2_iget(). The second read should then be from cache. And also we have to add a per superblock nfs_sync_lock to cover the lock for alloc inode and that for inode in question. this is because ocfs2_get_dentry() and ocfs2_delete_inode() lock on them in reverse order. nfs_sync_lock is locked in EX mode in ocfs2_get_dentry() and in PR mode in ocfs2_delete_inode(). so that mutliple ocfs2_delete_inode() can run concurrently in normal case. [mfasheh@suse.com: build warning fixes and comment cleanups] Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2/dlm: Tweak mle_state output	Sunil Mushran	2009-04-03	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The debugfs file, mle_state, now prints the number of largest number of mles in one hash link. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2/dlm: Do not purge lockres that is being migrated dlm_purge_lockres()	Sunil Mushran	2009-04-03	1	-2/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch attempts to fix a fine race between purging and migration. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2/dlm: Remove struct dlm_lock_name in struct dlm_master_list_entry	Sunil Mushran	2009-04-03	3	-71/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch removes struct dlm_lock_name and adds the entries directly to struct dlm_master_list_entry. Under the new scheme, both mles that are backed by a lockres or not, will have the name populated in mle->mname. This allows us to get rid of code that was figuring out the location of the mle name. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2/dlm: Show the number of lockres/mles in dlm_state	Sunil Mushran	2009-04-03	1	-0/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch shows the number of lockres' and mles in the debugfs file, dlm_state. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2/dlm: dlm_set_lockres_owner() and dlm_change_lockres_owner() inlined	Sunil Mushran	2009-04-03	2	-22/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch inlines dlm_set_lockres_owner() and dlm_change_lockres_owner(). Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2/dlm: Improve lockres counts	Sunil Mushran	2009-04-03	4	-38/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch replaces the lockres counts that tracked the number number of locally and remotely mastered lockres' with a current and total count. The total count is the number of lockres' that have been created since the dlm domain was created. The number of locally and remotely mastered counts can be computed using the locking_state output. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2/dlm: Track number of mles	Sunil Mushran	2009-04-03	3	-1/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The lifetime of a mle is limited to the duration of the lockres mastery process. While typically this lifetime is fairly short, we have noticed the number of mles explode under certain circumstances. This patch tracks the number of each different types of mles and should help us determine how best to speed up the mastery process. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2/dlm: Indent dlm_cleanup_master_list()	Sunil Mushran	2009-04-03	1	-54/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The previous patch explicitly did not indent dlm_cleanup_master_list() so as to make the patch readable. This patch properly indents the function. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2/dlm: Activate dlm->master_hash for master list entries	Sunil Mushran	2009-04-03	4	-30/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With this patch, the mles are stored in a hash and not a simple list. This should improve the mle lookup time when the number of outstanding masteries is large. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2/dlm: Create and destroy the dlm->master_hash	Sunil Mushran	2009-04-03	2	-0/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds code to create and destroy the dlm->master_hash. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2/dlm: Refactor dlm_clean_master_list()	Sunil Mushran	2009-04-03	1	-63/+85
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch refactors dlm_clean_master_list() so as to make it easier to convert the mle list to a hash. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2/dlm: Clean up struct dlm_lock_name	Sunil Mushran	2009-04-03	3	-44/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For master mle, the name it stored in the attached lockres in struct qstr. For block and migration mle, the name is stored inline in struct dlm_lock_name. This patch attempts to make struct dlm_lock_name look like a struct qstr. While we could use struct qstr, we don't because we want to avoid having to malloc and free the lockname string as the mle's lifetime is fairly short. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2/dlm: Encapsulate adding and removing of mle from dlm->master_list	Sunil Mushran	2009-04-03	2	-11/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch encapsulates adding and removing of the mle from the dlm->master_list. This patch is part of the series of patches that converts the mle list to a mle hash. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2: Optimize inode group allocation by recording last used group.	Tao Ma	2009-04-03	2	-4/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In ocfs2, the block group search looks for the "emptiest" group to allocate from. So if the allocator has many equally(or almost equally) empty groups, new block group will tend to get spread out amongst them. So we add osb_inode_alloc_group in ocfs2_super to record the last used inode allocation group. For more details, please see http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy. I have done some basic test and the results are a ten times improvement on some cold-cache stat workloads. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2: Allocate inode groups from global_bitmap.	Tao Ma	2009-04-03	1	-10/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Inode groups used to be allocated from local alloc file, but since we want all inodes to be contiguous enough, we will try to allocate them directly from global_bitmap. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2: Optimize inode allocation by remembering last group	Tao Ma	2009-04-03	5	-2/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In ocfs2, the inode block search looks for the "emptiest" inode group to allocate from. So if an inode alloc file has many equally (or almost equally) empty groups, new inodes will tend to get spread out amongst them, which in turn can put them all over the disk. This is undesirable because directory operations on conceptually "nearby" inodes force a large number of seeks. So we add ip_last_used_group in core directory inodes which records the last used allocation group. Another field named ip_last_used_slot is also added in case inode stealing happens. When claiming new inode, we passed in directory's inode so that the allocation can use this information. For more details, please see http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2: fix leaf start calculation in ocfs2_dx_dir_rebalance()	Mark Fasheh	2009-04-03	2	-2/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ocfs2_dx_dir_rebalance() is passed the block offset of a dx leaf which needs rebalancing. Since we rebalance an entire cluster at a time however, this function needs to calculate the beginning of that cluster, in blocks. The calculation was wrong, which would result in a read of non-leaf blocks. Fix the calculation by adding ocfs2_block_to_cluster_start() which is a more straight-forward way of determining this. Reported-by: Tristan Ye <tristan.ye@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2: re-order ocfs2_empty_dir checks	Mark Fasheh	2009-04-03	1	-6/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ocfs2_empty_dir() is far more expensive than checking link count. Since both need to be checked at the same time, we can improve performance by checking link count first. Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2: Enable indexed directories	Mark Fasheh	2009-04-03	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since the disk format is finalized, we can set this feature bit in the supported mask. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <Joel.Becker@oracle.com>
* \|	ocfs2: Add total entry count to dx_root_block	Mark Fasheh	2009-04-03	2	-44/+124
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This little bit of extra accounting speeds up ocfs2_empty_dir() dramatically by allowing us to short-circuit the full directory scan. Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* \|	ocfs2: Increase max links count	Mark Fasheh	2009-04-03	4	-26/+66
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since we've now got a directory format capable of handling a large number of entries, we can increase the maximum link count supported. This only gets increased if the directory indexing feature is turned on. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <joel.becker@oracle.com>
* \|	ocfs2: Introduce dir free space list	Mark Fasheh	2009-04-03	4	-93/+490
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The only operation which doesn't get faster with directory indexing is insert, which still has to walk the entire unindexed directory portion to find a free block. This patch provides an improvement in directory insert performance by maintaining a singly linked list of directory leaf blocks which have space for additional dirents. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <joel.becker@oracle.com>
* \|	ocfs2: Store dir index records inline	Mark Fasheh	2009-04-03	5	-145/+471
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Allow us to store a small number of directory index records in the ocfs2_dx_root_block. This saves us a disk read on small to medium sized directories (less than about 250 entries). The inline root is automatically turned into a root block with extents if the directory size increases beyond it's capacity. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <joel.becker@oracle.com>
* \|	ocfs2: Add a name indexed b-tree to directory inodes	Mark Fasheh	2009-04-03	13	-104/+2157
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch makes use of Ocfs2's flexible btree code to add an additional tree to directory inodes. The new tree stores an array of small, fixed-length records in each leaf block. Each record stores a hash value, and pointer to a block in the traditional (unindexed) directory tree where a dirent with the given name hash resides. Lookup exclusively uses this tree to find dirents, thus providing us with constant time name lookups. Some of the hashing code was copied from ext3. Unfortunately, it has lots of unfixed checkpatch errors. I left that as-is so that tracking changes would be easier. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <joel.becker@oracle.com>
* \|	ocfs2: Introduce dir lookup helper struct	Mark Fasheh	2009-04-03	3	-131/+148
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Many directory manipulation calls pass around a tuple of dirent, and it's containing buffer_head. Dir indexing has a bit more state, but instead of adding yet more arguments to functions, we introduce 'struct ocfs2_dir_lookup_result'. In this patch, it simply holds the same tuple, but future patches will add more state. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <joel.becker@oracle.com>
* \|	ocfs2: Remove debugfs file local_alloc_stats	Sunil Mushran	2009-04-03	2	-91/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch removes the debugfs file local_alloc_stats as that information is now included in the fs_state debugfs file. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>