op-kernel-dev - Development kernel branch for OpenPOWER systems

	Commit message (Collapse)	Author	Age	Files	Lines
*	cfq: set workload as expired if it doesn't have any slice left	Gui Jianfeng	2009-12-15	1	-1/+3
\| \| \| \| \| \| \| \| \| \|	When a group is resumed, if it doesn't have workload slice left, we should set workload_expires as expired. Otherwise, we might start from where we left in previous group by error. Thanks the idea from Corrado. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	Fix a CFQ crash in "for-2.6.33" branch of block tree	Vivek Goyal	2009-12-10	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I think my previous patch introduced a bug which can lead to CFQ hitting BUG_ON(). The offending commit in for-2.6.33 branch is. commit 7667aa0630407bc07dc38dcc79d29cc0a65553c1 Author: Vivek Goyal <vgoyal@redhat.com> Date: Tue Dec 8 17:52:58 2009 -0500 cfq-iosched: Take care of corner cases of group losing share due to deletion While doing some stress testing on my box, I enountered following. login: [ 3165.148841] BUG: scheduling while atomic: swapper/0/0x10000100 [ 3165.149821] Modules linked in: cfq_iosched dm_multipath qla2xxx igb scsi_transport_fc dm_snapshot [last unloaded: scsi_wait_scan] [ 3165.149821] Pid: 0, comm: swapper Not tainted 2.6.32-block-for-33-merged-new #3 [ 3165.149821] Call Trace: [ 3165.149821] <IRQ> [<ffffffff8103fab8>] __schedule_bug+0x5c/0x60 [ 3165.149821] [<ffffffff8103afd7>] ? __wake_up+0x44/0x4d [ 3165.149821] [<ffffffff8153a979>] schedule+0xe3/0x7bc [ 3165.149821] [<ffffffff8103a796>] ? cpumask_next+0x1d/0x1f [ 3165.149821] [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e [cfq_iosched] [ 3165.149821] [<ffffffff810422d8>] __cond_resched+0x2a/0x35 [ 3165.149821] [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e [cfq_iosched] [ 3165.149821] [<ffffffff8153b1ee>] _cond_resched+0x2c/0x37 [ 3165.149821] [<ffffffff8100e2db>] is_valid_bugaddr+0x16/0x2f [ 3165.149821] [<ffffffff811e4161>] report_bug+0x18/0xac [ 3165.149821] [<ffffffff8100f1fc>] die+0x39/0x63 [ 3165.149821] [<ffffffff8153cde1>] do_trap+0x11a/0x129 [ 3165.149821] [<ffffffff8100d470>] do_invalid_op+0x96/0x9f [ 3165.149821] [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e [cfq_iosched] [ 3165.149821] [<ffffffff81034b4d>] ? enqueue_task+0x5c/0x67 [ 3165.149821] [<ffffffff8103ae83>] ? task_rq_unlock+0x11/0x13 [ 3165.149821] [<ffffffff81041aae>] ? try_to_wake_up+0x292/0x2a4 [ 3165.149821] [<ffffffff8100c935>] invalid_op+0x15/0x20 [ 3165.149821] [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e [cfq_iosched] [ 3165.149821] [<ffffffff810df5a6>] ? virt_to_head_page+0xe/0x2f [ 3165.149821] [<ffffffff811d8c2a>] blk_peek_request+0x191/0x1a7 [ 3165.149821] [<ffffffff811e5b8d>] ? kobject_get+0x1a/0x21 [ 3165.149821] [<ffffffff812c8d4c>] scsi_request_fn+0x82/0x3df [ 3165.149821] [<ffffffff8110b2de>] ? bio_fs_destructor+0x15/0x17 [ 3165.149821] [<ffffffff810df5a6>] ? virt_to_head_page+0xe/0x2f [ 3165.149821] [<ffffffff811d931f>] __blk_run_queue+0x42/0x71 [ 3165.149821] [<ffffffff811d9403>] blk_run_queue+0x26/0x3a [ 3165.149821] [<ffffffff812c8761>] scsi_run_queue+0x2de/0x375 [ 3165.149821] [<ffffffff812b60ac>] ? put_device+0x17/0x19 [ 3165.149821] [<ffffffff812c92d7>] scsi_next_command+0x3b/0x4b [ 3165.149821] [<ffffffff812c9b9f>] scsi_io_completion+0x1c9/0x3f5 [ 3165.149821] [<ffffffff812c3c36>] scsi_finish_command+0xb5/0xbe I think I have hit following BUG_ON() in cfq_dispatch_request(). BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list)); Please find attached the patch to fix it. I have done some stress testing with it and have not seen it happening again. o We should wait on a queue even after slice expiry only if it is empty. If queue is not empty then continue to expire it. o If we decide to keep the queue then make cfqq=NULL. Otherwise select_queue() will return a valid cfqq and cfq_dispatch_request() can hit following BUG_ON(). BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list)) Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	cfq: Remove wait_request flag when idle time is being deleted	Gui Jianfeng	2009-12-10	1	-0/+1
\| \| \| \| \| \| \| \|	Remove wait_request flag when idle time is being deleted, otherwise it'll hit this path every time when a request is enqueued. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	cfq-iosched: commenting non-obvious initialization	Corrado Zoccolo	2009-12-09	1	-0/+4
\| \| \| \| \| \| \| \|	Added a comment to explain the initialization of last_delayed_sync. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	cfq-iosched: Take care of corner cases of group losing share due to deletion	Vivek Goyal	2009-12-09	1	-6/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If there is a sequential reader running in a group, we wait for next request to come in that group after slice expiry and once new request is in, we expire the queue. Otherwise we delete the group from service tree and group looses its fair share. So far I was marking a queue as wait_busy if it had consumed its slice and it was last queue in the group. But this condition did not cover following two cases. 1.If a request completed and slice has not expired yet. Next request comes in and is dispatched to disk. Now select_queue() hits and slice has expired. This group will be deleted. Because request is still in the disk, this queue will never get a chance to wait_busy. 2.If request completed and slice has not expired yet. Before next request comes in (delay due to think time), select_queue() hits and expires the queue hence group. This queue never got a chance to wait busy. Gui was hitting the boundary condition 1 and not getting fairness numbers proportional to weight. This patch puts the checks for above two conditions and improves the fairness numbers for sequential workload on rotational media. Check in select_queue() takes care of case 1 and additional check in should_wait_busy() takes care of case 2. Reported-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	cfq-iosched: Get rid of cfqq wait_busy_done flag	Vivek Goyal	2009-12-09	1	-9/+8
\| \| \| \| \| \| \| \| \| \|	o Get rid of wait_busy_done flag. This flag only tells we were doing wait busy on a queue and that queue got request so expire it. That information can easily be obtained by (cfq_cfqq_wait_busy() && queue_is_not_empty). So remove this flag and keep code simple. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	cfq: Optimization for close cooperating queue searching	Gui Jianfeng	2009-12-09	1	-0/+6
\| \| \| \| \| \| \| \|	It doesn't make any sense to try to find out a close cooperating queue if current cfqq is the only one in the group. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	cfq-iosched: reduce write depth only if sync was delayed	Corrado Zoccolo	2009-12-09	1	-4/+5
\| \| \| \| \| \| \| \| \| \|	The introduction of ramp-up formula for async queue depths has slowed down dirty page reclaim, by reducing async write performance. This patch makes sure the formula kicks in only when sync request was recently delayed. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	cfq-iosched: Do not access cfqq after freeing it	Vivek Goyal	2009-12-07	1	-3/+4
\| \| \| \| \| \| \| \| \| \|	Fix a crash during boot reported by Jeff Moyer. Fix the issue of accessing cfqq after freeing it. Reported-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@carl.(none)>
*	cfq-iosched: use call_rcu() instead of doing grace period stall on queue exit	Jens Axboe	2009-12-06	1	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \|	After the merge of the IO controller patches, booting on my megaraid box ran much slower. Vivek Goyal traced it down to megaraid discovery creating tons of devices, each suffering a grace period when they later kill that queue (if no device is found). So lets use call_rcu() to batch these deferred frees, instead of taking the grace period hit for each one. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Implement dynamic io controlling policy registration	Vivek Goyal	2009-12-04	1	-1/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	o One of the goals of block IO controller is that it should be able to support mulitple io control policies, some of which be operational at higher level in storage hierarchy. o To begin with, we had one io controlling policy implemented by CFQ, and I hard coded the CFQ functions called by blkio. This created issues when CFQ is compiled as module. o This patch implements a basic dynamic io controlling policy registration functionality in blkio. This is similar to elevator functionality where ioschedulers register the functions dynamically. o Now in future, when more IO controlling policies are implemented, these can dynakically register with block IO controller. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Export some symbols from blkio as its user CFQ can be a module	Vivek Goyal	2009-12-04	1	-2/+2
\| \| \| \| \| \| \| \| \|	o blkio controller is inside the kernel and cfq makes use of interfaces exported by blkio. CFQ can be a module too, hence export symbols used by CFQ. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	cfq-iosched: make nonrot check logic consistent	Shaohua Li	2009-12-04	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \|	cfq_arm_slice_timer() has logic to disable idle window for SSD device. The same thing should be done at cfq_select_queue() too, otherwise we will still see idle window. This makes the nonrot check logic consistent in cfq. Tests in a intel SSD with low_latency knob close, below patch can triple disk thoughput for muti-thread sequential read. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	cfq-iosched: move IO controller declerations to a header file	Jens Axboe	2009-12-04	1	-0/+1
\| \| \| \| \| \| \|	They should not be declared inside some other file that's not related to CFQ. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Wait on sync-noidle queue even if rq_noidle = 1	Vivek Goyal	2009-12-03	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	o rq_noidle() is supposed to tell cfq that do not expect a request after this one, hence don't idle. But this does not seem to work very well. For example for direct random readers, rq_noidle = 1 but there is next request coming after this. Not idling, leads to a group not getting its share even if group_isolation=1. o The right solution for this issue is to scan the higher layers and set right flag (WRITE_SYNC or WRITE_ODIRECT). For the time being, this single line fix helps. This should not have any significant impact when we are not using cgroups. I will later figure out IO paths in higher layer and fix it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Implement group_isolation tunable	Vivek Goyal	2009-12-03	1	-1/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	o If a group is running only a random reader, then it will not have enough traffic to keep disk busy and we will reduce overall throughput. This should result in better latencies for random reader though. If we don't idle on random reader service tree, then this random reader will experience large latencies if there are other groups present in system with sequential readers running in these. o One solution suggested by corrado is that by default keep the random readers or sync-noidle workload in root group so that during one dispatch round we idle only once on sync-noidle tree. This means that all the sync-idle workload queues will be in their respective group and we will see service differentiation in those but not on sync-noidle workload. o Provide a tunable group_isolation. If set, this will make sure that even sync-noidle queues go in their respective group and we wait on these. This provides stronger isolation between groups but at the expense of throughput if group does not have enough traffic to keep the disk busy. o By default group_isolation = 0 Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Determine async workload length based on total number of queues	Vivek Goyal	2009-12-03	1	-5/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	o Async queues are not per group. Instead these are system wide and maintained in root group. Hence their workload slice length should be calculated based on total number of queues in the system and not just queues in the root group. o As root group's default weight is 1000, make sure to charge async queue more in terms of vtime so that it does not get more time on disk because root group has higher weight. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Wait for cfq queue to get backlogged if group is empty	Vivek Goyal	2009-12-03	1	-5/+29
\| \| \| \| \| \| \| \| \| \| \| \| \|	o If a queue consumes its slice and then gets deleted from service tree, its associated group will also get deleted from service tree if this was the only queue in the group. That will make group loose its share. o For the queues on which we have idling on and if these have used their slice, wait a bit for these queues to get backlogged again and then expire these queues so that group does not loose its share. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Propagate cgroup weight updation to cfq groups	Vivek Goyal	2009-12-03	1	-0/+6
\| \| \| \| \| \| \|	o Propagate blkio cgroup weight updation to associated cfq groups. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Drop the reference to queue once the task changes cgroup	Vivek Goyal	2009-12-03	1	-0/+39
\| \| \| \| \| \| \| \| \|	o If a task changes cgroup, drop reference to the cfqq associated with io context and set cfqq pointer stored in ioc to NULL so that upon next request arrival we will allocate a new queue in new group. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Provide some isolation between groups	Vivek Goyal	2009-12-03	1	-10/+20
\| \| \| \| \| \| \| \| \| \| \| \| \|	o Do not allow following three operations across groups for isolation. - selection of co-operating queues - preemtpions across groups - request merging across groups. o Async queues are currently global and not per group. Allow preemption of an async queue if a sync queue in other group gets backlogged. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Export disk time and sectors used by a group to user space	Vivek Goyal	2009-12-03	1	-3/+16
\| \| \| \| \| \| \| \| \| \| \|	o Export disk time and sector used by a group to user space through cgroup interface. o Also export a "dequeue" interface to cgroup which keeps track of how many a times a group was deleted from service tree. Helps in debugging. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Some debugging aids for CFQ	Vivek Goyal	2009-12-03	1	-1/+18
\| \| \| \| \| \| \|	o Some debugging aids for CFQ. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Take care of cgroup deletion and cfq group reference counting	Vivek Goyal	2009-12-03	1	-0/+95
\| \| \| \| \| \| \| \| \| \|	o One can choose to change elevator or delete a cgroup. Implement group reference counting so that both elevator exit and cgroup deletion can take place gracefully. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Nauman Rafique <nauman@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Dynamic cfq group creation based on cgroup tasks belongs to	Vivek Goyal	2009-12-03	1	-11/+100
\| \| \| \| \| \| \| \| \| \| \| \|	o Determine the cgroup IO submitting task belongs to and create the cfq group if it does not exist already. o Also link cfqq and associated cfq group. o Currently all async IO is mapped to root group. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Group time used accounting and workload context save restore	Vivek Goyal	2009-12-03	1	-0/+79
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	o This patch introduces the functionality to do the accounting of group time when a queue expires. This time used decides which is the group to go next. o Also introduce the functionlity to save and restore the workload type context with-in group. It might happen that once we expire the cfq queue and group, a different group will schedule in and we will lose the context of the workload type. Hence save and restore it upon queue expiry. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Implement per cfq group latency target and busy queue avg	Vivek Goyal	2009-12-03	1	-20/+45
\| \| \| \| \| \| \| \| \| \| \|	o So far we had 300ms soft target latency system wide. Now with the introduction of cfq groups, divide that latency by number of groups so that one can come up with group target latency which will be helpful in determining the workload slice with-in group and also the dynamic slice length of the cfq queue. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Introduce per cfq group weights and vdisktime calculations	Vivek Goyal	2009-12-03	1	-1/+61
\| \| \| \| \| \| \| \| \|	o Bring in the per cfq group weight and how vdisktime is calculated for the group. Also bring in the functionality of updating the min_vdisktime of the group service tree. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Introduce the root service tree for cfq groups	Vivek Goyal	2009-12-03	1	-3/+134
\| \| \| \| \| \| \| \| \|	o So far we just had one cfq_group in cfq_data. To create space for more than one cfq_group, we need to have a service tree of groups where all the groups can be queued if they have active cfq queues backlogged in these. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Keep queue on service tree until we expire it	Vivek Goyal	2009-12-03	1	-21/+49
\| \| \| \| \| \| \| \| \| \| \| \| \|	o Currently cfqq deletes a queue from service tree if it is empty (even if we might idle on the queue). This patch keeps the queue on service tree hence associated group remains on the service tree until we decide that we are not going to idle on the queue and expire it. o This just helps in time accounting for queue/group and in implementation of rest of the patches. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Implement macro to traverse each service tree in group	Vivek Goyal	2009-12-03	1	-16/+25
\| \| \| \| \| \| \| \| \| \| \|	o Implement a macro to traverse each service tree in the group. This avoids usage of double for loop and special condition for idle tree 4 times. o Macro is little twisted because of special handling of idle class service tree. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Introduce the notion of cfq groups	Vivek Goyal	2009-12-03	1	-33/+75
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	o This patch introduce the notion of cfq groups. Soon we will can have multiple groups of different weights in the system. o Various service trees (prioclass and workload type trees), will become per cfq group. So hierarchy looks as follows. cfq_groups \| workload type \| cfq queue o When an scheduling decision has to be taken, first we select the cfq group then workload with-in the group and then cfq queue with-in the workload type. o This patch just makes various workload service tree per cfq group and introduce the function to be able to choose a group for scheduling. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	blkio: Set must_dispatch only if we decided to not dispatch the request	Vivek Goyal	2009-12-03	1	-3/+3
\| \| \| \| \| \| \| \|	o must_dispatch flag should be set only if we decided not to run the queue and dispatch the request. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	cfq-iosched: no dispatch limit for single queue	Shaohua Li	2009-12-03	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since commit 2f5cb7381b737e24c8046fd4aeab571fb71315f5, each queue can send up to 4 * 4 requests if only one queue exists. I wonder why we have such limit. Device supports tag can send more requests. For example, AHCI can send 31 requests. Test (direct aio randread) shows the limits reduce about 4% disk thoughput. On the other hand, since we send one request one time, if other queue pop when current is sending more than cfq_quantum requests, current queue will stop send requests soon after one request, so sounds there is no big latency. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	Revert "cfq: Make use of service count to estimate the rb_key offset"	Jens Axboe	2009-11-30	1	-6/+2
\| \| \| \| \| \| \| \| \| \|	This reverts commit 3586e917f2c7df769d173c4ec99554cb40a911e5. Corrado Zoccolo <czoccolo@gmail.com> correctly points out, that we need consistency of rb_key offset across groups. This means we cannot properly use the per-service_tree service count. Revert this change. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	cfq-iosched: fix corner cases in idling logic	Corrado Zoccolo	2009-11-26	1	-10/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Idling logic was disabled in some corner cases, leading to unfair share for noidle queues. * the idle timer was not armed if there were other requests in the driver. unfortunately, those requests could come from other workloads, or queues for which we don't enable idling. So we will check only pending requests from the active queue * rq_noidle check on no-idle queue could disable the end of tree idle if the last completed request was rq_noidle. Now, we will disable that idle only if all the queues served in the no-idle tree had rq_noidle requests. Reported-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	cfq-iosched: idling on deep seeky sync queues	Corrado Zoccolo	2009-11-26	1	-1/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Seeky sync queues with large depth can gain unfairly big share of disk time, at the expense of other seeky queues. This patch ensures that idling will be enabled for queues with I/O depth at least 4, and small think time. The decision to enable idling is sticky, until an idle window times out without seeing a new request. The reasoning behind the decision is that, if an application is using large I/O depth, it is already optimized to make full utilization of the hardware, and therefore we reserve a slice of exclusive use for it. Reported-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	cfq-iosched: fix no-idle preemption logic	Corrado Zoccolo	2009-11-26	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	An incoming no-idle queue should preempt the active no-idle queue only if the active queue is idling due to service tree empty. Previous code was buggy in two ways: * it relied on service_tree field to be set on the active queue, while it is not set when the code is idling for a new request * it didn't check for the service tree empty condition, so could lead to LIFO behaviour if multiple queues with depth > 1 were preempting each other on an non-NCQ device. Reported-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	cfq-iosched: fix ncq detection code	Corrado Zoccolo	2009-11-26	1	-9/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CFQ's detection of queueing devices initially assumes a queuing device and detects if the queue depth reaches a certain threshold. However, it will reconsider this choice periodically. Unfortunately, if device is considered not queuing, CFQ will force a unit queue depth for some workloads, thus defeating the detection logic. This leads to poor performance on queuing hardware, since the idle window remains enabled. Given this premise, switching to hw_tag = 0 after we have proved at least once that the device is NCQ capable is not a good choice. The new detection code starts in an indeterminate state, in which CFQ behaves as if hw_tag = 1, and then, if for a long observation period we never saw large depth, we switch to hw_tag = 0, otherwise we stick to hw_tag = 1, without reconsidering it again. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	cfq-iosched: cleanup unreachable code	Corrado Zoccolo	2009-11-26	1	-13/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cfq_should_idle returns false for no-idle queues that are not the last, so the control flow will never reach the removed code in a state that satisfies the if condition. The unreachable code was added to emulate previous cfq behaviour for non-NCQ rotational devices. My tests show that even without it, the performances and fairness are comparable with previous cfq, thanks to the fact that all seeky queues are grouped together, and that we idle at the end of the tree. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	cfq: Make use of service count to estimate the rb_key offset	Gui Jianfeng	2009-11-26	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \|	For the moment, different workload cfq queues are put into different service trees. But CFQ still uses "busy_queues" to estimate rb_key offset when inserting a cfq queue into a service tree. I think this isn't appropriate, and it should make use of service tree count to do this estimation. This patch is for for-2.6.33 branch. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	block: jiffies fixes	Randy Dunlap	2009-11-11	1	-0/+1
\| \| \| \| \| \| \| \| \|	Use HZ-independent calculation of milliseconds. Add jiffies.h where it was missing since functions or macros from it are used. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	cfq-iosched: fix next_rq computation	Corrado Zoccolo	2009-11-08	1	-6/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Cfq has a bug in computation of next_rq, that affects transition between multiple sequential request streams in a single queue (e.g.: two sequential buffered writers of the same priority), causing the alternation between the two streams for a transient period. 8,0 1 18737 0.260400660 5312 D W 141653311 + 256 8,0 1 20839 0.273239461 5400 D W 141653567 + 256 8,0 1 20841 0.276343885 5394 D W 142803919 + 256 8,0 1 20843 0.279490878 5394 D W 141668927 + 256 8,0 1 20845 0.292459993 5400 D W 142804175 + 256 8,0 1 20847 0.295537247 5400 D W 141668671 + 256 8,0 1 20849 0.298656337 5400 D W 142804431 + 256 8,0 1 20851 0.311481148 5394 D W 141668415 + 256 8,0 1 20853 0.314421305 5394 D W 142804687 + 256 8,0 1 20855 0.318960112 5400 D W 142804943 + 256 The fix makes sure that the next_rq is computed from the last dispatched request, and not affected by merging. 8,0 1 37776 4.305161306 0 D W 141738087 + 256 8,0 1 37778 4.308298091 0 D W 141738343 + 256 8,0 1 37780 4.312885190 0 D W 141738599 + 256 8,0 1 37782 4.315933291 0 D W 141738855 + 256 8,0 1 37784 4.319064459 0 D W 141739111 + 256 8,0 1 37786 4.331918431 5672 D W 142803007 + 256 8,0 1 37788 4.334930332 5672 D W 142803263 + 256 8,0 1 37790 4.337902723 5672 D W 142803519 + 256 8,0 1 37792 4.342359774 5672 D W 142803775 + 256 8,0 1 37794 4.345318286 0 D W 142804031 + 256 Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	cfq-iosched: get rid of the coop_preempt flag	Jens Axboe	2009-11-04	1	-19/+2
\| \| \| \| \| \| \|	We need to rework this logic post the cooperating cfq_queue merging, for now just get rid of it and Jeff Moyer will fix the fall out. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	cfq-iosched: fix merge error	Jens Axboe	2009-11-03	1	-1/+0
\| \| \| \| \| \| \|	We ended up with testing the same condition twice, pretty pointless. Remove that first if. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
*	Merge branch 'for-linus' into for-2.6.33	Jens Axboe	2009-11-03	1	-2/+20
\|\ \| \| \| \| \| \| \| \| \| \| \| \|	Conflicts: block/cfq-iosched.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
\| *	cfq-iosched: limit coop preemption	Shaohua Li	2009-11-03	1	-2/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CFQ has an optimization for cooperated applications. if several io-context have close requests, they will get boost. But the optimization get abused. Considering thread a, b, which work on one file. a reads sectors s, s+2, s+4, ...; b reads sectors s+1, s+3, s +5, ... Both a and b are sequential read, so they can open idle window. a reads a sector s and goes to idle window and wakeup b. b reads sector s+1, since in current implementation, cfq_should_preempt() thinks a and b are cooperators, b will preempt a. b then reads sector s+1 and goes to idle window and wakeup a. for the same reason, a will preempt b and reads s+2. a and b will continue the circle. The circle will be very long, and a and b will occupy whole disk queue. Other applications will nearly have no chance to run. Fix this limiting coop preempt until a queue is scheduled normally again. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
\| *	cfq-iosched: fix bad return value cfq_should_preempt()	Jens Axboe	2009-11-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit a6151c3a5c8e1ff5a28450bc8d6a99a2a0add0a7 inadvertently reversed a preempt condition check, potentially causing a performance regression. Make the meta check correct again. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* \|	Merge branch 'cfq-2.6.33' into for-2.6.33	Jens Axboe	2009-11-03	1	-52/+321
\|\ \
\| * \|	cfq-iosched: fix style issue in cfq_get_avg_queues()	Jens Axboe	2009-10-28	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Line breaks and bad brace placement. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>