summaryrefslogtreecommitdiffstats
path: root/include/linux/cgroup.h
Commit message (Collapse)AuthorAgeFilesLines
* cgroup: kill css_idLi Zefan2013-09-231-37/+0
| | | | | | | | | | The only user of css_id was memcg, and it has been convered to use cgroup->id, so kill css_id. Signed-off-by: Li Zefan <lizefan@huwei.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
* cgroup: implement CFTYPE_NO_PREFIXTejun Heo2013-08-261-0/+1
| | | | | | | | | | | | | | | When cgroup files are created, cgroup core automatically prepends the name of the subsystem as prefix. This patch adds CFTYPE_NO_ which disables the automatic prefix. This is to work around historical baggages and shouldn't be used for new files. This will be used to move "cgroup.event_control" from cgroup core to memcg. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Glauber Costa <glommer@gmail.com>
* cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntaxTejun Heo2013-08-261-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_css_from_dir() will grow another user. In preparation, make the following changes. * All css functions are prefixed with just "css_", rename it to css_from_dir(). * Take dentry * instead of file * as dentry is what ultimately identifies a cgroup and file may not always be available. Note that the function now checkes whether @dentry->d_inode is NULL as the caller now may specify a negative dentry. * Make it take cgroup_subsys * instead of integer subsys_id. This simplifies the function and allows specifying no subsystem for cgroup->dummy_css. * Make return section a bit less verbose. This patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com>
* cgroup: change cgroup_from_id() to css_from_id()Li Zefan2013-08-191-18/+2
| | | | | | | | | | | | | Now we want cgroup core to always provide the css to use to the subsystems, so change this API to css_from_id(). Uninline css_from_id(), because it's getting bigger and cgroup_css() has been unexported. While at it, remove the #ifdef, and shuffle the order of the args. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* cgroup: RCU protect each cgroup_subsys_state releaseTejun Heo2013-08-131-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the planned unified hierarchy, individual css's will be created and destroyed dynamically across the lifetime of a cgroup. To enable such usages, css destruction is being decoupled from cgroup destruction. Most of the destruction path has been decoupled but the actual free of css still depends on cgroup free path. When all css refs are drained, css_release() kicks off css_free_work_fn() which puts the cgroup. When the cgroup refcnt reaches zero, cgroup_diput() is invoked which in turn schedules RCU free of the cgroup. After a grace period, all css's are freed along with the cgroup itself. This patch moves the RCU grace period and css freeing from cgroup release path to css release path. css_release(), instead of kicking off css_free_work_fn() directly, schedules RCU callback css_free_rcu_fn() which in turn kicks off css_free_work_fn() after a RCU grace period. css_free_work_fn() is updated to free the css directly. The five-way punting - percpu ref kill confirmation, a work item, percpu ref release, RCU grace period, and again a work item - is quite hairy but the work items are there only to provide process context and the actual sequence is kill confirm -> release -> RCU free, which isn't simple but not too crazy. This removes cgroup_css() usage after offline_css() allowing clearing cgroup->subsys[] from offline_css(), which makes it consistent with online_css() and brings it closer to proper lifetime management for individual css's. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cgroup: decouple cgroup_subsys_state destruction from cgroup destructionTejun Heo2013-08-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | Currently, css (cgroup_subsys_state) lifetime is tied to that of the associated cgroup. css's are created when the associated cgroup is created and destroyed when it gets destroyed. Also, individual css's aren't RCU protected but the whole cgroup is. With the planned unified hierarchy, css's will need to be dynamically created and destroyed within the lifetime of a cgroup. To enable such usages, this patch decouples css destruction from cgroup destruction - offline_css() invocation and the final css_put() are moved from cgroup_destroy_css_killed() to css_killed_work_fn(). Now each css is individually offlined and put as its reference count is killed instead of waiting for all css's attached to the cgroup to finish refcnt killing and then proceeding to offlining and putting them together. While this changes the order of destruction operations, the changes shouldn't be noticeable to cgroup subsystems or userland. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cgroup: replace cgroup->css_kill_cnt with ->nr_cssTejun Heo2013-08-131-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, css (cgroup_subsys_state) lifetime is tied to that of the associated cgroup. With the planned unified hierarchy, css's will be dynamically created and destroyed within the lifetime of a cgroup. To enable such usages, css's will be individually RCU protected instead of being tied to the cgroup. cgroup->css_kill_cnt is used during cgroup destruction to wait for css reference count disable; however, this model doesn't work once css's lifetimes are managed separately from cgroup's. This patch replaces it with cgroup->nr_css which is an cgroup_mutex protected integer counting the number of attached css's. The count is incremented from online_css() and decremented after refcnt kill is confirmed. If the count reaches zero and the cgroup is marked dead, the second stage of cgroup destruction is kicked off. If a cgroup doesn't have any css attached at the time of rmdir, cgroup_destroy_locked() now invokes the second stage directly as no css kill confirmation would happen. cgroup_offline_fn() - the second step of cgroup destruction - is renamed to cgroup_destroy_css_killed() and now expects to be called with cgroup_mutex held. While this patch changes how css destruction is punted to work items, it shouldn't change any visible behavior. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cgroup: add __rcu modifier to cgroup->subsys[]Tejun Heo2013-08-131-1/+1
| | | | | | | | | | | | | | | | | For the planned unified hierarchy, each css (cgroup_subsys_state) will be RCU protected so that it can be created and destroyed individually while allowing RCU accesses. Previous changes ensured that all cgroup->subsys[] accesses use the cgroup_css() accessor. This patch adds __rcu modifier to cgroup->subsys[], add matching RCU dereference in cgroup_css() and convert all assignments to either rcu_assign_pointer() or RCU_INIT_POINTER(). This change prepares for the actual RCUfication of css's and doesn't introduce any visible behavior change. The conversion is verified with sparse and all accesses are properly RCU annotated. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cgroup: add cgroup_subsys_state->parentTejun Heo2013-08-131-9/+4
| | | | | | | | | | | | | | | | | | | | | | | | | With the planned unified hierarchy, css's (cgroup_subsys_state) will be RCU protected and allowed to be attached and detached dynamically over the course of a cgroup's lifetime. This means that css's will stay accessible after being detached from its cgroup - the matching pointer in cgroup->subsys[] cleared - for ref draining and RCU grace period. cgroup core still wants to guarantee that the parent css is never destroyed before its children and css_parent() always returns the parent regardless of the state of the child css as long as it's accessible. This patch makes css's hold onto their parents and adds css->parent so that the parent css is never detroyed before its children and can be determined without consulting the cgroups. cgroup->dummy_css is also updated to point to the parent dummy_css; however, it doesn't need to worry about object lifetime as the parent cgroup is already pinned by the child. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cgroup: rename cgroup_subsys_state->dput_work and its callback functionTejun Heo2013-08-131-1/+1
| | | | | | | | | | | | | | css (cgroup_subsys_state) will become RCU protected and there will be two stages which require punting to work item during release. To prepare for using the work item for multiple times, rename css->dput_work to css->destroy_work and css_dput_fn() to css_free_work_fn() and move work item initialization from css init to right before the actual usage. This reorganization doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cgroup: make css_for_each_descendant() and friends include the origin css in ↵Tejun Heo2013-08-081-8/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | the iteration Previously, all css descendant iterators didn't include the origin (root of subtree) css in the iteration. The reasons were maintaining consistency with css_for_each_child() and that at the time of introduction more use cases needed skipping the origin anyway; however, given that css_is_descendant() considers self to be a descendant, omitting the origin css has become more confusing and looking at the accumulated use cases rather clearly indicates that including origin would result in simpler code overall. While this is a change which can easily lead to subtle bugs, cgroup API including the iterators has recently gone through major restructuring and no out-of-tree changes will be applicable without adjustments making this a relatively acceptable opportunity for this type of change. The conversions are mostly straight-forward. If the iteration block had explicit origin handling before or after, it's moved inside the iteration. If not, if (pos == origin) continue; is added. Some conversions add extra reference get/put around origin handling by consolidating origin handling and the rest. While the extra ref operations aren't strictly necessary, this shouldn't cause any noticeable difference. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com>
* cgroup: unexport cgroup_css()Tejun Heo2013-08-081-13/+0
| | | | | | | | | | | | | | cgroup_css() no longer has any user left outside cgroup.c proper and we don't want subsystems to grow new usages of the function. cgroup core should always provide the css to use to the subsystems, which will make dynamic creation and destruction of css's across the lifetime of a cgroup much more manageable than exposing the cgroup directly to subsystems and let them dereference css's from it. Make cgroup_css() a static function in cgroup.c. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cgroup: make cgroup_taskset deal with cgroup_subsys_state instead of cgroupTejun Heo2013-08-081-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | cgroup is in the process of converting to css (cgroup_subsys_state) from cgroup as the principal subsystem interface handle. This is mostly to prepare for the unified hierarchy support where css's will be created and destroyed dynamically but also helps cleaning up subsystem implementations as css is usually what they are interested in anyway. cgroup_taskset which is used by the subsystem attach methods is the last cgroup subsystem API which isn't using css as the handle. Update cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp. The conversions are pretty mechanical. One exception is cpuset::cgroup_cs(), which lost its last user and got removed. This patch shouldn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org>
* cgroup: make cftype->[un]register_event() deal with cgroup_subsys_state ↵Tejun Heo2013-08-081-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | instead of cgroup cgroup is in the process of converting to css (cgroup_subsys_state) from cgroup as the principal subsystem interface handle. This is mostly to prepare for the unified hierarchy support where css's will be created and destroyed dynamically but also helps cleaning up subsystem implementations as css is usually what they are interested in anyway. cftype->[un]register_event() is among the remaining couple interfaces which still use struct cgroup. Convert it to cgroup_subsys_state. The conversion is mostly mechanical and removes the last users of mem_cgroup_from_cont() and cg_to_vmpressure(), which are removed. v2: indentation update as suggested by Li Zefan. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com>
* cgroup: make task iterators deal with cgroup_subsys_state instead of cgroupTejun Heo2013-08-081-10/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | cgroup is in the process of converting to css (cgroup_subsys_state) from cgroup as the principal subsystem interface handle. This is mostly to prepare for the unified hierarchy support where css's will be created and destroyed dynamically but also helps cleaning up subsystem implementations as css is usually what they are interested in anyway. This patch converts task iterators to deal with css instead of cgroup. Note that under unified hierarchy, different sets of tasks will be considered belonging to a given cgroup depending on the subsystem in question and making the iterators deal with css instead cgroup provides them with enough information about the iteration. While at it, fix several function comment formats in cpuset.c. This patch doesn't introduce any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com>
* cgroup: remove struct cgroup_scannerTejun Heo2013-08-081-10/+6
| | | | | | | | | | | | | | | | | | | cgroup_scan_tasks() takes a pointer to struct cgroup_scanner as its sole argument and the only function of that struct is packing the arguments of the function call which are consisted of five fields. It's not too unusual to pack parameters into a struct when the number of arguments gets excessive or the whole set needs to be passed around a lot, but neither holds here making it just weird. Drop struct cgroup_scanner and pass the params directly to cgroup_scan_tasks(). Note that struct cpuset_change_nodemask_arg was added to cpuset.c to pass both ->cs and ->newmems pointer to cpuset_change_nodemask() using single data pointer. This doesn't make any functional differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cgroup: make cgroup_task_iter remember the cgroup being iteratedTejun Heo2013-08-081-3/+3
| | | | | | | | | | | | | | | | Currently all cgroup_task_iter functions require @cgrp to be passed in, which is superflous and increases chance of usage error. Make cgroup_task_iter remember the cgroup being iterated and drop @cgrp argument from next and end functions. This patch doesn't introduce any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com>
* cgroup: rename cgroup_iter to cgroup_task_iterTejun Heo2013-08-081-23/+8
| | | | | | | | | | | | | | | | | | | | cgroup now has multiple iterators and it's quite confusing to have something which walks over tasks of a single cgroup named cgroup_iter. Let's rename it to cgroup_task_iter. While at it, reformat / update comments and replace the overview comment above the interface function decls with proper function comments. Such overview can be useful but function comments should be more than enough here. This is pure rename and doesn't introduce any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com>
* cgroup: make hierarchy iterators deal with cgroup_subsys_state instead of cgroupTejun Heo2013-08-081-42/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup is currently in the process of transitioning to using css (cgroup_subsys_state) as the primary handle instead of cgroup in subsystem API. For hierarchy iterators, this is beneficial because * In most cases, css is the only thing subsystems care about anyway. * On the planned unified hierarchy, iterations for different subsystems will need to skip over different subtrees of the hierarchy depending on which subsystems are enabled on each cgroup. Passing around css makes it unnecessary to explicitly specify the subsystem in question as css is intersection between cgroup and subsystem * For the planned unified hierarchy, css's would need to be created and destroyed dynamically independent from cgroup hierarchy. Having cgroup core manage css iteration makes enforcing deref rules a lot easier. Most subsystem conversions are straight-forward. Noteworthy changes are * blkio: cgroup_to_blkcg() is no longer used. Removed. * freezer: cgroup_freezer() is no longer used. Removed. * devices: cgroup_to_devcgroup() is no longer used. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk>
* cgroup: always use cgroup_next_child() to walk the children listTejun Heo2013-08-081-3/+2
| | | | | | | | | | | | | | There are several places where the children list is accessed directly. This patch converts those places to use cgroup_next_child(). This will help updating the hierarchy iterators to use @css instead of @cgrp. While cgroup_next_child() can be heavy in pathological cases - e.g. a lot of dead children, this shouldn't cause any noticeable behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cgroup: convert cgroup_next_sibling() to cgroup_next_child()Tejun Heo2013-08-081-2/+2
| | | | | | | | | | | | | | | | | | | | cgroup is transitioning to using css (cgroup_subsys_state) as the main subsys interface handle instead of cgroup and the iterators will be updated to use css too. The iterators need to walk the cgroup hierarchy and return the css's matching the origin css, which is a bit cumbersome to open code. This patch converts cgroup_next_sibling() to cgroup_next_child() so that it can handle all steps of direct child iteration. This will be used to update iterators to take @css instead of @cgrp. In addition to the new iteration init handling, cgroup_next_child() is restructured so that the different branches share the end of iteration condition check. This patch doesn't change any behavior. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cgroup: pass around cgroup_subsys_state instead of cgroup in file methodsTejun Heo2013-08-081-11/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup is currently in the process of transitioning to using struct cgroup_subsys_state * as the primary handle instead of struct cgroup. Please see the previous commit which converts the subsystem methods for rationale. This patch converts all cftype file operations to take @css instead of @cgroup. cftypes for the cgroup core files don't have their subsytem pointer set. These will automatically use the dummy_css added by the previous patch and can be converted the same way. Most subsystem conversions are straight forwards but there are some interesting ones. * freezer: update_if_frozen() is also converted to take @css instead of @cgroup for consistency. This will make the code look simpler too once iterators are converted to use css. * memory/vmpressure: mem_cgroup_from_css() needs to be exported to vmpressure while mem_cgroup_from_cont() can be made static. Updated accordingly. * cpu: cgroup_tg() doesn't have any user left. Removed. * cpuacct: cgroup_ca() doesn't have any user left. Removed. * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left. Removed. * net_cls: cgrp_cls_state() doesn't have any user left. Removed. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Steven Rostedt <rostedt@goodmis.org>
* cgroup: add cgroup->dummy_cssTejun Heo2013-08-081-1/+10
| | | | | | | | | | | | | | | | | | | cgroup subsystem API is being converted to use css (cgroup_subsys_state) as the main handle, which makes things a bit awkward for subsystem agnostic core features - the "cgroup.*" interface files and various iterations - a bit awkward as they don't have a css to use. This patch adds cgroup->dummy_css which has NULL ->ss and whose only role is pointing back to the cgroup. This will be used to support subsystem agnostic features on the coming css based API. css_parent() is updated to handle dummy_css's. Note that css will soon grow its own ->parent field and css_parent() will be made trivial. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cgroup: add subsys backlink pointer to cftypeTejun Heo2013-08-081-1/+7
| | | | | | | | | | | | | | | | | | | | | cgroup is transitioning to using css (cgroup_subsys_state) instead of cgroup as the primary subsystem handle. The cgroupfs file interface will be converted to use css's which requires finding out the subsystem from cftype so that the matching css can be determined from the cgroup. This patch adds cftype->ss which points to the subsystem the file belongs to. The field is initialized while a cftype is being registered. This makes it unnecessary to explicitly specify the subsystem for other cftype handling functions. @ss argument dropped from various cftype handling functions. This patch shouldn't introduce any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk>
* cgroup: pass around cgroup_subsys_state instead of cgroup in subsystem methodsTejun Heo2013-08-081-10/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup is currently in the process of transitioning to using struct cgroup_subsys_state * as the primary handle instead of struct cgroup * in subsystem implementations for the following reasons. * With unified hierarchy, subsystems will be dynamically bound and unbound from cgroups and thus css's (cgroup_subsys_state) may be created and destroyed dynamically over the lifetime of a cgroup, which is different from the current state where all css's are allocated and destroyed together with the associated cgroup. This in turn means that cgroup_css() should be synchronized and may return NULL, making it more cumbersome to use. * Differing levels of per-subsystem granularity in the unified hierarchy means that the task and descendant iterators should behave differently depending on the specific subsystem the iteration is being performed for. * In majority of the cases, subsystems only care about its part in the cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods often obtain the matching css pointer from the cgroup and don't bother with the cgroup pointer itself. Passing around css fits much better. This patch converts all cgroup_subsys methods to take @css instead of @cgroup. The conversions are mostly straight-forward. A few noteworthy changes are * ->css_alloc() now takes css of the parent cgroup rather than the pointer to the new cgroup as the css for the new cgroup doesn't exist yet. Knowing the parent css is enough for all the existing subsystems. * In kernel/cgroup.c::offline_css(), unnecessary open coded css dereference is replaced with local variable access. This patch shouldn't cause any behavior differences. v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced with local variable @css as suggested by Li Zefan. Rebased on top of new for-3.12 which includes for-3.11-fixes so that ->css_free() invocation added by da0a12caff ("cgroup: fix a leak when percpu_ref_init() fails") is converted too. Suggested by Li Zefan. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Steven Rostedt <rostedt@goodmis.org>
* cgroup: add css_parent()Tejun Heo2013-08-081-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently, controllers have to explicitly follow the cgroup hierarchy to find the parent of a given css. cgroup is moving towards using cgroup_subsys_state as the main controller interface construct, so let's provide a way to climb the hierarchy using just csses. This patch implements css_parent() which, given a css, returns its parent. The function is guarnateed to valid non-NULL parent css as long as the target css is not at the top of the hierarchy. freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices are converted to use css_parent() instead of accessing cgroup->parent directly. * __parent_ca() is dropped from cpuacct and its usage is replaced with parent_ca(). The only difference between the two was NULL test on cgroup->parent which is now embedded in css_parent() making the distinction moot. Note that eventually a css->parent field will be added to css and the NULL check in css_parent() will go away. This patch shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cgroup: add subsystem pointer to cgroup_subsys_stateTejun Heo2013-08-081-5/+4
| | | | | | | | | | | | | | | | Currently, given a cgroup_subsys_state, there's no way to find out which subsystem the css is for, which we'll need to convert the cgroup controller API to primarily use @css instead of @cgroup. This patch adds cgroup_subsys_state->ss which points to the subsystem the @css belongs to. While at it, remove the comment about accessing @css->cgroup to determine the hierarchy. cgroup core will provide API to traverse hierarchy of css'es and we don't want subsystems to directly walk cgroup hierarchies anymore. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/Tejun Heo2013-08-081-12/+19
| | | | | | | | | | | | | | | | | | The names of the two struct cgroup_subsys_state accessors - cgroup_subsys_state() and task_subsys_state() - are somewhat awkward. The former clashes with the type name and the latter doesn't even indicate it's somehow related to cgroup. We're about to revamp large portion of cgroup API, so, let's rename them so that they're less awkward. Most per-controller usages of the accessors are localized in accessor wrappers and given the amount of scheduled changes, this isn't gonna add any noticeable headache. Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state() to task_css(). This patch is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* cgroup: Merge branch 'for-3.11-fixes' into for-3.12Tejun Heo2013-08-021-2/+1
|\ | | | | | | | | | | | | | | | | | | for-3.12 branch is about to receive invasive updates which are dependent on da0a12caff ("cgroup: fix a leak when percpu_ref_init() fails"). Given the amount of scheduled changes, I think it'd less painful to pull in for-3.11-fixes as preparation. Pull in for-3.11-fixes into for-3.12. Signed-off-by: Tejun Heo <tj@kernel.org>
| * cgroup: replace task_cgroup_path_from_hierarchy() with task_cgroup_path()Tejun Heo2013-07-121-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | task_cgroup_path_from_hierarchy() was added for the planned new users and none of the currently planned users wants to know about multiple hierarchies. This patch drops the multiple hierarchy part and makes it always return the path in the first non-dummy hierarchy. As unified hierarchy will always have id 1, this is guaranteed to return the path for the unified hierarchy if mounted; otherwise, it will return the path from the hierarchy which happens to occupy the lowest hierarchy id, which will usually be the first hierarchy mounted after boot. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Lennart Poettering <lennart@poettering.net> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jan Kaluža <jkaluza@redhat.com>
* | cgroup: implement cgroup_from_id()Li Zefan2013-07-311-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This will be used as a replacement for css_lookup(). There's a difference with cgroup id and css id. cgroup id starts with 0, while css id starts with 1. v4: - also check if cggroup_mutex is held. - make it an inline function. Signed-off-by: Li Zefan <lizefan@huawei.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org>
* | cgroup: document how cgroup IDs are assignedLi Zefan2013-07-311-1/+7
| | | | | | | | | | | | | | | | | | | | | | As cgroup id has been used in netprio cgroup and will be used in memcg, it's important to make it clear how a cgroup id is allocated. For example, in netprio cgroup, the id is used as index of anarray. Signed-off-by: Li Zefan <lizefan@huwei.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org>
* | cgroup: convert cgroup_ida to cgroup_idrLi Zefan2013-07-311-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This enables us to lookup a cgroup by its id. v4: - add a comment for idr_remove() in cgroup_offline_fn(). v3: - on success, idr_alloc() returns the id but not 0, so fix the BUG_ON() in cgroup_init(). - pass the right value to idr_alloc() so that the id for dummy cgroup is 0. Signed-off-by: Li Zefan <lizefan@huawei.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org>
* | cgroup: more naming cleanupsLi Zefan2013-07-311-3/+3
| | | | | | | | | | | | | | | | Constantly use @cset for css_set variables and use @cgrp as cgroup variables. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | Merge branch 'for-3.11/core' of git://git.kernel.dk/linux-blockLinus Torvalds2013-07-111-0/+2
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull core block IO updates from Jens Axboe: "Here are the core IO block bits for 3.11. It contains: - A tweak to the reserved tag logic from Jan, for weirdo devices with just 3 free tags. But for those it improves things substantially for random writes. - Periodic writeback fix from Jan. Marked for stable as well. - Fix for a race condition in IO scheduler switching from Jianpeng. - The hierarchical blk-cgroup support from Tejun. This is the grunt of the series. - blk-throttle fix from Vivek. Just a note that I'm in the middle of a relocation, whole family is flying out tomorrow. Hence I will be awal the remainder of this week, but back at work again on Monday the 15th. CC'ing Tejun, since any potential "surprises" will most likely be from the blk-cgroup work. But it's been brewing for a while and sitting in my tree and linux-next for a long time, so should be solid." * 'for-3.11/core' of git://git.kernel.dk/linux-block: (36 commits) elevator: Fix a race in elevator switching block: Reserve only one queue tag for sync IO if only 3 tags are available writeback: Fix periodic writeback after fs mount blk-throttle: implement proper hierarchy support blk-throttle: implement throtl_grp->has_rules[] blk-throttle: Account for child group's start time in parent while bio climbs up blk-throttle: add throtl_qnode for dispatch fairness blk-throttle: make throtl_pending_timer_fn() ready for hierarchy blk-throttle: make tg_dispatch_one_bio() ready for hierarchy blk-throttle: make blk_throtl_bio() ready for hierarchy blk-throttle: make blk_throtl_drain() ready for hierarchy blk-throttle: dispatch from throtl_pending_timer_fn() blk-throttle: implement dispatch looping blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log() blk-throttle: add throtl_service_queue->parent_sq blk-throttle: generalize update_disptime optimization in blk_throtl_bio() blk-throttle: dispatch to throtl_data->service_queue.bio_lists[] blk-throttle: move bio_lists[] and friends to throtl_service_queue ...
| * blk-throttle: implement proper hierarchy supportTejun Heo2013-05-141-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the recent updates, blk-throttle is finally ready for proper hierarchy support. Dispatching now honors service_queue->parent_sq and propagates correctly. The only thing missing is setting ->parent_sq correctly so that throtl_grp hierarchy matches the cgroup hierarchy. This patch updates throtl_pd_init() such that service_queues form the same hierarchy as the cgroup hierarchy if sane_behavior is enabled. As this concludes proper hierarchy support for blkcg, the shameful .broken_hierarchy tag is removed from blkio_subsys. v2: Updated blkio-controller.txt as suggested by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Li Zefan <lizefan@huawei.com>
* | Merge branch 'for-3.11-cpuset' of ↵Linus Torvalds2013-07-021-1/+8
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cpuset changes from Tejun Heo: "cpuset has always been rather odd about its configurations - a cgroup right after creation didn't allow any task executions before configuration, changing configuration in the parent modifies the descendants irreversibly and so on. These behaviors are inherently nasty and almost hostile against sharing the hierarchy with other controllers making it very difficult to use in unified hierarchy. Li is currently in the process of updating the behaviors for __DEVEL__sane_behavior which is the bulk of changes in this pull request. It isn't complete yet and the behaviors will change further but all changes are gated behind sane_behavior. In the process, the rather hairy work-item punting which was used to work around the limitations of cgroup descendant iterator was simplified." * 'for-3.11-cpuset' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: rename @cont to @cgrp cpuset: fix to migrate mm correctly in a corner case cpuset: allow to move tasks to empty cpusets cpuset: allow to keep tasks in empty cpusets cpuset: introduce effective_{cpumask|nodemask}_cpuset() cpuset: record old_mems_allowed in struct cpuset cpuset: remove async hotplug propagation work cpuset: let hotplug propagation work wait for task attaching cpuset: re-structure update_cpumask() a bit cpuset: remove cpuset_test_cpumask() cpuset: remove unnecessary variable in cpuset_attach() cpuset: cleanup guarantee_online_{cpus|mems}() cpuset: remove redundant check in cpuset_cpus_allowed_fallback()
| * | cpuset: allow to move tasks to empty cpusetsLi Zefan2013-06-131-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently some cpuset behaviors are not friendly when cpuset is co-mounted with other cgroup controllers. Now with this patchset if cpuset is mounted with sane_behavior option, it behaves differently: - Tasks will be kept in empty cpusets when hotplug happens and take masks of ancestors with non-empty cpus/mems, instead of being moved to an ancestor. - A task can be moved into an empty cpuset, and again it takes masks of ancestors, so the user can drop a task into a newly created cgroup without having to do anything for it. As tasks can reside in empy cpusets, here're some rules: - They can be moved to another cpuset, regardless it's empty or not. - Though it takes masks from ancestors, it takes other configs from the empty cpuset. - If the ancestors' masks are changed, those tasks will also be updated to take new masks. v2: add documentation in include/linux/cgroup.h Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
| * | cpuset: allow to keep tasks in empty cpusetsLi Zefan2013-06-131-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To achieve this: - We call update_tasks_cpumask/nodemask() for empty cpusets when hotplug happens, instead of moving tasks out of them. - When a cpuset's masks are changed by writing cpuset.cpus/mems, we also update tasks in child cpusets which are empty. v3: - do propagation work in one place for both hotplug and unplug v2: - drop rcu_read_lock before calling update_task_nodemask() and update_task_cpumask(), instead of using workqueue. - add documentation in include/linux/cgroup.h Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | | cgroup: CGRP_ROOT_SUBSYS_BOUND should be ignored when comparing mount optionsTejun Heo2013-06-271-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1672d04070 ("cgroup: fix cgroupfs_root early destruction path") introduced CGRP_ROOT_SUBSYS_BOUND which is used to mark completion of subsys binding on a new root; however, this broke remounts. cgroup_remount() doesn't allow changing root options via remount and CGRP_ROOT_SUBSYS_BOUND, which is set on all fully initialized roots, makes the function reject all remounts. Fix it by putting the options part in the lower 16 bits of root->flags and masking the comparions. While at it, make cgroup_remount() emit an error message explaining why it's rejecting a remount request, so that it's less of a mystery. Signed-off-by: Tejun Heo <tj@kernel.org>
* | | cgroup: fix RCU accesses to task->cgroupsTejun Heo2013-06-261-10/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | task->cgroups is a RCU pointer pointing to struct css_set. A task switches to a different css_set on cgroup migration but a css_set doesn't change once created and its pointers to cgroup_subsys_states aren't RCU protected. task_subsys_state[_check]() is the macro to acquire css given a task and subsys_id pair. It RCU-dereferences task->cgroups->subsys[] not task->cgroups, so the RCU pointer task->cgroups ends up being dereferenced without read_barrier_depends() after it. It's broken. Fix it by introducing task_css_set[_check]() which does RCU-dereference on task->cgroups. task_subsys_state[_check]() is reimplemented to directly dereference ->subsys[] of the css_set returned from task_css_set[_check](). This removes some of sparse RCU warnings in cgroup. v2: Fixed unbalanced parenthsis and there's no need to use rcu_dereference_raw() when !CONFIG_PROVE_RCU. Both spotted by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: stable@vger.kernel.org
* | | cgroup: fix cgroupfs_root early destruction pathTejun Heo2013-06-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroupfs_root used to have ->actual_subsys_mask in addition to ->subsys_mask. a8a648c4ac ("cgroup: remove cgroup->actual_subsys_mask") removed it noting that the subsys_mask is essentially temporary and doesn't belong in cgroupfs_root; however, the patch made it impossible to tell whether a cgroupfs_root actually has the subsystems bound or just have the bits set leading to the following BUG when trying to mount with subsystems which are already mounted elsewhere. kernel BUG at kernel/cgroup.c:1038! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ... CPU: 1 PID: 7973 Comm: mount Tainted: G W 3.10.0-rc7-next-20130625-sasha-00011-g1c1dc0e #1105 task: ffff880fc0ae8000 ti: ffff880fc0b9a000 task.ti: ffff880fc0b9a000 RIP: 0010:[<ffffffff81249b29>] [<ffffffff81249b29>] rebind_subsystems+0x409/0x5f0 ... Call Trace: [<ffffffff8124bd4f>] cgroup_kill_sb+0xff/0x210 [<ffffffff813d21af>] deactivate_locked_super+0x4f/0x90 [<ffffffff8124f3b3>] cgroup_mount+0x673/0x6e0 [<ffffffff81257169>] cpuset_mount+0xd9/0x110 [<ffffffff813d2580>] mount_fs+0xb0/0x2d0 [<ffffffff81404afd>] vfs_kern_mount+0xbd/0x180 [<ffffffff814070b5>] do_new_mount+0x145/0x2c0 [<ffffffff814085d6>] do_mount+0x356/0x3c0 [<ffffffff8140873d>] SyS_mount+0xfd/0x140 [<ffffffff854eb600>] tracesys+0xdd/0xe2 We still want rebind_subsystems() to take added/removed masks, so let's fix it by marking whether a cgroupfs_root has finished binding or not. Also, document what's going on around ->subsys_mask initialization so that similar mistakes aren't repeated. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Li Zefan <lizefan@huawei.com>
* | | cgroup: remove cgroup->actual_subsys_maskTejun Heo2013-06-241-7/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup curiously has two subsystem masks, ->subsys_mask and ->actual_subsys_mask. The latter only exists because the new target subsys_mask is passed into rebind_subsystems() via @root>subsys_mask. rebind_subsystems() needs to know what the current mask is to decide how to reach the target mask so ->actual_subsys_mask is used as the temp location to remember the current state. Adding a temporary field to a permanent data structure is rather silly and can be misleading. Update rebind_subsystems() to take @added_mask and @removed_mask instead and remove @root->actual_subsys_mask. This patch shouldn't introduce any behavior changes. v2: Comment and description updated as suggested by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* | | cgroup: convert CFTYPE_* flags to enumsTejun Heo2013-06-241-3/+5
| | | | | | | | | | | | | | | | | | | | | Purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* | | cgroup: rename cont to cgrpLi Zefan2013-06-191-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cont is short for container. control group was named process container at first, but then people found container already has a meaning in linux kernel. Clean up the leftover variable name @cont. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | | cgroup: convert cgroup_cft_commit() to use cgroup_for_each_descendant_pre()Li Zefan2013-06-181-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We used root->allcg_list to iterate cgroup hierarchy because at that time cgroup_for_each_descendant_pre() hasn't been invented. tj: In cgroup_cfts_commit(), s/@serial_nr/@update_upto/, move the assignment right above releasing cgroup_mutex and explain what's going on there. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
* | | cgroup: disallow rename(2) if sane_behaviorTejun Heo2013-06-181-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup's rename(2) isn't a proper migration implementation - it can't move the cgroup to a different parent in the hierarchy. All it can do is swapping the name string for that cgroup. This isn't useful and can mislead users to think that cgroup supports proper cgroup-level migration. Disallow rename(2) if sane_behavior. v2: Fail with -EPERM instead of -EINVAL so that it matches the vfs return value when ->rename is not implemented as suggested by Li. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
* | | cgroup: update sane_behavior documentationTejun Heo2013-06-131-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | f12dc02014 ("cgroup: mark "tasks" cgroup file as insane") and cc5943a781 ("cgroup: mark "notify_on_release" and "release_agent" cgroup files insane") forgot to update the changed behavior documentation in cgroup.h. Update it. Signed-off-by: Tejun Heo <tj@kernel.org>
* | | cgroup: use percpu refcnt for cgroup_subsys_statesTejun Heo2013-06-131-15/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A css (cgroup_subsys_state) is how each cgroup is represented to a controller. As such, it can be used in hot paths across the various subsystems different controllers are associated with. One of the common operations is reference counting, which up until now has been implemented using a global atomic counter and can have significant adverse impact on scalability. For example, css refcnt can be gotten and put multiple times by blkcg for each IO request. For highops configurations which try to do as much per-cpu as possible, the global frequent refcnting can be very expensive. In general, given the various and hugely diverse paths css's end up being used from, we need to make it cheap and highly scalable. In its usage, css refcnting isn't very different from module refcnting. This patch converts css refcnting to use the recently added percpu_ref. css_get/tryget/put() directly maps to the matching percpu_ref operations and the deactivation logic is no longer necessary as percpu_ref already has refcnt killing. The only complication is that as the refcnt is per-cpu, percpu_ref_kill() in itself doesn't ensure that further tryget operations will fail, which we need to guarantee before invoking ->css_offline()'s. This is resolved collecting kill confirmation using percpu_ref_kill_and_confirm() and initiating the offline phase of destruction after all css refcnt's are confirmed to be seen as killed on all CPUs. The previous patches already splitted destruction into two phases, so percpu_ref_kill_and_confirm() can be hooked up easily. This patch removes css_refcnt() which is used for rcu dereference sanity check in css_id(). While we can add a percpu refcnt API to ask the same question, css_id() itself is scheduled to be removed fairly soon, so let's not bother with it. Just drop the sanity check and use rcu_dereference_raw() instead. v2: - init_cgroup_css() was calling percpu_ref_init() without checking the return value. This causes two problems - the obvious lack of error handling and percpu_ref_init() being called from cgroup_init_subsys() before the allocators are up, which triggers warnings but doesn't cause actual problems as the refcnt isn't used for roots anyway. Fix both by moving percpu_ref_init() to cgroup_create(). - The base references were put too early by percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the refs one extra time. This wasn't noticeable because css's go through another RCU grace period before being freed. Update cgroup_destroy_locked() to grab an extra reference before killing the refcnts. This problem was noticed by Kent. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <koverstreet@google.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: "Alasdair G. Kergon" <agk@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Glauber Costa <glommer@gmail.com>
* | | cgroup: split cgroup destruction into two stepsTejun Heo2013-06-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split cgroup_destroy_locked() into two steps and put the latter half into cgroup_offline_fn() which is executed from a work item. The latter half is responsible for offlining the css's, removing the cgroup from internal lists, and propagating release notification to the parent. The separation is to allow using percpu refcnt for css. Note that this allows for other cgroup operations to happen between the first and second halves of destruction, including creating a new cgroup with the same name. As the target cgroup is marked DEAD in the first half and cgroup internals don't care about the names of cgroups, this should be fine. A comment explaining this will be added by the next patch which implements the actual percpu refcnting. As RCU freeing is guaranteed to happen after the second step of destruction, we can use the same work item for both. This patch renames cgroup->free_work to ->destroy_work and uses it for both purposes. INIT_WORK() is now performed right before queueing the work item. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com>
OpenPOWER on IntegriCloud