summaryrefslogtreecommitdiffstats
path: root/fs/ocfs2/super.c
Commit message (Collapse)AuthorAgeFilesLines
* ocfs2: clean up some dead codeJun Piao2017-09-061-1/+0
| | | | | | | | | | | | | | clean up some unused functions and parameters. Link: http://lkml.kernel.org/r/598A5E21.2080807@huawei.com Signed-off-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Alex Chen <alex.chen@huawei.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: switch ->s_uuid to uuid_tChristoph Hellwig2017-06-051-1/+1
| | | | | | | | | | For some file systems we still memcpy into it, but in various places this already allows us to use the proper uuid helpers. More to come.. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> (Changes to IMA/EVM) Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
* sched/headers: Prepare for the reduction of <linux/sched.h>'s signal API ↵Ingo Molnar2017-03-021-0/+1
| | | | | | | | | | | | | | | | | | | | dependency Instead of including the full <linux/signal.h>, we are going to include the types-only <linux/signal_types.h> header in <linux/sched.h>, to further decouple the scheduler header from the signal headers. This means that various files which relied on the full <linux/signal.h> need to be updated to gain an explicit dependency on it. Update the code that relies on sched.h's inclusion of the <linux/signal.h> header. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge branch 'for_linus' of ↵Linus Torvalds2016-12-191-1/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull quota, fsnotify and ext2 updates from Jan Kara: "Changes to locking of some quota operations from dedicated quota mutex to s_umount semaphore, a fsnotify fix and a simple ext2 fix" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: quota: Fix bogus warning in dquot_disable() fsnotify: Fix possible use-after-free in inode iteration on umount ext2: reject inodes with negative size quota: Remove dqonoff_mutex ocfs2: Use s_umount for quota recovery protection quota: Remove dqonoff_mutex from dquot_scan_active() ocfs2: Protect periodic quota syncing with s_umount semaphore quota: Use s_umount protection for quota operations quota: Hold s_umount in exclusive mode when enabling / disabling quotas fs: Provide function to get superblock with exclusive s_umount
| * ocfs2: Protect periodic quota syncing with s_umount semaphoreJan Kara2016-11-301-1/+0
| | | | | | | | | | | | | | | | New quota locking rules will require s_umount semaphore for all quota scanning functions. Add is for periodic quota syncing. Tested-by: Eric Ren <zren@suse.com> Signed-off-by: Jan Kara <jack@suse.cz>
* | ocfs2: use time64_t to represent orphan scan timesDeepa Dinamani2016-12-121-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | struct timespec is not y2038 safe. Use time64_t which is y2038 safe to represent orphan scan times. time64_t is sufficient here as only the seconds delta times are relevant. Also use appropriate time functions that return time in time64_t format. Time functions now return monotonic time instead of real time as only delta scan times are relevant and these values are not persistent across reboots. The format string for the debug print is still using long as this is only the time elapsed since the last scan and long is sufficient to represent this value. Link: http://lkml.kernel.org/r/1475365138-20567-1-git-send-email-deepa.kernel@gmail.com Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs/ocfs2/super: remove deprecated create_singlethread_workqueue()Bhaktipriya Shridhar2016-10-071-1/+1
| | | | | | | | | | | | | | | | | | | | The workqueue "ocfs2_wq" queues multiple work items viz &osb->la_enable_wq, &journal->j_recovery_work, &os->os_orphan_scan_work, &osb->osb_truncate_log_wq which require strict execution ordering. Hence, an ordered dedicated workqueue has been used. WQ_MEM_RECLAIM has been set to ensure forward progress under memory pressure because the workqueue is being used on a memory reclaim path. Link: http://lkml.kernel.org/r/66279de510a7f4cfc6e386d99b7e04b3f65fb11b.1472590094.git.bhaktipriya96@gmail.com Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'akpm' (patches from Andrew)Linus Torvalds2016-07-261-1/+0
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge updates from Andrew Morton: - a few misc bits - ocfs2 - most(?) of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits) thp: fix comments of __pmd_trans_huge_lock() cgroup: remove unnecessary 0 check from css_from_id() cgroup: fix idr leak for the first cgroup root mm: memcontrol: fix documentation for compound parameter mm: memcontrol: remove BUG_ON in uncharge_list mm: fix build warnings in <linux/compaction.h> mm, thp: convert from optimistic swapin collapsing to conservative mm, thp: fix comment inconsistency for swapin readahead functions thp: update Documentation/{vm/transhuge,filesystems/proc}.txt shmem: split huge pages beyond i_size under memory pressure thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE khugepaged: add support of collapse for tmpfs/shmem pages shmem: make shmem_inode_info::lock irq-safe khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page() thp: extract khugepaged from mm/huge_memory.c shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings shmem: add huge pages support shmem: get_unmapped_area align huge page shmem: prepare huge= mount option and sysfs knob mm, rmap: account shmem thp pages ...
| * ocfs2: cleanup implemented prototypesJoseph Qi2016-07-261-1/+0
| | | | | | | | | | | | | | | | | | | | Several prototypes in inode.h are just defined but not actually implemented and used, so remove them. Link: http://lkml.kernel.org/r/57763787.4020706@huawei.com Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | fs: have ll_rw_block users pass in op and flags separatelyMike Christie2016-06-071-1/+1
|/ | | | | | | | | | | This has ll_rw_block users pass in the operation and flags separately, so ll_rw_block can setup the bio op and bi_rw flags on the bio that is submitted. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
* mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov2016-04-041-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: fix occurring deadlock by changing ocfs2_wq from global to localjiangyiwen2016-03-251-22/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes a deadlock, as follows: Node 1 Node 2 Node 3 1)volume a and b are only mount vol a only mount vol b mounted 2) start to mount b start to mount a 3) check hb of Node 3 check hb of Node 2 in vol a, qs_holds++ in vol b, qs_holds++ 4) -------------------- all nodes' network down -------------------- 5) progress of mount b the same situation as failed, and then call Node 2 ocfs2_dismount_volume. but the process is hung, since there is a work in ocfs2_wq cannot beo completed. This work is about vol a, because ocfs2_wq is global wq. BTW, this work which is scheduled in ocfs2_wq is ocfs2_orphan_scan_work, and the context in this work needs to take inode lock of orphan_dir, because lockres owner are Node 1 and all nodes' nework has been down at the same time, so it can't get the inode lock. 6) Why can't this node be fenced when network disconnected? Because the process of mount is hung what caused qs_holds is not equal 0. Because all works in the ocfs2_wq are relative to the super block. The solution is to change the ocfs2_wq from global to local. In other words, move it into struct ocfs2_super. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: fix ip_unaligned_aio deadlock with dio work queueRyan Ding2016-03-251-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the current implementation of unaligned aio+dio, lock order behave as follow: in user process context: -> call io_submit() -> get i_mutex <== window1 -> get ip_unaligned_aio -> submit direct io to block device -> release i_mutex -> io_submit() return in dio work queue context(the work queue is created in __blockdev_direct_IO): -> release ip_unaligned_aio <== window2 -> get i_mutex -> clear unwritten flag & change i_size -> release i_mutex There is a limitation to the thread number of dio work queue. 256 at default. If all 256 thread are in the above 'window2' stage, and there is a user process in the 'window1' stage, the system will became deadlock. Since the user process hold i_mutex to wait ip_unaligned_aio lock, while there is a direct bio hold ip_unaligned_aio mutex who is waiting for a dio work queue thread to be schedule. But all the dio work queue thread is waiting for i_mutex lock in 'window2'. This case only happened in a test which send a large number(more than 256) of aio at one io_submit() call. My design is to remove ip_unaligned_aio lock. Change it to a sync io instead. Just like ip_unaligned_aio lock, serialize the unaligned aio dio. [akpm@linux-foundation.org: remove OCFS2_IOCB_UNALIGNED_IO, per Junxiao Bi] Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: record UNWRITTEN extents when populate write descRyan Ding2016-03-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock. There is still one issue in the direct write procedure. phase 1: alloc extent with UNWRITTEN flag phase 2: submit direct data to disk, add zero page to page cache phase 3: clear UNWRITTEN flag when data has been written to disk When there are 2 direct write A(0~3KB),B(4~7KB) writing to the same cluster 0~7KB (cluster size 8KB). Write request A arrive phase 2 first, it will zero the region (4~7KB). Before request A enter to phase 3, request B arrive phase 2, it will zero region (0~3KB). This is just like request B steps request A. To resolve this issue, we should let request B knows this cluster is already under zero, to prevent it from steps the previous write request. This patch will add function ocfs2_unwritten_check() to do this job. It will record all clusters that are under direct write(it will be recorded in the 'ip_unwritten_list' member of inode info), and prevent the later direct write writing to the same cluster to do the zero work again. Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: create/remove sysfile for online file checkGang He2016-03-221-0/+5
| | | | | | | | | | | | | Create online file check sysfile when ocfs2 mount, remove the related sysfile when ocfs2 umount. Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: use spinlock_irqsave() to downconvert lock in ocfs2_osb_dump()jiangyiwen2016-03-151-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit a75e9ccabd92 ("ocfs2: use spinlock irqsave for downconvert lock") missed an unmodified place in ocfs2_osb_dump(), so it still exists a deadlock scenario. ocfs2_wake_downconvert_thread ocfs2_rw_unlock ocfs2_dio_end_io dio_complete ..... bio_endio req_bio_endio .... scsi_io_completion blk_done_softirq __do_softirq do_softirq irq_exit do_IRQ ocfs2_osb_dump cat /sys/kernel/debug/ocfs2/${uuid}/fs_state This patch still uses spin_lock_irqsave() - replace spin_lock() to solve this situation. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kmemcg: account certain kmem allocations to memcgVladimir Davydov2016-01-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mark those kmem allocations that are known to be easily triggered from userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to memcg. For the list, see below: - threadinfo - task_struct - task_delay_info - pid - cred - mm_struct - vm_area_struct and vm_region (nommu) - anon_vma and anon_vma_chain - signal_struct - sighand_struct - fs_struct - files_struct - fdtable and fdtable->full_fds_bits - dentry and external_name - inode for all filesystems. This is the most tedious part, because most filesystems overwrite the alloc_inode method. The list is far from complete, so feel free to add more objects. Nevertheless, it should be close to "account everything" approach and keep most workloads within bounds. Malevolent users will be able to breach the limit, but this was possible even with the former "account everything" approach (simply because it did not account everything in fact). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: clean up redundant NULL check before iputJoseph Qi2016-01-141-2/+1
| | | | | | | | | | | Since iput will take care the NULL check itself, NULL check before calling it is redundant. So clean them up. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: optimize bad declarations and redundant assignmentNorton.Zhu2016-01-141-6/+2
| | | | | | | | | | | | | | | | | In ocfs2_parse_options, a) it's better to declare variables(small size) outside of while loop; b) 'option' will be set by match_int, 'option = 0;' makes no sense, if match_int failed, it just goto bail and return. Signed-off-by: Norton.Zhu <norton.zhu@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Gang He <ghe@suse.com> Cc: Mark Fasheh <mfasheh@suse.de> Acked-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: create and use seq_show_option for escapingKees Cook2015-09-041-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many file systems that implement the show_options hook fail to correctly escape their output which could lead to unescaped characters (e.g. new lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This could lead to confusion, spoofed entries (resulting in things like systemd issuing false d-bus "mount" notifications), and who knows what else. This looks like it would only be the root user stepping on themselves, but it's possible weird things could happen in containers or in other situations with delegated mount privileges. Here's an example using overlay with setuid fusermount trusting the contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use of "sudo" is something more sneaky: $ BASE="ovl" $ MNT="$BASE/mnt" $ LOW="$BASE/lower" $ UP="$BASE/upper" $ WORK="$BASE/work/ 0 0 none /proc fuse.pwn user_id=1000" $ mkdir -p "$LOW" "$UP" "$WORK" $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt $ cat /proc/mounts none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0 none /proc fuse.pwn user_id=1000 0 0 $ fusermount -u /proc $ cat /proc/mounts cat: /proc/mounts: No such file or directory This fixes the problem by adding new seq_show_option and seq_show_option_n helpers, and updating the vulnerable show_option handlers to use them as needed. Some, like SELinux, need to be open coded due to unusual existing escape mechanisms. [akpm@linux-foundation.org: add lost chunk, per Kees] [keescook@chromium.org: seq_show_option should be using const parameters] Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Jan Kara <jack@suse.com> Acked-by: Paul Moore <paul@paul-moore.com> Cc: J. R. Okajima <hooanon05g@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: neaten do_error, ocfs2_error and ocfs2_abortJoe Perches2015-09-041-2/+2
| | | | | | | | | | | | | | | | | These uses sometimes do and sometimes don't have '\n' terminations. Make the uses consistently use '\n' terminations and remove the newline from the functions. Miscellanea: o Coalesce formats o Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: add errors=continueGoldwyn Rodrigues2015-09-041-19/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OCFS2 is often used in high-availaibility systems. However, ocfs2 converts the filesystem to read-only at the drop of the hat. This may not be necessary, since turning the filesystem read-only would affect other running processes as well, decreasing availability. This attempt is to add errors=continue, which would return the EIO to the calling process and terminate furhter processing so that the filesystem is not corrupted further. However, the filesystem is not converted to read-only. As a future plan, I intend to create a small utility or extend fsck.ocfs2 to fix small errors such as in the inode. The input to the utility such as the inode can come from the kernel logs so we don't have to schedule a downtime for fixing small-enough errors. The patch changes the ocfs2_error to return an error. The error returned depends on the mount option set. If none is set, the default is to turn the filesystem read-only. Perhaps errors=continue is not the best option name. Historically it is used for making an attempt to progress in the current process itself. Should we call it errors=eio? or errors=killproc? Suggestions/Comments welcome. Sources are available at: https://github.com/goldwynr/linux/tree/error-cont Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: fix race between dio and recover orphanJoseph Qi2015-09-041-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | During direct io the inode will be added to orphan first and then deleted from orphan. There is a race window that the orphan entry will be deleted twice and thus trigger the BUG when validating OCFS2_DIO_ORPHANED_FL in ocfs2_del_inode_from_orphan. ocfs2_direct_IO_write ... ocfs2_add_inode_to_orphan >>>>>>>> race window. 1) another node may rm the file and then down, this node take care of orphan recovery and clear flag OCFS2_DIO_ORPHANED_FL. 2) since rw lock is unlocked, it may race with another orphan recovery and append dio. ocfs2_del_inode_from_orphan So take inode mutex lock when recovering orphans and make rw unlock at the end of aio write in case of append dio. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Weiwei Wang <wangww631@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Revert "ocfs2: incorrect check for debugfs returns"Linus Torvalds2015-04-211-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit e2ac55b6a8e337fac7cc59c6f452caac92ab5ee6. Huang Ying reports that this causes a hang at boot with debugfs disabled. It is true that the debugfs error checks are kind of confusing, and this code certainly merits more cleanup and thinking about it, but there's something wrong with the trivial "check not just for NULL, but for error pointers too" patch. Yes, with debugfs disabled, we will end up setting the o2hb_debug_dir pointer variable to an error pointer (-ENODEV), and then continue as if everything was fine. But since debugfs is disabled, all the _users_ of that pointer end up being compiled away, so even though the pointer can not be dereferenced, that's still fine. So it's confusing and somewhat questionable, but the "more correct" error checks end up causing more trouble than they fix. Reported-by: Huang Ying <ying.huang@intel.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Chengyu Song <csong84@gatech.edu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* cleancache: zap uuid arg of cleancache_init_shared_fsVladimir Davydov2015-04-141-1/+1
| | | | | | | | | | | | | | | | | | | | | Use super_block->s_uuid instead. Every shared filesystem using cleancache must now initialize super_block->s_uuid before calling cleancache_init_shared_fs. The only one on the tree, ocfs2, already meets this requirement. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Bob Liu <lliubbo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: copy fs uuid to superblockVladimir Davydov2015-04-141-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, maximal number of cleancache enabled filesystems equals 32, which is insufficient nowadays, because a Linux host can have hundreds of containers on board, each of which might want its own filesystem. This patch set targets at removing this limitation - see patch 4 for more details. Patches 1-3 prepare the code for this change. This patch (of 4): This will allow us to remove the uuid argument from cleancache_init_shared_fs. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Bob Liu <lliubbo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: logging: remove static buffer, use vsprintf extension %pVJoe Perches2015-04-141-15/+18
| | | | | | | | | | | Use the vsprintf %pV extension to avoid using a static buffer and remove the now unnecessary buffer. Signed-off-by: Joe Perches <joe@perches.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: incorrect check for debugfs returnsChengyu Song2015-04-141-4/+5
| | | | | | | | | | | | | | | | | | | | | debugfs_create_dir and debugfs_create_file may return -ENODEV when debugfs is not configured, so the return value should be checked against ERROR_VALUE as well, otherwise the later dereference of the dentry pointer would crash the kernel. This patch tries to solve this problem by fixing certain checks. However, I have that found other call sites are protected by #ifdef CONFIG_DEBUG_FS. In current implementation, if CONFIG_DEBUG_FS is defined, then the above two functions will never return any ERROR_VALUE. So another possibility to fix this is to surround all the buggy checks/functions with the same #ifdef CONFIG_DEBUG_FS. But I'm not sure if this would break any functionality, as only OCFS2_FS_STATS declares dependency on DEBUG_FS. Signed-off-by: Chengyu Song <csong84@gatech.edu> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: wait for orphan recovery first once append O_DIRECT write crashJoseph Qi2015-02-161-0/+2
| | | | | | | | | | | | | | | | | If one node has crashed with orphan entry leftover, another node which do append O_DIRECT write to the same file will override the i_dio_orphaned_slot. Then the old entry won't be cleaned forever. If this case happens, we let it wait for orphan recovery first. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Weiwei Wang <wangww631@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Xuejiufei <xuejiufei@huawei.com> Cc: alex chen <alex.chen@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'akpm' (patches from Andrew)Linus Torvalds2015-02-101-0/+17
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge misc updates from Andrew Morton: "Bite-sized chunks this time, to avoid the MTA ratelimiting woes. - fs/notify updates - ocfs2 - some of MM" That laconic "some MM" is mainly the removal of remap_file_pages(), which is a big simplification of the VM, and which gets rid of a *lot* of random cruft and special cases because we no longer support the non-linear mappings that it used. From a user interface perspective, nothing has changed, because the remap_file_pages() syscall still exists, it's just done by emulating the old behavior by creating a lot of individual small mappings instead of one non-linear one. The emulation is slower than the old "native" non-linear mappings, but nobody really uses or cares about remap_file_pages(), and simplifying the VM is a big advantage. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (78 commits) memcg: zap memcg_slab_caches and memcg_slab_mutex memcg: zap memcg_name argument of memcg_create_kmem_cache memcg: zap __memcg_{charge,uncharge}_slab mm/page_alloc.c: place zone_id check before VM_BUG_ON_PAGE check mm: hugetlb: fix type of hugetlb_treat_as_movable variable mm, hugetlb: remove unnecessary lower bound on sysctl handlers"? mm: memory: merge shared-writable dirtying branches in do_wp_page() mm: memory: remove ->vm_file check on shared writable vmas xtensa: drop _PAGE_FILE and pte_file()-related helpers x86: drop _PAGE_FILE and pte_file()-related helpers unicore32: drop pte_file()-related helpers um: drop _PAGE_FILE and pte_file()-related helpers tile: drop pte_file()-related helpers sparc: drop pte_file()-related helpers sh: drop _PAGE_FILE and pte_file()-related helpers score: drop _PAGE_FILE and pte_file()-related helpers s390: drop pte_file()-related helpers parisc: drop _PAGE_FILE and pte_file()-related helpers openrisc: drop _PAGE_FILE and pte_file()-related helpers nios2: drop _PAGE_FILE and pte_file()-related helpers ...
| * ocfs2: add a mount option journal_async_commit on ocfs2 filesystemalex chen2015-02-101-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a mount option to support JBD2 feature: JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT. When this feature is opened, journal commit block can be written to disk without waiting for descriptor blocks, which can improve journal commit performance. This option will enable 'journal_checksum' internally. Using the fs_mark benchmark, using journal_async_commit shows a 50% improvement, the files per second go up from 215.2 to 317.5. test script: fs_mark -d /mnt/ocfs2/ -s 10240 -n 1000 default: FSUse% Count Size Files/sec App Overhead 0 1000 10240 215.2 17878 with journal_async_commit option: FSUse% Count Size Files/sec App Overhead 0 1000 10240 317.5 17881 Signed-off-by: Alex Chen <alex.chen@huawei.com> Signed-off-by: Weiwei Wang <wangww631@huawei.comm> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | ocfs2: Use generic helpers for quotaon and quotaoffJan Kara2015-01-301-31/+1
|/ | | | | | | | | | | Ocfs2 can just use the generic helpers provided by quota code for turning quotas on and off when quota files are stored as system inodes. The only difference is the feature test in ocfs2_quota_on() and that is covered by dquot_quota_enable() checking whether usage tracking is enabled (which can happen only if the filesystem has the quota feature set). Signed-off-by: Jan Kara <jack@suse.cz>
* Merge branch 'akpm' (patchbomb from Andrew)Linus Torvalds2014-12-101-1/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge first patchbomb from Andrew Morton: - a few minor cifs fixes - dma-debug upadtes - ocfs2 - slab - about half of MM - procfs - kernel/exit.c - panic.c tweaks - printk upates - lib/ updates - checkpatch updates - fs/binfmt updates - the drivers/rtc tree - nilfs - kmod fixes - more kernel/exit.c - various other misc tweaks and fixes * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (190 commits) exit: pidns: fix/update the comments in zap_pid_ns_processes() exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting exit: exit_notify: re-use "dead" list to autoreap current exit: reparent: call forget_original_parent() under tasklist_lock exit: reparent: avoid find_new_reaper() if no children exit: reparent: introduce find_alive_thread() exit: reparent: introduce find_child_reaper() exit: reparent: document the ->has_child_subreaper checks exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper() exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting exit: proc: don't try to flush /proc/tgid/task/tgid exit: release_task: fix the comment about group leader accounting exit: wait: drop tasklist_lock before psig->c* accounting exit: wait: don't use zombie->real_parent exit: wait: cleanup the ptrace_reparented() checks usermodehelper: kill the kmod_thread_locker logic usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper() fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmp nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races ...
| * ocfs2: fix error handling when creating debugfs root in ocfs2_init()Jan Kara2014-12-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Error handling if creation of root of debugfs in ocfs2_init() fails is broken. Although error code is set we fail to exit ocfs2_init() with error and thus initialization ends with success. Later when mounting a filesystem, ocfs2 debugfs entries end up being created in the root of debugfs filesystem which is confusing. Fix the error handling to bail out. Coverity id: 1227009. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | ocfs2: Convert to private i_dquot fieldJan Kara2014-11-101-0/+8
|/ | | | | | | | CC: Mark Fasheh <mfasheh@suse.com> CC: Joel Becker <jlbec@evilplan.org> CC: ocfs2-devel@oss.oracle.com Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
* Merge branch 'for_linus' of ↵Linus Torvalds2014-10-111-13/+17
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull UDF and quota updates from Jan Kara: "A few UDF fixes and also a few patches which are preparing filesystems for support of project quotas in VFS" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Fix loading of special inodes ocfs2: Back out change to use OCFS2_MAXQUOTAS in ocfs2_setattr() udf: remove redundant sys_tz declaration ocfs2: Don't use MAXQUOTAS value reiserfs: Don't use MAXQUOTAS value ext3: Don't use MAXQUOTAS value udf: Fix race between write(2) and close(2)
| * ocfs2: Don't use MAXQUOTAS valueJan Kara2014-09-171-13/+17
| | | | | | | | | | | | | | | | | | | | | | | | MAXQUOTAS value defines maximum number of quota types VFS supports. This isn't necessarily the number of types ocfs2 supports and with addition of project quotas these two numbers stop matching. So make ocfs2 use its private definition. CC: Mark Fasheh <mfasheh@suse.com> CC: Joel Becker <jlbec@evilplan.org> CC: ocfs2-devel@oss.oracle.com Signed-off-by: Jan Kara <jack@suse.cz>
* | ocfs2: free vol_label in ocfs2_delete_osb()Joseph Qi2014-09-261-0/+1
|/ | | | | | | | | | | | osb->vol_label is malloced in ocfs2_initialize_super but not freed if error occurs or during umount, thus causing a memory leak. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: joyce.xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: revert "ocfs2: fix NULL pointer dereference when dismount and ↵Xue jiufei2014-06-231-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | ocfs2rec simultaneously" 75f82eaa502c ("ocfs2: fix NULL pointer dereference when dismount and ocfs2rec simultaneously") may cause umount hang while shutting down truncate log. The situation is as followes: ocfs2_dismout_volume -> ocfs2_recovery_exit -> free osb->recovery_map -> ocfs2_truncate_shutdown -> lock global bitmap inode -> ocfs2_wait_for_recovery -> check whether osb->recovery_map->rm_used is zero Because osb->recovery_map is already freed, rm_used can be any other values, so it may yield umount hang. Signed-off-by: joyce.xue <xuejiufei@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs/ocfs2/super.c: use OCFS2_MAX_VOL_LABEL_LEN and strlcpyFabian Frederick2014-06-041-2/+2
| | | | | | | | | | Replace strncpy(size 63) by defined value. Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: remove NULL assignments on staticFabian Frederick2014-06-041-2/+2
| | | | | | | | | | Static values are automatically initialized to NULL. Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'ext4_for_linus' of ↵Linus Torvalds2014-04-041-0/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Major changes for 3.14 include support for the newly added ZERO_RANGE and COLLAPSE_RANGE fallocate operations, and scalability improvements in the jbd2 layer and in xattr handling when the extended attributes spill over into an external block. Other than that, the usual clean ups and minor bug fixes" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits) ext4: fix premature freeing of partial clusters split across leaf blocks ext4: remove unneeded test of ret variable ext4: fix comment typo ext4: make ext4_block_zero_page_range static ext4: atomically set inode->i_flags in ext4_set_inode_flags() ext4: optimize Hurd tests when reading/writing inodes ext4: kill i_version support for Hurd-castrated file systems ext4: each filesystem creates and uses its own mb_cache fs/mbcache.c: doucple the locking of local from global data fs/mbcache.c: change block and index hash chain to hlist_bl_node ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate ext4: refactor ext4_fallocate code ext4: Update inode i_size after the preallocation ext4: fix partial cluster handling for bigalloc file systems ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents ext4: only call sync_filesystm() when remounting read-only fs: push sync_filesystem() down to the file system's remount_fs() jbd2: improve error messages for inconsistent journal heads jbd2: minimize region locked by j_list_lock in jbd2_journal_forget() jbd2: minimize region locked by j_list_lock in journal_get_create_access() ...
| * fs: push sync_filesystem() down to the file system's remount_fs()Theodore Ts'o2014-03-131-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, the no-op "mount -o mount /dev/xxx" operation when the file system is already mounted read-write causes an implied, unconditional syncfs(). This seems pretty stupid, and it's certainly documented or guaraunteed to do this, nor is it particularly useful, except in the case where the file system was mounted rw and is getting remounted read-only. However, it's possible that there might be some file systems that are actually depending on this behavior. In most file systems, it's probably fine to only call sync_filesystem() when transitioning from read-write to read-only, and there are some file systems where this is not needed at all (for example, for a pseudo-filesystem or something like romfs). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Jan Kara <jack@suse.cz> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Anders Larsen <al@alarsen.net> Cc: Phillip Lougher <phillip@squashfs.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Cc: Petr Vandrovec <petr@vandrovec.name> Cc: xfs@oss.sgi.com Cc: linux-btrfs@vger.kernel.org Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Cc: codalist@coda.cs.cmu.edu Cc: linux-ext4@vger.kernel.org Cc: linux-f2fs-devel@lists.sourceforge.net Cc: fuse-devel@lists.sourceforge.net Cc: cluster-devel@redhat.com Cc: linux-mtd@lists.infradead.org Cc: jfs-discussion@lists.sourceforge.net Cc: linux-nfs@vger.kernel.org Cc: linux-nilfs@vger.kernel.org Cc: linux-ntfs-dev@lists.sourceforge.net Cc: ocfs2-devel@oss.oracle.com Cc: reiserfs-devel@vger.kernel.org
* | ocfs2: avoid system inode ref confusion by adding mutex lockjiangyiwen2014-04-031-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following case may lead to the same system inode ref in confusion. A thread B thread ocfs2_get_system_file_inode ->get_local_system_inode ->_ocfs2_get_system_file_inode because of *arr == NULL, ocfs2_get_system_file_inode ->get_local_system_inode ->_ocfs2_get_system_file_inode gets first ref thru _ocfs2_get_system_file_inode, gets second ref thru igrab and set *arr = inode at the moment, B thread also gets two refs, so lead to one more inode ref. So add mutex lock to avoid multi thread set two inode ref once at the same time. Signed-off-by: jiangyiwen <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | ocfs2: revert iput deferring code in ocfs2_drop_dentry_lockGoldwyn Rodrigues2014-04-031-29/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following patches are reverted in this patch because these patches caused performance regression in the remote unlink() calls. ea455f8ab683 - ocfs2: Push out dropping of dentry lock to ocfs2_wq f7b1aa69be13 - ocfs2: Fix deadlock on umount 5fd131893793 - ocfs2: Don't oops in ocfs2_kill_sb on a failed mount Previous patches in this series removed the possible deadlocks from downconvert thread so the above patches shouldn't be needed anymore. The regression is caused because these patches delay the iput() in case of dentry unlocks. This also delays the unlocking of the open lockres. The open lockresource is required to test if the inode can be wiped from disk or not. When the deleting node does not get the open lock, it marks it as orphan (even though it is not in use by another node/process) and causes a journal checkpoint. This delays operations following the inode eviction. This also moves the inode to the orphaned inode which further causes more I/O and a lot of unneccessary orphans. The following script can be used to generate the load causing issues: declare -a create declare -a remove declare -a iterations=(1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384) unique="`mktemp -u XXXXX`" script="/tmp/idontknow-${unique}.sh" cat <<EOF > "${script}" for n in {1..8}; do mkdir -p test/dir\${n} eval touch test/dir\${n}/foo{1.."\$1"} done EOF chmod 700 "${script}" function fcreate () { exec 2>&1 /usr/bin/time --format=%E "${script}" "$1" } function fremove () { exec 2>&1 /usr/bin/time --format=%E ssh node2 "cd `pwd`; rm -Rf test*" } function fcp () { exec 2>&1 /usr/bin/time --format=%E ssh node3 "cd `pwd`; cp -R test test.new" } echo ------------------------------------------------- echo "| # files | create #s | copy #s | remove #s |" echo ------------------------------------------------- for ((x=0; x < ${#iterations[*]} ; x++)) do create[$x]="`fcreate ${iterations[$x]}`" copy[$x]="`fcp ${iterations[$x]}`" remove[$x]="`fremove`" printf "| %8d | %9s | %9s | %9s |\n" ${iterations[$x]} ${create[$x]} ${copy[$x]} ${remove[$x]} done rm "${script}" echo "------------------------" Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | ocfs2: implement delayed dropping of last dquot referenceJan Kara2014-04-031-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We cannot drop last dquot reference from downconvert thread as that creates the following deadlock: NODE 1 NODE2 holds dentry lock for 'foo' holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE dquot_initialize(bar) ocfs2_dquot_acquire() ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE) ... downconvert thread (triggered from another node or a different process from NODE2) ocfs2_dentry_post_unlock() ... iput(foo) ocfs2_evict_inode(foo) ocfs2_clear_inode(foo) dquot_drop(inode) ... ocfs2_dquot_release() ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE) - blocks finds we need more space in quota file ... ocfs2_extend_no_holes() ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE) - deadlocks waiting for downconvert thread We solve the problem by postponing dropping of the last dquot reference to a workqueue if it happens from the downconvert thread. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | ocfs2: improve fsync efficiency and fix deadlock between aio_write and sync_fileDarrick J. Wong2014-04-031-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, ocfs2_sync_file grabs i_mutex and forces the current journal transaction to complete. This isn't terribly efficient, since sync_file really only needs to wait for the last transaction involving that inode to complete, and this doesn't require i_mutex. Therefore, implement the necessary bits to track the newest tid associated with an inode, and teach sync_file to wait for that instead of waiting for everything in the journal to commit. Furthermore, only issue the flush request to the drive if jbd2 hasn't already done so. This also eliminates the deadlock between ocfs2_file_aio_write() and ocfs2_sync_file(). aio_write takes i_mutex then calls ocfs2_aiodio_wait() to wait for unaligned dio writes to finish. However, if that dio completion involves calling fsync, then we can get into trouble when some ocfs2_sync_file tries to take i_mutex. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | ocfs2: remove unused variable uuid_net_key in ocfs2_initialize_superjoyce.xue2014-04-031-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | Variable uuid_net_key in ocfs2_initialize_super() is not used. Clean it up. Signed-off-by: joyce.xue <xuejiufei@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Acked-by: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | ocfs2: change ip_unaligned_aio to of type mutex from atomit_tWengang Wang2014-04-031-7/+2
|/ | | | | | | | | | | | | | | There is a problem that waitqueue_active() may check stale data thus miss a wakeup of threads waiting on ip_unaligned_aio. The valid value of ip_unaligned_aio is only 0 and 1 so we can change it to be of type mutex thus the above prolem is avoid. Another benifit is that mutex which works as FIFO is fairer than wake_up_all(). Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: fix NULL pointer dereference when dismount and ocfs2rec simultaneouslyYiwen Jiang2014-01-211-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 2 nodes cluster, say Node A and Node B, mount the same ocfs2 volume, and create a file 1. Node A Node B open 1, get open lock rm 1, and then add 1 to orphan_dir storage link down, o2hb_write_timeout ->o2quo_disk_timeout ->emergency_restart at the moment, Node B dismount and do ocfs2rec simultaneously 1) ocfs2_dismount_volume ->ocfs2_recovery_exit ->wait_event(osb->recovery_event) ->flush_workqueue(ocfs2_wq) 2) ocfs2rec ->queue_work(&journal->j_recovery_work) ->ocfs2_recover_orphans ->ocfs2_commit_truncate ->queue_delayed_work(&osb->osb_truncate_log_wq) In ocfs2_recovery_exit, it flushes workqueue and then releases system inodes. When doing ocfs2rec, it will call ocfs2_flush_truncate_log which will try to get sys_root_inode, and NULL pointer dereference occurs. Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Signed-off-by: joyce <xuejiufei@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud