summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* ext4: drop unneeded BUFFER_TRACE in ext4_delete_inline_entry()Geliang Tang2016-03-101-1/+0
| | | | | | | | BUFFER_TRACE info "call ext4_handle_dirty_metadata" doesn't match the code, so drop it. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix misspellings in comments.Adam Buchbinder2016-03-095-5/+5
| | | | | Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount pathOGAWA Hirofumi2016-03-091-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On umount path, jbd2_journal_destroy() writes latest transaction ID (->j_tail_sequence) to be used at next mount. The bug is that ->j_tail_sequence is not holding latest transaction ID in some cases. So, at next mount, there is chance to conflict with remaining (not overwritten yet) transactions. mount (id=10) write transaction (id=11) write transaction (id=12) umount (id=10) <= the bug doesn't write latest ID mount (id=10) write transaction (id=11) crash mount [recovery process] transaction (id=11) transaction (id=12) <= valid transaction ID, but old commit must not replay Like above, this bug become the cause of recovery failure, or FS corruption. So why ->j_tail_sequence doesn't point latest ID? Because if checkpoint transactions was reclaimed by memory pressure (i.e. bdev_try_to_free_page()), then ->j_tail_sequence is not updated. (And another case is, __jbd2_journal_clean_checkpoint_list() is called with empty transaction.) So in above cases, ->j_tail_sequence is not pointing latest transaction ID at umount path. Plus, REQ_FLUSH for checkpoint is not done too. So, to fix this problem with minimum changes, this patch updates ->j_tail_sequence, and issue REQ_FLUSH. (With more complex changes, some optimizations would be possible to avoid unnecessary REQ_FLUSH for example though.) BTW, journal->j_tail_sequence = ++journal->j_transaction_sequence; Increment of ->j_transaction_sequence seems to be unnecessary, but ext3 does this. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
* ext4: more efficient SEEK_DATA implementationJan Kara2016-03-093-61/+106
| | | | | | | | | | | | | | | | | Using SEEK_DATA in a huge sparse file can easily lead to sotflockups as ext4_seek_data() iterates hole block-by-block. Fix the problem by using returned hole size from ext4_map_blocks() and thus skip the hole in one go. Update also SEEK_HOLE implementation to follow the same pattern as SEEK_DATA to make future maintenance easier. Furthermore we add cond_resched() to both ext4_seek_data() and ext4_seek_hole() to avoid softlockups in case evil user creates huge fragmented file and we have to go through lots of extents. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: cleanup handling of bh->b_state in DAX mmapJan Kara2016-03-091-3/+2
| | | | | | | | | | ext4_dax_mmap_get_block() updates bh->b_state directly instead of using ext4_update_bh_state(). This is mostly a cosmetic issue since DAX code always passes on-stack buffer_head but clean this up to make code more uniform. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: return hole from ext4_map_blocks()Jan Kara2016-03-093-9/+36
| | | | | | | | | | | | | Currently, ext4_map_blocks() just returns 0 when it finds a hole and allocation is not requested. However we have all the information available to tell how large the hole actually is and there are callers of ext4_map_blocks() which would save some block-by-block hole iteration if they knew this information. So fill in struct ext4_map_blocks even for holes with the information we have. We keep returning 0 for holes to maintain backward compatibility of the function. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: factor out determining of hole sizeJan Kara2016-03-091-33/+47
| | | | | | | | | | | ext4_ext_put_gap_in_cache() determines hole size in the extent tree, then trims this with possible delayed allocated blocks, and inserts the result into the extent status tree. Factor out determination of the size of the hole in the extent tree as we will need this information in ext4_ext_map_blocks() as well. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix setting of referenced bit in ext4_es_lookup_extent()Jan Kara2016-03-091-2/+2
| | | | | | | | | | We were setting referenced bit on the extent structure we return from ext4_es_lookup_extent() which is just a private structure on stack. Thus setting had no effect. Set the bit in the structure in the status tree instead. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: remove i_ioend_countJan Kara2016-03-084-14/+1
| | | | | | | Remove counter of pending io ends as it is unused. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: simplify io_end handling for AIO DIOJan Kara2016-03-083-104/+74
| | | | | | | | | | | | | | | | | | | | | | | | | | | When mapping blocks for direct IO, we allocate io_end structure before mapping blocks and store pointer to it in the inode. This creates a requirement that any AIO DIO using io_end must be protected by i_mutex. This created problems in the past with dioread_nolock mode which was corrupting io_end pointers. Also io_end is allocated unnecessarily in case where we don't need to convert any extents (which is a common case for example when overwriting file). We fix the problem by allocating io_end only once we return unwritten extent from block mapping function for AIO DIO (so we can save some pointless io_end allocations) and we pass pointer to it in bh->b_private which generic DIO code later passes to our end IO callback. That way we remove any need for global pointer to io_end structure and thus fix the races. The downside of this change is that the checking for unwritten IO in flight in ext4_extents_can_be_merged() is more racy since we now increment i_unwritten / set EXT4_STATE_DIO_UNWRITTEN only after dropping i_data_sem. However the check has been racy already before because ext4_writepages() already increment i_unwritten after dropping i_data_sem and reserved blocks save us from hitting ENOSPC in the worst case. Signed-off-by: Jan Kara <jack@suse.cz>
* ext4: move trans handling and completion deferal out of _ext4_get_blockJan Kara2016-03-081-32/+59
| | | | | | | | | | There is no need to handle starting of a transaction and deferal of DIO completion in _ext4_get_block() function. We can move this out to get block functions for direct IO that need it. That way we can add stricter checks verifying things work as we expect. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: rename and split get blocks functionsJan Kara2016-03-084-47/+74
| | | | | | | | | | Rename ext4_get_blocks_write() to ext4_get_blocks_unwritten() to better describe what it does. Also split out get blocks functions for direct IO. Later we move functionality from _ext4_get_blocks() there. There's no functional change in this patch. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: use i_mutex to serialize unaligned AIO DIOJan Kara2016-03-083-26/+14
| | | | | | | | | | | | | | | | | Currently we've used hashed aio_mutex to serialize unaligned AIO DIO. However the code cleanups that happened after 2011 when the lock was introduced made aio_mutex acquired at almost the same places where we already have exclusion using i_mutex. So just use i_mutex for the exclusion of unaligned AIO DIO. The change moves waiting for pending unwritten extent conversion under i_mutex. That makes special handling of O_APPEND writes unnecessary and also avoids possible livelocking of unaligned AIO DIO with aligned one (nothing was preventing contiguous stream of aligned AIO DIOs to let unaligned AIO DIO wait forever). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: pack ioend structure betterJan Kara2016-03-081-1/+1
| | | | | | | | | On 64-bit architectures we have two 4-byte holes in struct ext4_io_end. Order entries better to avoid this and thus make the structure occupy 64 instead of 72 bytes for 64-bit architectures. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: save some atomic ops in __JI_COMMIT_RUNNING handlingJan Kara2016-02-222-7/+7
| | | | | | | | | | Currently we used atomic bit operations to manipulate __JI_COMMIT_RUNNING bit. However this is unnecessary as i_flags are always written and read under j_list_lock. So just change the operations to standard bit operations. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: unify revoke and tag block checksum handlingJan Kara2016-02-225-66/+25
| | | | | | | | | | Revoke and tag descriptor blocks are just different kinds of descriptor blocks and thus have checksum in the same place. Unify computation and checking of checksums for these. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: factor out common descriptor block initializationJan Kara2016-02-224-19/+16
| | | | | | | | Descriptor block header is initialized in several places. Factor out the common code into jbd2_journal_get_descriptor_buffer(). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* jbd2: remove unnecessary arguments of jbd2_journal_write_revoke_recordsJan Kara2016-02-223-24/+18
| | | | | | | | | jbd2_journal_write_revoke_records() takes journal pointer and write_op, although journal can be obtained from the passed transaction and write_op is always WRITE_SYNC. Remove these superfluous arguments. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: trim unused parameter from convert_initialized_extent()Eric Whitney2016-02-221-2/+2
| | | | | | | The flags parameter is also unused. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* mbcache: add reusable flag to cache entriesAndreas Gruenbacher2016-02-224-30/+81
| | | | | | | | | | | | | | | | | | | | | | | | | | | | To reduce amount of damage caused by single bad block, we limit number of inodes sharing an xattr block to 1024. Thus there can be more xattr blocks with the same contents when there are lots of files with the same extended attributes. These xattr blocks naturally result in hash collisions and can form long hash chains and we unnecessarily check each such block only to find out we cannot use it because it is already shared by too many inodes. Add a reusable flag to cache entries which is cleared when a cache entry has reached its maximum refcount. Cache entries which are not marked reusable are skipped by mb_cache_entry_find_{first,next}. This significantly speeds up mbcache when there are many same xattr blocks. For example for xattr-bench with 5 values and each process handling 20000 files, the run for 64 processes is 25x faster with this patch. Even for 8 processes the speedup is almost 3x. We have also verified that for situations where there is only one xattr block of each kind, the patch doesn't have a measurable cost. [JK: Remove handling of setting the same value since it is not needed anymore, check for races in e_reusable setting, improve changelog, add measurements] Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: shortcut setting of xattr to the same valueJan Kara2016-02-221-0/+18
| | | | | | | | | | | | When someone tried to set xattr to the same value (i.e., not changing anything) we did all the work of removing original xattr, possibly breaking references to shared xattr block, inserting new xattr, and merging xattr blocks again. Since this is not so rare operation and it is relatively cheap for us to detect this case, check for this and shortcut xattr setting in that case. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* mbcache: get rid of _e_hash_list_headAndreas Gruenbacher2016-02-222-37/+12
| | | | | | | | | Get rid of field _e_hash_list_head in cache entries and add bit field e_referenced instead. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: kill ext4_mballoc_readyAndreas Gruenbacher2016-02-221-10/+4
| | | | | | | | | | This variable, introduced in commit 9c191f70, is unnecessary: it is set once the module has been initialized correctly, and ext4_fill_super cannot run unless the module has been initialized correctly. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* mbcache2: rename to mbcacheJan Kara2016-02-2210-204/+204
| | | | | | | | Since old mbcache code is gone, let's rename new code to mbcache since number 2 is now meaningless. This is just a mechanical replacement. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* mbcache2: Use referenced bit instead of LRUJan Kara2016-02-222-35/+63
| | | | | | | | | | | | | | | | | | Currently we maintain perfect LRU list by moving entry to the tail of the list when it gets used. However these operations on cache-global list are relatively expensive. In this patch we switch to lazy updates of LRU list. Whenever entry gets used, we set a referenced bit in it. When reclaiming entries, we give referenced entries another round in the LRU. Since the list is not a real LRU anymore, rename it to just 'list'. In my testing this logic gives about 30% boost to workloads with mostly unique xattr blocks (e.g. xattr-bench with 10 files and 10000 unique xattr values). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* mbcache2: limit cache sizeJan Kara2016-02-221-5/+45
| | | | | | | | | | | | | | | | So far number of entries in mbcache is limited only by the pressure from the shrinker. Since too many entries degrade the hash table and generally we expect that caching more entries has diminishing returns, limit number of entries the same way as in the old mbcache to 16 * hash table size. Once we exceed the desired maximum number of entries, we schedule a backround work to reclaim entries. If the background work cannot keep up and the number of entries exceeds two times the desired maximum, we reclaim some entries directly when allocating a new entry. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* mbcache: remove mbcacheJan Kara2016-02-223-914/+1
| | | | | | | | Both ext2 and ext4 are now converted to mbcache2. Remove the old mbcache code. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext2: convert to mbcache2Jan Kara2016-02-224-100/+92
| | | | | | | | | | | The conversion is generally straightforward. We convert filesystem from a global cache to per-fs one. Similarly to ext4 the tricky part is that xattr block corresponding to found mbcache entry can get freed before we get buffer lock for that block. So we have to check whether the entry is still valid after getting the buffer lock. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: convert to mbcache2Jan Kara2016-02-224-75/+75
| | | | | | | | | | The conversion is generally straightforward. The only tricky part is that xattr block corresponding to found mbcache entry can get freed before we get buffer lock for that block. So we have to check whether the entry is still valid after getting buffer lock. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* mbcache2: reimplement mbcacheJan Kara2016-02-223-1/+410
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Original mbcache was designed to have more features than what ext? filesystems ended up using. It supported entry being in more hashes, it had a home-grown rwlocking of each entry, and one cache could cache entries from multiple filesystems. This genericity also resulted in more complex locking, larger cache entries, and generally more code complexity. This is reimplementation of the mbcache functionality to exactly fit the purpose ext? filesystems use it for. Cache entries are now considerably smaller (7 instead of 13 longs), the code is considerably smaller as well (414 vs 913 lines of code), and IMO also simpler. The new code is also much more lightweight. I have measured the speed using artificial xattr-bench benchmark, which spawns P processes, each process sets xattr for F different files, and the value of xattr is randomly chosen from a pool of V values. Averages of runtimes for 5 runs for various combinations of parameters are below. The first value in each cell is old mbache, the second value is the new mbcache. V=10 F\P 1 2 4 8 16 32 64 10 0.158,0.157 0.208,0.196 0.500,0.277 0.798,0.400 3.258,0.584 13.807,1.047 61.339,2.803 100 0.172,0.167 0.279,0.222 0.520,0.275 0.825,0.341 2.981,0.505 12.022,1.202 44.641,2.943 1000 0.185,0.174 0.297,0.239 0.445,0.283 0.767,0.340 2.329,0.480 6.342,1.198 16.440,3.888 V=100 F\P 1 2 4 8 16 32 64 10 0.162,0.153 0.200,0.186 0.362,0.257 0.671,0.496 1.433,0.943 3.801,1.345 7.938,2.501 100 0.153,0.160 0.221,0.199 0.404,0.264 0.945,0.379 1.556,0.485 3.761,1.156 7.901,2.484 1000 0.215,0.191 0.303,0.246 0.471,0.288 0.960,0.347 1.647,0.479 3.916,1.176 8.058,3.160 V=1000 F\P 1 2 4 8 16 32 64 10 0.151,0.129 0.210,0.163 0.326,0.245 0.685,0.521 1.284,0.859 3.087,2.251 6.451,4.801 100 0.154,0.153 0.211,0.191 0.276,0.282 0.687,0.506 1.202,0.877 3.259,1.954 8.738,2.887 1000 0.145,0.179 0.202,0.222 0.449,0.319 0.899,0.333 1.577,0.524 4.221,1.240 9.782,3.579 V=10000 F\P 1 2 4 8 16 32 64 10 0.161,0.154 0.198,0.190 0.296,0.256 0.662,0.480 1.192,0.818 2.989,2.200 6.362,4.746 100 0.176,0.174 0.236,0.203 0.326,0.255 0.696,0.511 1.183,0.855 4.205,3.444 19.510,17.760 1000 0.199,0.183 0.240,0.227 1.159,1.014 2.286,2.154 6.023,6.039 ---,10.933 ---,36.620 V=100000 F\P 1 2 4 8 16 32 64 10 0.171,0.162 0.204,0.198 0.285,0.230 0.692,0.500 1.225,0.881 2.990,2.243 6.379,4.771 100 0.151,0.171 0.220,0.210 0.295,0.255 0.720,0.518 1.226,0.844 3.423,2.831 19.234,17.544 1000 0.192,0.189 0.249,0.225 1.162,1.043 2.257,2.093 5.853,4.997 ---,10.399 ---,32.198 We see that the new code is faster in pretty much all the cases and starting from 4 processes there are significant gains with the new code resulting in upto 20-times shorter runtimes. Also for large numbers of cached entries all values for the old code could not be measured as the kernel started hitting softlockups and died before the test completed. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: iterate over buffer heads correctly in move_extent_per_page()Eryu Guan2016-02-211-0/+1
| | | | | | | | | | | | In commit bcff24887d00 ("ext4: don't read blocks from disk after extents being swapped") bh is not updated correctly in the for loop and wrong data has been written to disk. generic/324 catches this on sub-page block size ext4. Fixes: bcff24887d00 ("ext4: don't read blocks from disk after extentsbeing swapped") Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
* ext4: make sure to revoke all the freeable blocks in ext4_free_blocksDaeho Jeong2016-02-211-19/+13
| | | | | | | | | | | | | | | Now, ext4_free_blocks() doesn't revoke data blocks of per-file data journalled inode and it can cause file data inconsistency problems. Even though data blocks of per-file data journalled inode are already forgotten by jbd2_journal_invalidatepage() in advance of invoking ext4_free_blocks(), we still need to revoke the data blocks here. Moreover some of the metadata blocks, which are not found by sb_find_get_block(), are still needed to be revoked, but this is also missing here. Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
* ext4: fix crashes in dioread_nolock modeJan Kara2016-02-191-20/+20
| | | | | | | | | | | | | | | | | | Competing overwrite DIO in dioread_nolock mode will just overwrite pointer to io_end in the inode. This may result in data corruption or extent conversion happening from IO completion interrupt because we don't properly set buffer_defer_completion() when unlocked DIO races with locked DIO to unwritten extent. Since unlocked DIO doesn't need io_end for anything, just avoid allocating it and corrupting pointer from inode for locked DIO. A cleaner fix would be to avoid these games with io_end pointer from the inode but that requires more intrusive changes so we leave that for later. Cc: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix bh->b_state corruptionJan Kara2016-02-191-2/+30
| | | | | | | | | | | | | | | | | | | | | | | | ext4 can update bh->b_state non-atomically in _ext4_get_block() and ext4_da_get_block_prep(). Usually this is fine since bh is just a temporary storage for mapping information on stack but in some cases it can be fully living bh attached to a page. In such case non-atomic update of bh->b_state can race with an atomic update which then gets lost. Usually when we are mapping bh and thus updating bh->b_state non-atomically, nobody else touches the bh and so things work out fine but there is one case to especially worry about: ext4_finish_bio() uses BH_Uptodate_Lock on the first bh in the page to synchronize handling of PageWriteback state. So when blocksize < pagesize, we can be atomically modifying bh->b_state of a buffer that actually isn't under IO and thus can race e.g. with delalloc trying to map that buffer. The result is that we can mistakenly set / clear BH_Uptodate_Lock bit resulting in the corruption of PageWriteback state or missed unlock of BH_Uptodate_Lock. Fix the problem by always updating bh->b_state bits atomically. CC: stable@vger.kernel.org Reported-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix memleak in ext4_readdir()Kirill Tkhai2016-02-161-2/+5
| | | | | | | | | When ext4_bread() fails, fname_crypto_str remains allocated after return. Fix that. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> CC: Dmitry Monakhov <dmonakhov@virtuozzo.com>
* ext4: remove unused parameter "newblock" in convert_initialized_extent()Eryu Guan2016-02-121-2/+2
| | | | | | | | The "newblock" parameter is not used in convert_initialized_extent(), remove it. Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: don't read blocks from disk after extents being swappedEryu Guan2016-02-121-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I notice ext4/307 fails occasionally on ppc64 host, reporting md5 checksum mismatch after moving data from original file to donor file. The reason is that move_extent_per_page() calls __block_write_begin() and block_commit_write() to write saved data from original inode blocks to donor inode blocks, but __block_write_begin() not only maps buffer heads but also reads block content from disk if the size is not block size aligned. At this time the physical block number in mapped buffer head is pointing to the donor file not the original file, and that results in reading wrong data to page, which get written to disk in following block_commit_write call. This also can be reproduced by the following script on 1k block size ext4 on x86_64 host: mnt=/mnt/ext4 donorfile=$mnt/donor testfile=$mnt/testfile e4compact=~/xfstests/src/e4compact rm -f $donorfile $testfile # reserve space for donor file, written by 0xaa and sync to disk to # avoid EBUSY on EXT4_IOC_MOVE_EXT xfs_io -fc "pwrite -S 0xaa 0 1m" -c "fsync" $donorfile # create test file written by 0xbb xfs_io -fc "pwrite -S 0xbb 0 1023" -c "fsync" $testfile # compute initial md5sum md5sum $testfile | tee md5sum.txt # drop cache, force e4compact to read data from disk echo 3 > /proc/sys/vm/drop_caches # test defrag echo "$testfile" | $e4compact -i -v -f $donorfile # check md5sum md5sum -c md5sum.txt Fix it by creating & mapping buffer heads only but not reading blocks from disk, because all the data in page is guaranteed to be up-to-date in mext_page_mkuptodate(). Cc: stable@vger.kernel.org Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix potential integer overflowInsu Yun2016-02-121-1/+1
| | | | | | | | | | Since sizeof(ext_new_group_data) > sizeof(ext_new_flex_group_data), integer overflow could be happened. Therefore, need to fix integer overflow sanitization. Cc: stable@vger.kernel.org Signed-off-by: Insu Yun <wuninsu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: add a line break for proc mb_groups displayHuaitong Han2016-02-121-1/+1
| | | | | | | | This patch adds a line break for proc mb_groups display. Signed-off-by: Huaitong Han <huaitong.han@intel.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
* ext4: ioctl: fix erroneous return valueAnton Protopopov2016-02-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ext4_ioctl_setflags() function which is used in the ioctls EXT4_IOC_SETFLAGS and EXT4_IOC_FSSETXATTR may return the positive value EPERM instead of -EPERM in case of error. This bug was introduced by a recent commit 9b7365fc. The following program can be used to illustrate the wrong behavior: #include <sys/types.h> #include <sys/ioctl.h> #include <sys/stat.h> #include <fcntl.h> #include <err.h> #define FS_IOC_GETFLAGS _IOR('f', 1, long) #define FS_IOC_SETFLAGS _IOW('f', 2, long) #define FS_IMMUTABLE_FL 0x00000010 int main(void) { int fd; long flags; fd = open("file", O_RDWR|O_CREAT, 0600); if (fd < 0) err(1, "open"); if (ioctl(fd, FS_IOC_GETFLAGS, &flags) < 0) err(1, "ioctl: FS_IOC_GETFLAGS"); flags |= FS_IMMUTABLE_FL; if (ioctl(fd, FS_IOC_SETFLAGS, &flags) < 0) err(1, "ioctl: FS_IOC_SETFLAGS"); warnx("ioctl returned no error"); return 0; } Running it gives the following result: $ strace -e ioctl ./test ioctl(3, FS_IOC_GETFLAGS, 0x7ffdbd8bfd38) = 0 ioctl(3, FS_IOC_SETFLAGS, 0x7ffdbd8bfd38) = 1 test: ioctl returned no error +++ exited with 0 +++ Running the program on a kernel with the bug fixed gives the proper result: $ strace -e ioctl ./test ioctl(3, FS_IOC_GETFLAGS, 0x7ffdd2768258) = 0 ioctl(3, FS_IOC_SETFLAGS, 0x7ffdd2768258) = -1 EPERM (Operation not permitted) test: ioctl: FS_IOC_SETFLAGS: Operation not permitted +++ exited with 1 +++ Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4: fix scheduling in atomic on group checksum failureJan Kara2016-02-112-5/+8
| | | | | | | | | | | | | When block group checksum is wrong, we call ext4_error() while holding group spinlock from ext4_init_block_bitmap() or ext4_init_inode_bitmap() which results in scheduling while in atomic. Fix the issue by calling ext4_error() later after dropping the spinlock. CC: stable@vger.kernel.org Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
* ext4 crypto: move context consistency check to ext4_file_open()Theodore Ts'o2016-02-082-2/+15
| | | | | | | | | | In the case where the per-file key for the directory is cached, but root does not have access to the key needed to derive the per-file key for the files in the directory, we allow the lookup to succeed, so that lstat(2) and unlink(2) can suceed. However, if a program tries to open the file, it will get an ENOKEY error. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* ext4 crypto: revalidate dentry after adding or removing the keyTheodore Ts'o2016-02-074-0/+81
| | | | | | | | | | | Add a validation check for dentries for encrypted directory to make sure we're not caching stale data after a key has been added or removed. Also check to make sure that status of the encryption key is updated when readdir(2) is executed. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* Linux 4.5-rc2v4.5-rc2Linus Torvalds2016-01-311-1/+1
|
* Merge tag 'usb-4.5-rc2' of ↵Linus Torvalds2016-01-319-7/+68
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB driver fixes from Greg KH: "Here are some small USB fixes and new device ids for 4.5-rc2. Nothing major here, full details are in the shortlog, and all of these have been in linux-next successfully" * tag 'usb-4.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: USB: option: fix Cinterion AHxx enumeration USB: mxu11x0: fix memory leak on usb_serial private data USB: serial: ftdi_sio: add support for Yaesu SCU-18 cable USB: serial: option: Adding support for Telit LE922 USB: serial: visor: fix crash on detecting device without write_urbs USB: visor: fix null-deref at probe USB: cp210x: add ID for IAI USB to RS485 adaptor usb: hub: do not clear BOS field during reset device cdc-acm:exclude Samsung phone 04e8:685d usb: cdc-acm: send zero packet for intel 7260 modem usb: cdc-acm: handle unlinked urb in acm read callback
| * Merge tag 'usb-serial-4.5-rc2' of ↵Greg Kroah-Hartman2016-01-286-3/+49
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/johan/usb-serial into usb-linus Johan writes: USB-serial fixes for v4.5-rc2 Here are two fixes of crashes in the visor driver that could be triggered using bad (malicious) descriptors, a fix for two memory leaks in the new mxu11x0 driver, and an interface-blacklist fix for the option driver. Included are also some new device ids. Signed-off-by: Johan Hovold <johan@kernel.org>
| | * USB: option: fix Cinterion AHxx enumerationJohn Ernberg2016-01-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In certain kernel configurations where the cdc_ether and option drivers are compiled as modules there can occur a race condition in enumeration. This causes the option driver to enumerate the ethernet(wwan) interface as usb-serial interfaces. usb-devices output for the modem: T: Bus=01 Lev=01 Prnt=01 Port=00 Cnt=01 Dev#= 5 Spd=480 MxCh= 0 D: Ver= 2.00 Cls=ef(misc ) Sub=02 Prot=01 MxPS=64 #Cfgs= 1 P: Vendor=1e2d ProdID=0055 Rev=00.00 S: Manufacturer=Cinterion S: Product=AHx C: #Ifs= 6 Cfg#= 1 Atr=e0 MxPwr=10mA I: If#= 0 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option I: If#= 1 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option I: If#= 2 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=ff Driver=option I: If#= 3 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=ff Driver=option I: If#= 4 Alt= 0 #EPs= 1 Cls=02(commc) Sub=06 Prot=00 Driver=cdc_ether I: If#= 5 Alt= 1 #EPs= 2 Cls=0a(data ) Sub=00 Prot=00 Driver=cdc_ether Signed-off-by: John Ernberg <john.ernberg@actia.se> Fixes: 1941138e1c02 ("USB: added support for Cinterion's products...") Cc: stable <stable@vger.kernel.org> # v3.9: 8ff10bdb14a52 Signed-off-by: Johan Hovold <johan@kernel.org>
| | * USB: mxu11x0: fix memory leak on usb_serial private dataMathieu OTHACEHE2016-01-251-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | On nominal execution, private data allocated on port_probe and attach are never freed. Add port_remove and release callbacks to free them respectively. Signed-off-by: Mathieu OTHACEHE <m.othacehe@gmail.com> Signed-off-by: Johan Hovold <johan@kernel.org>
| | * USB: serial: ftdi_sio: add support for Yaesu SCU-18 cableGreg Kroah-Hartman2016-01-252-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Harald Linden reports that the ftdi_sio driver works properly for the Yaesu SCU-18 cable if the device ids are added to the driver. So let's add them. Reported-by: Harald Linden <harald.linden@7183.org> Cc: stable <stable@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Johan Hovold <johan@kernel.org>
| | * USB: serial: option: Adding support for Telit LE922Daniele Palmas2016-01-251-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | This patch adds support for two PIDs of LE922. Signed-off-by: Daniele Palmas <dnlplm@gmail.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Johan Hovold <johan@kernel.org>
OpenPOWER on IntegriCloud