summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* crypto: algif_skcipher - Fixed overflow when sndbuf is page alignedHerbert Xu2010-11-301-21/+11
| | | | | | | | | | | | | | | | | | When sk_sndbuf is not a multiple of PAGE_SIZE, the limit tests in sendmsg fail as the limit variable becomes negative and we're using an unsigned comparison. The same thing can happen if sk_sndbuf is lowered after a sendmsg call. This patch fixes this by always taking the signed maximum of limit and 0 before we perform the comparison. It also rounds the value of sk_sndbuf down to a multiple of PAGE_SIZE so that we don't end up allocating a page only to use a small number of bytes in it because we're bound by sk_sndbuf. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: af_alg - Add dependency on NETHerbert Xu2010-11-291-0/+2
| | | | | | | | | | | | Add missing dependency on NET since we require sockets for our interface. Should really be a select but kconfig doesn't like that: net/Kconfig:6:error: found recursive dependency: NET -> NETWORK_FILESYSTEMS -> AFS_FS -> AF_RXRPC -> CRYPTO -> CRYPTO_USER_API_HASH -> CRYPTO_USER_API -> NET Reported-by: Zimny Lech <napohybelskurwysynom2010@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: aesni-intel - Fixed build error on x86-32Mathias Krause2010-11-292-14/+17
| | | | | | | | | | Exclude AES-GCM code for x86-32 due to heavy usage of 64-bit registers not available on x86-32. While at it, fixed unregister order in aesni_exit(). Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: algif_skcipher - Pass on error from af_alg_make_sgHerbert Xu2010-11-281-1/+2
| | | | | | | The error returned from af_alg_make_sg is currently lost and we always pass on -EINVAL. This patch pases on the underlying error. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: omap-sham - zero-copy scatterlist handlingDmitry Kasatkin2010-11-271-26/+61
| | | | | | | | | | | | | If scatterlist have more than one entry, current driver uses aligned buffer to copy data to to accelerator to tackle possible issues with DMA and SHA buffer alignment. This commit adds more intelligence to verify SG alignment and possibility to use DMA directly on the data without using copy buffer. Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: omap-sham - FLAGS_FIRST is redundant and removedDmitry Kasatkin2010-11-271-7/+1
| | | | | | | | bufcnt is 0 if it was no update requests before, which is exact meaning of FLAGS_FIRST. Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: omap-sham - hash-in-progress is stored in hw formatDmitry Kasatkin2010-11-271-14/+24
| | | | | | | | | Hash-in-progress is now stored in hw format. Only on final call, hash is converted to correct format. Speedup copy procedure and will allow to use OMAP burst mode. Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: omap-sham - crypto_ahash_final() now not need to be called.Dmitry Kasatkin2010-11-271-86/+82
| | | | | | | | | | | | | | | | | | | | | According to the Herbert Xu, client may not always call crypto_ahash_final(). In the case of error in hash calculation resources will be automatically cleaned up. But if no hash calculation error happens and client will not call crypto_ahash_final() at all, then internal buffer will not be freed, and clocks will not be disabled. This patch provides support for atomic crypto_ahash_update() call. Clocks are now enabled and disabled per update request. Data buffer is now allocated as a part of request context. Client is obligated to free it with crypto_free_ahash(). Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: omap-sham - removed redundunt lockingDmitry Kasatkin2010-11-271-26/+21
| | | | | | | | Locking for queuing and dequeuing is combined. test_and_set_bit() is also replaced with checking under dd->lock. Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: omap-sham - error handling improvedDmitry Kasatkin2010-11-271-23/+44
| | | | | | | | | | | | | Introduces DMA error handling. DMA error is returned as a result code of the hash request. Clients needs to handle error codes and may repeat hash calculation attempt. Also in the case of DMA error, SHAM module is set to be re-initialized again. It significantly improves stability against possible HW failures. Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: omap-sham - DMA initialization fixes for off modeDmitry Kasatkin2010-11-271-9/+10
| | | | | | | | | | | | DMA parameters for constant data were initialized during driver probe(). It seems that those settings sometimes are lost when devices goes to off mode. This patch makes DMA initialization just before use. It solves off mode problems. Fixes: NB#202786 - Aegis & SHA1 block off mode changes Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: omap-sham - uses digest buffer in request contextDmitry Kasatkin2010-11-271-3/+8
| | | | | | | | | | Currently driver storred digest results in req->results provided by the client. But some clients do not set it until final() call. It leads to crash. Changed to use internal buffer to store temporary digest results. Signed-off-by: Dmitry Kasatkin <dmitry.kasatkin@nokia.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: aesni-intel - Ported implementation to x86-32Mathias Krause2010-11-273-40/+191
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The AES-NI instructions are also available in legacy mode so the 32-bit architecture may profit from those, too. To illustrate the performance gain here's a short summary of a dm-crypt speed test on a Core i7 M620 running at 2.67GHz comparing both assembler implementations: x86: i568 aes-ni delta ECB, 256 bit: 93.8 MB/s 123.3 MB/s +31.4% CBC, 256 bit: 84.8 MB/s 262.3 MB/s +209.3% LRW, 256 bit: 108.6 MB/s 222.1 MB/s +104.5% XTS, 256 bit: 105.0 MB/s 205.5 MB/s +95.7% Additionally, due to some minor optimizations, the 64-bit version also got a minor performance gain as seen below: x86-64: old impl. new impl. delta ECB, 256 bit: 121.1 MB/s 123.0 MB/s +1.5% CBC, 256 bit: 285.3 MB/s 290.8 MB/s +1.9% LRW, 256 bit: 263.7 MB/s 265.3 MB/s +0.6% XTS, 256 bit: 251.1 MB/s 255.3 MB/s +1.7% Signed-off-by: Mathias Krause <minipli@googlemail.com> Reviewed-by: Huang Ying <ying.huang@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: Makefile clean upTracey Dent2010-11-271-7/+7
| | | | | | | Changed Makefile to use <modules>-y instead of <modules>-objs. Signed-off-by: Tracey Dent <tdent48227@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: Use vzallocJoe Perches2010-11-272-4/+2
| | | | | Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: algif_skcipher - User-space interface for skcipher operationsHerbert Xu2010-11-263-0/+649
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the af_alg plugin for symmetric key ciphers, corresponding to the ablkcipher kernel operation type. Keys can optionally be set through the setsockopt interface. Once a sendmsg call occurs without MSG_MORE no further writes may be made to the socket until all previous data has been read. IVs and and whether encryption/decryption is performed can be set through the setsockopt interface or as a control message to sendmsg. The interface is completely synchronous, all operations are carried out in recvmsg(2) and will complete prior to the system call returning. The splice(2) interface support reading the user-space data directly without copying (except that the Crypto API itself may copy the data if alignment is off). The recvmsg(2) interface supports directly writing to user-space without additional copying, i.e., the kernel crypto interface will receive the user-space address as its output SG list. Thakns to Miloslav Trmac for reviewing this and contributing fixes and improvements. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: David S. Miller <davem@davemloft.net>
* crypto: algif_hash - User-space interface for hash operationsHerbert Xu2010-11-193-0/+328
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the af_alg plugin for hash, corresponding to the ahash kernel operation type. Keys can optionally be set through the setsockopt interface. Each sendmsg call will finalise the hash unless sent with a MSG_MORE flag. Partial hash states can be cloned using accept(2). The interface is completely synchronous, all operations will complete prior to the system call returning. Both sendmsg(2) and splice(2) support reading the user-space data directly without copying (except that the Crypto API itself may copy the data if alignment is off). For now only the splice(2) interface supports performing digest instead of init/update/final. In future the sendmsg(2) interface will also be modified to use digest/finup where possible so that hardware that cannot return a partial hash state can still benefit from this interface. Thakns to Miloslav Trmac for reviewing this and contributing fixes and improvements. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: David S. Miller <davem@davemloft.net> Tested-by: Martin Willi <martin@strongswan.org>
* crypto: af_alg - User-space interface for Crypto APIHerbert Xu2010-11-195-0/+618
| | | | | | | | | | | | | | | | | | | | | | | | | This patch creates the backbone of the user-space interface for the Crypto API, through a new socket family AF_ALG. Each session corresponds to one or more connections obtained from that socket. The number depends on the number of inputs/outputs of that particular type of operation. For most types there will be a s ingle connection/file descriptor that is used for both input and output. AEAD is one of the few that require two inputs. Each algorithm type will provide its own implementation that plugs into af_alg. They're keyed using a string such as "skcipher" or "hash". IOW this patch only contains the boring bits that is required to hold everything together. Thakns to Miloslav Trmac for reviewing this and contributing fixes and improvements. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: David S. Miller <davem@davemloft.net> Tested-by: Martin Willi <martin@strongswan.org>
* net - Add AF_ALG macrosHerbert Xu2010-11-191-1/+4
| | | | | | | | | This patch adds the socket family/level macros for the yet-to-be-born AF_ALG family. The AF_ALG family provides the user-space interface for the kernel crypto API. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: David S. Miller <davem@davemloft.net>
* crypto: rfc4106 - Extending the RC4106 AES-GCM test vectorsAdrian Hoban2010-11-133-0/+396
| | | | | | | | | | | | Updated RFC4106 AES-GCM testing. Some test vectors were taken from http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/proposedmodes/ gcm/gcm-test-vectors.tar.gz Signed-off-by: Adrian Hoban <adrian.hoban@intel.com> Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com> Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by: Aidan O'Mahony <aidan.o.mahony@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: aesni-intel - RFC4106 AES-GCM Driver Using Intel New InstructionsTadeusz Struk2010-11-132-2/+1708
| | | | | | | | | | | | | | | | | This patch adds an optimized RFC4106 AES-GCM implementation for 64-bit kernels. It supports 128-bit AES key size. This leverages the crypto AEAD interface type to facilitate a combined AES & GCM operation to be implemented in assembly code. The assembly code leverages Intel(R) AES New Instructions and the PCLMULQDQ instruction. Signed-off-by: Adrian Hoban <adrian.hoban@intel.com> Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com> Signed-off-by: Gabriele Paoloni <gabriele.paoloni@intel.com> Signed-off-by: Aidan O'Mahony <aidan.o.mahony@intel.com> Signed-off-by: Erdinc Ozturk <erdinc.ozturk@intel.com> Signed-off-by: James Guilford <james.guilford@intel.com> Signed-off-by: Wajdi Feghali <wajdi.k.feghali@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: cast5 - simplify if-statementsNicolas Kaiser2010-11-131-50/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I noticed that by factoring out common rounds from the branches of the if-statements in the encryption and decryption functions, the executable file size goes down significantly, for crypto/cast5.ko from 26688 bytes to 24336 bytes (amd64). On my test system, I saw a slight speedup. This is the first time I'm doing such a benchmark - I found a similar one on the crypto mailing list, and I hope I did it right? Before: # cryptsetup create dm-test /dev/hda2 -c cast5-cbc-plain -s 128 Passsatz eingeben: # dd if=/dev/zero of=/dev/mapper/dm-test bs=1M count=50 52428800 Bytes (52 MB) kopiert, 2,43484 s, 21,5 MB/s # dd if=/dev/zero of=/dev/mapper/dm-test bs=1M count=50 52428800 Bytes (52 MB) kopiert, 2,4089 s, 21,8 MB/s # dd if=/dev/zero of=/dev/mapper/dm-test bs=1M count=50 52428800 Bytes (52 MB) kopiert, 2,41091 s, 21,7 MB/s After: # cryptsetup create dm-test /dev/hda2 -c cast5-cbc-plain -s 128 Passsatz eingeben: # dd if=/dev/zero of=/dev/mapper/dm-test bs=1M count=50 52428800 Bytes (52 MB) kopiert, 2,38128 s, 22,0 MB/s # dd if=/dev/zero of=/dev/mapper/dm-test bs=1M count=50 52428800 Bytes (52 MB) kopiert, 2,29486 s, 22,8 MB/s # dd if=/dev/zero of=/dev/mapper/dm-test bs=1M count=50 52428800 Bytes (52 MB) kopiert, 2,37162 s, 22,1 MB/s Signed-off-by: Nicolas Kaiser <nikai@nikai.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* crypto: hash - Fix async import on shash algorithmHerbert Xu2010-11-041-1/+7
| | | | | | | | | | The function shash_async_import did not initialise the descriptor correctly prior to calling the underlying shash import function. This patch adds the required initialisation. Reported-by: Miloslav Trmac <mitr@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* Merge branch 'upstream-merge' of ↵Linus Torvalds2010-10-2733-1131/+2510
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'upstream-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (50 commits) ext4,jbd2: convert tracepoints to use major/minor numbers ext4: optimize orphan_list handling for ext4_setattr ext4: fix unbalanced mutex unlock in error path of ext4_li_request_new ext4: fix compile error in ext4_fallocate() ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them static ext4: rename mark_bitmap_end() to ext4_mark_bitmap_end() ext4: move flush_completed_IO to fs/ext4/fsync.c and make it static ext4: rename {ext,idx}_pblock and inline small extent functions ext4: make various ext4 functions be static ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*() ext4: fix kernel oops if the journal superblock has a non-zero j_errno ext4: update writeback_index based on last page scanned ext4: implement writeback livelock avoidance using page tagging ext4: tidy up a void argument in inode.c ext4: add batched_discard into ext4 feature list ext4: Add batched discard support for ext4 fs: Add FITRIM ioctl ext4: Use return value from sb_issue_discard() ext4: Check return value of sb_getblk() and friends ext4: use bio layer instead of buffer layer in mpage_da_submit_io ...
| * Merge branch 'next' into upstream-mergeTheodore Ts'o2010-10-2733-1131/+2510
| |\ | | | | | | | | | | | | | | | | | | Conflicts: fs/ext4/inode.c fs/ext4/mballoc.c include/trace/events/ext4.h
| | * ext4,jbd2: convert tracepoints to use major/minor numbersTheodore Ts'o2010-10-272-156/+251
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unfortunately perf can't deal with anything other than direct structure accesses in the TP_printk() section. It will drop dead when it sees jbd2_dev_to_name() in the "print fmt" section of the tracepoint. Addresses-Google-Bug: 3138508 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: optimize orphan_list handling for ext4_setattrDmitry Monakhov2010-10-271-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Surprisingly chown() on ext4 is not SMP scalable operation. Due to unconditional orphan_del(NULL, inode) in ext4_setattr() result in significant performance overhead because of global orphan mutex, especially in no-journal mode (where orphan_add() is noop). It is possible to skip explicit orphan_del if possible. Results of fchown() micro-benchmark in no-journal mode while (1) { iteration++; fchown(fd, uid, gid); fchown(fd, uid + 1, gid + 1) } measured: iterations per millisecond | nr_tasks | w/o patch | with patch | | 1 | 142 | 185 | | 4 | 109 | 642 | Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: fix unbalanced mutex unlock in error path of ext4_li_request_newNicolas Kaiser2010-10-271-14/+8
| | | | | | | | | | | | | | | Signed-off-by: Nicolas Kaiser <nikai@nikai.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: fix compile error in ext4_fallocate()Kazuya Mio2010-10-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I compiled 2.6.36-rc3 kernel with EXT4FS_DEBUG definition, I got the following compile error. CC [M] fs/ext4/extents.o fs/ext4/extents.c: In function 'ext4_fallocate': fs/ext4/extents.c:3772: error: 'block' undeclared (first use in this function) fs/ext4/extents.c:3772: error: (Each undeclared identifier is reported only once fs/ext4/extents.c:3772: error: for each function it appears in.) make[2]: *** [fs/ext4/extents.o] Error 1 The patch fixes this problem. Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: move ext4_mb_{get,put}_buddy_cache_lock and make them staticEric Sandeen2010-10-272-81/+79
| | | | | | | | | | | | | | | | | | | | | | | | These functions are only used within fs/ext4/mballoc.c, so move them so they are used after they are defined, and then make them be static. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: rename mark_bitmap_end() to ext4_mark_bitmap_end()Theodore Ts'o2010-10-274-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | Fix a namespace leak from fs/ext4 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: move flush_completed_IO to fs/ext4/fsync.c and make it staticTheodore Ts'o2010-10-273-84/+83
| | | | | | | | | | | | | | | | | | | | | Fix a namespace leak by moving the function to the file where it is used and making it static. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: rename {ext,idx}_pblock and inline small extent functionsTheodore Ts'o2010-10-274-118/+121
| | | | | | | | | | | | | | | | | | | | | | | | | | | Cleanup namespace leaks from fs/ext4 and the inline trivial functions ext4_{ext,idx}_pblock() and ext4_{ext,idx}_store_pblock() since the code size actually shrinks when we make these functions inline, they're so trivial. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: make various ext4 functions be staticTheodore Ts'o2010-10-277-44/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These functions have no need to be exported beyond file context. No functions needed to be moved for this commit; just some function declarations changed to be static and removed from header files. (A similar patch was submitted by Eric Sandeen, but I wanted to handle code movement in separate patches to make sure code changes didn't accidentally get dropped.) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: rename {exit,init}_ext4_*() to ext4_{exit,init}_*()Theodore Ts'o2010-10-277-34/+34
| | | | | | | | | | | | | | | | | | This is a cleanup to avoid namespace leaks out of fs/ext4 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: fix kernel oops if the journal superblock has a non-zero j_errnoTheodore Ts'o2010-10-272-2/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 84061e0 fixed an accounting bug only to introduce the possibility of a kernel OOPS if the journal has a non-zero j_errno field indicating that the file system had detected a fs inconsistency. After the journal replay, if the journal superblock indicates that the file system has an error, this indication is transfered to the file system and then ext4_commit_super() is called to write this to the disk. But since the percpu counters are now initialized after the journal replay, the call to ext4_commit_super() will cause a kernel oops since it needs to use the percpu counters the ext4 superblock structure. The fix is to skip setting the ext4 free block and free inode fields if the percpu counter has not been set. Thanks to Ken Sumrall for reporting and analyzing the root causes of this bug. Addresses-Google-Bug: #3054080 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: update writeback_index based on last page scannedEric Sandeen2010-10-271-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As pointed out in a prior patch, updating the mapping's writeback_index based on pages written isn't quite right; what the writeback index is really supposed to reflect is the next page which should be scanned for writeback during periodic flush. As in write_cache_pages(), write_cache_pages_da() does this scanning for us as we assemble the mpd for later writeout. If we keep track of the next page after the current scan, we can easily update writeback_index without worrying about pages written vs. pages skipped, etc. Without this, an fsync will reset writeback_index to 0 (its starting index) + however many pages it wrote, which can mess up the progress of periodic flush. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: implement writeback livelock avoidance using page taggingEric Sandeen2010-10-272-3/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is analogous to Jan Kara's commit, f446daaea9d4a420d16c606f755f3689dcb2d0ce mm: implement writeback livelock avoidance using page tagging but since we forked write_cache_pages, we need to reimplement it there (and in ext4_da_writepages, since range_cyclic handling was moved to there) If you start a large buffered IO to a file, and then set fsync after it, you'll find that fsync does not complete until the other IO stops. If you continue re-dirtying the file (say, putting dd with conv=notrunc in a loop), when fsync finally completes (after all IO is done), it reports via tracing that it has written many more pages than the file contains; in other words it has synced and re-synced pages in the file multiple times. This then leads to problems with our writeback_index update, since it advances it by pages written, and essentially sets writeback_index off the end of the file... With the following patch, we only sync as much as was dirty at the time of the sync. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: tidy up a void argument in inode.cEric Sandeen2010-10-271-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This doesn't fix anything at all, it just removes a vestige of prior use from __mpage_da_writepage() __mpage_da_writepage() had a *void argument leftover from its previous life as a callback; make it reflect the actual type. Fixing this up makes it slightly more obvious to read, and enables proper typechecking. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: add batched_discard into ext4 feature listLukas Czerner2010-10-271-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | Should be applied on the top of "lazy inode table initialization" and "batched discard support" patch-sets. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: Add batched discard support for ext4Lukas Czerner2010-10-273-0/+188
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Walk through allocation groups and trim all free extents. It can be invoked through FITRIM ioctl on the file system. The main idea is to provide a way to trim the whole file system if needed, since some SSD's may suffer from performance loss after the whole device was filled (it does not mean that fs is full!). It search for free extents in allocation groups specified by Byte range start -> start+len. When the free extent is within this range, blocks are marked as used and then trimmed. Afterwards these blocks are marked as free in per-group bitmap. Since fstrim is a long operation it is good to have an ability to interrupt it by a signal. This was added by Dmitry Monakhov. Thanks Dimitry. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * fs: Add FITRIM ioctlLukas Czerner2010-10-272-0/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds an filesystem independent ioctl to allow implementation of file system batched discard support. I takes fstrim_range structure as an argument. fstrim_range is definec in the include/fs.h and its definition is as follows. struct fstrim_range { start; len; minlen; } start - first Byte to trim len - number of Bytes to trim from start minlen - minimum extent length to trim, free extents shorter than this number of Bytes will be ignored. This will be rounded up to fs block size. It is also possible to specify NULL as an argument. In this case the arguments will set itself as follows: start = 0; len = ULLONG_MAX; minlen = 0; So it will trim the whole file system at one run. After the FITRIM is done, the number of actually discarded Bytes is stored in fstrim_range.len to give the user better insight on how much storage space has been really released for wear-leveling. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: Use return value from sb_issue_discard()Lukas Czerner2010-10-271-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use return value from sb_issue_discard() as return value in ext4_issue_discard(). Since sb_issue_discard() may result in more serious errors than just -EOPNOTSUPP it is worth to inform user of this function about them to handle error cases properly. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: Check return value of sb_getblk() and friendsNamhyung Kim2010-10-272-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | Fail block allocation if sb_getblk() returns NULL. In that case, sb_find_get_block() also likely to fail so that it should skip calling ext4_forget(). Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: use bio layer instead of buffer layer in mpage_da_submit_ioTheodore Ts'o2010-10-276-109/+489
| | | | | | | | | | | | | | | | | | | | | | | | | | | Call the block I/O layer directly instad of going through the buffer layer. This should give us much better performance and scalability, as well as lowering our CPU utilization when doing buffered writeback. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: move mpage_put_bnr_to_bhs()'s functionality to mpage_da_submit_io()Theodore Ts'o2010-10-271-101/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This massively simplifies the ext4_da_writepages() code path by completely removing mpage_put_bnr_bhs(), which is almost 100 lines of code iterating over a set of pages using pagevec_lookup(), and folds that functionality into mpage_da_submit_io()'s existing pagevec_lookup() loop. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: inline walk_page_buffers() into mpage_da_submit_ioTheodore Ts'o2010-10-271-11/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Expand the call: if (walk_page_buffers(NULL, page_bufs, 0, len, NULL, ext4_bh_delay_or_unwritten)) goto redirty_page into mpage_da_submit_io(). This will allow us to merge in mpage_put_bnr_to_bhs() in the next patch. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: inline ext4_writepage() into mpage_da_submit_io()Theodore Ts'o2010-10-271-11/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As a prepratory step to switching to bio_submit, inline ext4_writepage() into mpage_da_submit() and then simplify things a bit. This makes it clearer what mpage_da_submit needs to do. Also, move the ClearPageChecked(page) call into __ext4_journalled_writepage(), as a minor bit of cleanup refactoring. This also allows us to pull i_size_read() and ext4_should_journal_data() out of the loop, which should be a very minor CPU savings. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: simplify ext4_writepage()Theodore Ts'o2010-10-271-48/+25
| | | | | | | | | | | | | | | | | | | | | | | | The actual code in ext4_writepage() is unnecessarily convoluted. Simplify it so it is easier to understand, but otherwise logically equivalent. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
| | * ext4: call mpage_da_submit_io() from mpage_da_map_blocks()Theodore Ts'o2010-10-271-33/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Eventually we need to completely reorganize the ext4 writepage callpath, but for now, we simplify things a little by calling mpage_da_submit_io() from mpage_da_map_blocks(), since all of the places where we call mpage_da_map_blocks() it is followed up by a call to mpage_da_submit_io(). We're also a wee bit better with respect to error handling, but there are still a number of issues where it's not clear what the right thing is to do with ext4 functions deep in the writeback codepath fails. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
OpenPOWER on IntegriCloud