summaryrefslogtreecommitdiffstats
path: root/block.c
Commit message (Collapse)AuthorAgeFilesLines
* Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into stagingPeter Maydell2015-06-151-40/+197
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Block layer core and image format patches # gpg: Signature made Fri Jun 12 16:08:53 2015 BST using RSA key ID C88F2FD6 # gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" * remotes/kevin/tags/for-upstream: (25 commits) block: Fix reopen flag inheritance block: Add BlockDriverState.inherits_from block: Add list of children to BlockDriverState queue.h: Add QLIST_FIX_HEAD_PTR() block: Drain requests before swapping nodes in bdrv_swap() block: Move flag inheritance to bdrv_open_inherit() block: Use QemuOpts in bdrv_open_common() block: Use macro for cache option names vmdk: Use bdrv_open_image() quorum: Use bdrv_open_image() check-qdict: Test cases for new functions qdict: Add qdict_{set,copy}_default() qdict: Add qdict_array_entries() iotests: Add tests for overriding BDRV_O_PROTOCOL block: driver should override flags in bdrv_open() block: Change bitmap truncate conditional to assertion block: record new size in bdrv_dirty_bitmap_truncate raw-posix: Fix .bdrv_co_get_block_status() for unaligned image size vmdk: Use vmdk_find_index_in_cluster everywhere vmdk: Fix index_in_cluster calculation in vmdk_co_get_block_status ... Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
| * block: Fix reopen flag inheritanceKevin Wolf2015-06-121-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When reopening an image, the block layer already takes care to reopen bs->file as well with recalculated inherited flags. The same must happen for any other child (most notably missing before this patch: backing files). If bs->file (or any other child) didn't originally inherit from bs, e.g. because it was created separately and then only referenced, it must not inherit flags on reopen either, so check the inherited_from field before propagation the reopen down. VMDK already reopened its extents manually; this code can now be dropped. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
| * block: Add BlockDriverState.inherits_fromKevin Wolf2015-06-121-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the block layer assumes that any block node can have only one parent, and if it has a parent, that it inherits some options/flags from this parent. This is not true any more: With references used in block device creation, a single node can be used by multiple parents, or it can be created separately and not inherit flags from any parent. To handle reopens correctly, a node must know from which parent it inherited options. This patch adds the information to BlockDriverState. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>
| * block: Add list of children to BlockDriverStateKevin Wolf2015-06-121-0/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows iterating over all children of a given BDS, not only including bs->file and bs->backing_hd, but also driver-specific ones like VMDK extents or Quorum children. For bdrv_swap(), the list of children of the swapped BDS stays at that BDS (because that's where the pointers stay as well). The list head moves and pointers to it must be fixed up therefore. The list of children in the parent of the swapped BDS is not affected by the swap. The contents of the BDS objects is swapped, so the existing pointer in the parent automatically points to the newly swapped in BDS. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>
| * block: Drain requests before swapping nodes in bdrv_swap()Kevin Wolf2015-06-121-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bdrv_swap() requires that there are no requests in flight on either of the two devices. The request coroutine would work on the wrong BlockDriverState object (with bs->opaque even being interpreted as a different type potentially) and all sorts of bad things would result from this. The currently existing callers mostly ensure that there is no I/O pending on nodes that are swapped. In detail, this is: 1. Live snapshots. This goes through qmp_transaction(), which calls bdrv_drain_all() before doing anything. The command is executed synchronously, so no new I/O can be issued concurrently. 2. snapshot=on in bdrv_open(). We're in the middle of opening the image (both the original image and its temporary overlay), so there can't be any I/O in flight yet. 3. Mirroring. bdrv_drain() is already used on the source device so that the mirror doesn't miss anything. However, the main loop runs between that and the bdrv_swap() (which is actually a bug, being addressed in another series), so there is a small window in which new I/O might be issued that would be in flight during bdrv_swap(). It is safer to just drain the request queue of both devices in bdrv_swap() instead of relying on callers to do the right thing. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>
| * block: Move flag inheritance to bdrv_open_inherit()Kevin Wolf2015-06-121-13/+61
| | | | | | | | | | | | | | | | | | | | | | Instead of letting every caller of bdrv_open() determine the right flags for its child node manually and pass them to the function, pass the parent node and the role of the newly opened child (like backing file, protocol layer, etc.). Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>
| * block: Use QemuOpts in bdrv_open_common()Kevin Wolf2015-06-121-5/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | Instead of manually parsing options and then deleting them from the options QDict, just use QemuOpts like most other places that deal with block device options. More options will be added there and then QemuOpts is a lot more manageable than open-coding everything. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>
| * block: driver should override flags in bdrv_open()Max Reitz2015-06-121-18/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The BDRV_O_PROTOCOL flag should have an impact only if no driver is specified explicitly. Therefore, if bdrv_open() is called with an explicit block driver argument (either through the options QDict or through the drv parameter) and that block driver is a protocol block driver, BDRV_O_PROTOCOL should be set; if it is a format block driver, BDRV_O_PROTOCOL should be unset. While there was code to unset the flag in case a format block driver has been selected, it only followed the bdrv_fill_options() function call whereas the flag in fact needs to be adjusted before it is used there. With that change, BDRV_O_PROTOCOL will always be set if the BDS should be a protocol driver; if the driver has been specified explicitly, the new code will set it; and bdrv_fill_options() will only "probe" a protocol driver if BDRV_O_PROTOCOL is set. The probing after bdrv_fill_options() cannot select a protocol driver. Thus, bdrv_open_image() to open BDS.file is never called if a protocol BDS is about to be created. With that change in turn it is impossible to call bdrv_open_common() with a protocol drv and file != NULL, which allows us to remove the bdrv_swap() call. This change breaks a test case in qemu-iotest 051: "-drive file=t.qcow2,file.driver=qcow2" now works because the explicitly specified "qcow2" overrides the BDRV_O_PROTOCOL which is automatically set for the "file" BDS (and the filename is just passed down). Therefore, this patch removes that test case. Signed-off-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * block: Change bitmap truncate conditional to assertionJohn Snow2015-06-121-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This is an artifact of an older version that had both all-bitmap and single-bitmap truncate functions, and some info got lost in the shuffle. Bitmaps can only be frozen during a backup operation, and a backup operation should prevent a resize operation, so just assert that this cannot happen. Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
| * block: record new size in bdrv_dirty_bitmap_truncateJohn Snow2015-06-121-0/+1
| | | | | | | | | | | | | | | | | | | | ce1ffea8 neglected to update the BdrvDirtyBitmap structure itself for internal consistency. It's currently not an issue, but for migration and persistence series this will cause headaches. Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* | throttle: acquire the ThrottleGroup lock in bdrv_swap()Alberto Garcia2015-06-121-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | bdrv_swap() touches the fields of a BlockDriverState that are protected by the ThrottleGroup lock. Although those fields end up in their original place, they are temporarily swapped in the process, so there's a chance that an operation on a member of the same group happening on a different thread can try to use them. Signed-off-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: d92dc40d7c4f1fc5cda5cbbf4ffb7a4670b79d17.1433779731.git.berto@igalia.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* | throttle: Add throttle group supportAlberto Garcia2015-06-121-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The throttle group support use a cooperative round robin scheduling algorithm. The principles of the algorithm are simple: - Each BDS of the group is used as a token in a circular way. - The active BDS computes if a wait must be done and arms the right timer. - If a wait must be done the token timer will be armed so the token will become the next active BDS. Signed-off-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: f0082a86f3ac01c46170f7eafe2101a92e8fde39.1433779731.git.berto@igalia.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* | throttle: Extract timers from ThrottleState into a separate structureBenoƮt Canet2015-06-121-4/+7
|/ | | | | | | | | | | | | | | | | | Group throttling will share ThrottleState between multiple bs. As a consequence the ThrottleState will be accessed by multiple aio context. Timers are tied to their aio context so they must go out of the ThrottleState structure. This commit paves the way for each bs of a common ThrottleState to have its own timer. Signed-off-by: Benoit Canet <benoit.canet@nodalink.com> Signed-off-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 6cf9ea96d8b32ae2f8769cead38f68a6a0c8c909.1433779731.git.berto@igalia.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* qapi: add dirty bitmap statusJohn Snow2015-05-291-1/+12
| | | | | | | | | | | | | | | | | | | | | | | Bitmaps can be in a handful of different states with potentially more to come as we tool around with migration and persistence patches. Management applications may need to know why certain bitmaps are unavailable for various commands, e.g. busy in another operation, busy being migrated, etc. Right now, all we offer is BlockDirtyInfo's boolean member 'frozen'. Instead of adding more booleans, replace it by an enumeration member 'status' with values 'active' and 'frozen'. Then add new value 'disabled'. Incompatible change. Fine because the changed part hasn't been released so far. Suggested-by: Eric Blake <eblake@redhat.com> Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> [Commit message tweaked] Signed-off-by: Markus Armbruster <armbru@redhat.com>
* block: Detect multiplication overflow in bdrv_getlengthFam Zheng2015-05-221-0/+1
| | | | | | | | | | | | Bogus image may have a large total_sectors that will overflow the multiplication. For cleanness, fix the return code so the error message will be meaningful. Reported-by: Richard W.M. Jones <rjones@redhat.com> Signed-off-by: Fam Zheng <famz@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: align bounce buffers to pageDenis V. Lunev2015-05-221-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following sequence int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); for (i = 0; i < 100000; i++) write(fd, buf, 4096); performs 5% better if buf is aligned to 4096 bytes. The difference is quite reliable. On the other hand we do not want at the moment to enforce bounce buffering if guest request is aligned to 512 bytes. The patch changes default bounce buffer optimal alignment to MAX(page size, 4k). 4k is chosen as maximal known sector size on real HDD. The justification of the performance improve is quite interesting. From the kernel point of view each request to the disk was split by two. This could be seen by blktrace like this: 9,0 11 1 0.000000000 11151 Q WS 312737792 + 1023 [qemu-img] 9,0 11 2 0.000007938 11151 Q WS 312738815 + 8 [qemu-img] 9,0 11 3 0.000030735 11151 Q WS 312738823 + 1016 [qemu-img] 9,0 11 4 0.000032482 11151 Q WS 312739839 + 8 [qemu-img] 9,0 11 5 0.000041379 11151 Q WS 312739847 + 1016 [qemu-img] 9,0 11 6 0.000042818 11151 Q WS 312740863 + 8 [qemu-img] 9,0 11 7 0.000051236 11151 Q WS 312740871 + 1017 [qemu-img] 9,0 5 1 0.169071519 11151 Q WS 312741888 + 1023 [qemu-img] After the patch the pattern becomes normal: 9,0 6 1 0.000000000 12422 Q WS 314834944 + 1024 [qemu-img] 9,0 6 2 0.000038527 12422 Q WS 314835968 + 1024 [qemu-img] 9,0 6 3 0.000072849 12422 Q WS 314836992 + 1024 [qemu-img] 9,0 6 4 0.000106276 12422 Q WS 314838016 + 1024 [qemu-img] and the amount of requests sent to disk (could be calculated counting number of lines in the output of blktrace) is reduced about 2 times. Both qemu-img and qemu-io are affected while qemu-kvm is not. The guest does his job well and real requests comes properly aligned (to page). Signed-off-by: Denis V. Lunev <den@openvz.org> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-id: 1431441056-26198-3-git-send-email-den@openvz.org CC: Paolo Bonzini <pbonzini@redhat.com> CC: Kevin Wolf <kwolf@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* block: minimal bounce buffer alignmentDenis V. Lunev2015-05-221-0/+11
| | | | | | | | | | | | | | | | | | | | The patch introduces new concept: minimal memory alignment for bounce buffers. Original so called "optimal" value is actually minimal required value for aligment. It should be used for validation that the IOVec is properly aligned and bounce buffer is not required. Though, from the performance point of view, it would be better if bounce buffer or IOVec allocated by QEMU will be aligned stricter. The patch does not change any alignment value yet. Signed-off-by: Denis V. Lunev <den@openvz.org> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-id: 1431441056-26198-2-git-send-email-den@openvz.org CC: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* block: move I/O request processing to block/io.cStefan Hajnoczi2015-04-281-2512/+0
| | | | | | | | | | | | | | | | | | | | | | | | The block.c file has grown to over 6000 lines. It is time to split this file so there are fewer conflicts and the code is easier to maintain. Extract I/O request processing code: * Read * Write * Zero writes and making the image empty * Flush * Discard * ioctl * Tracked requests and queuing * Throttling and copy-on-read * Block status and allocated functions * Refreshing block limits * Reading/writing vmstate * qemu_blockalign() and friends The patch simply moves code from block.c into block/io.c. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: extract bdrv_setup_io_funcs()Stefan Hajnoczi2015-04-281-1/+6
| | | | | | | | Move the code to install coroutine and aio emulation function pointers in a BlockDriver to its own function. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: add bdrv_set_dirty()/bdrv_reset_dirty() to block_int.hStefan Hajnoczi2015-04-281-8/+4
| | | | | | | | | The dirty bitmap functions are called from the block I/O processing code. Make them visible to block_int.h users so they can be used outside block.c. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: replace bdrv_states iteration with bdrv_next()Stefan Hajnoczi2015-04-281-6/+8
| | | | | | | | | | | | | | The bdrv_states list is a static variable in block.c. bdrv_drain_all() and bdrv_flush_all() use this variable to iterate over all drives. The next patch will move bdrv_drain_all() and bdrv_flush_all() out of block.c so it's necessary to switch to the public bdrv_next() interface. Reviewed-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: Resize bitmaps on bdrv_truncateJohn Snow2015-04-281-0/+18
| | | | | | | | | Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 1429314609-29776-16-git-send-email-jsnow@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: Ensure consistent bitmap function prototypesJohn Snow2015-04-281-7/+6
| | | | | | | | | | | | We often don't need the BlockDriverState for functions that operate on bitmaps. Remove it. Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 1429314609-29776-15-git-send-email-jsnow@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: add BdrvDirtyBitmap documentationJohn Snow2015-04-281-5/+5
| | | | | | | | | Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 1429314609-29776-14-git-send-email-jsnow@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* qmp: Add dirty bitmap status field in query-blockJohn Snow2015-04-281-0/+1
| | | | | | | | | | | | | | Add the "frozen" status booleans, to inform clients when a bitmap is occupied doing a task. Signed-off-by: Fam Zheng <famz@redhat.com> Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-id: 1429314609-29776-13-git-send-email-jsnow@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* qmp: add block-dirty-bitmap-clearJohn Snow2015-04-281-0/+8
| | | | | | | | | | | | | | | | | Add bdrv_clear_dirty_bitmap and a matching QMP command, qmp_block_dirty_bitmap_clear that enables a user to reset the bitmap attached to a drive. This allows us to reset a bitmap in the event of a full drive backup. Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-id: 1429314609-29776-12-git-send-email-jsnow@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* qmp: Add support of "dirty-bitmap" sync mode for drive-backupJohn Snow2015-04-281-0/+9
| | | | | | | | | | | | | | | | For "dirty-bitmap" sync mode, the block job will iterate through the given dirty bitmap to decide if a sector needs backup (backup all the dirty clusters and skip clean ones), just as allocation conditions of "top" sync mode. Signed-off-by: Fam Zheng <famz@redhat.com> Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 1429314609-29776-11-git-send-email-jsnow@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: Add bitmap successorsJohn Snow2015-04-281-1/+103
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A bitmap successor is an anonymous BdrvDirtyBitmap that is intended to be created just prior to a sensitive operation (e.g. Incremental Backup) that can either succeed or fail, but during the course of which we still want a bitmap tracking writes. On creating a successor, we "freeze" the parent bitmap which prevents its deletion, enabling, anonymization, or creating a bitmap with the same name. On success, the parent bitmap can "abdicate" responsibility to the successor, which will inherit its name. The successor will have been tracking writes during the course of the backup operation. The parent will be safely deleted. On failure, we can "reclaim" the successor from the parent, unifying them such that the resulting bitmap describes all writes occurring since the last successful backup, for instance. Reclamation will thaw the parent, but not explicitly re-enable it. BdrvDirtyBitmap operations that target a single bitmap are protected by assertions that the bitmap is not frozen and/or disabled. BdrvDirtyBitmap operations that target a group of bitmaps, such as bdrv_{set,reset}_dirty will ignore frozen/disabled drives with a conditional instead. Internal functions that enable/disable dirty bitmaps have assertions added to them to prevent modifying frozen bitmaps. Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-id: 1429314609-29776-10-git-send-email-jsnow@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: Add bitmap disabled statusJohn Snow2015-04-281-0/+25
| | | | | | | | | | | | | | | | | | | | | Add a status indicating the enabled/disabled state of the bitmap. A bitmap is by default enabled, but you can lock the bitmap into a read-only state by setting disabled = true. A previous version of this patch added a QMP interface for changing the state of the bitmap, but it has since been removed for now until a use case emerges where this state must be revealed to the user. The disabled state WILL be used internally for bitmap migration and bitmap persistence. Signed-off-by: Fam Zheng <famz@redhat.com> Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 1429314609-29776-9-git-send-email-jsnow@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: Introduce bdrv_dirty_bitmap_granularity()John Snow2015-04-281-2/+6
| | | | | | | | | | | | | | This returns the granularity (in bytes) of dirty bitmap, which matches the QMP interface and the existing query interface. Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 1429314609-29776-6-git-send-email-jsnow@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* qmp: Add block-dirty-bitmap-add and block-dirty-bitmap-removeJohn Snow2015-04-281-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new command pair is added to manage a user created dirty bitmap. The dirty bitmap's name is mandatory and must be unique for the same device, but different devices can have bitmaps with the same names. The granularity is an optional field. If it is not specified, we will choose a default granularity based on the cluster size if available, clamped to between 4K and 64K to mirror how the 'mirror' code was already choosing granularity. If we do not have cluster size info available, we choose 64K. This code has been factored out into a helper shared with block/mirror. This patch also introduces the 'block_dirty_bitmap_lookup' helper, which takes a device name and a dirty bitmap name and validates the lookup, returning NULL and setting errp if there is a problem with either field. This helper will be re-used in future patches in this series. The types added to block-core.json will be re-used in future patches in this series, see: 'qapi: Add transaction support to block-dirty-bitmap-{add, enable, disable}' Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-id: 1429314609-29776-5-git-send-email-jsnow@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* qmp: Ensure consistent granularity typeJohn Snow2015-04-281-5/+6
| | | | | | | | | | | | | We treat this field with a variety of different types everywhere in the code. Now it's just uint32_t. Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 1429314609-29776-4-git-send-email-jsnow@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* qapi: Add optional field "name" to block dirty bitmapFam Zheng2015-04-281-1/+31
| | | | | | | | | | | | | | | | | | | | | | This field will be set for user created dirty bitmap. Also pass in an error pointer to bdrv_create_dirty_bitmap, so when a name is already taken on this BDS, it can report an error message. This is not global check, two BDSes can have dirty bitmap with a common name. Implemented bdrv_find_dirty_bitmap to find a dirty bitmap by name, will be used later when other QMP commands want to reference dirty bitmap by name. Add bdrv_dirty_bitmap_make_anon. This unsets the name of dirty bitmap. Signed-off-by: Fam Zheng <famz@redhat.com> Signed-off-by: John Snow <jsnow@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 1429314609-29776-3-git-send-email-jsnow@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* qmp: fill in the image field in BlockDeviceInfoAlberto Garcia2015-04-281-2/+7
| | | | | | | | | | | | | | | | | | | The image field in BlockDeviceInfo is supposed to contain an ImageInfo object. However that is being filled in by bdrv_query_info(), not by bdrv_block_device_info(), which is where BlockDeviceInfo is actually created. Anyone calling bdrv_block_device_info() directly will get a null image field. As a consequence of this, the HMP command 'info block -n -v' crashes QEMU. This patch moves the code that fills in that field from bdrv_query_info() to bdrv_block_device_info(). Signed-off-by: Alberto Garcia <berto@igalia.com> Message-id: 1429271563-3765-1-git-send-email-berto@igalia.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: use bdrv_get_device_or_node_name() in error messagesAlberto Garcia2015-04-281-12/+12
| | | | | | | | | | | | | | | | | | There are several error messages that identify a BlockDriverState by its device name. However those errors can be produced in nodes that don't have a device name associated. In those cases we should use bdrv_get_device_or_node_name() to fall back to the node name and produce a more meaningful message. The messages are also updated to use the more generic term 'node' instead of 'device'. Signed-off-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-id: 9823a1f0514fdb0692e92868661c38a9e00a12d6.1428485266.git.berto@igalia.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: add bdrv_get_device_or_node_name()Alberto Garcia2015-04-281-0/+9
| | | | | | | | | | | | This function gets the device name associated with a BlockDriverState, or its node name if the device name is empty. Signed-off-by: Alberto Garcia <berto@igalia.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-id: 4fa30aa8d61d9052ce266fd5429a59a14e941255.1428485266.git.berto@igalia.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: avoid unnecessary bottom halvesPaolo Bonzini2015-04-281-9/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | bdrv_aio_* APIs can use coroutines to achieve asynchronicity. However, the coroutine may terminate without having yielded back to the caller (for example because of something that invokes a nested event loop, or because the coroutine is doing nothing at all). In this case, the bdrv_aio_* API must delay the completion to the next iteration of the main loop, because bdrv_aio_* will never invoke the callback before returning. This can be done with a bottom half, and indeed bdrv_aio_* is always using one for simplicity. It is possible to gain some performance (~3%) by avoiding this in the common case. A new field in the BlockAIOCBCoroutine struct is set to true until the first time the corotine has yielded to its creator, and completion goes through a new function bdrv_co_complete. If the flag is false, bdrv_co_complete invokes the callback immediately. If it is true, the caller will notice that the coroutine has completed and schedule the bottom half itself. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 1427524638-28157-1-git-send-email-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: Pause block jobs in bdrv_drain_allFam Zheng2015-04-281-0/+20
| | | | | | | | | | | | | This is necessary to suppress more IO requests from being generated from block job coroutines. Signed-off-by: Fam Zheng <famz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Message-id: 1428069921-2957-3-git-send-email-famz@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: Switch to host monotonic clock for IO throttlingFam Zheng2015-04-281-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, throttle timers won't make any progress when VCPU is not running, which would stall the request queue in utils, qtest, vm suspending, and live migration, without special handling. Block jobs are confusingly inconsistent between with and without throttling: if user sets a bps limit, stops the vm, then start a block job, the block job will not make any progress; in contrary, if user unsets the bps limit, or if it's not set, the block job will run normally. After this patch, with the host clock, even if the VCPUs are stopped, the throttle queues will be processed. This patch also enables potential to add throttle to bdrv_drain_all. Currently all requests are drained immediately. In other words whenever it is called, IO throttling goes ineffective (examples: system reset, migration and many block job operations.). This is a loophole that guest could exploit. If we use the host clock, we can later just trust the nested poll. This could be done on top. Note that for qemu-iotests case 093, which uses qtest, we still keep vm clock so the script can control the clock stepping in order to be deterministic. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Fam Zheng <famz@redhat.com> Message-id: 1427268446-6426-1-git-send-email-famz@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* Convert (ffs(val) - 1) to ctz32(val)Stefan Hajnoczi2015-04-281-1/+1
| | | | | | | | | | | | | | | | | | | This commit was generated mechanically by coccinelle from the following semantic patch: @@ expression val; @@ - (ffs(val) - 1) + ctz32(val) The call sites have been audited to ensure the ffs(0) - 1 == -1 case never occurs (due to input validation, asserts, etc). Therefore we don't need to worry about the fact that ctz32(0) == 32. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 1427124571-28598-5-git-send-email-stefanha@redhat.com Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: Fix unaligned zero writeFam Zheng2015-03-271-6/+39
| | | | | | | | | | | | | | | If the zero write is not aligned, bdrv_co_do_pwritev will segfault because of accessing to the NULL qiov passed in by bdrv_co_write_zeroes. Fix this by allocating a local qiov in bdrv_co_do_pwritev if the request is not aligned. (In this case the padding iovs are necessary anyway, so it doesn't hurt.) Also add a check at the end of bdrv_co_do_pwritev to clear the zero flag if padding is involved. Signed-off-by: Fam Zheng <famz@redhat.com> Message-id: 1427160230-4489-2-git-send-email-famz@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* block: Drop bdrv_findFam Zheng2015-03-161-9/+0
| | | | | | | | | All callers are converted, so drop it. Signed-off-by: Fam Zheng <famz@redhat.com> Message-id: 1425296209-1476-5-git-send-email-famz@redhat.com Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
* block: Deprecate QCOW/QCOW2 encryptionMarkus Armbruster2015-03-161-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've steered users away from QCOW/QCOW2 encryption for a while, because it's a flawed design (commit 136cd19 Describe flaws in qcow/qcow2 encryption in the docs). In addition to flawed crypto, we have comically bad usability, and plain old bugs. Let me show you. = Example images = I'm going to use a raw image as backing file, and two QCOW2 images, one encrypted, and one not: $ qemu-img create -f raw backing.img 4m Formatting 'backing.img', fmt=raw size=4194304 $ qemu-img create -f qcow2 -o encryption,backing_file=backing.img,backing_fmt=raw geheim.qcow2 4m Formatting 'geheim.qcow2', fmt=qcow2 size=4194304 backing_file='backing.img' backing_fmt='raw' encryption=on cluster_size=65536 lazy_refcounts=off $ qemu-img create -f qcow2 -o backing_file=backing.img,backing_fmt=raw normal.qcow2 4m Formatting 'normal.qcow2', fmt=qcow2 size=4194304 backing_file='backing.img' backing_fmt='raw' encryption=off cluster_size=65536 lazy_refcounts=off = Usability issues = == Confusing startup == When no image is encrypted, and you don't give -S, QEMU starts the guest immediately: $ qemu-system-x86_64 -nodefaults -display none -monitor stdio normal.qcow2 QEMU 2.2.50 monitor - type 'help' for more information (qemu) info status VM status: running But as soon as there's an encrypted image in play, the guest is *not* started, with no notification whatsoever: $ qemu-system-x86_64 -nodefaults -display none -monitor stdio geheim.qcow2 QEMU 2.2.50 monitor - type 'help' for more information (qemu) info status VM status: paused (prelaunch) If the user figured out that he needs to type "cont" to enter his keys, the confusion enters the next level: "cont" asks for at most *one* key. If more are needed, it then silently does nothing. The user has to type "cont" once per encrypted image: $ qemu-system-x86_64 -nodefaults -display none -monitor stdio -drive if=none,file=geheim.qcow2 -drive if=none,file=geheim.qcow2 QEMU 2.2.50 monitor - type 'help' for more information (qemu) info status VM status: paused (prelaunch) (qemu) c none0 (geheim.qcow2) is encrypted. Password: ****** (qemu) info status VM status: paused (prelaunch) (qemu) c none1 (geheim.qcow2) is encrypted. Password: ****** (qemu) info status VM status: running == Incorrect passwords not caught == All existing encryption schemes give you the GIGO treatment: garbage password in, garbage data out. Guests usually refuse to mount garbage, but other usage is prone to data loss. == Need to stop the guest to add an encrypted image == $ qemu-system-x86_64 -nodefaults -display none -monitor stdio QEMU 2.2.50 monitor - type 'help' for more information (qemu) info status VM status: running (qemu) drive_add "" if=none,file=geheim.qcow2 Guest must be stopped for opening of encrypted image (qemu) stop (qemu) drive_add "" if=none,file=geheim.qcow2 OK Commit c3adb58 added this restriction. Before, we could expose images lacking an encryption key to guests, with potentially catastrophic results. See also "Use without key is not always caught". = Bugs = == Use without key is not always caught == Encrypted images can be in an intermediate state "opened, but no key". The weird startup behavior and the need to stop the guest are there to ensure the guest isn't exposed to that state. But other things still are! * drive_backup $ qemu-system-x86_64 -nodefaults -display none -monitor stdio geheim.qcow2 QEMU 2.2.50 monitor - type 'help' for more information (qemu) drive_backup -f ide0-hd0 out.img raw Formatting 'out.img', fmt=raw size=4194304 I guess this writes encrypted data to raw image out.img. Good luck with figuring out how to decrypt that again. * commit $ qemu-system-x86_64 -nodefaults -display none -monitor stdio geheim.qcow2 QEMU 2.2.50 monitor - type 'help' for more information (qemu) commit ide0-hd0 I guess this writes encrypted data into the unencrypted raw backing image, effectively destroying it. == QMP device_add of usb-storage fails when it shouldn't == When the image is encrypted, device_add creates the device, defers actually attaching it to when the key becomes available, then fails. This is wrong. device_add must either create the device and succeed, or do nothing and fail. $ qemu-system-x86_64 -nodefaults -display none -usb -qmp stdio -drive if=none,id=foo,file=geheim.qcow2 {"QMP": {"version": {"qemu": {"micro": 50, "minor": 2, "major": 2}, "package": ""}, "capabilities": []}} { "execute": "qmp_capabilities" } {"return": {}} { "execute": "device_add", "arguments": { "driver": "usb-storage", "id": "bar", "drive": "foo" } } {"error": {"class": "DeviceEncrypted", "desc": "'foo' (geheim.qcow2) is encrypted"}} {"execute":"device_del","arguments": { "id": "bar" } } {"timestamp": {"seconds": 1426003440, "microseconds": 237181}, "event": "DEVICE_DELETED", "data": {"path": "/machine/peripheral/bar/bar.0/legacy[0]"}} {"timestamp": {"seconds": 1426003440, "microseconds": 238231}, "event": "DEVICE_DELETED", "data": {"device": "bar", "path": "/machine/peripheral/bar"}} {"return": {}} This stuff is worse than useless, it's a trap for users. If people become sufficiently interested in encrypted images to contribute a cryptographically sane implementation for QCOW2 (or whatever other format), then rewriting the necessary support around it from scratch will likely be easier and yield better results than fixing up the existing mess. Let's deprecate the mess now, drop it after a grace period, and move on. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* block: add bdrv functions for geometry and blocksizeEkaterina Tumanova2015-03-101-0/+34
| | | | | | | | | | | | Add driver functions for geometry and blocksize detection Signed-off-by: Ekaterina Tumanova <tumanova@linux.vnet.ibm.com> Reviewed-by: Thomas Huth <thuth@linux.vnet.ibm.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 1424087278-49393-2-git-send-email-tumanova@linux.vnet.ibm.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* qemu-img: Suppress unhelpful extra errors in convert, amendMarkus Armbruster2015-02-261-1/+4
| | | | | | | | | | | | | | | | | | | | img_convert() and img_amend() use qemu_opts_do_parse(), which reports errors with qerror_report_err(). Its error messages aren't helpful here, the caller reports one that actually makes sense. Reproducer: $ qemu-img convert -o backing_format=raw in.img out.img qemu-img: Invalid parameter 'backing_format' qemu-img: Invalid options for file format 'raw' To fix, propagate errors through qemu_opts_do_parse(). This lifts the error reporting into callers. Drop it from img_convert() and img_amend(), keep it in qemu_chr_parse_compat(), bdrv_img_create(). Since I'm touching qemu_opts_do_parse() anyway, write a function comment for it. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
* QemuOpts: Drop qemu_opt_set(), rename qemu_opt_set_err(), fix useMarkus Armbruster2015-02-261-3/+2
| | | | | | | | | | | | | | | | qemu_opt_set() is a wrapper around qemu_opt_set() that reports the error with qerror_report_err(). Most of its users assume the function can't fail. Make them use qemu_opt_set_err() with &error_abort, so that should the assumption ever break, it'll break noisily. Just two users remain, in util/qemu-config.c. Switch them to qemu_opt_set_err() as well, then rename qemu_opt_set_err() to qemu_opt_set(). Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
* block: Suppress unhelpful extra errors in bdrv_img_create()Markus Armbruster2015-02-261-2/+5
| | | | | | | | | | | | bdrv_img_create() uses qemu_opt_set(), which reports errors with qerror_report_err(). Its error messages aren't helpful here, the caller reports one that actually makes sense. I don't know how to trigger the error conditions, though. Switch to qemu_opt_set_err() to get rid of the unwanted messages. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
* QemuOpts: Convert qemu_opt_set_number() to Error, fix its useMarkus Armbruster2015-02-261-3/+3
| | | | | | | | | | | | | | | | Return the Error object instead of reporting it with qerror_report_err(). Change callers that assume the function can't fail to pass &error_abort, so that should the assumption ever break, it'll break noisily. Turns out all callers outside its unit test assume that. We could drop the Error ** argument, but that would make the interface less regular, so don't. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
* block: Keep bdrv_check*_request()'s return valueMax Reitz2015-02-161-11/+24
| | | | | | | | | | | | | Do not throw away the value returned by bdrv_check_request() and bdrv_check_byte_request(). Fix up some coding style issues in the proximity of the affected hunks. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-id: 1423162705-32065-17-git-send-email-mreitz@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
* block: Remove "growable" from BDSMax Reitz2015-02-161-17/+7
| | | | | | | | | | | | Now that request clamping is done in the BlockBackend, the "growable" field can be removed from the BlockDriverState. All BDSs are now treated as being "growable" (that is, they are allowed to grow; they are not necessarily actually able to). Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Message-id: 1423162705-32065-16-git-send-email-mreitz@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
OpenPOWER on IntegriCloud