summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* drbd: fix potential data corruption and protocol errorLars Ellenberg2012-05-093-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We assumed only bios with bi_idx == 0 would end up in drbd_make_request(). That is wrong. At least device mapper, in __clone_and_map(), may submit clones only covering a partial bio, but sharing the original bvec, by adjusting bi_idx and relevant other bio members of the clone. We used __bio_for_each_segment() in various places, even though that is documented as * drivers should not use the __ version unless they _really_ want to * run through the entire bio and not just pending pieces Impact: we would send the full bio bvec, even for the clone with bi_idx > 0, which will cause data corruption on the peer (because we submit wrong data at the clone offset), and will cause a DRBD protocol error, disconnect/reconnect and resync (thus fixing the corruption), because the next package header would be expected right in the middle of the sent data, causing DRBD magic mismatch. Fix: drop the assert, and use bio_for_each_segment() instead of the __ version. Conflicts: drbd/drbd_tracing.c Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Fix a potential write ordering issue on SyncTarget nodesPhilipp Reisner2012-05-091-0/+21
| | | | | | | | | | | | | | | If a SyncTarget node gets a P_RS_DATA_REPLY before a P_DATA packet for the same sector, it simply submits these two IO requests. This is be possible because on the SyncSource node, the data of the P_RS_DATA_REPLY packet was read from disk. Immediately after that a write request from upper layers came in. The disk scheduler or even the "hardware" queues on the disk drive might reorder these writes. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Fix a potential race that could case data inconsistencyPhilipp Reisner2012-05-091-2/+4
| | | | | | | | | | | | | | | | | | | When we have a write request and a state change C_WF_BITMAP_S -> C_SYNC_SOURCE at the same time, and it happens that the line remote = remote && drbd_should_do_remote(s); stills sees C_WF_BITMAP_S, and send_oos = rw == WRITE && drbd_should_send_oos(s); already sees C_SYNC_SOURCE both are 0. This causes the write to not be mirrored, but marked as out-of-sync on the Sync_Source node. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: add missing part_round_stats to _drbd_start_io_acctLars Ellenberg2012-05-091-0/+1
| | | | | | | Without this, iostat frequently sees bogus svctime and >= 100% "utilization". Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Fix module refcount leak in drbd_accept()Lars Ellenberg2012-05-091-0/+1
| | | | | | | | | | | | | | | | | | drbd_accept was modelled after kernel_accept with drbd commit 53eb779 in July 2008. Only, kernel_accept was then broken, and only fixed later with kernel commit 1b08534e in Dec 2008: net: Fix module refcount leak in kernel_accept() Impact: protocol families provided as modules, e.g. ipv6 or ib_sdp, would soon have their reference count become negative, preventing them from being unloaded (likely), or worse, hit zero without actually being unused, allowing them to be unloaded while still in use (unlikely, but if triggered, causing a kernel crash). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Consider the disk-timeout also for meta-data IO operationsPhilipp Reisner2012-05-094-39/+45
| | | | | | | | | | | | | | | | | | If the backing device is already frozen during attach, we failed to recognize that. The current disk-timeout code works on top of the drbd_request objects. During attach we do not allow IO and therefore never generate a drbd_request object but block before that in drbd_make_request(). This patch adds the timeout to all drbd_md_sync_page_io(). Before this patch we used to go from D_ATTACHING directly to D_DISKLESS if IO failed during attach. We can no longer do this since we have to stay in D_FAILED until all IO ops issued to the backing device returned. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Do not send state packets while lower than C_CONNECTED cstatePhilipp Reisner2012-05-091-5/+6
| | | | | | | | | | | I.e. in C_WF_REPORT_PARAMS or in C_WF_CONNECTION. Sending may already work in these cstates, but the peer still expects the HandShake / ConnectionFeatures packet. Actually triggered by the Testuite on kugel. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: fix race between disconnect and receive_stateLars Ellenberg2012-05-092-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the asender thread, or request_timer_fn(), or some other part of the code, decided to drop the connection (because of timeout or other), but the receiver just now was processing a P_STATE packet, there was a chance that receive_state() would do a hard state change "re-establishing" an already failed connection without additional handshake. Log excerpt: Remote failed to finish a request within ko-count * timeout peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) asender terminated ... peer( Unknown -> Secondary ) conn( Timeout -> Connected ) pdsk( DUnknown -> UpToDate ) peer_isp( 0 -> 1 ) ... Connection closed peer( Secondary -> Unknown ) conn( Connected -> Unconnected ) pdsk( UpToDate -> DUnknown ) peer_isp( 1 -> 0 ) receiver terminated Impact: while the connection state is erroneously "Connected", requests may be queued and even sent, which would never be acknowledged, and may have been missed by the cleanup. These requests would never be completed. The next drbd_suspend_io() will then lock up, waiting forever for these requests to complete. Fixed in several code paths: Make sure the connection state is NetworkFailure or worse before starting the cleanup in drbd_disconnect(). This should make sure the cleanup won't miss any requests. Disallow receive_state() to "upgrade" the connection state from an error state. This will make sure the "illegal" state transition won't happen. For all connection failure states, relax the safe-guard in sanitize_state() again to silently mask out those state changes (e.g. Timeout -> Connected becomes Timeout -> Timeout). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: fix potential spinlock deadlockLars Ellenberg2012-05-091-8/+12
| | | | | | | | | | | | | | | | drbd_try_clear_on_disk_bm() has a sanity check for the number of blocks left to be resynced (rs_left) in the current resync extent. If it detects a mismatch, it complains, and forces a disconnect using drbd_force_state(mdev, NS(conn, C_DISCONNECTING)); Unfortunately, this may be called while holding the req_lock, and drbd_force_state() want's to aquire that lock itself. Deadlock. Don't force a disconnect, but fix up rs_left by recounting and reassigning the number of dirty blocks in that extent. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Fixed an obvious copy-n-paste mistakePhilipp Reisner2012-05-092-2/+2
| | | | | | | | This bug might have caused troubles if disk-barriers and the ahead-behind more are enabled at the same time. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: send intermediate state change results to the peerLars Ellenberg2012-05-094-16/+47
| | | | | | | | | | | | | | | | | | DRBD state changes schedule after_state_ch() actions to a worker thread, which decides on the old and new states of that change, whether to send an informational state update packet (P_STATE) to the peer. If it decides to drbd_send_state(), it would however always send the _curent_ state, which, if a second state change happens before the after_state_ch() of the first ran, may "fast-forward" the peer's view about this node. In most cases that is harmless, but sometimes this can confuse DRBD, for example into not actually starting a necessary resync if you do a very tight detach/attach loop on a Connected Secondary. Fix this by always sending the "new" state of the respective state transition which scheduled this after_state_ch() work. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: fix spurious meta data IO "error"Lars Ellenberg2012-05-091-0/+2
| | | | | | | | | | | When detaching, even cleanly detaching due to administrator request, we always go through D_FAILED before we become D_DISKLESS. Don't let that state change race with an in-flight meta data IO, or that one might think it actually experienced an IO error. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Fixed a race condition between detach and start of resyncPhilipp Reisner2012-05-091-3/+3
| | | | | | | | | | | | | | drbd_state_lock() is only there to serialize cluster wide state changes. Testing the local disk state needs to happen while holding the global_state_lock. Otherwise you might see something like this (Oct 6 on kugel) 14:20:24 drbd0: conn( WFSyncUUID -> Connected ) disk( Inconsistent -> Failed ) 14:20:24 drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) 14:20:24 drbd0: conn( Connected -> SyncTarget ) disk( Failed -> Inconsistent ) Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: fix harmless race to not trigger an ASSERTLars Ellenberg2012-05-091-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | We have one pre-allocated page to do certain synchronous meta data IO with, using it is serialized like so: drbd_md_get_buffer(); drbd_md_sync_page_io(); drbd_md_sync_page_io(); ... drbd_md_put_buffer(); In drbd_md_sync_page_io() there is an ASSERT(atomic_read(&mdev->md_io_in_use) == 1); We want to be able to timeout on unresponsive lower level devices, so we can "detach" in that case. Inside drbd_md_sync_page_io() we grab an extra reference, to not have a dangling pointer in case a delayed IO eventually does still complete, even after we "detached" already. We need to put the extra reference before we signal completion from the completion handler, or the second drbd_md_sync_page_io() above may trigger the assert (reference count still 2). Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Derive sync-UUIDs only from the bitmap-uuid if it is non-zeroPhilipp Reisner2012-05-091-1/+5
| | | | | Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: drbd_nl_resize(): Fix missing put_ldev() on error pathAndreas Gruenbacher2012-05-091-1/+5
| | | | | Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: fix "stalled" empty resyncLars Ellenberg2012-05-091-3/+8
| | | | | | | | | | | | | With sync-after dependencies, given "lucky" timing of pause/unpause events, and the end of an empty (0 bits set) resync was sometimes not detected on the SyncTarget, leading to a "stalled" SyncSource state. Fixed this by expecting not only "Inconsistent -> UpToDate" but also "Consistent -> UpToDate" transitions for the peer disk state to end a resync. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Bugfix for the connection behaviorPhilipp Reisner2012-05-092-6/+6
| | | | | | | | | | | | | | | | | | | | | If we get into the C_BROKEN_PIPE cstate once, the state engine set the thi->t_state of the receiver thread to restarting. But with the while loop in drbdd_init() a new connection gets established. After the call into drbdd() returns immediately since the thi->t_state is not RUNNING. The restart of drbd_init() then resets thi->t_state to RUNNING. I.e. after entering C_BROKEN_PIPE once, the next successful established connection gets wasted. The two parts of the fix: * Do not cause the thread to restart if we detect the issue with the sockets while we are in C_WF_CONNECTION. * Make sure that all actions that would have set us to C_BROKEN_PIPE happen before the state change to C_WF_REPORT_PARAMS. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Cleanup all epoch objects upon connection lossPhilipp Reisner2012-05-091-2/+3
| | | | | Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: detach must not try to abort non-local requests from drbd-8.4Philipp Reisner2012-05-091-9/+34
| | | | | | | Cherry picked form 8.4 Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Consider that the no-data-condition could be in connected statePhilipp Reisner2012-05-091-1/+2
| | | | | | | | ...when the peer has inconsistent data. In that case we failed to clear the susp_nod flag. When the local disk was attached again Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Fixed current UUID generationPhilipp Reisner2012-05-091-2/+2
| | | | | | | | | | Now, the new edition of the clause only fires if a diskless peer gets promoted. This is a fixup for "drbd: Delayed creation of current-UUID". Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: change some GFP_KERNEL to GFP_NOIOLars Ellenberg2012-05-091-3/+3
| | | | | | | | Bitmap IO may happend in the context of an application write, in the generic block IO path. We need to use GFP_NOIO. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Implemented the disk-timeout optionPhilipp Reisner2012-05-095-14/+31
| | | | | | | | | When the disk-timeout is active, and it expires for a single request, we consider the local disk as D_FAILED. Note: With this change, I made both timeout based state transitions HARD state transitions. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Force flag for the detach operationPhilipp Reisner2012-05-093-2/+19
| | | | | Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Allow new IOs while the local disk in in FAILED statePhilipp Reisner2012-05-091-1/+1
| | | | | | | | | The last bunch of commits prepared the 'detach from tar pit' feature. With that we can be for long time in disk state FAILED. We need to accept new IO requests during that time. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Bitmap IO functions can now return prematurely if the disk breaksPhilipp Reisner2012-05-091-10/+30
| | | | | Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Added a kref to bm_aio_ctxPhilipp Reisner2012-05-091-25/+59
| | | | | Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Hold a reference to ldev while doing meta-data IOPhilipp Reisner2012-05-092-0/+7
| | | | | Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Keep a reference to the bio until the completion handler finishedPhilipp Reisner2012-05-092-0/+2
| | | | | Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Implemented wait_until_done_or_disk_failure()Philipp Reisner2012-05-093-5/+18
| | | | | Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Replaced md_io_mutex by an atomic: md_io_in_usePhilipp Reisner2012-05-094-19/+51
| | | | | | | | The new function drbd_md_get_buffer() aborts waiting for the buffer in case the disk failes in the meantime. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: moved md_io into mdevPhilipp Reisner2012-05-093-9/+10
| | | | | Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Immediately allow completion of IOs, that wait for IO completions on a ↵Philipp Reisner2012-05-093-11/+37
| | | | | | | failed disk Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Keep a reference to barrier acked requestsPhilipp Reisner2012-05-093-2/+23
| | | | | Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Improve compatibility with drbd's older than 8.3.7Philipp Reisner2012-05-092-3/+8
| | | | | | | | | | | | | Regression introduced with 8.3.11 commit: drbd: Take a more conservative approach when deciding max_bio_size Never ever tell an older drbd, that we support more than 32KiB in a single data request (packet). Never believe an older drbd, that is supports more than 32KiB in a single data request (packet) Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Only print sanitize state's warnings, if the state change happensPhilipp Reisner2012-05-091-15/+40
| | | | | | | | | | The reason for this change is that, with when doing 'drbdadm invalidate' on a disconnected resource caused an "implicitly set pdsk from UpToDate to DUnknown" message, which was missleading. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: downgraded error printk to infoLars Ellenberg2012-05-091-4/+2
| | | | | Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: allow ping-timeout of up to 30 secondsLars Ellenberg2012-05-091-1/+1
| | | | | | | | | | Allow up to 300 centi-seconds to be configured for the "ping timeout". There may be setups where heavy congestion, huge buffers, and asymmetric bandwidth limitations may need a "huge" ping-timeout as work-around for "spurious connection loss" problems. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* DRBD: Fix comparison always false warning due to long/long long compareDavid Howells2012-05-091-1/+1
| | | | | | | | | | | | | | | Fix warnings of the following nature in the drbd header: In file included from drivers/block/drbd/drbd_bitmap.c:32: drivers/block/drbd/drbd_int.h: In function 'drbd_get_syncer_progress': drivers/block/drbd/drbd_int.h:2234: warning: comparison is always false due to limited range of data where mdev->rs_total (an unsigned long) is being compared to 1ULL << 32, which is always false on a 32-bit machine. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
* drbd: spelling fix: too smallLars Ellenberg2012-05-092-6/+6
| | | | | | | It is not "to small", but "too small". Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: cosmetic: fix accidental division instead of modulo when pretty printingLars Ellenberg2012-05-091-1/+1
| | | | | | | | For large resync rates, seq_printf_with_thousands_grouping() accidentally only produced Y,000,00Y, instead of the real numbers. Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd: Lower log priority for an event that is definitely not an errorPhilipp Reisner2012-05-091-1/+1
| | | | | Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* Merge tag 'v3.4-rc5' into for-3.5/coreJens Axboe2012-05-011355-8950/+16167
|\ | | | | | | | | | | | | | | | | | | | | | | The core branch is behind driver commits that we want to build on for 3.5, hence I'm pulling in a later -rc. Linux 3.4-rc5 Conflicts: Documentation/feature-removal-schedule.txt Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * Linux 3.4-rc5v3.4-rc5Linus Torvalds2012-04-291-1/+1
| |
| * Merge tag 'pm-for-3.4-rc5' of ↵Linus Torvalds2012-04-292-24/+41
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael J. Wysocki: "Fix for an issue causing hibernation to hang on systems with highmem (that practically means i386) due to broken memory management (bug introduced in 3.2, so -stable material) and PM documentation update making the freezer documentation follow the code again after some recent updates." * tag 'pm-for-3.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM / Freezer / Docs: Update documentation about freezing of tasks PM / Hibernate: fix the number of pages used for hibernate/thaw buffering
| | * PM / Freezer / Docs: Update documentation about freezing of tasksMarcos Paulo de Souza2012-04-291-18/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The file Documentation/power/freezing-of-tasks.txt was still referencing the TIF_FREEZE flag, that was removed by the commit d88e4cb67197d007fb778d62fe17360e970d5bfa(freezer: remove now unused TIF_FREEZE). This patch removes all the references of TIF_FREEZE that were left behind. Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
| | * PM / Hibernate: fix the number of pages used for hibernate/thaw bufferingBojan Smojver2012-04-241-6/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hibernation regression fix, since 3.2. Calculate the number of required free pages based on non-high memory pages only, because that is where the buffers will come from. Commit 081a9d043c983f161b78fdc4671324d1342b86bc introduced a new buffer page allocation logic during hibernation, in order to improve the performance. The amount of pages allocated was calculated based on total amount of pages available, although only non-high memory pages are usable for this purpose. This caused hibernation code to attempt to over allocate pages on platforms that have high memory, which led to hangs. Signed-off-by: Bojan Smojver <bojan@rexursive.com> Signed-off-by: Rafael J. Wysocki <rjw@suse.de>
| * | autofs: make the autofsv5 packet file descriptor use a packetized pipeLinus Torvalds2012-04-293-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The autofs packet size has had a very unfortunate size problem on x86: because the alignment of 'u64' differs in 32-bit and 64-bit modes, and because the packet data was not 8-byte aligned, the size of the autofsv5 packet structure differed between 32-bit and 64-bit modes despite looking otherwise identical (300 vs 304 bytes respectively). We first fixed that up by making the 64-bit compat mode know about this problem in commit a32744d4abae ("autofs: work around unhappy compat problem on x86-64"), and that made a 32-bit 'systemd' work happily on a 64-bit kernel because everything then worked the same way as on a 32-bit kernel. But it turned out that 'automount' had actually known and worked around this problem in user space, so fixing the kernel to do the proper 32-bit compatibility handling actually *broke* 32-bit automount on a 64-bit kernel, because it knew that the packet sizes were wrong and expected those incorrect sizes. As a result, we ended up reverting that compatibility mode fix, and thus breaking systemd again, in commit fcbf94b9dedd. With both automount and systemd doing a single read() system call, and verifying that they get *exactly* the size they expect but using different sizes, it seemed that fixing one of them inevitably seemed to break the other. At one point, a patch I seriously considered applying from Michael Tokarev did a "strcmp()" to see if it was automount that was doing the operation. Ugly, ugly. However, a prettier solution exists now thanks to the packetized pipe mode. By marking the communication pipe as being packetized (by simply setting the O_DIRECT flag), we can always just write the bigger packet size, and if user-space does a smaller read, it will just get that partial end result and the extra alignment padding will simply be thrown away. This makes both automount and systemd happy, since they now get the size they asked for, and the kernel side of autofs simply no longer needs to care - it could pad out the packet arbitrarily. Of course, if there is some *other* user of autofs (please, please, please tell me it ain't so - and we haven't heard of any) that tries to read the packets with multiple writes, that other user will now be broken - the whole point of the packetized mode is that one system call gets exactly one packet, and you cannot read a packet in pieces. Tested-by: Michael Tokarev <mjt@tls.msk.ru> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Miller <davem@davemloft.net> Cc: Ian Kent <raven@themaw.net> Cc: Thomas Meyer <thomas@m3y3r.de> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | pipes: add a "packetized pipe" mode for writingLinus Torvalds2012-04-292-2/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The actual internal pipe implementation is already really about individual packets (called "pipe buffers"), and this simply exposes that as a special packetized mode. When we are in the packetized mode (marked by O_DIRECT as suggested by Alan Cox), a write() on a pipe will not merge the new data with previous writes, so each write will get a pipe buffer of its own. The pipe buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn will tell the reader side to break the read at that boundary (and throw away any partial packet contents that do not fit in the read buffer). End result: as long as you do writes less than PIPE_BUF in size (so that the pipe doesn't have to split them up), you can now treat the pipe as a packet interface, where each read() system call will read one packet at a time. You can just use a sufficiently big read buffer (PIPE_BUF is sufficient, since bigger than that doesn't guarantee atomicity anyway), and the return value of the read() will naturally give you the size of the packet. NOTE! We do not support zero-sized packets, and zero-sized reads and writes to a pipe continue to be no-ops. Also note that big packets will currently be split at write time, but that the size at which that happens is not really specified (except that it's bigger than PIPE_BUF). Currently that limit is the system page size, but we might want to explicitly support bigger packets some day. The main user for this is going to be the autofs packet interface, allowing us to stop having to care so deeply about exact packet sizes (which have had bugs with 32/64-bit compatibility modes). But user space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will fail with an EINVAL on kernels that do not support this interface. Tested-by: Michael Tokarev <mjt@tls.msk.ru> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: David Miller <davem@davemloft.net> Cc: Ian Kent <raven@themaw.net> Cc: Thomas Meyer <thomas@m3y3r.de> Cc: stable@kernel.org # needed for systemd/autofs interaction fix Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
OpenPOWER on IntegriCloud