drbd_device_work is a work item that has a reference to a device,
while drbd_work is a more generic work item that does not carry
a reference to a device.
All callbacks get a pointer to a drbd_work instance, those callbacks
that expect a drbd_device_work use the container_of macro to get it.
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
The implicit dependency on a variable inside the macro is problematic.
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
With the polymorphic drbd_() macros, we no longer need the connection
specific variants.
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
DRBD was using dev_err() and similar all over the code; instead of having to
write dev_err(disk_to_dev(device->vdisk), ...) to convert a drbd_device into a
kernel device, a DEV macro was used which implicitly references the device
variable. This is terrible; introduce separate drbd_err() and similar macros
with an explicit device parameter instead.
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Let connection->peer_devices point to peer devices; connection->volumes was
pointing to devices.
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
In a setup where a device (aka volume) can replicate to multiple peers and one
connection can be shared between multiple devices, we need separate objects to
represent devices on peer nodes and network connections.
As a first step to introduce multiple connections per device, give each
drbd_device object a single drbd_peer_device object which connects it to a
drbd_connection object.
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
sed -i -e 's:all_tconn:connections:g' -e 's:tconn:connection:g'
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
sed -i -e 's:\<drbd_conf\>:drbd_device:g'
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Keep the protocol definitions separate from the kernel code; they are useful in
their own right.
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Move the prototype declaration of function tl_abort_disk_io() from
drbd/drbd_state.c to appropriate header file drbd/drbd_int.h because it
is used by more than 2 files.
This eliminates the following warnings in drbd/drbd_main.c:
drivers/block/drbd/drbd_main.c:310:6: warning: no previous prototype for ‘tl_abort_disk_io’ [-Wmissing-prototypes]
Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
In case the connection was established and lost again before
the a fence-peer handler returns, ignore the exit code of this
instance. (And use the exit code of the later started instance)
Signed-off-by: Andreas Gruenbacher <agruen@linbit.com>
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The issue was that if the connection broke while we did the
gracefull state change to C_DISCONNECTING (C_TEARDOWN), then
we returned a success code from the state engine. (SS_CW_NO_NEED)
The result of that is that we missed to call the fence-peer
script in such a case.
Fixed that by introducing a new error code (SS_OUTDATE_WO_CONN).
This one should never reach back into user space.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit
drbd: Disallow the peer_disk_state to be D_OUTDATED while connected
trying to invalidate a disconnected Primary returned an error code
that did not really match the situation:
"Refusing to be Outdated while Connected"
Insert two more specific conditions into is_valid_state(),
changing that to "Need access to UpToDate data",
respectively "Need a connection to start verify or resync".
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
To avoid other state change requests, after passing through
sanitize_state(), to be mistaken for an invalidate,
move the "set all bits as out-of-sync" into the invalidate path.
Make invalidate and invalidate-remote behave consistently wrt.
current connection state (need either an established replication link,
or really be disconnected). Also mention that in the documentation.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When we notice a disk failure on the receiving side,
we stop sending it new incoming writes.
Depending on exact timing of various events, the same transfer log epoch
could end up containing both replicated (before we noticed the failure)
and local-only requests (after we noticed the failure).
The sanity checks in tl_release(), called when receiving a
P_BARRIER_ACK, check that the ack'ed transfer log epoch matches
the expected epoch, and the number of contained writes matches
the number of ack'ed writes.
In this case, they counted both replicated and local-only writes,
but the peer only acknowledges those it has seen. We get a mismatch,
resulting in a protocol error and disconnect/reconnect cycle.
Messages logged are
"BAD! BarrierAck #%u received with n_writes=%u, expected n_writes=%u!\n"
A similar issue can also be triggered when starting a resync while
having a healthy replication link, by invalidating one side, forcing a
full sync, or attaching to a diskless node.
Fix this by closing the current epoch if the state changes in a way
that would cause the replication intent of the next write.
Epochs now contain either only non-replicated,
or only replicated writes.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Compiling drbd yields:
drivers/block/drbd/drbd_state.c: In function ‘_conn_request_state’:
drivers/block/drbd/drbd_state.c:1804:5: error: macro "wait_event_lock_irq" passed 4 arguments, but takes just 3
drivers/block/drbd/drbd_state.c:1801:3: error: ‘wait_event_lock_irq’ undeclared (first use in this function)
drivers/block/drbd/drbd_state.c:1801:3: note: each undeclared identifier is reported only once for each function it appears in
drivers/block/drbd/drbd_state.c: At top level:
drivers/block/drbd/drbd_state.c:1734:1: warning: ‘_conn_rq_cond’ defined but not used [-Wunused-function]
Due to drbd having copied the MD definition for wait_event_lock_irq()
as well. Kill them.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
is_valid_transition() might return SS_NOTHING_TO_DO.
The condition function _req_st_cond() returned SS_NOTHING_TO_DO, which
caused the wait_event to abort too early. Therefore drbd_req_state()
did not consume the next CL_ST_CHG_SUCCESS or SS_CW_FAILED_BY_PEER
causing serve disruption of the state machine logic...
Detaching from a single volue was one way to trigger this bug.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Without this, the meta-data gets updates after 5 seconds by the
md_sync_timer. Better to do it immeditaly after a state change.
If the asender detects a network failure, it may take a bit until
the worker processes the according after-conn-state-change work item.
The worker might be blocked in sending something, i.e. it
takes until it gets into its timeout. That is 6 seconds by
default which is longer than the 5 seconds of the md_sync_timer.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
With merging the commit
'drbd: Delay/reject other state changes while establishing a connection'
the condition check for clearing the flag was wrong.
Move the bit clearing to the __drbd_set_state() function
in order to have it already cleared for the other parts of
the function. I.e. clearing the susp_fen in the after_state_ch() function.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
When _conn_requests_state() is used to change other parts of the state
than the connection, do not check for a valid connection transition.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
The previous way of doing the state change was also okay since the
state change on the susp flag gets propagated from the mdev
to the tconn.
Fortunately all this goes away in drbd-9.0
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Documentation of mutex_unlock says
we must not use it in interrupt context.
So do not call it while holding the spin_lock_irq,
but give up the spinlock temporarily.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If the preconditions for a state change change after the wait_event() we
might hit the BUG() statement in conn_set_state().
With holding the spin_lock while evaluating the condition AND until the
actual state change we ensure the the preconditions can not change anymore.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Disconnecting is a cluster wide state change. In case the peer node agrees
to the state transition, it sends back the fact on the meta-data connection
and closes both sockets.
In case the node node that initiated the state transfer sees the closing
action on the data-socket, before the P_STATE_CHG_REPLY packet, it was
going into one of the network failure states.
At least with the fencing option set to something else thatn "dont-care",
the unclean shutdown of the connection causes a short IO freeze or
a fence operation.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Requests of an acked epoch are stored on the barrier_acked_requests list. In
case the private bio of such a request completes while IO on the drbd device
is suspended [req_mod(completed_ok)] then the request stays there.
When thawing IO because the fence_peer handler returned, then we use
tl_clear() to apply the connection_lost_while_pending event to all requests
on the transfer-log and the barrier_acked_requests list.
Up to now the connection_lost_while_pending event was not applied
on requests on the barrier_acked_requests list. Fixed that.
I.e. now the connection_lost_while_pending and resend events are
applied to requests on the barrier_acked_requests list. For that
it is necessary that the resend event finishes (local only)
READS correctly.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
We now can schedule only a specific range of sectors for online verify,
or interrupt a running verify without interrupting the connection.
Had to bump the protocol version differently, we are now 101.
Added verify_can_do_stop_sector() { protocol >= 97 && protocol != 100; }
Also, the return value convention for worker callbacks has changed,
we returned "true/false" for "keep the connection up" in 8.3,
we return 0 for success and <= for failure in 8.4.
Affected: receive_state()
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
In case we want to hard-reset from the local-io-error handler,
we need to call it before notifying the peer or aborting local IO.
Otherwise the peer will advance its data generation UUIDs even
if secondary.
This way, local io error looks like a "regular" node crash,
which reduces the number of different failure cases.
This may be useful in a bigger picture where crashed or otherwise
"misbehaving" nodes are automatically re-deployed.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Fix asserts like
block drbd0: in got_BlockAck:4634: rs_pending_cnt = -35 < 0 !
We reset the resync lru cache and related information (rs_pending_cnt),
once we successfully finished a resync or online verify, or if the
replication connection is lost.
We also need to reset it if a resync or online verify is aborted
because a lower level disk failed.
In that case the replication link is still established,
and we may still have packets queued in the network buffers
which want to touch rs_pending_cnt.
We do not have any synchronization mechanism to know for sure when all
such pending resync related packets have been drained.
To avoid this counter to go negative (and violate the ASSERT that it
will always be >= 0), just do not reset it when we lose a disk.
It is good enough to make sure it is re-initialized before the next
resync can start: reset it when we re-attach a disk.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Aborting local requests (not waiting for completion from the lower level
disk) is dangerous: if the master bio has been completed to upper
layers, data pages may be re-used for other things already.
If local IO is still pending and later completes,
this may cause crashes or corrupt unrelated data.
Only abort local IO if explicitly requested.
Intended use case is a lower level device that turned into a tarpit,
not completing io requests, not even doing error completion.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
cherry-picked and adapted from drbd 9 devel branch
In 8.4, we don't distinguish between "resource work" and "connection
work" yet, we have one worker for both, as we still have only one connection.
We only ever used the "data.work",
no need to keep the "meta.work" around.
Move tconn->data.work to tconn->sender_work.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Symptom: messages similar to
"FIXME asender in bm_change_bits_to,
bitmap locked for 'write from resync_finished' by worker"
If a resync or verify is finished (or aborted), a full bitmap writeout
is triggered. If we have ongoing local IO, the bitmap may still change
during that writeout, pending and not yet processed acks may cause bits
to be cleared, while new writes may cause bits to be to be set.
To fix this, introduce the drbd_bm_write_copy_pages() variant.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
DRBD can freeze IO, due to fencing policy (fencing resource-and-stonith),
or because we lost access to data (on-no-data-accessible suspend-io).
Resuming from there (re-connect, or re-attach, or explicit admin
intervention) should "just work".
Unfortunately, if the re-attach/re-connect did not happen within
the timeout, since the commit
drbd: Implemented real timeout checking for request processing time
if so configured, the request_timer_fn() would timeout and
detach/disconnect virtually immediately.
This change tracks the most recent attach and connect, and does not
timeout within <configured timeout interval> after attach/connect.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Changes to the role and disk state should be delayed or rejected
while we establish a connection.
This is necessary, since the peer will base its resync decision
on the UUIDs and the state we sent in the drbd_connect() function.
The most prominent example for this race is becoming primary after
sending state and UUIDs and before the state changes to C_WF_CONNECTION.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
If the backing device is already frozen during attach, we failed
to recognize that. The current disk-timeout code works on top
of the drbd_request objects. During attach we do not allow IO
and therefore never generate a drbd_request object but block
before that in drbd_make_request().
This patch adds the timeout to all drbd_md_sync_page_io().
Before this patch we used to go from D_ATTACHING directly
to D_DISKLESS if IO failed during attach. We can no longer
do this since we have to stay in D_FAILED until all IO
ops issued to the backing device returned.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Commit d0ef827e (drbd: switch configuration interface from connector to
genetlink) introduced a regression by removing the ability to set all
bits in the out of sync bitmap and to suspend updates to the activity log
of a disconnected device via the invalidate-remote management call.
Credits for reporting the issue are going to Arne Redlich.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
For compatibility reasons 8.4 has to send P_STATE_CHG_REQ (instead
of P_CONN_ST_CHG_REQ) when disconnecting.
In the receiving code path we missed to convert the old
answer (P_STATE_CHG_REPLY) back to 8.4 logic. Therefore
the CL_ST_CHG_SUCCESS or CL_ST_CHG_FAIL bit in the flags word
of mdev got set, while the state code was waiting for
the CONN_WD_ST_CHG_OKAY or CONN_WD_ST_CHG_FAIL bits in tconn.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
I.e. in C_WF_REPORT_PARAMS or in C_WF_CONNECTION.
Sending may already work in these cstates, but the peer still expects
the HandShake / ConnectionFeatures packet.
Actually triggered by the Testuite on kugel.
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
* drbd-8.3:
drbd: O_SYNC gives EIO on ramdisks for some kernels (eg. RHEL6).
drbd: send intermediate state change results to the peer
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>