Chapter 7. Troubleshooting and error recovery

Table of Contents

7.1. Dealing with hard drive failure
7.1.1. Manually detaching DRBD from your hard drive
7.1.2. Automatic detach on I/O error
7.1.3. Replacing a failed disk when using internal meta data
7.1.4. Replacing a failed disk when using external meta data
7.2. Dealing with node failure
7.2.1. Dealing with temporary secondary node failure
7.2.2. Dealing with temporary primary node failure
7.2.3. Dealing with permanent node failure
7.3. Manual split brain recovery

This chapter describes tasks to be performed in the event of hardware or system failures.

7.1. Dealing with hard drive failure

How to deal with hard drive failure depends on the way DRBD is configured to handle disk I/O errors (see Section 2.11, “Disk error handling strategies”), and on the type of meta data configured (see Section 17.1, “DRBD meta data”).

[Note]Note

For the most part, the steps described here apply only if you run DRBD directly on top of physical hard drives. They generally do not apply in case you are running DRBD layered on top of

  • an MD software RAID set (in this case, use mdadm to manage drive replacement),
  • device-mapper RAID (use dmraid),
  • a hardware RAID appliance (follow the vendor’s instructions on how to deal with failed drives),
  • some non-standard device-mapper virtual block devices (see the device mapper documentation).

7.1.1. Manually detaching DRBD from your hard drive

If DRBD is configured to pass on I/O errors (not recommended), you must first detach the DRBD resource, that is, disassociate it from its backing storage:

drbdadm detach <resource>

By running the drbdadm dstate command, you will now be able to verify that the resource is now in diskless mode:

drbdadm dstate <resource>
Diskless/UpToDate

If the disk failure has occured on your primary node, you may combine this step with a switch-over operation.

7.1.2. Automatic detach on I/O error

If DRBD is configured to automatically detach upon I/O error (the recommended option), DRBD should have automatically detached the resource from its backing storage already, without manual intervention. You may still use the drbdadm dstate command to verify that the resource is in fact running in diskless mode.

7.1.3. Replacing a failed disk when using internal meta data

If using internal meta data, it is sufficient to bind the DRBD device to the new hard disk. If the new hard disk has to be addressed by another Linux device name than the defective disk, this has to be modified accordingly in the DRBD configuration file.

This process involves creating a new meta data set, then re-attaching the resource:

drbdadm create-md <resource>
v08 Magic number not found
Writing meta data...
initialising activity log
NOT initializing bitmap
New drbd meta data block sucessfully created.

drbdadm attach <resource>

Full synchronization of the new hard disk starts instantaneously and automatically. You will be able to monitor the synchronization’s progress via /proc/drbd, as with any background synchronization.

7.1.4. Replacing a failed disk when using external meta data

When using external meta data, the procedure is basically the same. However, DRBD is not able to recognize independently that the hard drive was swapped, thus an additional step is required.

drbdadm create-md <resource>
v08 Magic number not found
Writing meta data...
initialising activity log
NOT initializing bitmap
New drbd meta data block sucessfully created.

drbdadm attach <resource>
drbdadm invalidate <resource>

Here, the drbdadm invalidate command triggers synchronization. Again, sync progress may be observed via /proc/drbd.