Recovering from Disk Failure
Most software failures can be remedied by running the fsck
utility, which
scans the storage pool that the disk belongs to and reports errors. For hardware failures,
remove the failed disk and replace it according to the procedure in Removing and Replacing
Disks.
The following table lists types of failures and recommended courses of action:
Error | Failure Reason | Recommended Course of Action |
---|---|---|
I/OTimeOut Error | The default value for mfs.disk.io.timeout parameter is 60 seconds.
The time to declare an IO as stuck is 3 times the value of this parameter (3 x
mfs.disk.io.timeout ). The disk will be taken offline even if a single
IO has not completed. |
|
No such device | The $INSTALL_DIR/conf/disktab file contains “/MissingDisk” or references a
disk path not found in /proc/partitions file. |
Run mrdisk <device path> to determine whether a disk is formatted for MapR-FS.
Also, check the device paths in $INSTALL_DIR/conf/disktab file. The disktab file contains the
disk path and disk GUID that is used to load the disks in MFS. If the disk paths have
been renamed, fix them or run disksetup -X command to regenerate disktab from
/proc/partitions. Alternatively, restart MFS to resolve disk name changes. If the problem still persists, contact MapR support. |
ENODEV: MissingDisk# Error: disktab file contains a /MissingDisk# entry | A disk corresponding to a GUID is missing and the corresponding disk path in the
disktab file belongs to another disk. When attempt is made to
automatically fix the disktab file, this entry is replaced with
/MissingDisk# path. |
If a disk corresponding to a GUID is permanently lost, remove the line corresponding
to it in the disktab file. Alternatively, run maprcli disk
remove _MissingDisk# command, where # corresponds to the disk number
and restart MFS. |
EIO Error | I/O error. This could be due to a bad block or disk. The system will offline the SP after one final attempt to complete the IO. | Check /var/log/messages for errors from the disk drivers. |
CRC error | This could be due to a bad block or bit flip on the disk. The SP will be taken offline immediately. | Run fsck -n <sp> -d to perform a CRC (Cyclic Redundancy Check) on
the data blocks in the storage pool, then bring it back online. To load all the SPs to the list of SPs, run: To bring back all
SPs online, run: To bring specific SPs back
online, run:
|
SlowDisk Error | The default value for mfs.disk.io.timeout parameter is 60 seconds. The time to declare
an IO as slow is equal to the value of this parameter (1 x mfs.disk.io.timeout ). Thirty or
more slow IO completions in a short span of time (5 seconds) on the same disk is recorded
as a slow event. The SP will be taken offline if 3 such events are recorded within an hour.
NOTE: After an hour, MapR-FS will reset tracking (to 0).
|
|
GUID of disk mismatches with the one
in $INSTALL_DIR/conf/disktab |
It's possible that disk names have changed. | After a node restart, the operating system can reassign the drive labels (for example,
/sda ), resulting in drive labels no longer matching the entries in
the disktab file. The disktab file contains the disk
path and disk GUID that is used to load the disks in MFS. Run
$INSTALL_DIR/server/disksetup -X to update the
disktab file by looking up the disks in
/proc/partitions and make the disk paths match the
GUIDs. |
Unknown error | Contact MapR support. |