Recovering from Disk Failure

Most software failures can be remedied by running the fsck utility, which scans the storage pool that the disk belongs to and reports errors. For hardware failures, remove the failed disk and replace it according to the procedure in Removing and Replacing Disks.

The following table lists types of failures and recommended courses of action:

Error	Failure Reason	Recommended Course of Action
I/OTimeOut Error	The default value for `mfs.disk.io.timeout` parameter is 60 seconds. The time to declare an IO as stuck is 3 times the value of this parameter (3 x `mfs.disk.io.timeout`). The disk will be taken offline even if a single IO has not completed.	Check if the disks are good and still reliable. If disks are good, increase the value of the `mfs.io.disk.timeout` parameter in the /opt/mapr/conf/mfs.conf file. Otherwise, replace the disks.
No such device	The $INSTALL_DIR/conf/disktab file contains `“/MissingDisk”` or references a disk path not found in /proc/partitions file.	Run mrdisk <device path> to determine whether a disk is formatted for MapR-FS. Also, check the device paths in $INSTALL_DIR/conf/disktab file. The disktab file contains the disk path and disk GUID that is used to load the disks in MFS. If the disk paths have been renamed, fix them or run disksetup -X command to regenerate disktab from /proc/partitions. Alternatively, restart MFS to resolve disk name changes. If the problem still persists, contact MapR support.
ENODEV: MissingDisk# Error: disktab file contains a /MissingDisk# entry	A disk corresponding to a GUID is missing and the corresponding disk path in the `disktab` file belongs to another disk. When attempt is made to automatically fix the `disktab` file, this entry is replaced with /MissingDisk# path.	If a disk corresponding to a GUID is permanently lost, remove the line corresponding to it in the `disktab` file. Alternatively, run `maprcli disk remove _MissingDisk#` command, where # corresponds to the disk number and restart MFS.
EIO Error	I/O error. This could be due to a bad block or disk. The system will offline the SP after one final attempt to complete the IO.	Check /var/log/messages for errors from the disk drivers.
CRC error	This could be due to a bad block or bit flip on the disk. The SP will be taken offline immediately.	Run `fsck -n <sp> -d` to perform a CRC (Cyclic Redundancy Check) on the data blocks in the storage pool, then bring it back online. To load all the SPs to the list of SPs, run: `mrconfig disk load or mrconfig sp load` To bring back all SPs online, run: `mrconfig sp refresh` To bring specific SPs back online, run: `mrconfig sp online <sp path>`
SlowDisk Error	The default value for `mfs.disk.io.timeout` parameter is 60 seconds. The time to declare an IO as slow is equal to the value of this parameter (1 x `mfs.disk.io.timeout`). Thirty or more slow IO completions in a short span of time (5 seconds) on the same disk is recorded as a slow event. The SP will be taken offline if 3 such events are recorded within an hour. NOTE: After an hour, MapR-FS will reset tracking (to 0).	Check if the disks are good and still reliable. If disks are good, increase the value of the `mfs.io.disk.timeout` parameter in the /opt/mapr/conf/mfs.conf file. Otherwise, replace the disks.
GUID of disk mismatches with the one in `$INSTALL_DIR/conf/disktab`	It's possible that disk names have changed.	After a node restart, the operating system can reassign the drive labels (for example, `/sda`), resulting in drive labels no longer matching the entries in the `disktab` file. The `disktab` file contains the disk path and disk GUID that is used to load the disks in MFS. Run $INSTALL_DIR/server/disksetup -X to update the disktab file by looking up the disks in /proc/partitions and make the disk paths match the GUIDs.
Unknown error		Contact MapR support.