Example: Using MapR Metrics to Diagnose a Faulty Network Interface Card (NIC)

In this example, a node in your cluster has a NIC that is intermittently failing. This condition is leading to abnormally long task completion times due to that node being occasionally unreachable. In the Metrics interface, you can display a job's average and maximum task attempt durations for both map and reduce attempts. A high variance between the average and maximum attempt durations suggests that some task attempts are taking an unusually long time. You can sort the list of jobs by maximum map task attempt duration to find jobs with such an unusually high variance.

Click the name of a job name to display information about the job's tasks, then sort the task attempt list by duration to find the outliers. Because the list of tasks includes information about the node the task is running on, you can see that several of these unusually long-running task attempts are assigned to the same node. This information suggests that there may be an issue with that specific node that is causing task attempts to take longer than usual.

When you display summary information for that node, you can see that the Network I/O speeds are lower than the speeds for other similarly configured nodes in the cluster. You can use that information to examine the node's network I/O configuration and hardware and diagnose the specific cause.