Managing Third-Party Libraries for MapReduce

Any third-party library that is required by a MapReduce program must be accessible to the data node that processes the job or application. A data node is a node in the cluster that includes the TaskTracker role, the NodeManager role, or both roles. You can provide the third-party libraries when you submit the program or you can install the third-party libraries on each node that processes the job or application.

Include the third-party libraries with each program

Including the third-party libraries with each program is the preferred method.

Perform one the following operations to include the third-party jars when you submit the program:

  • Package the third-party libraries with the MapReduce jar file. The benefit of this method is that the node from which you submit the program and the node that runs the program is not required to have the libraries files.

  • Use the - libjars parameter to specify the third-party libraries on the command line. With this option, the library files are submitted to the data node along with the program. The benefit of this method is that the node that runs the program does not need to have the libraries files installed. However, the node that submits the program must have the library files installed,

Install the third-party libraries on each node that runs the program

You can also install the third-party libraries on each data node. However, this may not be preferred as there could be conflicts between library versions or library files.

Perform one of the following operations to install the third-party libraries on each data node:

  • Install the third-party libraries in the hadoop library directory that corresponds to the framework that will be used to run the program:

    • For classic mode, install the third-party libraries in the following directory on each Task Tracker node: /opt/mapr/hadoop/hadoop-0.20.2/lib

    • For yarn mode, install the third-party libraries in the following directory on each Node Manager node: /opt/mapr/hadoop/hadoop-2.x/share/hadoop/common

  • On each node with the TaskTracker or NodeManager role, install the required third-party libraries and then specify the location(s) of the third-party libraries with the HADOOP_CLASSPATH env variable in the env.sh file. The env.sh file is located in the following directory: /opt/mapr/conf