Managing Third-Party Libraries

Any third-party library that is required by a MapReduce program must be accessible to the data node that processes the application.

A data node is a node in the cluster that includes the NodeManager role. You can provide the third-party libraries when you submit the program or you can install the third-party libraries on each node that processes the application.

Include the third-party libraries with each program

Including the third-party libraries with each program is the preferred method.

Perform one the following operations to include the third-party jars when you submit the program:

  • Package the third-party libraries with the MapReduce jar file. The benefit of this method is that the node from which you submit the program and the node that runs the program is not required to have the libraries files.

  • Use the -libjars parameter to specify the third-party libraries on the command line. With this option, the library files are submitted to the data node along with the program. The benefit of this method is that the node that runs the program does not need to have the libraries files installed. However, the node that submits the program must have the library files installed.

Install the third-party libraries on each node that runs the program

You can also install the third-party libraries on each data node. However, this may not be preferred as there could be conflicts between library versions or library files.

Perform one of the following operations to install the third-party libraries on each data node:

  • Install the third-party libraries in the following directory on each Node Manager node: /opt/mapr/hadoop/hadoop-2.x/share/hadoop/common

  • On each node with the NodeManager role, install the required third-party libraries and then specify the location(s) of the third-party libraries with the HADOOP_CLASSPATH env variable in the env.sh file. The env.sh file is located in the following directory: /opt/mapr/conf