Home > Parallelization > How to Include Third-Party Libraries in Your Map-Reduce Job

How to Include Third-Party Libraries in Your Map-Reduce Job

Share from http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/

“My library is in the classpath but I still get a Class Not Found exception in a MapReduce job” – If you have this problem this blog is for you.

Java requires third-party and user-defined classes to be on the command line’s “-classpath” option when the JVM is launched. The `hadoop` wrapper shell script does exactly this for you by building the classpath from the core libraries located in /usr/lib/hadoop-0.20/ and /usr/lib/hadoop-0.20/lib/ directories. However, with MapReduce you job’s task attempts are executed on remote nodes. How do you tell a remote machine to include third-party and user-defined classes?

Map-Reduce jobs are executed in separate JVMs on TaskTrackers and sometimes you need to use third-party libraries in the map/reduce task attempts. For example, you might want to access HBase from within your map tasks. One way to do this is to package every class used in the submittable JAR. You will have to unpack the original hbase-.jar and repackage all the classes in your submittable Hadoop jar. Not good. Don’t do this: The version compatibility issues are going to bite you sooner or later.

There are better ways of doing the same by either putting your jar in distributed cache or installing the whole JAR on the Hadoop nodes and telling TaskTrackers about their location.

1. Include the JAR in the “-libjars” command line option of the `hadoop jar …` command. The jar will be placed in distributed cache and will be made available to all of the job’s task attempts. More specifically, you will find the JAR in one of the ${mapred.local.dir}/taskTracker/archive/${user.name}/distcache/… subdirectories on local nodes. The advantage of the distributed cache is that your jar might still be there on your next program run (at least in theory: The files should be kicked out of the distributed cache only when they exceed soft limit defined by the local.cache.size configuration variable, defaults to 10GB, but your actual mileage can vary particularly with the newest security enhancements). Hadoop keeps track of the changes to the distributed cache files by examining their modification timestamp.

2. Include the referenced JAR in the lib subdirectory of the submittable JAR: A MapReduce job will unpack the JAR from this subdirectory into ${mapred.local.dir}/taskTracker/${user.name}/jobcache/$jobid/jars on the TaskTracker nodes and point your tasks to this directory to make the JAR available to your code. If the JARs are small, change often, and are job-specific this is the preferred method.

3. Finally, you can install the JAR on the cluster nodes. The easiest way is to place the JAR into $HADOOP_HOME/lib directory as everything from this directory is included when a Hadoop daemon starts. However, since you know that only TaskTrackers will need these the new JAR, a better way is to modify HADOOP_TASKTRACKER_OPTS option in the hadoop-env.sh configuration file. This method is preferred if the JAR is tied to the code running on the nodes, like HBase.

HADOOP_TASKTRACKER_OPTS="-classpath <colon-separated-paths-to-your-jars>"

Restart the TastTrackers when you are done. Do not forget to update the jar when the underlying software changes.

All of the above options affect only the code running on the distributed nodes. If your code that launches the Hadoop job uses the same library, you need to include the JAR in the HADOOP_CLASSPATH environment variable as well:

HADOOP_CLASSPATH="<colon-separated-paths-to-your-jars>"

Note that starting with Java 1.6 classpath can point to directories like “/path/to/your/jars/*” which will pick up all JARs from the given directory.

The same guiding principles apply to native code libraries that need to be run on the nodes (JNI or C++ pipes). You can put them into distributed cache with the “-files” options, include them into archive files specified with the “-archives” option, or install them on the cluster nodes. If the dynamic library linker is configured properly the native code should be made available to your task attempts. You can also modify the environment of the job’s running task attempts explicitly by specifying JAVA_LIBRARY_PATH or LD_LIBRARY_PATH variables:

hadoop jar <your jar> [main class]
      -D mapred.child.env="LD_LIBRARY_PATH=/path/to/your/libs" ...
Advertisements
Categories: Parallelization
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: