Integrate Pig and Apache HBase

About this task

To configure Pig to work with Apache HBase tables, perform the following steps:

Procedure

  1. On the client node where Pig is installed, add the following string to /opt/mapr/conf/env.sh:
    export PIG_CLASSPATH=$PIG_CLASSPATH:/location-to-hbase-jar
  2. If the client node where Pig is installed also has either the mapr-hbase-regionserver or mapr-hbase-master packages installed, add the location of the hbase-<version>.jar file to the PIG_CLASSPATH variable from the previous step:
    export PIG_CLASSPATH="$PIG_CLASSPATH:/opt/mapr/hbase/hbase-<version>/hbase-<version>.jar"
  3. If the client node where Pig is installed does not have any HBase packages installed, copy the HBase JAR from a node that does have HBase installed to a location on the Pig client node. Add the HBase JAR's location to the definition from previous steps:
    export PIG_CLASSPATH=$PIG_CLASSPATH:/opt/mapr/lib/hbase-<version>.jar
  4. List the cluster's zookeeper nodes:
    maprcli node listzookeepers
  5. Add the following variable to the /opt/mapr/conf/env.sh file;
    export PIG_OPTS="-Dhbase.zookeeper.property.clientPort=5181
    -Dhbase.zookeeper.quorum=<comma-separated list of ZooKeeper IP addresses>"
  6. Launch a Pig job and verify that Pig can access HBase tables by using the HBase table name directly. Do not use the hbase:// prefix.

Example

Sample env.sh file for HBase and Pig integration

[root@nmk-centos-60-3 ~]# cat /opt/mapr/conf/env.sh 
#!/bin/bash
# Copyright (c) 2009 & onwards. MapR Tech, Inc., All rights reserved
# Please set all environment variable you want to be used during MapR cluster
# runtime here.
# namely MAPR_HOME, JAVA_HOME, MAPR_SUBNETS

export PIG_OPTS="-Dhbase.zookeeper.property.clientPort=5181
-Dhbase.zookeeper.quorum=10.10.80.61,10.10.80.62,10.10.80.63"
export
PIG_CLASSPATH="$PIG_CLASSPATH:/opt/mapr/hbase/hbase-<version>/conf:/usr/java/default/lib/tools.jar:/opt/mapr/hbase/hbase-<version>:/opt/mapr/hbase/hbase-<version>/hbase-<version>.jar"
export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$PIG_CLASSPATH"
export CLASSPATH="$CLASSPATH:$HADOOP_CLASSPATH"
#export JAVA_HOME=
#export MAPR_SUBNETS=
#export MAPR_HOME=
#export MAPR_ULIMIT_U=
#export MAPR_ULIMIT_N=
#export MAPR_SYSCTL_SOMAXCONN=
#export PIG_CLASSPATH=:$PIG_CLASSPATH
[root@nmk-centos-60-3 ~]# 

Sample HBase insertion script

[root@nmk-centos-60-3 nabeel]# cat hbase_pig.pig 
raw_data = LOAD '/user/mapr/input2.csv' USING PigStorage(',') AS (
listing_id: chararray,
fname: chararray,
lname: chararray );

STORE raw_data INTO 'sample_names' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage (
'info:fname info:lname');