Getting Started with Hive and MapR-DB Integration
In this tutorial we will:
- Create a Hive table
- Populate the Hive table with data from a text file
- Query the Hive table
- Create a Hive-MapR-DB table
- Introspect the Hive-MapR-DB table from the HBase shell
- Populate the Hive-MapR-DB table with data from the Hive table
- Query the Hive-MapR-DB table from Hive
- Convert an existing MapR-DB table into a Hive-MapR table
Be sure that you have successfully completed all of the steps in Installing Hive and review the MapR-DB topics before beginning this Getting Started tutorial.
This Getting Started tutorial is based on the Hive-HBase Integration section of the Apache Hive Wiki. However, please note that there are some significant differences.
Create a Hive table with two columns
Change to your Hive installation directory if you're not already there and start Hive:
$ cd $HIVE_HOME
$ bin/hive
Execute the CREATE TABLE command to create the Hive
pokes
table
hive> CREATE TABLE pokes (foo INT, bar STRING);
To see if the
pokes
table has been created successfully, execute the
SHOW TABLES
command
hive> SHOW TABLES;
OK
pokes
Time taken: 0.74 seconds
The pokes
table appears in the list of tables. Populate the Hive
pokes
table with data:
The kv1.txt
file is provided in the
$HIVE_HOME/examples/files
directory. Execute the LOAD DATA LOCAL INPATH
command to populate the Hive pokes
table with data from the
kv1.txt
file.
hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
A message appears confirming that the table was created successfully, and the Hive prompt reappears:
Copying data from file:
...
OK
Time taken: 0.278 seconds
hive>
Execute a SELECT query on the Hive
pokes
table
hive> SELECT * FROM pokes WHERE foo = 98;
The SELECT statement executes, runs a MapReduce job, and prints the job output:
OK
98 val_98
98 val_98
Time taken: 18.059 seconds
The output of the SELECT command displays two identical rows because there are two
identical rows in the Hive pokes
table with a key of 98.
Hive tables can have multiple identical keys. As we will see shortly, MapR-DB tables cannot have multiple identical keys, only unique keys.
Create a Hive-MapR-DB table
Enter these four lines of code at the Hive prompt:
hive> CREATE TABLE mapr_table_1(key int, value string)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
> TBLPROPERTIES ("hbase.table.name" = "/user/mapr/xyz");
After a brief delay, a message appears confirming that the table was created successfully:
OK
Time taken: 5.195 seconds
Note: The TBLPROPERTIES command is not required, but those new to Hive-MapR-DB integration may find it easier to understand what's going on if Hive and MapR-DB use different names for the same table.
In this example, Hive will recognize this table as "mapr_table_1" and MapR-DB will recognize this table as "xyz".
Start the HBase shell
Keeping the Hive terminal session open, start a new terminal session for HBase, then start the HBase shell:
$ cd $HBASE_HOME
$ bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 0.90.4, rUnknown, Wed Nov 9 17:35:00 PST 2011
hbase(main):001:0>
Execute the
list
command to see a list of HBase tables
hbase(main):001:0> list
TABLE
/user/mapr/xyz
1 row(s) in 0.8260 seconds
HBase recognizes the Hive-MapR-DB table named xyz
in directory
/user/mapr
. This is the same table known to Hive as
mapr_table_1
.
Display the description of the
/user/mapr/xyz
table in the HBase shell
hbase(main):004:0> describe "/user/mapr/xyz"
DESCRIPTION ENABLED
{NAME => '/user/mapr/xyz', FAMILIES => [{NAME => 'cf1', DATA_B true
LOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REP
LICATION_SCOPE => '0', VERSIONS => '3', MIN_VERSION
S => '0', TTL => '2147483647', KEEP_DELETED_CELLS =
> 'false', BLOCKSIZE => '65536', IN_MEMORY => 'fals
e', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'}
]}
1 row(s) in 0.0240 seconds
From the Hive prompt, insert data from the Hive table
pokes
into the Hive-MapR-DB table
mapr_table_1
hive> INSERT OVERWRITE TABLE mapr_table_1 SELECT * FROM pokes WHERE foo=98;
...
2 Rows loaded to mapr_table_1
OK
Time taken: 13.384 seconds
Query
mapr_table_1
to see the data we have inserted into the Hive-MapR-DB table
hive> SELECT * FROM mapr_table_1;
OK
98 val_98
Time taken: 0.56 seconds
Even though we loaded two rows from the Hive pokes
table that had the
same key of 98, only one row was actually inserted into mapr_table_1
. This
is because mapr_table_1
is a MapR-DB table, and although Hive tables
support duplicate keys, MapR-DB tables only support unique keys. MapR-DB tables arbitrarily
retain only one key, and silently discard all of the data associated with duplicate keys.
Convert a pre-existing MapR-DB table to a Hive-MapR-DB table
To convert a pre-existing MapR-DB table to a Hive-MapR-DB table, enter the following four commands at the Hive prompt.
Note that in this example the existing MapR-DB table is my_mapr_table
in
directory /user/mapr
.
hive> CREATE EXTERNAL TABLE mapr_table_2(key int, value string)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf1:val")
> TBLPROPERTIES("hbase.table.name" = "/user/mapr/my_mapr_table");
Now we can run a Hive query against the pre-existing MapR-DB table
/user/mapr/my_mapr_table
that Hive sees as
mapr_table_2
:
hive> SELECT * FROM mapr_table_2 WHERE key > 400 AND key < 410;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
...
OK
401 val_401
402 val_402
403 val_403
404 val_404
406 val_406
407 val_407
409 val_409
Time taken: 9.452 seconds