A connection to the Apache Hive can be created exactly the same way as any other DB Connection. Here we make just a few Hive specific remarks you may find useful.
The JDBC driver is a part of the Hive release. But the library and its dependencies are scattered
among other Hive libraries. Moreover, the driver depends on one more library from the Hadoop distribution:
depending on version of your Hadoop, there's always only one of them.
For Hive version 0.8.1, here is a minimal list of libraries you need for the Hive DB connection JDBC driver:
You can put all of the Hive distribution libraries + the one Hadoop lib on the JDBC driver classpath.
But some of the Hive distribution libraries may already be included in CloverETL which may result in
class loading conflicts. Typically, no
log4j* libraries should be included, otherwise (harmless) warnings
will appear in a graph run log.
Remember that Hive is not an ordinary SQL relational database, it has its own SQL-like query language, called QL. Great resource about the Hive QL and Hive in general is the Apache Hive Wiki.
One of the consequences is that it makes no sense to use the DBOutputTable
component, because the
INSERT INTO statement can insert only results of the
query. Even though it's still possible to work around this, each CloverETL data record inserted using such
INSERT statement will results in a heavy-weigh MapReduce job,
which renders the component extremely slow. Use
LOAD DATA Hive QL statement instead.
In the DBExecute component, always set the Transaction set attribute to One statement. The reason is that the Hive JDBC driver doesn't support transactions, and attempt to use them would result in an error saying that the AutoCommit mode cannot be disabled.
Note that the append file operation is fully supported only since version 0.21.0 of HDFS.
Consequently, if you run Hive on top of older HDFS, you cannot append data to existing tables
(use of the
OVERWRITE keyword becomes mandatory).