Using Clojure to query AWS EMR via Hive JDBC

Background

In January, I launched an Amazon Elastic Map/Reduce (EMR) cluster, for a migration project.  Since then, I’ve been slowing adjusting and configuring it (with help) to:

  • Use an external Hive metastore from a data lake
  • Enable querying the cluster via Hive JDBC
  • Enable querying the cluster via Simba JDBC (Spark Thrift)

It took a while to achieve those bullets in a security-restricted environment.  In the few weeks since it’s been copasetic, I’ve begun testing the Hive JDBC access.

The first successful test involved JRuby 9.1.15.0 on Rails 4.2 using a non-ActiveRecord model (PORO). This article illustrates my second test, using Clojure.

Prerequisites

This assumes you already have your AWS EMR cluster running with Hive et. al. enabled.

Download the Hive JDBC Drivers

The first step is to download the Hive JDBC drivers Amazon provides.  This was tricky to find since links have changed over time.  This 2014 blog post, from Simba Technologies, on the AWS Big Data Blog, contain a link to the drivers.  Despite my Google-fu, I couldn’t find a more recent version of the drivers.

I’m using the JDBC 4.1 drivers found in the AmazonHiveJDBC/AmazonHiveJDBC41-1.0.9.1060 directory.

JDBC 4.1 JARs for Hive

Create a New Project

Use Leiningen to create a new Clojure project:

lein new app emr-hive-jdbc-example

Add the Dependencies

Edit the project.clj file and add dependency entries for Logging and Java JDBC.

:dependencies [[org.clojure/clojure "1.8.0"]
               [org.clojure/tools.logging "0.4.1"]
               [org.clojure/java.jdbc "0.7.8"]]

Add the JAR files

I didn’t care to mess with Maven for this example code.  I took the approach of placing the JAR files in the resources directory.  And added the corresponding entries to the project.clj file.

:resource-paths ["resources/HiveJDBC41.jar"
                 "resources/commons-codec-1.3.jar"
                 "resources/commons-logging-1.1.1.jar"
                 "resources/hive_metastore.jar"
                 "resources/hive_service.jar"
                 "resources/httpclient-4.1.3.jar"
                 "resources/httpcore-4.1.3.jar"
                 "resources/libfb303-0.9.0.jar"
                 "resources/libthrift-0.9.0.jar"
                 "resources/log4j-1.2.14.jar"
                 "resources/ql.jar"
                 "resources/slf4j-api-1.5.11.jar"
                 "resources/slf4j-log4j12-1.5.11.jar"
                 "resources/TCLIServiceClient.jar"
                 "resources/zookeeper-3.4.6.jar"]

Here’s what the full project.clj file should look like:

(defproject emr-hive-jdbc-example "0.1.0-SNAPSHOT"
  :description "Example of how to connect to AWS EMR via Hive JDBC"
  :url "https://gitlab.com/ejstembler/emr-hive-jdbc-example"
  :license {:name "MIT"
            :url "https://mit-license.org"}
  :dependencies [[org.clojure/clojure "1.8.0"]
                 [org.clojure/tools.logging "0.4.1"]
                 [org.clojure/java.jdbc "0.7.8"]]
  :resource-paths ["resources/HiveJDBC41.jar"
                   "resources/commons-codec-1.3.jar"
                   "resources/commons-logging-1.1.1.jar"
                   "resources/hive_metastore.jar"
                   "resources/hive_service.jar"
                   "resources/httpclient-4.1.3.jar"
                   "resources/httpcore-4.1.3.jar"
                   "resources/libfb303-0.9.0.jar"
                   "resources/libthrift-0.9.0.jar"
                   "resources/log4j-1.2.14.jar"
                   "resources/ql.jar"
                   "resources/slf4j-api-1.5.11.jar"
                   "resources/slf4j-log4j12-1.5.11.jar"
                   "resources/TCLIServiceClient.jar"
                   "resources/zookeeper-3.4.6.jar"]
  :main ^:skip-aot emr-hive-jdbc-example.core
  :target-path "target/%s"
  :profiles {:uberjar {:aot :all}})

Add a log4j.properties file

Create a new log4j.properties file in the src directory.  Add some basic logging configuration for log4j to pick up.

log4j.rootLogger=INFO, console
log4j.logger.example=DEBUG
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%-5p %c: %m%n

Edit the core source file

The core.clj file is the main and only source file for this example.

Add requires to namespace

At the top of the file, we need to add requires statements for logging and JDBC.  I like to use succinct but logical aliases.

(ns emr-hive-jdbc-example.core
  (:require [clojure.tools.logging :as log])
  (:require [clojure.java.jdbc :as jdbc])
  (:gen-class))

Some examples out there use j as an alias for JDBC.  That’s too short and not descriptive!

Read sensitive values from environment variables

Next, I’m reading some sensitive values from environment variables.  You never want to hard-code sensitive values in your source code, or even in your config files!  See also:  The Twelve-Factor App:  III. Config.

;; Define the user/password/subname from ENV variables
(def user (System/getenv "EMR_HIVE_JDBC_USER"))
(def password (System/getenv "EMR_HIVE_JDBC_PASSWORD"))
(def subname (System/getenv "EMR_HIVE_JDBC_SUBNAME"))

When dealing with JDBC in other JVM languages I use a url.  Yet, Clojure JDBC seems to prefer using a subname instead.  It likely builds the url from all the parts you provide, including the subname.  The subname is a part of the url anyway.

Preconditions: Confirm the required values

I’m a fan of Betrand Meyer‘s concept of Preconditions.  I try to use them in most all languages I write code in.  Some languages have built-in support or 3rd-party libraries.  Though sometimes it’s better to hand-code them for simplicity’s sake. See also: Design by contract.

;; Preconditions
(when (nil? user)
  (throw (RuntimeException. "user is nil")))
(when (nil? password)
  (throw (RuntimeException. "password is nil")))
(when (nil? subname)
  (throw (RuntimeException. "subname is nil")))

Since the code cannot continue without valid values for the user / password / subname it makes sense to raise errors if they are missing.  This way you can control the message that’s displayed to the end-user.

Define the connection specification

Defining the connection specification is pretty straightforward.  It’s only a Map.

;; Define the connection specification
(def hive-spec {:classname "com.amazon.hive.jdbc41.HS2Driver"
               :subprotocol "hive2"
               :subname subname
               :user user
               :password password})

I’m using the JDBC 4.1 HS2 driver with the hive2 subprotocol.

Define the SQL statement

This is pretty straightforward and simple too.  Though, I was  perplexed about why the Clojure JDBC query function requires the SQL statement in a Vector.  Why not use a String?

;; Define the SQL statement to execute
(def sql ["SELECT
               COUNT(*) AS row_count
           FROM
               my_table"]) ; TODO: Change to your desired SQL statement

The main entry-point

Here’s the main entry-point of this example, which does the following:

  1. Logs the SQL to execute
  2. Executes the SQL
  3. Returns the row_count value of the first row
;; The main entry-point
;; 1. Logs the SQL to execute
;; 2. Executes the SQL
;; 3. Returns the row_count value of the first row
(defn -main
  [& args]
  (log/info "Executing SQL:\n" (first sql))
  (println (jdbc/query hive-spec sql {:row-fn :row_count :result-set-fn first})))

I’m ignoring any command-line arguments here.  I log the SQL statement so I can see it.  And I call the query function, passing the spec, the sql, and an options Map which specifies:

  • a result-set function which returns only the 1st row
  • a row function which only returns the row_count column I aliased in my SQL statement

I noticed that if I used COUNT(*) without the column alias, Hive defines it as something obscure like _c0.  Not user friendly at all.

Here’s the complete core.clj file:

(ns emr-hive-jdbc-example.core
  (:require [clojure.tools.logging :as log])
  (:require [clojure.java.jdbc :as jdbc])
  (:gen-class))

;; Define the user/password/subname from ENV variables
(def user (System/getenv "EMR_HIVE_JDBC_USER"))
(def password (System/getenv "EMR_HIVE_JDBC_PASSWORD"))
(def subname (System/getenv "EMR_HIVE_JDBC_SUBNAME"))

;; Preconditions
(when (nil? user)
  (throw (RuntimeException. "user is nil")))
(when (nil? password)
  (throw (RuntimeException. "password is nil")))
(when (nil? subname)
  (throw (RuntimeException. "subname is nil")))

;; Define the connection specification
(def hive-spec {:classname "com.amazon.hive.jdbc41.HS2Driver"
               :subprotocol "hive2"
               :subname subname
               :user user
               :password password})

;; Define the SQL statement to execute
(def sql ["SELECT
               COUNT(*) AS row_count
           FROM
               my_table"]) ; TODO: Change to your desired SQL statement

;; The main entry-point
;; 1. Logs the SQL to execute
;; 2. Executes the SQL
;; 3. Returns the row_count value of the first row
(defn -main
  [& args]
  (log/info "Executing SQL:\n" (first sql))
  (println (jdbc/query hive-spec sql {:row-fn :row_count :result-set-fn first})))

Running the project

When I was testing this project, I used lein to run it.

lein run

Later, after you’ve compiled the JAR, you can run it via Java:

java -jar emr-hive-jdbc-example-0.1.0-standalone.jar

Conclusion

In any case, I’m glad I got this working!  I like the functional way Clojure + JDBC code turns out.

Hopefully someone may find this useful too.

Full Source Code

ejstembler/emr-hive-jdbc-example

Sliding Sidebar