This fall, I found myself writing a number of Hive SQL queries, which was fun. The problem was that I was trying to do some not entirely simple things. In particular, I was trying to implement some of the R functions for data frames, e.g. melt, aggregate, etc. I started out writing these as shell scripts, but that quickly became uncomfortable. Around that time I started to take another look at the RHadoop packages for interacting with Hadoop in R. They are, in fact, excellent, offering a surprisingly seemless experience. Motivated by this experience, I decided to rewrite my shell scripts and the code using them in R.
There are a couple of different existing R packages for interacting with Hive, but I found them unsatisfactory. The most complete of these is nexr’s RHive. It seems to have lots of great features, but, among other things, it didn’t work for me. They use a Thrift connection and failed to properly construct results as R objects when I tried it. There were a couple of other things I wasn’t thrilled with about RHive. Their api has a lot of great features, but it requires users to explicitly export any variables and functions that are required by map or reduce functions. This leads to a substantially different experience from RHadoop, where the goal is to make the boundary between R and Hadoop invisible.
The rhive package
These experiences led me to write the rhive package, which is an attempt to bring the ease of use of RHadoop to Hive for R. The principle features of rhive are:
- Creation and querying of Hive tables from R.
- Importing and exporting data from Hive tables
- Manipulation of Hive tables, just like data frames in R
- Applying map and reduce functions defined in R to Hive tables and storing results in Hive tables
For information on the current state of the project, checkout it out on github.