In order to complement the excelent RHadoop tools, I developed several R packages for use with Hive databases. I found it desirable to be able express SQL expressions using R, particularly for WHERE and SELECT expressions. To that end, I developed the rsql package.
I found the existing packages for R and Hive unsatisfactory, requiring too much explicit importing and exporting of variables. I decided to build my own package leveraging the RHadoop framework, which seemlessly exports the current R environment, transfers it to a shared location on hdfs and copies it locally onto each task node, loading it into an R session before evaluating whatever expression the user provided. There were several impediments to enabling that workflow with Hive. For the initial version, I chose to focus on implementing the Hive TRANSFORM statement, which streams each row through a user-provided executable. This required implementing a SerDe for the typedbytes data format preferred by RHadoop. For more information check out the rhive package on github or read any of my blog articles about it.
I also implemented a very limited subset of the fantastic caret package for Hive tables as represented in rhive. Specifically, I implemented several functions related to sampling and partitioning.