{ "feed": { "id": "urn:uuid:b48fb060-c28d-5798-9996-f30c6f8d947a", "link": [ { "@attributes": { "href": "\"http://jfolson.com#{current_page.url}\"", "rel": "self" } }, { "@attributes": { "href": "http://jfolson.com/" } } ], "title" : "Problem Solving", "subtitle" : "Algorithms, Programming, Math and Food", "updated" : "2013-04-11T10:36:49-04:00", "author": { "name": "Jamie F Olson" }, "rights" : "Copyright (c) 2013, Jamie F Olson", "entries" : [ { "summary" : "
This fall, I found myself writing a number of Hive SQL queries, which was fun. The problem was that I was trying to do some not entirely simple things. In particular, I was trying to implement some of the R functions for data frames, e.g.\u00a0melt, aggregate, etc. I started out writing these as shell scripts, but
\n", "content" : "This fall, I found myself writing a number of Hive SQL queries, which was fun. The problem was that I was trying to do some not entirely simple things. In particular, I was trying to implement some of the R functions for data frames, e.g.\u00a0melt, aggregate, etc. I started out writing these as shell scripts, but that quickly became uncomfortable. Around that time I started to take another look at the RHadoop packages for interacting with Hadoop in R. They are, in fact, excellent, offering a surprisingly seemless experience. Motivated by this experience, I decided to rewrite my shell scripts and the code using them in R.
\nThere are a couple of different existing R packages for interacting with Hive, but I found them unsatisfactory. The most complete of these is nexr\u2019s RHive. It seems to have lots of great features, but, among other things, it didn\u2019t work for me. They use a Thrift connection and failed to properly construct results as R objects when I tried it. There were a couple of other things I wasn\u2019t thrilled with about RHive. Their api has a lot of great features, but it requires users to explicitly export any variables and functions that are required by map or reduce functions. This leads to a substantially different experience from RHadoop, where the goal is to make the boundary between R and Hadoop invisible.
\nThese experiences led me to write the rhive package, which is an attempt to bring the ease of use of RHadoop to Hive for R. The principle features of rhive are:
\nFor information on the current state of the project, checkout it out on github.
\n", "title" : "Introducing rhive", "updated" : "2013-03-28T00:00:00+00:00", "id" : "urn:uuid:beeba428-8c02-5c70-9b9c-efdbbb1685a8", "link" : "http://jfolson.com/blog/2013/03/28/introducing-rhive/" }, { "summary" : "For a variety of reasons, some good and some bad, I wanted to use S4 classes for the table objects in rsql. Among other things, S4 gives you inheritance, prototypes, and some other neat stuff. Just as important to me, but totally superficial is the fact that the contents (ie slots) of an S4 object are
\n", "content" : "For a variety of reasons, some good and some bad, I wanted to use S4 classes for the table objects in rsql. Among other things, S4 gives you inheritance, prototypes, and some other neat stuff. Just as important to me, but totally superficial is the fact that the contents (ie slots) of an S4 object are retrieved using the \u2018@\u2019 operator, freeing the \u2018$\u2019 operator for fun.
\nSpecifically, I decided to use the \u2018$\u2019 operator to access references to a table\u2019s columns:
\n> tab$x\ntab.x
\nThis makes a variety of things much cleaner and more pleasant to use.
\nI hadn\u2019t had a lot of experience in writing S4 classes, so I was surprised to learn that it is not possible for S4 methods to have unevaluated arguments.
\nThis is problematic since a wide variety of frequently used (S3) functions delay evaluation of their arguments for a specific context (usually a data frame). The most notable example of this is subset
but there are plenty of others.
> dat = data.frame(x=rnorm(10),y=1:10)\n> dat.sub = subset(dat,subset=(x>0))
\nSince arguments to S4 are evaluated while selecting an appropriate method, this is not possible. Instead, it becomes necessary to immediately capture the unevaluated expression using something like plyr::.
in this example in the documentation for plyr::dcast
:
#Air quality example\nnames(airquality) <- tolower(names(airquality))\naqm <- melt(airquality, id=c("month", "day"), na.rm=TRUE)\n\nacast(aqm, variable ~ month, mean, subset = .(variable == "ozone"))
\nSo that\u2019s what I did, too.
\nThis story is really uninteresting if you don\u2019t realize how much work I put in to capture and pass those unevaluated expressions around before I needed to transition to S4 and it stopped working.
\n", "title" : "Designing rsql: S3, S4 and expression evaluation in R", "updated" : "2013-03-27T00:00:00+00:00", "id" : "urn:uuid:eca7a615-29c7-5602-96eb-99a0a71debd9", "link" : "http://jfolson.com/blog/2013/03/27/rsql-s3-s4/" }, { "summary" : "SQL is a powerful language for manipulating data. R is a powerful language for manipulating data. Frequently, data to be analyzed in R actually comes from a database using a SQL query. Fortunately, there are a variety of great R
\n", "content" : "SQL is a powerful language for manipulating data. R is a powerful language for manipulating data. Frequently, data to be analyzed in R actually comes from a database using a SQL query. Fortunately, there are a variety of great R packages for interacting with databases and everything goes smoothly.
\nEventually, you find yourself wanting to start doing a little bit more of the SQL in R, since R functions tend to be easier to document, generalize and reuse than SQL scripts. Now, everything is great! You\u2019ve got R functions that get or manipulate data in database tables before bringing the data into R. Everything works, unless you make a make a mistake. Unfortunately, the arguments for those functions are just strings, so it\u2019s rather cumbersome to combine all those things programmatically. Seemingly small generalizations become more and more difficult because, ultimately, R is not SQL.
\nWhen you operate of a data frame you write this
\nx.sub = subset(x,subset=(y>0))
\nNot this
\nx.sub = subset(x,subset=("y > 0"))
\nSo why would you want to do that just because the data is in a database?
\nUnfortunately, things are just barely less awesome:
\nx.sub = subset(x,subset=.(y>0))
\nCheck it out on github.
\n", "title" : "Introducting rsql: programming SQL in R", "updated" : "2013-03-26T00:00:00+00:00", "id" : "urn:uuid:59bfe5f3-31bc-5719-8a32-a70ba1a81dbd", "link" : "http://jfolson.com/blog/2013/03/26/introducing-rsql/" }, { "summary" : "Typedbytes is a binary format for serializing data that is supported by Hadoop streaming. Several different Hadoop applications have found dramatic performance improvements by transitioning from text formats (e.g.\u00a0csv) to
\n", "content" : "Typedbytes is a binary format for serializing data that is supported by Hadoop streaming. Several different Hadoop applications have found dramatic performance improvements by transitioning from text formats (e.g.\u00a0csv) to typedbytes. The format is by no means perfect, though. Among other things, typedbytes 8 and 9 seem largely redundant, there\u2019s no distinction between a homogeneous collection and a heterogeneous collecetion, the initial version lacks support for a null type, etc. Despite these limitations, the real-world performance gains make it a desirable format to support.
\nThere are a great many things that are strange about Hive\u2019s support for typedbytes. First, Hive contains a duplicate but nearly identical version of the Hadoop streaming typedbytes implementation. To be fair, this is somewhat understandable since the Hadoop implementation is (in my opinion) excessively protective about access to variables, while also using an Enum for type. Taken together, this makes it quite difficult to extend Hadoop\u2019s implementation without duplicating the whole thing ie subclassing won\u2019t let you do much.
\nThe larger issue is that files created by Hive\u2019s typedbytes SerDe are still a heck of a long way from the format required by Hadoop Streaming. For example, the typedbytes objects Hive serializes into and deserializes from are actually generic Binary objects, TypedBytes objects, which means that they are serialized by Hadoop as Objects, not as the actual typedbytes bit sequence contained in the object.
\nAnother problem is that Hive really doesn\u2019t support much of the typedbytes spec. Particularly, complex objects (list, vector and map) can only be serialized and deserialized as their JSON representations. Hive\u2019s typedbytes SerDe will fail on encountering any non-primitives in the typedbytes sequence.
\nThe typedbytes spec allows for application-specific type-codes (anything between 50 and 200). Hive is unable to support these type-codes and will simply fail.
\nNeedlessly to say, these are problems for getting RHadoop-styled functionality for Hive tables with R. I\u2019ve solved these issues and a few more in the current version of rhive. It\u2019s currently focused primarily on compatibility with RHadoop\u2019s rmr package, but it should be helpful in getting Hive to work with other Hadoop Streaming-based tools.
\nIn particular, the serde for rhive supports application-specific type-codes These types are converted into Binary objects containing the entire typedbytes sequence, including the type code. When serializing, Binary types are assumed to be an entire typedbytes sequence, and serialized as such unless they cannot be (invalid type or incorrect length).
\n", "title" : "Hive, Hadoop and TypedBytes", "updated" : "2013-03-24T00:00:00+00:00", "id" : "urn:uuid:59373f76-f9d6-5e86-ba73-44b52f95bb54", "link" : "http://jfolson.com/blog/2013/03/24/hive-typedbytes/" }, ] } }