{
    "feed": {
        "id": "urn:uuid:b48fb060-c28d-5798-9996-f30c6f8d947a",
        "link": [
            {
                "@attributes": {
                    "href": "\"http://jfolson.com#{current_page.url}\"",
                    "rel": "self"
                }
            },
            {
                "@attributes": {
                    "href": "http://jfolson.com/"
                }
            }
            ],
        
 
        "title"    : "Problem Solving",
        "subtitle" : "Algorithms, Programming, Math and Food",
        "updated"  : "2013-04-11T10:36:49-04:00",
        "author": {
            "name": "Jamie F Olson"
        },
        "rights" : "Copyright (c) 2013, Jamie F Olson",
        "entries" : [
        
            {
                "summary" : "<h2 id=\"motivation\">Motivation</h2>\n<p>This fall, I found myself writing a number of Hive SQL queries, which was fun. The problem was that I was trying to do some not entirely simple things. In particular, I was trying to implement some of the R functions for data frames, e.g.\u00a0melt, aggregate, etc. I started out writing these as shell scripts, but</p>\n",
                "content" : "<h2 id=\"motivation\">Motivation</h2>\n<p>This fall, I found myself writing a number of Hive SQL queries, which was fun. The problem was that I was trying to do some not entirely simple things. In particular, I was trying to implement some of the R functions for data frames, e.g.\u00a0melt, aggregate, etc. I started out writing these as shell scripts, but that quickly became uncomfortable. Around that time I started to take another look at the RHadoop packages for interacting with Hadoop in R. They are, in fact, excellent, offering a surprisingly seemless experience. Motivated by this experience, I decided to rewrite my shell scripts and the code using them in R.</p>\n<p>There are a couple of different existing R packages for interacting with Hive, but I found them unsatisfactory. The most complete of these is <a href=\"http://github.com/nexr/RHive/\">nexr\u2019s RHive</a>. It seems to have lots of great features, but, among other things, it didn\u2019t work for me. They use a Thrift connection and failed to properly construct results as R objects when I tried it. There were a couple of other things I wasn\u2019t thrilled with about RHive. Their api has a lot of great features, but it requires users to explicitly export any variables and functions that are required by map or reduce functions. This leads to a substantially different experience from RHadoop, where the goal is to make the boundary between R and Hadoop invisible.</p>\n<h2 id=\"the-rhive-package\">The rhive package</h2>\n<p>These experiences led me to write the <a href=\"http://github.com/jamiefolson/rhive/\">rhive</a> package, which is an attempt to bring the ease of use of RHadoop to Hive for R. The principle features of rhive are:</p>\n<ul>\n<li>Creation and querying of Hive tables from R.</li>\n<li>Importing and exporting data from Hive tables</li>\n<li>Manipulation of Hive tables, just like data frames in R</li>\n<li>Applying map and reduce functions defined in R to Hive tables and storing results in Hive tables</li>\n</ul>\n<p>For information on the current state of the project, <a href=\"http://github.com/jamiefolson/rhive/\">checkout it out on github</a>.</p>\n",
                "title"   : "Introducing rhive",
                "updated" : "2013-03-28T00:00:00+00:00",
                "id"      : "urn:uuid:beeba428-8c02-5c70-9b9c-efdbbb1685a8",
                "link"    : "http://jfolson.com/blog/2013/03/28/introducing-rhive/"
            },
        
            {
                "summary" : "<h2 id=\"s4-for-rsql\">S4 for rsql</h2>\n<p>For a variety of reasons, some good and some bad, I wanted to use S4 classes for the table objects in rsql. Among other things, S4 gives you inheritance, prototypes, and some other neat stuff. Just as important to me, but totally superficial is the fact that the contents (ie slots) of an S4 object are</p>\n",
                "content" : "<h2 id=\"s4-for-rsql\">S4 for rsql</h2>\n<p>For a variety of reasons, some good and some bad, I wanted to use S4 classes for the table objects in rsql. Among other things, S4 gives you inheritance, prototypes, and some other neat stuff. Just as important to me, but totally superficial is the fact that the contents (ie slots) of an S4 object are retrieved using the \u2018@\u2019 operator, freeing the \u2018$\u2019 operator for fun.</p>\n<p>Specifically, I decided to use the \u2018$\u2019 operator to access references to a table\u2019s columns:</p>\n<pre><code>&gt; tab$x\ntab.x</code></pre>\n<p>This makes a variety of things much cleaner and more pleasant to use.</p>\n<h2 id=\"problem-s4-evaluates-arguments\">Problem: S4 evaluates arguments</h2>\n<p>I hadn\u2019t had a lot of experience in writing S4 classes, so I was surprised to learn that <a href=\"http://stackoverflow.com/questions/6999767/dispatching-s4-methods-with-an-expression-as-argument\"><strong>it is not possible for S4 methods to have unevaluated arguments</strong></a>.</p>\n<p>This is problematic since a wide variety of frequently used (S3) functions delay evaluation of their arguments for a specific context (usually a data frame). The most notable example of this is <code>subset</code> but there are plenty of others.</p>\n<pre><code>&gt; dat = data.frame(x=rnorm(10),y=1:10)\n&gt; dat.sub = subset(dat,subset=(x&gt;0))</code></pre>\n<p>Since arguments to S4 are evaluated while selecting an appropriate method, this is not possible. Instead, it becomes necessary to immediately capture the unevaluated expression using something like <code>plyr::.</code> in this example in the documentation for <code>plyr::dcast</code>:</p>\n<pre><code>#Air quality example\nnames(airquality) &lt;- tolower(names(airquality))\naqm &lt;- melt(airquality, id=c(&quot;month&quot;, &quot;day&quot;), na.rm=TRUE)\n\nacast(aqm, variable ~ month, mean, subset = .(variable == &quot;ozone&quot;))</code></pre>\n<h2 id=\"the-end\">The end</h2>\n<p>So that\u2019s what I did, too.</p>\n<h2 id=\"addendum\">Addendum</h2>\n<p>This story is really uninteresting if you don\u2019t realize how much work I put in to capture and pass those unevaluated expressions around before I needed to transition to S4 and it stopped working.</p>\n",
                "title"   : "Designing rsql: S3, S4 and expression evaluation in R",
                "updated" : "2013-03-27T00:00:00+00:00",
                "id"      : "urn:uuid:eca7a615-29c7-5602-96eb-99a0a71debd9",
                "link"    : "http://jfolson.com/blog/2013/03/27/rsql-s3-s4/"
            },
        
            {
                "summary" : "<h2 id=\"sql-is-great-but-its-not-r\">SQL is great, but it\u2019s not R</h2>\n<p>SQL is a powerful language for manipulating data. R is a powerful language for manipulating data. Frequently, data to be analyzed in R actually comes from a database using a SQL query. Fortunately, there are a variety of great R</p>\n",
                "content" : "<h2 id=\"sql-is-great-but-its-not-r\">SQL is great, but it\u2019s not R</h2>\n<p>SQL is a powerful language for manipulating data. R is a powerful language for manipulating data. Frequently, data to be analyzed in R actually comes from a database using a SQL query. Fortunately, there are a variety of great R packages for interacting with databases and everything goes smoothly.</p>\n<p>Eventually, you find yourself wanting to start doing a little bit more of the SQL in R, since R functions tend to be easier to document, generalize and reuse than SQL scripts. Now, everything is great! You\u2019ve got R functions that get or manipulate data in database tables before bringing the data into R. Everything works, unless you make a make a mistake. Unfortunately, the arguments for those functions are just strings, so it\u2019s rather cumbersome to combine all those things programmatically. Seemingly small generalizations become more and more difficult because, ultimately, <em>R is not SQL</em>.</p>\n<h2 id=\"why-rsql-is-better\">Why rsql is better</h2>\n<p>When you operate of a data frame you write this</p>\n<pre><code>x.sub = subset(x,subset=(y&gt;0))</code></pre>\n<p>Not this</p>\n<pre><code>x.sub = subset(x,subset=(&quot;y &gt; 0&quot;))</code></pre>\n<p>So why would you want to do that just because the data is in a database?</p>\n<h2 id=\"caveats\">Caveats</h2>\n<p>Unfortunately, things are just barely less awesome:</p>\n<pre><code>x.sub = subset(x,subset=.(y&gt;0))</code></pre>\n<p>Check it out <a href=\"http://github.com/jamiefolson/rsql/\">on github</a>.</p>\n",
                "title"   : "Introducting rsql: programming SQL in R",
                "updated" : "2013-03-26T00:00:00+00:00",
                "id"      : "urn:uuid:59bfe5f3-31bc-5719-8a32-a70ba1a81dbd",
                "link"    : "http://jfolson.com/blog/2013/03/26/introducing-rsql/"
            },
        
            {
                "summary" : "<h2 id=\"hadoop-streaming-and-typedbytes\">Hadoop Streaming and TypedBytes</h2>\n<p>Typedbytes is a binary format for serializing data that is supported by Hadoop streaming. Several different Hadoop applications have found dramatic performance improvements by transitioning from text formats (e.g.\u00a0csv) to</p>\n",
                "content" : "<h2 id=\"hadoop-streaming-and-typedbytes\">Hadoop Streaming and TypedBytes</h2>\n<p>Typedbytes is a binary format for serializing data that is supported by Hadoop streaming. Several different Hadoop applications have found dramatic performance improvements by transitioning from text formats (e.g.\u00a0csv) to typedbytes. The format is by no means perfect, though. Among other things, typedbytes 8 and 9 seem largely redundant, there\u2019s no distinction between a homogeneous collection and a heterogeneous collecetion, the initial version lacks support for a null type, etc. Despite these limitations, the real-world performance gains make it a desirable format to support.</p>\n<h2 id=\"hive-typedbytes-are-not-hadoop-typedbytes\">Hive typedbytes are not Hadoop typedbytes</h2>\n<p>There are a great many things that are strange about Hive\u2019s support for typedbytes. First, Hive contains a duplicate but nearly identical version of the Hadoop streaming typedbytes implementation. To be fair, this is somewhat understandable since the Hadoop implementation is (in my opinion) excessively protective about access to variables, while also using an Enum for type. Taken together, this makes it quite difficult to extend Hadoop\u2019s implementation without duplicating the whole thing ie subclassing won\u2019t let you do much.</p>\n<p>The larger issue is that files created by Hive\u2019s typedbytes SerDe are still a heck of a long way from the format required by Hadoop Streaming. For example, the typedbytes objects Hive serializes into and deserializes from are actually generic Binary objects, TypedBytes objects, which means that they are serialized by Hadoop as Objects, not as the actual typedbytes bit sequence contained in the object.</p>\n<p>Another problem is that Hive really doesn\u2019t support much of the typedbytes spec. Particularly, complex objects (list, vector and map) can only be serialized and deserialized as their JSON representations. Hive\u2019s typedbytes SerDe will <strong>fail</strong> on encountering any non-primitives in the typedbytes sequence.</p>\n<h2 id=\"hive-destroys-application-specific-type-codes\">Hive destroys application-specific type codes</h2>\n<p>The typedbytes spec allows for application-specific type-codes (anything between 50 and 200). Hive is unable to support these type-codes and will simply fail.</p>\n<h2 id=\"fixing-this\">Fixing this</h2>\n<p>Needlessly to say, these are problems for getting RHadoop-styled functionality for Hive tables with R. I\u2019ve solved these issues and a few more in the current version of <a href=\"http://github.com/jamiefolson/rhive/\">rhive</a>. It\u2019s currently focused primarily on compatibility with RHadoop\u2019s rmr package, but it should be helpful in getting Hive to work with other Hadoop Streaming-based tools.</p>\n<p>In particular, the serde for rhive supports application-specific type-codes These types are converted into Binary objects containing the entire typedbytes sequence, including the type code. When serializing, Binary types are assumed to be an entire typedbytes sequence, and serialized as such unless they cannot be (invalid type or incorrect length).</p>\n",
                "title"   : "Hive, Hadoop and TypedBytes",
                "updated" : "2013-03-24T00:00:00+00:00",
                "id"      : "urn:uuid:59373f76-f9d6-5e86-ba73-44b52f95bb54",
                "link"    : "http://jfolson.com/blog/2013/03/24/hive-typedbytes/"
            },
        
        ]
    }
}