{
    "feed": {
        "id": "urn:uuid:b48fb060-c28d-5798-9996-f30c6f8d947a",
        "link": [
            {
                "@attributes": {
                    "href": "\"http://jfolson.com#{current_page.url}\"",
                    "rel": "self"
                }
            },
            {
                "@attributes": {
                    "href": "http://jfolson.com/"
                }
            }
            ],
        
 
        "title"    : "Problem Solving",
        "subtitle" : "Algorithms, Programming, Math and Food",
        "updated"  : "2013-04-11T10:36:46-04:00",
        "author": {
            "name": "Jamie F Olson"
        },
        "rights" : "Copyright (c) 2013, Jamie F Olson",
        "entries" : [
        
            {
                "summary" : "<h1 id=\"background\">Background</h1>\n<p>I had never actually used any of the modern functional programming languages until a couple of weeks ago. However, I\u2019ve worked quite a lot with the functional features of more multi-paradigm languages: python, R, ruby, javascript, etc. so I didn\u2019t expect to be caught off guard by Haskell.</p>\n",
                "content" : "<h1 id=\"background\">Background</h1>\n<p>I had never actually used any of the modern functional programming languages until a couple of weeks ago. However, I\u2019ve worked quite a lot with the functional features of more multi-paradigm languages: python, R, ruby, javascript, etc. so I didn\u2019t expect to be caught off guard by Haskell.</p>\n<p>I decided to learn Haskell because I wanted to extend the wonderful <a href=\"http://johnmacfarlane.net/pandoc/\">Pandoc</a>, which is written in Haskell. This turned out to be a bit more ambitious than I expected.</p>\n<h1 id=\"surprises\">Surprises</h1>\n<p>I was almost entirely unprepared for the way Haskell thinks about types. The more functional languages I\u2019ve used have tended to have somewhat weaker type concepts, from the duck-typing of python to the not-really-a-type <code>S3</code> classes in R.</p>\n<h2 id=\"the-types-are-strong-with-this-one\">The types are strong with this one</h2>\n<p>In contrast, Haskell is strongly typed, and I had not appreciated just how strongly typed it is. To an uninitiated Haskell outsider, I would not have expected types to be a big part of understanding the Pandoc source code. However, in Haskell, types are \u201cmerely\u201d data containers, meaning that any and all containers/structs are new types. This happens because there is no real concept of \u201cmethod\u201d in the object-oriented sense.</p>\n<p>Making it worse was the fact that I really did not understand Haskell types. Haskell has completely different concepts for <em>types constructors</em> and <em>data constructors</em>. Types do not actually exist as anything you can refer to in Haskell.</p>\n<pre class=\"sourceCode haskell\"><code class=\"sourceCode haskell\"><span class=\"ot\">f ::</span> <span class=\"dt\">FromType</span> <span class=\"ot\">-&gt;</span> <span class=\"dt\">ToType</span>\n<span class=\"kw\">data</span> <span class=\"dt\">FromType</span> <span class=\"fu\">=</span> <span class=\"dt\">FromTypeConstructor</span> <span class=\"dt\">Value</span></code></pre>\n<p>Here, you can create an object of type <code>FromType</code> with <code>FromTypeConstructor value</code> if <code>value</code> has type <code>Value</code>. However, <code>FromType</code> itself has no value.</p>\n<h2 id=\"when-names-collide\">When names collide</h2>\n<p>This gets particularly complicated since types and constructors have <em>completely separate namespaces</em>! As an example, in <code>Text.JSON</code>, <code>JSObject</code> is both a type and a constructor, but the <code>JSObject</code> constructor <em>does not create <code>JSObject</code> type objects</em>. Instead, <code>JSONObject</code> must be used to construct objects of type <code>JSObject</code>, while the constructor <code>JSObject</code> creates objects of type <code>JSValue</code>.</p>\n",
                "title"   : "Haskell: Initial reactions",
                "updated" : "2013-03-29T00:00:00+00:00",
                "id"      : "urn:uuid:972ecca5-0b89-53f4-a2c3-0a50d62ad4e1",
                "link"    : "http://jfolson.com/blog/2013/03/29/haskell-great-and-terrible/"
            },
        
            {
                "summary" : "<h2 id=\"motivation\">Motivation</h2>\n<p>This fall, I found myself writing a number of Hive SQL queries, which was fun. The problem was that I was trying to do some not entirely simple things. In particular, I was trying to implement some of the R functions for data frames, e.g.\u00a0melt, aggregate, etc. I started out writing these as shell scripts, but</p>\n",
                "content" : "<h2 id=\"motivation\">Motivation</h2>\n<p>This fall, I found myself writing a number of Hive SQL queries, which was fun. The problem was that I was trying to do some not entirely simple things. In particular, I was trying to implement some of the R functions for data frames, e.g.\u00a0melt, aggregate, etc. I started out writing these as shell scripts, but that quickly became uncomfortable. Around that time I started to take another look at the RHadoop packages for interacting with Hadoop in R. They are, in fact, excellent, offering a surprisingly seemless experience. Motivated by this experience, I decided to rewrite my shell scripts and the code using them in R.</p>\n<p>There are a couple of different existing R packages for interacting with Hive, but I found them unsatisfactory. The most complete of these is <a href=\"http://github.com/nexr/RHive/\">nexr\u2019s RHive</a>. It seems to have lots of great features, but, among other things, it didn\u2019t work for me. They use a Thrift connection and failed to properly construct results as R objects when I tried it. There were a couple of other things I wasn\u2019t thrilled with about RHive. Their api has a lot of great features, but it requires users to explicitly export any variables and functions that are required by map or reduce functions. This leads to a substantially different experience from RHadoop, where the goal is to make the boundary between R and Hadoop invisible.</p>\n<h2 id=\"the-rhive-package\">The rhive package</h2>\n<p>These experiences led me to write the <a href=\"http://github.com/jamiefolson/rhive/\">rhive</a> package, which is an attempt to bring the ease of use of RHadoop to Hive for R. The principle features of rhive are:</p>\n<ul>\n<li>Creation and querying of Hive tables from R.</li>\n<li>Importing and exporting data from Hive tables</li>\n<li>Manipulation of Hive tables, just like data frames in R</li>\n<li>Applying map and reduce functions defined in R to Hive tables and storing results in Hive tables</li>\n</ul>\n<p>For information on the current state of the project, <a href=\"http://github.com/jamiefolson/rhive/\">checkout it out on github</a>.</p>\n",
                "title"   : "Introducing rhive",
                "updated" : "2013-03-28T00:00:00+00:00",
                "id"      : "urn:uuid:beeba428-8c02-5c70-9b9c-efdbbb1685a8",
                "link"    : "http://jfolson.com/blog/2013/03/28/introducing-rhive/"
            },
        
            {
                "summary" : "<h2 id=\"s4-for-rsql\">S4 for rsql</h2>\n<p>For a variety of reasons, some good and some bad, I wanted to use S4 classes for the table objects in rsql. Among other things, S4 gives you inheritance, prototypes, and some other neat stuff. Just as important to me, but totally superficial is the fact that the contents (ie slots) of an S4 object are</p>\n",
                "content" : "<h2 id=\"s4-for-rsql\">S4 for rsql</h2>\n<p>For a variety of reasons, some good and some bad, I wanted to use S4 classes for the table objects in rsql. Among other things, S4 gives you inheritance, prototypes, and some other neat stuff. Just as important to me, but totally superficial is the fact that the contents (ie slots) of an S4 object are retrieved using the \u2018@\u2019 operator, freeing the \u2018$\u2019 operator for fun.</p>\n<p>Specifically, I decided to use the \u2018$\u2019 operator to access references to a table\u2019s columns:</p>\n<pre><code>&gt; tab$x\ntab.x</code></pre>\n<p>This makes a variety of things much cleaner and more pleasant to use.</p>\n<h2 id=\"problem-s4-evaluates-arguments\">Problem: S4 evaluates arguments</h2>\n<p>I hadn\u2019t had a lot of experience in writing S4 classes, so I was surprised to learn that <a href=\"http://stackoverflow.com/questions/6999767/dispatching-s4-methods-with-an-expression-as-argument\"><strong>it is not possible for S4 methods to have unevaluated arguments</strong></a>.</p>\n<p>This is problematic since a wide variety of frequently used (S3) functions delay evaluation of their arguments for a specific context (usually a data frame). The most notable example of this is <code>subset</code> but there are plenty of others.</p>\n<pre><code>&gt; dat = data.frame(x=rnorm(10),y=1:10)\n&gt; dat.sub = subset(dat,subset=(x&gt;0))</code></pre>\n<p>Since arguments to S4 are evaluated while selecting an appropriate method, this is not possible. Instead, it becomes necessary to immediately capture the unevaluated expression using something like <code>plyr::.</code> in this example in the documentation for <code>plyr::dcast</code>:</p>\n<pre><code>#Air quality example\nnames(airquality) &lt;- tolower(names(airquality))\naqm &lt;- melt(airquality, id=c(&quot;month&quot;, &quot;day&quot;), na.rm=TRUE)\n\nacast(aqm, variable ~ month, mean, subset = .(variable == &quot;ozone&quot;))</code></pre>\n<h2 id=\"the-end\">The end</h2>\n<p>So that\u2019s what I did, too.</p>\n<h2 id=\"addendum\">Addendum</h2>\n<p>This story is really uninteresting if you don\u2019t realize how much work I put in to capture and pass those unevaluated expressions around before I needed to transition to S4 and it stopped working.</p>\n",
                "title"   : "Designing rsql: S3, S4 and expression evaluation in R",
                "updated" : "2013-03-27T00:00:00+00:00",
                "id"      : "urn:uuid:eca7a615-29c7-5602-96eb-99a0a71debd9",
                "link"    : "http://jfolson.com/blog/2013/03/27/rsql-s3-s4/"
            },
        
            {
                "summary" : "<h2 id=\"sql-is-great-but-its-not-r\">SQL is great, but it\u2019s not R</h2>\n<p>SQL is a powerful language for manipulating data. R is a powerful language for manipulating data. Frequently, data to be analyzed in R actually comes from a database using a SQL query. Fortunately, there are a variety of great R</p>\n",
                "content" : "<h2 id=\"sql-is-great-but-its-not-r\">SQL is great, but it\u2019s not R</h2>\n<p>SQL is a powerful language for manipulating data. R is a powerful language for manipulating data. Frequently, data to be analyzed in R actually comes from a database using a SQL query. Fortunately, there are a variety of great R packages for interacting with databases and everything goes smoothly.</p>\n<p>Eventually, you find yourself wanting to start doing a little bit more of the SQL in R, since R functions tend to be easier to document, generalize and reuse than SQL scripts. Now, everything is great! You\u2019ve got R functions that get or manipulate data in database tables before bringing the data into R. Everything works, unless you make a make a mistake. Unfortunately, the arguments for those functions are just strings, so it\u2019s rather cumbersome to combine all those things programmatically. Seemingly small generalizations become more and more difficult because, ultimately, <em>R is not SQL</em>.</p>\n<h2 id=\"why-rsql-is-better\">Why rsql is better</h2>\n<p>When you operate of a data frame you write this</p>\n<pre><code>x.sub = subset(x,subset=(y&gt;0))</code></pre>\n<p>Not this</p>\n<pre><code>x.sub = subset(x,subset=(&quot;y &gt; 0&quot;))</code></pre>\n<p>So why would you want to do that just because the data is in a database?</p>\n<h2 id=\"caveats\">Caveats</h2>\n<p>Unfortunately, things are just barely less awesome:</p>\n<pre><code>x.sub = subset(x,subset=.(y&gt;0))</code></pre>\n<p>Check it out <a href=\"http://github.com/jamiefolson/rsql/\">on github</a>.</p>\n",
                "title"   : "Introducting rsql: programming SQL in R",
                "updated" : "2013-03-26T00:00:00+00:00",
                "id"      : "urn:uuid:59bfe5f3-31bc-5719-8a32-a70ba1a81dbd",
                "link"    : "http://jfolson.com/blog/2013/03/26/introducing-rsql/"
            },
        
            {
                "summary" : "<h2 id=\"hadoop-streaming-and-typedbytes\">Hadoop Streaming and TypedBytes</h2>\n<p>Typedbytes is a binary format for serializing data that is supported by Hadoop streaming. Several different Hadoop applications have found dramatic performance improvements by transitioning from text formats (e.g.\u00a0csv) to</p>\n",
                "content" : "<h2 id=\"hadoop-streaming-and-typedbytes\">Hadoop Streaming and TypedBytes</h2>\n<p>Typedbytes is a binary format for serializing data that is supported by Hadoop streaming. Several different Hadoop applications have found dramatic performance improvements by transitioning from text formats (e.g.\u00a0csv) to typedbytes. The format is by no means perfect, though. Among other things, typedbytes 8 and 9 seem largely redundant, there\u2019s no distinction between a homogeneous collection and a heterogeneous collecetion, the initial version lacks support for a null type, etc. Despite these limitations, the real-world performance gains make it a desirable format to support.</p>\n<h2 id=\"hive-typedbytes-are-not-hadoop-typedbytes\">Hive typedbytes are not Hadoop typedbytes</h2>\n<p>There are a great many things that are strange about Hive\u2019s support for typedbytes. First, Hive contains a duplicate but nearly identical version of the Hadoop streaming typedbytes implementation. To be fair, this is somewhat understandable since the Hadoop implementation is (in my opinion) excessively protective about access to variables, while also using an Enum for type. Taken together, this makes it quite difficult to extend Hadoop\u2019s implementation without duplicating the whole thing ie subclassing won\u2019t let you do much.</p>\n<p>The larger issue is that files created by Hive\u2019s typedbytes SerDe are still a heck of a long way from the format required by Hadoop Streaming. For example, the typedbytes objects Hive serializes into and deserializes from are actually generic Binary objects, TypedBytes objects, which means that they are serialized by Hadoop as Objects, not as the actual typedbytes bit sequence contained in the object.</p>\n<p>Another problem is that Hive really doesn\u2019t support much of the typedbytes spec. Particularly, complex objects (list, vector and map) can only be serialized and deserialized as their JSON representations. Hive\u2019s typedbytes SerDe will <strong>fail</strong> on encountering any non-primitives in the typedbytes sequence.</p>\n<h2 id=\"hive-destroys-application-specific-type-codes\">Hive destroys application-specific type codes</h2>\n<p>The typedbytes spec allows for application-specific type-codes (anything between 50 and 200). Hive is unable to support these type-codes and will simply fail.</p>\n<h2 id=\"fixing-this\">Fixing this</h2>\n<p>Needlessly to say, these are problems for getting RHadoop-styled functionality for Hive tables with R. I\u2019ve solved these issues and a few more in the current version of <a href=\"http://github.com/jamiefolson/rhive/\">rhive</a>. It\u2019s currently focused primarily on compatibility with RHadoop\u2019s rmr package, but it should be helpful in getting Hive to work with other Hadoop Streaming-based tools.</p>\n<p>In particular, the serde for rhive supports application-specific type-codes These types are converted into Binary objects containing the entire typedbytes sequence, including the type code. When serializing, Binary types are assumed to be an entire typedbytes sequence, and serialized as such unless they cannot be (invalid type or incorrect length).</p>\n",
                "title"   : "Hive, Hadoop and TypedBytes",
                "updated" : "2013-03-24T00:00:00+00:00",
                "id"      : "urn:uuid:59373f76-f9d6-5e86-ba73-44b52f95bb54",
                "link"    : "http://jfolson.com/blog/2013/03/24/hive-typedbytes/"
            },
        
            {
                "summary" : "<h2 id=\"trying-to-help\">Trying to help</h2>\n<p>One of the things I\u2019ve liked about git is how much it encourages responsible and clean commits. Personally, I\u2019m not there yet. I commit when I think to, usually when I\u2019m at a reasonable stopping point, which means I just end up doing <code>git commit -a</code> rather than adding and committing clean groups of related</p>\n",
                "content" : "<h2 id=\"trying-to-help\">Trying to help</h2>\n<p>One of the things I\u2019ve liked about git is how much it encourages responsible and clean commits. Personally, I\u2019m not there yet. I commit when I think to, usually when I\u2019m at a reasonable stopping point, which means I just end up doing <code>git commit -a</code> rather than adding and committing clean groups of related changes.</p>\n<p>Because of this, when I submitted a pull request to the awesome <a href=\"http://johnmacfarlane.net/pandoc/\">Pandoc</a>, my commit history was a mess. The owner of the project, John MacFarlane, reasonably asked to to separate things into nice clean commits. Embarrassingly, I\u2019d never done this before, but everything went a bit smoother than expected.</p>\n<p>The <code>git rebase</code> command allows you to rewrite the commit history of your project. There are lots of ways to do this, but I found the easiest to be:</p>\n<pre><code>git rebase -i HEAD~10</code></pre>\n<p>Which will allow you to edit the ten most recent commits.</p>\n<p>The <a href=\"http://git-scm.com/book/ch6-4.html\">git documentation</a> describes how to use this, but I had trouble using the <code>squash</code> option and just ended up <code>edit</code>ing them all.</p>\n<p>As mentioned there, <code>git reset HEAD^</code> resets the staged commit and then you\u2019re free to commit as you wish before using <code>git rebase --continue</code> to move forward.</p>\n",
                "title"   : "Rewriting commits with git is easier than you think",
                "updated" : "2013-03-21T00:00:00+00:00",
                "id"      : "urn:uuid:1e4832ba-d82b-54ea-b8c2-921b7ce281f6",
                "link"    : "http://jfolson.com/blog/2013/03/21/git-rebase/"
            },
        
            {
                "summary" : "<h2 id=\"logic\">Logic</h2>\n<p>Implicit conversion with <code>==</code></p>\n<pre><code>0==&quot;0&quot;</code></pre>\n<p>But no unboxing with <code>===</code></p>\n<pre><code>0!==new Number(0)</code></pre>\n<p>Equality is not transitive</p>\n<pre><code>&quot;&quot; == 0 \n0 == &quot;0&quot; \n&quot;&quot; != &quot;0&quot;</code></pre>\n<p>parseInt is not base ten!</p>\n<pre><code>parseInt(&quot;8&quot;); //8\nparseInt(&quot;08&quot;); //0</code></pre>\n",
                "content" : "<h2 id=\"logic\">Logic</h2>\n<p>Implicit conversion with <code>==</code></p>\n<pre><code>0==&quot;0&quot;</code></pre>\n<p>But no unboxing with <code>===</code></p>\n<pre><code>0!==new Number(0)</code></pre>\n<p>Equality is not transitive</p>\n<pre><code>&quot;&quot; == 0 \n0 == &quot;0&quot; \n&quot;&quot; != &quot;0&quot;</code></pre>\n<p>parseInt is not base ten!</p>\n<pre><code>parseInt(&quot;8&quot;); //8\nparseInt(&quot;08&quot;); //0</code></pre>\n<p>Types and Objects don\u2019t know about eachother</p>\n<pre><code>typeof &quot;hello&quot; === &quot;string&quot;; //true\ntypeof new String(&quot;hello&quot;) === &quot;string&quot;; //false\n&quot;hello&quot; instanceof String; //false\nnew String(&quot;hello&quot;) instanceof String; //true</code></pre>\n<h2 id=\"no-actual-map\">No Actual Map</h2>\n<p>People use object literals <code>{}</code> to construct map-like objects, but <em>they\u2019re not real maps</em>! What they <em>actually</em> do is map the string conversion of the object. Look what happens when you try to nest objects inside objects:</p>\n<pre><code>&gt; var a = {x:&quot;a&quot;}\nundefined\n&gt; var b = {x:&quot;b&quot;}\nundefined\n&gt; var obj = {}\nundefined\n&gt; obj[a] = &quot;a&quot;\n&quot;a&quot;\n&gt; obj[b] = &quot;b&quot;\n&quot;b&quot;\n&gt; obj[a]\n&quot;b&quot;</code></pre>\n",
                "title"   : "Javascript's Nearly Unforgiveable Sins",
                "updated" : "2013-03-18T00:00:00+00:00",
                "id"      : "urn:uuid:458b19f8-85a8-56c5-aad0-c5939daeb593",
                "link"    : "http://jfolson.com/blog/2013/03/18/javascript-sins/"
            },
        
            {
                "summary" : "<h2 id=\"installing-middleman\">Installing Middleman</h2>\n<p>First things first, so install rvm and all your dependencies.</p>\n<pre><code>\\curl -L https://get.rvm.io | bash -s stable\nsource $HOME/.rvm/scripts/rvm\nrvm install 1.9.3\nrvm use 1.9.3\necho &quot;source $HOME/.rvm/scripts/rvm&quot; &gt;&gt; ~/.bash_profile</code></pre>\n",
                "content" : "<h2 id=\"installing-middleman\">Installing Middleman</h2>\n<p>First things first, so install rvm and all your dependencies.</p>\n<pre><code>\\curl -L https://get.rvm.io | bash -s stable\nsource $HOME/.rvm/scripts/rvm\nrvm install 1.9.3\nrvm use 1.9.3\necho &quot;source $HOME/.rvm/scripts/rvm&quot; &gt;&gt; ~/.bash_profile\nbundle install</code></pre>\n<p>For some reason, this failed when installing v8, but it worked when I just explicitly installed it:</p>\n<pre><code>gem install</code></pre>\n",
                "title"   : "Deploying a site with Middleman and Git",
                "updated" : "2013-03-07T00:00:00+00:00",
                "id"      : "urn:uuid:d2e86e62-6f90-5977-a159-525c45ecc9dc",
                "link"    : "http://jfolson.com/blog/2013/03/07/git-middleman-deploy/"
            },
        
            {
                "summary" : "<p>For my first post, I thought I\u2019d document how I built this site.</p>\n<h2 id=\"middleman\">Middleman</h2>\n<p><a href=\"http://middlemanapp.com/\">Middleman</a> is a tool for dynamically building a static website. There are a variety of advantages to building a static website. For me, I wanted a blog a</p>\n",
                "content" : "<p>For my first post, I thought I\u2019d document how I built this site.</p>\n<h2 id=\"middleman\">Middleman</h2>\n<p><a href=\"http://middlemanapp.com/\">Middleman</a> is a tool for dynamically building a static website. There are a variety of advantages to building a static website. For me, I wanted a blog a could control without the bother of setting up, maintaining and eventually moving a database. Plus, even with my current webhost <a href=\"https://www.nearlyfreespeech.net/\">nearly free speech</a> it costs a bit more to add a database instance. Mostly, though, it was thinking about the trouble of dealing with a database.</p>\n<h3 id=\"bootstrap-and-less\">Bootstrap and LESS</h3>\n<p>Unfortunately, the version of <a href=\"https://github.com/cowboyd/less.rb\">less.rb</a> that\u2019s available doesn\u2019t support LESS version 1.3.3, which is required by the current version of Twitter Bootstrap. Fortunately, <a href=\"https://github.com/cowboyd/less.rb/pull/51\">someone alreay patched in support for v1.3.3 and submitted a pull request</a>. In order to use this branch, just use this line in your Gemfile: gem \u201cless\u201d, :git=&gt; \u2018git://github.com/populr/less.rb.git\u2019, :branch =&gt; \u201cv2.2.2-less1.3.3\u201d, submodules: true You need to add <code>submodules: true</code> in order to actually grab the javascript.</p>\n<h3 id=\"middleman-blog\"><a href=\"https://github.com/middleman/middleman-blog\">Middleman-blog</a></h3>\n<p>Middleman-blog is an extension for middleman that makes it nice and easy to write and format blog articles using middleman. The documentation is pretty good for getting the basics up and running, but there are a few things you might want to tweak.</p>\n<p>*<br /> You can move the root for the blog by setting the <code>blog.prefix</code> variable. *<br /> If you choose to use pretty urls with <code>activate :directory_indexes</code> you may want to move the the source for \u2018blog/index.html\u2019 to \u2018blog.html\u2019 *<br /> You can do a lot more than the default layout for the blog! In particular, you may want to do something different when rendering a blog article. The following simply adds a level-two header containing the title:</p>\n<pre><code>    &lt;% if is_blog_article? %&gt;\n        &lt;article&gt;\n            &lt;h2&gt;&lt;%= current_article.title %&gt;&lt;/h2&gt;\n            &lt;%= yield %&gt;\n        &lt;/article&gt;\n    &lt;% else %&gt;\n        &lt;%= yield %&gt;\n    &lt;% end %&gt;</code></pre>\n",
                "title"   : "Building this site: Middleman, Bootswatch and More",
                "updated" : "2013-03-01T00:00:00+00:00",
                "id"      : "urn:uuid:8fce5982-ad6a-50ae-a4c9-4c0d97dd376c",
                "link"    : "http://jfolson.com/blog/2013/03/01/building-this-site/"
            }
        
        ]
    }
}