Jamie F Olson

Problem Solver.

Problem Solving Algorithms, Programming, Math and Food urn:uuid:b48fb060-c28d-5798-9996-f30c6f8d947a 2013-04-11T10:36:45-04:00 Copyright (c) 2013, Jamie F Olson Jamie F Olson Haskell: Initial reactions 2013-03-29T00:00:00Z urn:uuid:972ecca5-0b89-53f4-a2c3-0a50d62ad4e1 Background I had never actually used any of the modern functional programming languages until a couple of weeks ago. However, I’ve worked quite a lot with the functional features of more multi-paradigm languages: python, R, ruby, javascript, etc. so I didn’t expect to be caught off guard by Haskell. Background

I had never actually used any of the modern functional programming languages until a couple of weeks ago. However, I’ve worked quite a lot with the functional features of more multi-paradigm languages: python, R, ruby, javascript, etc. so I didn’t expect to be caught off guard by Haskell.

I decided to learn Haskell because I wanted to extend the wonderful Pandoc, which is written in Haskell. This turned out to be a bit more ambitious than I expected.

Surprises

I was almost entirely unprepared for the way Haskell thinks about types. The more functional languages I’ve used have tended to have somewhat weaker type concepts, from the duck-typing of python to the not-really-a-type S3 classes in R.

The types are strong with this one

In contrast, Haskell is strongly typed, and I had not appreciated just how strongly typed it is. To an uninitiated Haskell outsider, I would not have expected types to be a big part of understanding the Pandoc source code. However, in Haskell, types are “merely” data containers, meaning that any and all containers/structs are new types. This happens because there is no real concept of “method” in the object-oriented sense.

Making it worse was the fact that I really did not understand Haskell types. Haskell has completely different concepts for types constructors and data constructors. Types do not actually exist as anything you can refer to in Haskell.

f :: FromType -> ToType
data FromType = FromTypeConstructor Value

Here, you can create an object of type FromType with FromTypeConstructor value if value has type Value. However, FromType itself has no value.

When names collide

This gets particularly complicated since types and constructors have completely separate namespaces! As an example, in Text.JSON, JSObject is both a type and a constructor, but the JSObject constructor does not create JSObject type objects. Instead, JSONObject must be used to construct objects of type JSObject, while the constructor JSObject creates objects of type JSValue.

]]>
Introducing rhive 2013-03-28T00:00:00Z urn:uuid:beeba428-8c02-5c70-9b9c-efdbbb1685a8 Motivation This fall, I found myself writing a number of Hive SQL queries, which was fun. The problem was that I was trying to do some not entirely simple things. In particular, I was trying to implement some of the R functions for data frames, e.g. melt, aggregate, etc. I started out writing these as shell scripts, but Motivation

This fall, I found myself writing a number of Hive SQL queries, which was fun. The problem was that I was trying to do some not entirely simple things. In particular, I was trying to implement some of the R functions for data frames, e.g. melt, aggregate, etc. I started out writing these as shell scripts, but that quickly became uncomfortable. Around that time I started to take another look at the RHadoop packages for interacting with Hadoop in R. They are, in fact, excellent, offering a surprisingly seemless experience. Motivated by this experience, I decided to rewrite my shell scripts and the code using them in R.

There are a couple of different existing R packages for interacting with Hive, but I found them unsatisfactory. The most complete of these is nexr’s RHive. It seems to have lots of great features, but, among other things, it didn’t work for me. They use a Thrift connection and failed to properly construct results as R objects when I tried it. There were a couple of other things I wasn’t thrilled with about RHive. Their api has a lot of great features, but it requires users to explicitly export any variables and functions that are required by map or reduce functions. This leads to a substantially different experience from RHadoop, where the goal is to make the boundary between R and Hadoop invisible.

The rhive package

These experiences led me to write the rhive package, which is an attempt to bring the ease of use of RHadoop to Hive for R. The principle features of rhive are:

  • Creation and querying of Hive tables from R.
  • Importing and exporting data from Hive tables
  • Manipulation of Hive tables, just like data frames in R
  • Applying map and reduce functions defined in R to Hive tables and storing results in Hive tables

For information on the current state of the project, checkout it out on github.

]]>
Designing rsql: S3, S4 and expression evaluation in R 2013-03-27T00:00:00Z urn:uuid:eca7a615-29c7-5602-96eb-99a0a71debd9 S4 for rsql For a variety of reasons, some good and some bad, I wanted to use S4 classes for the table objects in rsql. Among other things, S4 gives you inheritance, prototypes, and some other neat stuff. Just as important to me, but totally superficial is the fact that the contents (ie slots) of an S4 object are S4 for rsql

For a variety of reasons, some good and some bad, I wanted to use S4 classes for the table objects in rsql. Among other things, S4 gives you inheritance, prototypes, and some other neat stuff. Just as important to me, but totally superficial is the fact that the contents (ie slots) of an S4 object are retrieved using the ‘@’ operator, freeing the ‘$’ operator for fun.

Specifically, I decided to use the ‘$’ operator to access references to a table’s columns:

> tab$x
tab.x

This makes a variety of things much cleaner and more pleasant to use.

Problem: S4 evaluates arguments

I hadn’t had a lot of experience in writing S4 classes, so I was surprised to learn that it is not possible for S4 methods to have unevaluated arguments.

This is problematic since a wide variety of frequently used (S3) functions delay evaluation of their arguments for a specific context (usually a data frame). The most notable example of this is subset but there are plenty of others.

> dat = data.frame(x=rnorm(10),y=1:10)
> dat.sub = subset(dat,subset=(x>0))

Since arguments to S4 are evaluated while selecting an appropriate method, this is not possible. Instead, it becomes necessary to immediately capture the unevaluated expression using something like plyr::. in this example in the documentation for plyr::dcast:

#Air quality example
names(airquality) <- tolower(names(airquality))
aqm <- melt(airquality, id=c("month", "day"), na.rm=TRUE)

acast(aqm, variable ~ month, mean, subset = .(variable == "ozone"))

The end

So that’s what I did, too.

Addendum

This story is really uninteresting if you don’t realize how much work I put in to capture and pass those unevaluated expressions around before I needed to transition to S4 and it stopped working.

]]>
Introducting rsql: programming SQL in R 2013-03-26T00:00:00Z urn:uuid:59bfe5f3-31bc-5719-8a32-a70ba1a81dbd SQL is great, but it’s not R SQL is a powerful language for manipulating data. R is a powerful language for manipulating data. Frequently, data to be analyzed in R actually comes from a database using a SQL query. Fortunately, there are a variety of great R SQL is great, but it’s not R

SQL is a powerful language for manipulating data. R is a powerful language for manipulating data. Frequently, data to be analyzed in R actually comes from a database using a SQL query. Fortunately, there are a variety of great R packages for interacting with databases and everything goes smoothly.

Eventually, you find yourself wanting to start doing a little bit more of the SQL in R, since R functions tend to be easier to document, generalize and reuse than SQL scripts. Now, everything is great! You’ve got R functions that get or manipulate data in database tables before bringing the data into R. Everything works, unless you make a make a mistake. Unfortunately, the arguments for those functions are just strings, so it’s rather cumbersome to combine all those things programmatically. Seemingly small generalizations become more and more difficult because, ultimately, R is not SQL.

Why rsql is better

When you operate of a data frame you write this

x.sub = subset(x,subset=(y>0))

Not this

x.sub = subset(x,subset=("y > 0"))

So why would you want to do that just because the data is in a database?

Caveats

Unfortunately, things are just barely less awesome:

x.sub = subset(x,subset=.(y>0))

Check it out on github.

]]>
Hive, Hadoop and TypedBytes 2013-03-24T00:00:00Z urn:uuid:59373f76-f9d6-5e86-ba73-44b52f95bb54 Hadoop Streaming and TypedBytes Typedbytes is a binary format for serializing data that is supported by Hadoop streaming. Several different Hadoop applications have found dramatic performance improvements by transitioning from text formats (e.g. csv) to Hadoop Streaming and TypedBytes

Typedbytes is a binary format for serializing data that is supported by Hadoop streaming. Several different Hadoop applications have found dramatic performance improvements by transitioning from text formats (e.g. csv) to typedbytes. The format is by no means perfect, though. Among other things, typedbytes 8 and 9 seem largely redundant, there’s no distinction between a homogeneous collection and a heterogeneous collecetion, the initial version lacks support for a null type, etc. Despite these limitations, the real-world performance gains make it a desirable format to support.

Hive typedbytes are not Hadoop typedbytes

There are a great many things that are strange about Hive’s support for typedbytes. First, Hive contains a duplicate but nearly identical version of the Hadoop streaming typedbytes implementation. To be fair, this is somewhat understandable since the Hadoop implementation is (in my opinion) excessively protective about access to variables, while also using an Enum for type. Taken together, this makes it quite difficult to extend Hadoop’s implementation without duplicating the whole thing ie subclassing won’t let you do much.

The larger issue is that files created by Hive’s typedbytes SerDe are still a heck of a long way from the format required by Hadoop Streaming. For example, the typedbytes objects Hive serializes into and deserializes from are actually generic Binary objects, TypedBytes objects, which means that they are serialized by Hadoop as Objects, not as the actual typedbytes bit sequence contained in the object.

Another problem is that Hive really doesn’t support much of the typedbytes spec. Particularly, complex objects (list, vector and map) can only be serialized and deserialized as their JSON representations. Hive’s typedbytes SerDe will fail on encountering any non-primitives in the typedbytes sequence.

Hive destroys application-specific type codes

The typedbytes spec allows for application-specific type-codes (anything between 50 and 200). Hive is unable to support these type-codes and will simply fail.

Fixing this

Needlessly to say, these are problems for getting RHadoop-styled functionality for Hive tables with R. I’ve solved these issues and a few more in the current version of rhive. It’s currently focused primarily on compatibility with RHadoop’s rmr package, but it should be helpful in getting Hive to work with other Hadoop Streaming-based tools.

In particular, the serde for rhive supports application-specific type-codes These types are converted into Binary objects containing the entire typedbytes sequence, including the type code. When serializing, Binary types are assumed to be an entire typedbytes sequence, and serialized as such unless they cannot be (invalid type or incorrect length).

]]>
Rewriting commits with git is easier than you think 2013-03-21T00:00:00Z urn:uuid:1e4832ba-d82b-54ea-b8c2-921b7ce281f6 Trying to help One of the things I’ve liked about git is how much it encourages responsible and clean commits. Personally, I’m not there yet. I commit when I think to, usually when I’m at a reasonable stopping point, which means I just end up doing git commit -a rather than adding and committing clean groups of related Trying to help

One of the things I’ve liked about git is how much it encourages responsible and clean commits. Personally, I’m not there yet. I commit when I think to, usually when I’m at a reasonable stopping point, which means I just end up doing git commit -a rather than adding and committing clean groups of related changes.

Because of this, when I submitted a pull request to the awesome Pandoc, my commit history was a mess. The owner of the project, John MacFarlane, reasonably asked to to separate things into nice clean commits. Embarrassingly, I’d never done this before, but everything went a bit smoother than expected.

The git rebase command allows you to rewrite the commit history of your project. There are lots of ways to do this, but I found the easiest to be:

git rebase -i HEAD~10

Which will allow you to edit the ten most recent commits.

The git documentation describes how to use this, but I had trouble using the squash option and just ended up editing them all.

As mentioned there, git reset HEAD^ resets the staged commit and then you’re free to commit as you wish before using git rebase --continue to move forward.

]]>
Javascript's Nearly Unforgiveable Sins 2013-03-18T00:00:00Z urn:uuid:458b19f8-85a8-56c5-aad0-c5939daeb593 Logic Implicit conversion with == 0=="0" But no unboxing with === 0!==new Number(0) Equality is not transitive "" == 0 0 == "0" "" != "0" parseInt is not base ten! parseInt("8"); //8 parseInt("08"); //0 Logic

Implicit conversion with ==

0=="0"

But no unboxing with ===

0!==new Number(0)

Equality is not transitive

"" == 0 
0 == "0" 
"" != "0"

parseInt is not base ten!

parseInt("8"); //8
parseInt("08"); //0

Types and Objects don’t know about eachother

typeof "hello" === "string"; //true
typeof new String("hello") === "string"; //false
"hello" instanceof String; //false
new String("hello") instanceof String; //true

No Actual Map

People use object literals {} to construct map-like objects, but they’re not real maps! What they actually do is map the string conversion of the object. Look what happens when you try to nest objects inside objects:

> var a = {x:"a"}
undefined
> var b = {x:"b"}
undefined
> var obj = {}
undefined
> obj[a] = "a"
"a"
> obj[b] = "b"
"b"
> obj[a]
"b"
]]>
Deploying a site with Middleman and Git 2013-03-07T00:00:00Z urn:uuid:d2e86e62-6f90-5977-a159-525c45ecc9dc Installing Middleman First things first, so install rvm and all your dependencies. \curl -L https://get.rvm.io | bash -s stable source $HOME/.rvm/scripts/rvm rvm install 1.9.3 rvm use 1.9.3 echo "source $HOME/.rvm/scripts/rvm" >> ~/.bash_profile Installing Middleman

First things first, so install rvm and all your dependencies.

\curl -L https://get.rvm.io | bash -s stable
source $HOME/.rvm/scripts/rvm
rvm install 1.9.3
rvm use 1.9.3
echo "source $HOME/.rvm/scripts/rvm" >> ~/.bash_profile
bundle install

For some reason, this failed when installing v8, but it worked when I just explicitly installed it:

gem install
]]>
Building this site: Middleman, Bootswatch and More 2013-03-01T00:00:00Z urn:uuid:8fce5982-ad6a-50ae-a4c9-4c0d97dd376c For my first post, I thought I’d document how I built this site. Middleman Middleman is a tool for dynamically building a static website. There are a variety of advantages to building a static website. For me, I wanted a blog a For my first post, I thought I’d document how I built this site.

Middleman

Middleman is a tool for dynamically building a static website. There are a variety of advantages to building a static website. For me, I wanted a blog a could control without the bother of setting up, maintaining and eventually moving a database. Plus, even with my current webhost nearly free speech it costs a bit more to add a database instance. Mostly, though, it was thinking about the trouble of dealing with a database.

Bootstrap and LESS

Unfortunately, the version of less.rb that’s available doesn’t support LESS version 1.3.3, which is required by the current version of Twitter Bootstrap. Fortunately, someone alreay patched in support for v1.3.3 and submitted a pull request. In order to use this branch, just use this line in your Gemfile: gem “less”, :git=> ‘git://github.com/populr/less.rb.git’, :branch => “v2.2.2-less1.3.3”, submodules: true You need to add submodules: true in order to actually grab the javascript.

Middleman-blog

Middleman-blog is an extension for middleman that makes it nice and easy to write and format blog articles using middleman. The documentation is pretty good for getting the basics up and running, but there are a few things you might want to tweak.

*
You can move the root for the blog by setting the blog.prefix variable. *
If you choose to use pretty urls with activate :directory_indexes you may want to move the the source for ‘blog/index.html’ to ‘blog.html’ *
You can do a lot more than the default layout for the blog! In particular, you may want to do something different when rendering a blog article. The following simply adds a level-two header containing the title:

    <% if is_blog_article? %>
        <article>
            <h2><%= current_article.title %></h2>
            <%= yield %>
        </article>
    <% else %>
        <%= yield %>
    <% end %>
]]>