{ "feed": { "id": "urn:uuid:b48fb060-c28d-5798-9996-f30c6f8d947a", "link": [ { "@attributes": { "href": "\"http://jfolson.com#{current_page.url}\"", "rel": "self" } }, { "@attributes": { "href": "http://jfolson.com/" } } ], "title" : "Problem Solving", "subtitle" : "Algorithms, Programming, Math and Food", "updated" : "2013-04-11T10:36:46-04:00", "author": { "name": "Jamie F Olson" }, "rights" : "Copyright (c) 2013, Jamie F Olson", "entries" : [ { "summary" : "
I had never actually used any of the modern functional programming languages until a couple of weeks ago. However, I\u2019ve worked quite a lot with the functional features of more multi-paradigm languages: python, R, ruby, javascript, etc. so I didn\u2019t expect to be caught off guard by Haskell.
\n", "content" : "I had never actually used any of the modern functional programming languages until a couple of weeks ago. However, I\u2019ve worked quite a lot with the functional features of more multi-paradigm languages: python, R, ruby, javascript, etc. so I didn\u2019t expect to be caught off guard by Haskell.
\nI decided to learn Haskell because I wanted to extend the wonderful Pandoc, which is written in Haskell. This turned out to be a bit more ambitious than I expected.
\nI was almost entirely unprepared for the way Haskell thinks about types. The more functional languages I\u2019ve used have tended to have somewhat weaker type concepts, from the duck-typing of python to the not-really-a-type S3
classes in R.
In contrast, Haskell is strongly typed, and I had not appreciated just how strongly typed it is. To an uninitiated Haskell outsider, I would not have expected types to be a big part of understanding the Pandoc source code. However, in Haskell, types are \u201cmerely\u201d data containers, meaning that any and all containers/structs are new types. This happens because there is no real concept of \u201cmethod\u201d in the object-oriented sense.
\nMaking it worse was the fact that I really did not understand Haskell types. Haskell has completely different concepts for types constructors and data constructors. Types do not actually exist as anything you can refer to in Haskell.
\nf :: FromType -> ToType\ndata FromType = FromTypeConstructor Value
\nHere, you can create an object of type FromType
with FromTypeConstructor value
if value
has type Value
. However, FromType
itself has no value.
This gets particularly complicated since types and constructors have completely separate namespaces! As an example, in Text.JSON
, JSObject
is both a type and a constructor, but the JSObject
constructor does not create JSObject
type objects. Instead, JSONObject
must be used to construct objects of type JSObject
, while the constructor JSObject
creates objects of type JSValue
.
This fall, I found myself writing a number of Hive SQL queries, which was fun. The problem was that I was trying to do some not entirely simple things. In particular, I was trying to implement some of the R functions for data frames, e.g.\u00a0melt, aggregate, etc. I started out writing these as shell scripts, but
\n", "content" : "This fall, I found myself writing a number of Hive SQL queries, which was fun. The problem was that I was trying to do some not entirely simple things. In particular, I was trying to implement some of the R functions for data frames, e.g.\u00a0melt, aggregate, etc. I started out writing these as shell scripts, but that quickly became uncomfortable. Around that time I started to take another look at the RHadoop packages for interacting with Hadoop in R. They are, in fact, excellent, offering a surprisingly seemless experience. Motivated by this experience, I decided to rewrite my shell scripts and the code using them in R.
\nThere are a couple of different existing R packages for interacting with Hive, but I found them unsatisfactory. The most complete of these is nexr\u2019s RHive. It seems to have lots of great features, but, among other things, it didn\u2019t work for me. They use a Thrift connection and failed to properly construct results as R objects when I tried it. There were a couple of other things I wasn\u2019t thrilled with about RHive. Their api has a lot of great features, but it requires users to explicitly export any variables and functions that are required by map or reduce functions. This leads to a substantially different experience from RHadoop, where the goal is to make the boundary between R and Hadoop invisible.
\nThese experiences led me to write the rhive package, which is an attempt to bring the ease of use of RHadoop to Hive for R. The principle features of rhive are:
\nFor information on the current state of the project, checkout it out on github.
\n", "title" : "Introducing rhive", "updated" : "2013-03-28T00:00:00+00:00", "id" : "urn:uuid:beeba428-8c02-5c70-9b9c-efdbbb1685a8", "link" : "http://jfolson.com/blog/2013/03/28/introducing-rhive/" }, { "summary" : "For a variety of reasons, some good and some bad, I wanted to use S4 classes for the table objects in rsql. Among other things, S4 gives you inheritance, prototypes, and some other neat stuff. Just as important to me, but totally superficial is the fact that the contents (ie slots) of an S4 object are
\n", "content" : "For a variety of reasons, some good and some bad, I wanted to use S4 classes for the table objects in rsql. Among other things, S4 gives you inheritance, prototypes, and some other neat stuff. Just as important to me, but totally superficial is the fact that the contents (ie slots) of an S4 object are retrieved using the \u2018@\u2019 operator, freeing the \u2018$\u2019 operator for fun.
\nSpecifically, I decided to use the \u2018$\u2019 operator to access references to a table\u2019s columns:
\n> tab$x\ntab.x
\nThis makes a variety of things much cleaner and more pleasant to use.
\nI hadn\u2019t had a lot of experience in writing S4 classes, so I was surprised to learn that it is not possible for S4 methods to have unevaluated arguments.
\nThis is problematic since a wide variety of frequently used (S3) functions delay evaluation of their arguments for a specific context (usually a data frame). The most notable example of this is subset
but there are plenty of others.
> dat = data.frame(x=rnorm(10),y=1:10)\n> dat.sub = subset(dat,subset=(x>0))
\nSince arguments to S4 are evaluated while selecting an appropriate method, this is not possible. Instead, it becomes necessary to immediately capture the unevaluated expression using something like plyr::.
in this example in the documentation for plyr::dcast
:
#Air quality example\nnames(airquality) <- tolower(names(airquality))\naqm <- melt(airquality, id=c("month", "day"), na.rm=TRUE)\n\nacast(aqm, variable ~ month, mean, subset = .(variable == "ozone"))
\nSo that\u2019s what I did, too.
\nThis story is really uninteresting if you don\u2019t realize how much work I put in to capture and pass those unevaluated expressions around before I needed to transition to S4 and it stopped working.
\n", "title" : "Designing rsql: S3, S4 and expression evaluation in R", "updated" : "2013-03-27T00:00:00+00:00", "id" : "urn:uuid:eca7a615-29c7-5602-96eb-99a0a71debd9", "link" : "http://jfolson.com/blog/2013/03/27/rsql-s3-s4/" }, { "summary" : "SQL is a powerful language for manipulating data. R is a powerful language for manipulating data. Frequently, data to be analyzed in R actually comes from a database using a SQL query. Fortunately, there are a variety of great R
\n", "content" : "SQL is a powerful language for manipulating data. R is a powerful language for manipulating data. Frequently, data to be analyzed in R actually comes from a database using a SQL query. Fortunately, there are a variety of great R packages for interacting with databases and everything goes smoothly.
\nEventually, you find yourself wanting to start doing a little bit more of the SQL in R, since R functions tend to be easier to document, generalize and reuse than SQL scripts. Now, everything is great! You\u2019ve got R functions that get or manipulate data in database tables before bringing the data into R. Everything works, unless you make a make a mistake. Unfortunately, the arguments for those functions are just strings, so it\u2019s rather cumbersome to combine all those things programmatically. Seemingly small generalizations become more and more difficult because, ultimately, R is not SQL.
\nWhen you operate of a data frame you write this
\nx.sub = subset(x,subset=(y>0))
\nNot this
\nx.sub = subset(x,subset=("y > 0"))
\nSo why would you want to do that just because the data is in a database?
\nUnfortunately, things are just barely less awesome:
\nx.sub = subset(x,subset=.(y>0))
\nCheck it out on github.
\n", "title" : "Introducting rsql: programming SQL in R", "updated" : "2013-03-26T00:00:00+00:00", "id" : "urn:uuid:59bfe5f3-31bc-5719-8a32-a70ba1a81dbd", "link" : "http://jfolson.com/blog/2013/03/26/introducing-rsql/" }, { "summary" : "Typedbytes is a binary format for serializing data that is supported by Hadoop streaming. Several different Hadoop applications have found dramatic performance improvements by transitioning from text formats (e.g.\u00a0csv) to
\n", "content" : "Typedbytes is a binary format for serializing data that is supported by Hadoop streaming. Several different Hadoop applications have found dramatic performance improvements by transitioning from text formats (e.g.\u00a0csv) to typedbytes. The format is by no means perfect, though. Among other things, typedbytes 8 and 9 seem largely redundant, there\u2019s no distinction between a homogeneous collection and a heterogeneous collecetion, the initial version lacks support for a null type, etc. Despite these limitations, the real-world performance gains make it a desirable format to support.
\nThere are a great many things that are strange about Hive\u2019s support for typedbytes. First, Hive contains a duplicate but nearly identical version of the Hadoop streaming typedbytes implementation. To be fair, this is somewhat understandable since the Hadoop implementation is (in my opinion) excessively protective about access to variables, while also using an Enum for type. Taken together, this makes it quite difficult to extend Hadoop\u2019s implementation without duplicating the whole thing ie subclassing won\u2019t let you do much.
\nThe larger issue is that files created by Hive\u2019s typedbytes SerDe are still a heck of a long way from the format required by Hadoop Streaming. For example, the typedbytes objects Hive serializes into and deserializes from are actually generic Binary objects, TypedBytes objects, which means that they are serialized by Hadoop as Objects, not as the actual typedbytes bit sequence contained in the object.
\nAnother problem is that Hive really doesn\u2019t support much of the typedbytes spec. Particularly, complex objects (list, vector and map) can only be serialized and deserialized as their JSON representations. Hive\u2019s typedbytes SerDe will fail on encountering any non-primitives in the typedbytes sequence.
\nThe typedbytes spec allows for application-specific type-codes (anything between 50 and 200). Hive is unable to support these type-codes and will simply fail.
\nNeedlessly to say, these are problems for getting RHadoop-styled functionality for Hive tables with R. I\u2019ve solved these issues and a few more in the current version of rhive. It\u2019s currently focused primarily on compatibility with RHadoop\u2019s rmr package, but it should be helpful in getting Hive to work with other Hadoop Streaming-based tools.
\nIn particular, the serde for rhive supports application-specific type-codes These types are converted into Binary objects containing the entire typedbytes sequence, including the type code. When serializing, Binary types are assumed to be an entire typedbytes sequence, and serialized as such unless they cannot be (invalid type or incorrect length).
\n", "title" : "Hive, Hadoop and TypedBytes", "updated" : "2013-03-24T00:00:00+00:00", "id" : "urn:uuid:59373f76-f9d6-5e86-ba73-44b52f95bb54", "link" : "http://jfolson.com/blog/2013/03/24/hive-typedbytes/" }, { "summary" : "One of the things I\u2019ve liked about git is how much it encourages responsible and clean commits. Personally, I\u2019m not there yet. I commit when I think to, usually when I\u2019m at a reasonable stopping point, which means I just end up doing git commit -a
rather than adding and committing clean groups of related
One of the things I\u2019ve liked about git is how much it encourages responsible and clean commits. Personally, I\u2019m not there yet. I commit when I think to, usually when I\u2019m at a reasonable stopping point, which means I just end up doing git commit -a
rather than adding and committing clean groups of related changes.
Because of this, when I submitted a pull request to the awesome Pandoc, my commit history was a mess. The owner of the project, John MacFarlane, reasonably asked to to separate things into nice clean commits. Embarrassingly, I\u2019d never done this before, but everything went a bit smoother than expected.
\nThe git rebase
command allows you to rewrite the commit history of your project. There are lots of ways to do this, but I found the easiest to be:
git rebase -i HEAD~10
\nWhich will allow you to edit the ten most recent commits.
\nThe git documentation describes how to use this, but I had trouble using the squash
option and just ended up edit
ing them all.
As mentioned there, git reset HEAD^
resets the staged commit and then you\u2019re free to commit as you wish before using git rebase --continue
to move forward.
Implicit conversion with ==
0=="0"
\nBut no unboxing with ===
0!==new Number(0)
\nEquality is not transitive
\n"" == 0 \n0 == "0" \n"" != "0"
\nparseInt is not base ten!
\nparseInt("8"); //8\nparseInt("08"); //0
\n",
"content" : "Implicit conversion with ==
0=="0"
\nBut no unboxing with ===
0!==new Number(0)
\nEquality is not transitive
\n"" == 0 \n0 == "0" \n"" != "0"
\nparseInt is not base ten!
\nparseInt("8"); //8\nparseInt("08"); //0
\nTypes and Objects don\u2019t know about eachother
\ntypeof "hello" === "string"; //true\ntypeof new String("hello") === "string"; //false\n"hello" instanceof String; //false\nnew String("hello") instanceof String; //true
\nPeople use object literals {}
to construct map-like objects, but they\u2019re not real maps! What they actually do is map the string conversion of the object. Look what happens when you try to nest objects inside objects:
> var a = {x:"a"}\nundefined\n> var b = {x:"b"}\nundefined\n> var obj = {}\nundefined\n> obj[a] = "a"\n"a"\n> obj[b] = "b"\n"b"\n> obj[a]\n"b"
\n",
"title" : "Javascript's Nearly Unforgiveable Sins",
"updated" : "2013-03-18T00:00:00+00:00",
"id" : "urn:uuid:458b19f8-85a8-56c5-aad0-c5939daeb593",
"link" : "http://jfolson.com/blog/2013/03/18/javascript-sins/"
},
{
"summary" : "First things first, so install rvm and all your dependencies.
\n\\curl -L https://get.rvm.io | bash -s stable\nsource $HOME/.rvm/scripts/rvm\nrvm install 1.9.3\nrvm use 1.9.3\necho "source $HOME/.rvm/scripts/rvm" >> ~/.bash_profile
\n",
"content" : "First things first, so install rvm and all your dependencies.
\n\\curl -L https://get.rvm.io | bash -s stable\nsource $HOME/.rvm/scripts/rvm\nrvm install 1.9.3\nrvm use 1.9.3\necho "source $HOME/.rvm/scripts/rvm" >> ~/.bash_profile\nbundle install
\nFor some reason, this failed when installing v8, but it worked when I just explicitly installed it:
\ngem install
\n",
"title" : "Deploying a site with Middleman and Git",
"updated" : "2013-03-07T00:00:00+00:00",
"id" : "urn:uuid:d2e86e62-6f90-5977-a159-525c45ecc9dc",
"link" : "http://jfolson.com/blog/2013/03/07/git-middleman-deploy/"
},
{
"summary" : "For my first post, I thought I\u2019d document how I built this site.
\nMiddleman is a tool for dynamically building a static website. There are a variety of advantages to building a static website. For me, I wanted a blog a
\n", "content" : "For my first post, I thought I\u2019d document how I built this site.
\nMiddleman is a tool for dynamically building a static website. There are a variety of advantages to building a static website. For me, I wanted a blog a could control without the bother of setting up, maintaining and eventually moving a database. Plus, even with my current webhost nearly free speech it costs a bit more to add a database instance. Mostly, though, it was thinking about the trouble of dealing with a database.
\nUnfortunately, the version of less.rb that\u2019s available doesn\u2019t support LESS version 1.3.3, which is required by the current version of Twitter Bootstrap. Fortunately, someone alreay patched in support for v1.3.3 and submitted a pull request. In order to use this branch, just use this line in your Gemfile: gem \u201cless\u201d, :git=> \u2018git://github.com/populr/less.rb.git\u2019, :branch => \u201cv2.2.2-less1.3.3\u201d, submodules: true You need to add submodules: true
in order to actually grab the javascript.
Middleman-blog is an extension for middleman that makes it nice and easy to write and format blog articles using middleman. The documentation is pretty good for getting the basics up and running, but there are a few things you might want to tweak.
\n*
You can move the root for the blog by setting the blog.prefix
variable. *
If you choose to use pretty urls with activate :directory_indexes
you may want to move the the source for \u2018blog/index.html\u2019 to \u2018blog.html\u2019 *
You can do a lot more than the default layout for the blog! In particular, you may want to do something different when rendering a blog article. The following simply adds a level-two header containing the title:
<% if is_blog_article? %>\n <article>\n <h2><%= current_article.title %></h2>\n <%= yield %>\n </article>\n <% else %>\n <%= yield %>\n <% end %>
\n",
"title" : "Building this site: Middleman, Bootswatch and More",
"updated" : "2013-03-01T00:00:00+00:00",
"id" : "urn:uuid:8fce5982-ad6a-50ae-a4c9-4c0d97dd376c",
"link" : "http://jfolson.com/blog/2013/03/01/building-this-site/"
}
]
}
}