Hive, Hadoop and TypedBytes

Hadoop Streaming and TypedBytes

Typedbytes is a binary format for serializing data that is supported by Hadoop streaming. Several different Hadoop applications have found dramatic performance improvements by transitioning from text formats (e.g. csv) to typedbytes. The format is by no means perfect, though. Among other things, typedbytes 8 and 9 seem largely redundant, there’s no distinction between a homogeneous collection and a heterogeneous collecetion, the initial version lacks support for a null type, etc. Despite these limitations, the real-world performance gains make it a desirable format to support.

Hive typedbytes are not Hadoop typedbytes

There are a great many things that are strange about Hive’s support for typedbytes. First, Hive contains a duplicate but nearly identical version of the Hadoop streaming typedbytes implementation. To be fair, this is somewhat understandable since the Hadoop implementation is (in my opinion) excessively protective about access to variables, while also using an Enum for type. Taken together, this makes it quite difficult to extend Hadoop’s implementation without duplicating the whole thing ie subclassing won’t let you do much.

The larger issue is that files created by Hive’s typedbytes SerDe are still a heck of a long way from the format required by Hadoop Streaming. For example, the typedbytes objects Hive serializes into and deserializes from are actually generic Binary objects, TypedBytes objects, which means that they are serialized by Hadoop as Objects, not as the actual typedbytes bit sequence contained in the object.

Another problem is that Hive really doesn’t support much of the typedbytes spec. Particularly, complex objects (list, vector and map) can only be serialized and deserialized as their JSON representations. Hive’s typedbytes SerDe will fail on encountering any non-primitives in the typedbytes sequence.

Hive destroys application-specific type codes

The typedbytes spec allows for application-specific type-codes (anything between 50 and 200). Hive is unable to support these type-codes and will simply fail.

Fixing this

Needlessly to say, these are problems for getting RHadoop-styled functionality for Hive tables with R. I’ve solved these issues and a few more in the current version of rhive. It’s currently focused primarily on compatibility with RHadoop’s rmr package, but it should be helpful in getting Hive to work with other Hadoop Streaming-based tools.

In particular, the serde for rhive supports application-specific type-codes These types are converted into Binary objects containing the entire typedbytes sequence, including the type code. When serializing, Binary types are assumed to be an entire typedbytes sequence, and serialized as such unless they cannot be (invalid type or incorrect length).

Comments