Hello, Manta: Bringing Unix to Big Data

I came to Joyent about two and half years ago after being thoroughly convincedthat Joyent was a place where "full stack engineering" actually happened;nothing was off-limits, systems could be built with the best abstractions, andwe could take a fresh approach to tackling cloud computing problems using theright technology abstraction for each task. In particular, one of the productsJoyent has long needed to build was object storage, but we didn't really want tobuild something that was only marginally better than the existing storage offerings already on the market. If we were going to tackle large-scale storage, we needed to do something truly better than anything else out there. AsBryan Cantrill points out, storage is always atrail of tears, and not something you undertake lightly (having built a great enterprise storage appliance and long walked that trail of tears himself, he has some experience here).

Joyent created and maintains SmartOS, which offerssome very compelling technologies for systems infrastructure:DTrace,ZFS andZones. For a while we kickedaround how ZFS could give us something better, and were fixated on how to make a better object storage service even better using ZFS. However, somewhere alongthe way, we realized the differentiating technology wasn't ZFS; it was Zones.

I've been working on the cloud for a while now (almost since "the beginning"),and one of the things I've always loathed is that it's just too hard toleverage the cloud to perform basic data processing tasks. There's clustersetup and management, data movement ETL (extract, transform, load), high availability (HA) management. And that's all before you write any code to actually work with your data. Therealization eventually came that we could deliver a truly amazing product if weelevated compute to a first-class citizen in the product; specifically bybringing arbitrary compute to data.

One afternoon in October of 2011, Bryan and I talked this over as a first pass, and it became clear Zones were the answer. A very short time later (a few days, if I recall), Dave Pacheco was on board and we started kicking around what this would really look like. For the first while we were also trying to integrate with Node, such that users phrased work in terms of JavaScript code, and it actually took quite a while of kicking ideas around before we realized the interface we really want issimply Unix. Every Unix user is familiar with data processing using pipes (I'm looking at you, find | grep | sort), and if we could truly run arbitrary compute on stored data while managing the distributed problems, large-scale data processing is significantly easier. We were all busy with other commitments, so we really didn't start development in earnest until the Spring of 2012 when we all got together in San Francisco for a "summit," and created a whiteboard architecture of how this would work.

There were obviously a lot of other problems to solve, and as I said earlier,Joyent is a "full stack" company. With some of the initial design laid out,Jerry Jelinek actuallywrote the first lines of code for Manta, in the form ofhyperlofs,without which Manta would not exist.Yunong Xiao wrote an HAmetadata stack built on Postgres andZooKeeper. Keith Wesolowskicreated custom hardware systems for storage and applications in Manta.Bill Pijewski wrote a custom deploymentmanagement system forManta. Nate Fitch wrotegarbage collection, and Fred Kuo wrote usage aggregation. And Mantacouldn't actually exist without all the engineering effort Joyent has put intoSmartDataCenter.

As Manta has evolved from something that barely worked in our lab, to a private beta, to public availability now, we've continually striven to make it easier and easier to use. Nobody disagrees that the simplest abstractions are best, and really the further along we got, the more we realized we just wanted to make it easier and easier to carry over existing "one-liners" to Manta, and have Manta manage the distributed coordination and "muck." In essence, we've brought the Unix philosophy to big data. From what we've seen, everyone who tries Manta walks in thinking it's object storage with a twist, but walks away realizing that it's a paradigm shift (including us). Manta dramatically reduces the barrier to entry for big data processing, and enables completely new use cases that weren't possible before; while Unix one-liners are the "hello world" for Manta, really you can do anything you would be able to do in a full OS.

It is profoundly satisfying to see Manta hit the market today. As withall revolutionary technologies, we don't really know what all the possibilitiesare for users; we just know that we've unlocked them. We're tremendouslyexcited to see all the applications that will be built on top of Manta, and aswe go, we'll be adding more features as we hear about various use cases. We'realready planning on rich access control, triggers, SQL, real-time analytics andmore.

To get started, there's a tutorial and a screencast that will walk youthrough installing an SDK, managing objects, and running some jobs. Give it a spin, and let us know what you think.

Post written by Mark Cavage