Infobright, MonetDB and HadoopDB: Open source DBMS solutions
I’ve recently been taking a look at Vertica, a relatively new player in the analytics DBMS world (of which there are now several contenders including some known names like Netezza and Greenplum). Vertica’s flagship product is a MPP database solution that apparently has some huge scalability potential (we keep reading about handfuls of nodes for now, but the research suggests it can scale much further and is rather straightforward to do so).
The most significant "sell" Vertica (and folks like Exasol) has over other solutions is that it’s product is a column-store (or column-oriented) database, rather than the better known row-store. Column-store (where "tables" are separated into having one object per column) databases have a significant advantage in the analytics world as a typical BI query tends to filter for only a select set of columns from tables (we typically don’t require all dimension columns from a star query) – that means less data (less I/O) to sift through. Second, a column-store database will typically get much higher compression since each column object will always be consistent in type and length. And vendors like Vertica have the ability to perform functions right off the compressed data set (vs. having to decompress first). According to their benchmarks, the end result is a blazing fast DB for typical analytic queries (note that there is plenty of discussion on when a column-store is not the right choice for an enterprise data warehouse, but that’s for a later blog post). All exciting stuff …
So that led to checking out what open source alternatives might be out there. Two of the more talked about column-store databases are Infobright and MonetDB. Both seem to be pretty active projects, and both seem to be backed by some bigger companies that offer commercial support. Neither, however, support an MPP architecture at this point. Infobright has a blog post that seems to indicate they are considering it, but as of now, scalability is upward more than outward. Not that SMP is necessarily a bad thing; I know it doesn’t have the marketing feel MPP does nowadays, but it really is a more simple approach and we don�t typically see a huge need for it in the data mart world (ODS, maybe, but a client with average volumes won’t see a need for that much processing on summarized data). Ultimately, it would be nice to have both options and I’d love to see a truly open source MPP database.
That’s where HadoopDB comes in. Recently released, this project can be the glue that brings it all together. Here is the description from Daniel Abadi’s release blog post:
"It’s an open source stack that includes PostgreSQL, Hadoop, and Hive, along with some glue between PostgreSQL and Hadoop, a catalog, a data loader, and an interface that accepts queries in MapReduce or SQL and generates query plans that are processed partly in Hadoop and partly in different PostgreSQL instances spread across many nodes in a shared-nothing cluster of machines. In essence it is a hybrid of MapReduce and parallel DBMS technologies. But unlike Aster Data, Greenplum, Pig, and Hive, it is not a hybrid simply at the language/interface level. It is a hybrid at a deeper, systems implementation level. Also unlike Aster Data and Greenplum, it is free and open source."
This is great stuff, especially if "deeper, systems implementation level" means a solution to some of the fault tolerance vs. performance issues Daniel discusses in the article. Better yet, he claims they can switch out the underlying database (PostgreSQL) to a column-store like MonetDB or Infobright. Now that would be a huge win!
It feels like we’re going to see a lot of convergence in technologies in the upcoming year, and that’s probably a good thing, at least in the open source world – the more collaboration of projects, the better. I’d love to see HadoopDB get more coverage in the Hadoop/Apache community – would be great to see this progress along with Hive and Pig.
Overall, it’s exciting to see these (hopefully) lower-TCO solutions maturing – it’s getting harder and harder for some of you to shell out 100k/TB for some of those commercial appliance offerings, so it’s nice to see we have open source some options out there. They may not be the perfect option just yet, but I see a bright future here, and that’s good for everyone, even for the commercial offerings (competition will drive innovation, and innovation and solid support are what open up wallets).
I haven’t had time yet to do some POCs on the two DB solutions or HadoopDB – I’ll be posting some findings when I do. But send us your thoughts if you’ve tried these out. What advantages/disadvantages do you see when comparing to Vertica and other commercial offerings?

