drevell's posterous

Notes from Hadoop World 2011 day 2

Yesterday I posted some thoughts and notes from day 1 of Hadoop World 2011 (http://drevell.posterous.com/hadoop-world-2011-thoughts-and-notes), here's more from day 2.

It looks like all the slide decks have been posted at http://www.cloudera.com/resources/Hadoop+World/ .

Also, high fives to all the Hadoop & HBase folks I met (and re-met) in NYC. It was a real pleasure partying with you.

 

Doug Cutting's keynote

He thinks that Hadoop arose as an effective way of exploiting the exponential growth of hardware performance. Networks, disks, and CPUs have all gotten faster but software hasn't improved similarly. He sees Hadoop as the "kernel" of a distributed OS for big data. He envisions Hadoop working like a Linux distro where the pieces evolve independently and are packaged into distributions.

He says the next CDH3u3 will include Mahout and will emphasize Avro.

We can expect Hadoop 0.23 early next year, and there's a potential alpha release being voted on now.

He thinks crunch, giraph, S4, and blur are interesting new projects in the Hadoop ecosystem that could be important.

He says that Bigtop, the integration testing tool, will serve as a "basis for future CDH releases" by evaluating compatiblity between Hadoop subprojects (e.g. how well does Hive version X work with HBase version Y running on HDFS version Z). It seems to me Bigtop could play an important role in defining what a Hadoop distro means and how a distro's components are chosen.

 

Hadoop Performance

Todd Lipcon of Cloudera talked about new performance tweaks that are coming in Hadoop 0.23. Mapreduce has changes to the built-in sort that save CPU and IO by 

  • using kernel features (fadvise, sync) more cleverly
  • changing some in-memory data structures for better cache locality
  • usings low-level Sun JVM features and native assembly

Small clusters will send heartbeats more frequently to optimize time-to-completion for workloads with lots of small jobs.

Upgrading Linux kernels can give a big performance boost due to recent writeback and readahead improvements. Kernel 2.6.32 was suggested.

There were some big wins in the HDFS layer. Half of datanodes' CPU usage was spent checksumming, so new code was added to use Intel's native SSE checksum instructions. In the benchmark presented, this change cut latency in half and cut CPU usage by 3x. Another change added persistent TCP connections from clients to datanodes.

The HDFS changes gave a ~40% YCSB throughput improvement for HBase, which is huge. This will be a strong incentive for HBase shops to upgrade to 0.23.

Yanpei Chen presented a more scientific benchmark scheme for MapReduce that uses samples from real cluster workloads. Briefly, he presented a scheme for distilling large traces into a representative benchmark, and showed that the scheme accurately approximates a real workload. One key idea is that MR clusters need a realistic workload in order to quantify their performance. The definition of realistic varies between clusters, as does the performance goal (time-to-completion, throughput, efficiency, etc.).

Berkeley has a repository of MapReduce workloads at www.eecs.berkeley.edu/~ychen2/SWIM.html for simulating different workloads.

 

HDFS Federation

Suresh Srinivas of Hortonworks talked about the some interrelated changes to the HDFS namenode that are happening under the umbrella of "HDFS federation." The biggest change is that the namenode is broken into two separate services, which will eventually be separate daemons. The two services are:

  • namespace: dealing with files and directories. File creation, modification, and deletion go here.
  • block management: block location, allocation, and deletion. Also manages replication and datanode membership. Receives block reports.

So there could be potentially many namespaces services sharing the same block storage service. It's operationally much nicer to have a single cluster with multiple namespaces rather than having multiple clusters. The block storage service can scale independently from the namespace service, which is harder to scale.

The "block pool" is a new abstraction offered by the storage service. Each pool might have different configurations, which would allow the storage service to handle different namespaces differently. For example, different namespaces would be backed by separate block pools. You could say that block pools are the unit of multitenancy for the block storage layer.

 

HBase Roadmap

Jonathan Gray of Facebook gave a rough sketch of upcoming HBase development. Version 0.92 will be coming out within "weeks." It includes:

  • coprocessors: hooks in the regionserver to run arbitrary code
  • ACL-based security
  • HFile v2: file format changes that make very large regions possible. Metadata blocks are scattered through the files forming a B-tree-like structure.
  • Log splitting on regionservers instead of the master
  • Compaction improvements: multi-threaded compaction, smarter input file selection
  • Operational improvements: better hbck, slow query log.

Version 0.94, currently trunk, will have a focus and stability, usability, and simplicity (again). Features will probably include:

  • Fine grained operational metrics (e.g. latency) at the table and column family level
  • Backups with point-in-time recovery
  • Multi-slave replication
  • Embedded thrift server (instead of the current proxy server)

Beyond 0.94, we're likely to see:

  • A new RPC protocol
  • Dynamic configuration (rather than file-based), I wasn't clear exactly what this would look like
  • More fine-grained monitoring

He said that Facebook uses pre-split tables and does not use automatic region splitting.

 

Hadoop 0.23

Arun Murthy of HortonWorks talked about Hadoop 0.23 and its new features. It's going into alpha any day, and we should expect the full release early next year.

  • HDFS federation, covered above
  • Next-generation MapReduce, covered yesterday
  • HDFS/Mapreduce performance improvements, covered above
  • HDFS HA changes, covered yesterday
  • Maven build system for all of Hadoop

We should expect to be able to run 6,000 node clusters with 0.23.

 

Hadoop World 2011 thoughts and notes

Hi!

These are my notes from day 1 of Hadoop World 2011 along with some simple observations. They're HBase-centric since that's what I work on the most. Comments are welcome if you have corrections/opinions. Sorry for the verbosity.

 

HBase in General

HBase has become very popular. It's really a first class use case for Hadoop alongside MapReduce. The HBase meetup last night was full to capacity, and most of the people there seemed to be using HBase pretty heavily. In my small breakout session of about a dozen people, there were multiple people running 20+ node clusters and handling 100K+ writes per second. Tumblr handles 1MM writes/sec, and they use a sharding scheme on top of multiple HBase clusters, with a 24GB heap per node.

HBase has problems with throughput and latency during compactions. The canonical solution is to turn off automatic splits and compactions and compact manually during off-hours when the performance hit is more tolerable. The other option is to buy more capacity.

According to Todd Lipcon, who knows these things, region size should be <=4GB and there should be 20-500 regions per regionserver if you're running HBASE 0.90. Facebook and Stumbleupon use a patch that makes bigger regions workable by avoiding recompaction of large store files, but these patches aren't in the stable release yet. According to J.D. Cryans, Facebook uses a fixed number of regions per node, without any splitting.

Multitenancy on HBase is problematic since different tables will want different settings for things like cache sizes and log/memstore flushes. Future changes will make this easier, though I wasn't clear on how.

Support for HBase snapshots is reportedly coming out of Facebook soon.

 

eBay

eBay is running an 18-month program called Cassini to build a new search engine using Hadoop and HBase. They've had some HBase stability issues, but it's not clear that they're using HBase correctly (it's easy to crash regionservers with poor schema choices). They're building a custom search engine with their own indexing and ranking code.

Facebook

Jonathan Gray from Facebook gave a great talk about their use cases for HBase. Their custom-sharded MySQL+memcache database is still alive and well, but they're finding that HBase is operationally nicer and cheaper for some new use cases. More specific reasons why they're using HBase:

  • Re-sharding the MySQL database is difficult, since load is unpredictable and uneven. HBase regions are comparatively easier. 
  • HBase wins at write latency/throughput due to the append-only merge tree design. Reads aren't great but merely tolerable
  • Tight Mapreduce integration is useful, especially the ability to write HFiles from MR jobs and import them.
  • Memcache is crazy fast but the the API is not as convenient as the lists+dictionaries offered by the HBase API

Facebook uses Tokyo Cabinet on flash storage for datasets that are memcache-like (key-value) but need persistence.

Their messaging product is built on one gigantic HBase column family.

Their real-time analytics service is built on Flume+HDFS+HBase. Event logs are streamed from clients into HDFS. The files in HDFS are tailed by other processes that do several increments/updates to OLAPpy rollups in HBase. High five to jgray for the use of "OLAPpy."

They do 75 billion reads/writes per day, 55% reads and 45% writes. Their nodes have dozens of TB each on 7200RPM drives.

They have a separate time-series database for operational metrics and alerting. This is ongoing work that's not fully in production.

 

Hadoop future developments

Eli Collins from Cloudera talked about the history of Hadoop and upcoming changes in the ecosystem. He emphasized enterprise features. Authentication and security are important and will be coming soon. Multitenancy will start to be a priority, where resources will be shared in a configurable way between multiple apps on the same infrastructure. Stream and graph processing are probably coming. Connectors with other enterprise systems will continue to happen.

 

HDFS HA

Aaron Myers (Cloudera) and Suresh Srinivas (Hortonworks) covered ongoing work on removing the namenode as a SPOF (HDFS-1623). The "backup namenode" work in 0.21 and the Avatar node design are not being used. The new design has not been finalized; we can expect the namenode failover in 0.23.1. The new design will be active-standby: one node is a master and one is waiting to become the master. The standby may either be "cold" (no state), "warm" (receives edits but no datanode block reports) or "hot" (has almost all state). So in the hot standby case, datanodes send their block reports to both masters. \

The edit log may either be shared from the master to the standby via a filer (e.g. NFS) or BookKeeper+Zookeeper. In the shared filer case, fancy schemes are needed to prevent double-master scenarios where both namenodes might update the shared state.

Client failover might be achieved using Zookeeper. 

 

Lily

Lily is an indexing system built on HBase and Solr. It's open source with an "enterprise version." It stores schema'd records with configurable indexes. There's significant engineering complexity spent on maintaining strict consistency between the records and the indexes.

 

Next Generation Mapreduce

Mahadev Konar from Hortonworks presented the "next generation mapreduce" design. It will be in 0.23.0 in an unstable form. The new design addresses a few problems with existing mapreduce:

  1. The jobtracker does a lot of work and limits MR cluster size to about 4000 nodes. The new design moves some of the work elsewhere.
  2. Mapreduce is not a general-purpose paradigm, stream/graph processing can't be done naturally
  3. Upgrading mapreduce versions requires downtime and synchronized upgrades of several components.
  4. Allocating resources in terms of "slots" tends to underutilize resources
  5. Jobtracker failure causes all running and queued jobs to fail

Under the new system, the jobtracker is replaced with a "resource manager" that does less work. The resource manager keeps track of what nodes are available and what resources they have available. The resource manager is responsible for allocating these resources when requested. Interestingly, the resource manager knows nothing about mapreduce and contains no mapreduce code. Mapreduce is just another application that requests resources on the cluster. Instead of resources being allocated in "slots," resources are allocated in "containers." A container is a certain quota of resources, e.g. one core and 2GB RAM.

Each application running on the cluster has an "application master" process. The application master processes run in a container on one of the workers, and not on the central node that has the resource manager. A mapreduce job would have an application master that starts up map and reduce tasks by requesting containers from the resource manager. The application master would contain the mapreduce code as a library. This solves the "synchronized upgrade" issue by allowing different jobs to use different mapreduce library versions.

Fault tolerance is handled by saving the central resource manager state to zookeeper.

Support for other processing paradigms (e.g. stream, graph) is possible because these can just be app masters running inside containers. The core system for allocating containers has been separated from mapreduce and is reusable. This is similar to Apache Mesos (http://www.mesosproject.org/).>

The problem with master/slave failover in data storage

Everyone's building a so-called "NoSQL" database (non-relational scalable data storage of some kind) these days. The quality of the conversation comparing these projects is sometimes disappointing because there are some subtle and complicated trade-offs that are not widely appreciated, and debates often devolve into unconstructive (and intensely annoying) fanboyism. I hope to elevate this conversation just a little bit by describing one of the trade-offs you'll need to make if you're choosing a distributed data storage system.

Specifically, I'm going to tell you why a system based on sharding with master/slave failover might not be the right solution for some of your problems. By "master/slave failover" I mean any scheme where all writes for a database (or shard) are sent to a single node. I'm looking in the general direction of mongo, membase, redis and similar systems.

This post is intended for all developers. I'm not going to assume you know the CAP theorem, but you really should glance at it. Smarter people than me have thought hard and written a lot about these topics. In CAP terms, this entire post can be summarized as "the master has total consistency, therefore we have to choose one of availability or partition tolerance."

What's good about master/slave failover

  • It's simple to understand. Since the master makes the decisions, there are no complicated protocols for maintaining agreement among multiple hosts.
  • It's simple to implement. You can just bolt on an asynchronous replication protocol to get slaves. Persistent hash table + replication = NoSQL win. Let's have an IPO.
  • It's simple to get ACID transaction semantics on the master, and transactions are extremely useful. Operations that depend on a consistent view of the data (like counting) are easy. Statistics are easy to collect.
  • It's fast. If you have low latency to the master, operations are fast and simple, as long as it's not overloaded.
  • Some reads can go to slaves, allowing distribution across high-latency links in many cases where stale data is OK.

Since simplicity is a hugely desirable characteristic of distributed systems, there are many good things to be said about master/slave failover.

The Problem:

As Prof. David Cheriton told my class one day, it doesn't matter whether your distributed system works on a sunny day going downhill with a tailwind, but how things work on a rainy day going uphill with a headwind and wolves attacking. I didn't paraphrase that well; I think there was something in there about bin Laden. The point is that any good distributed system gracefully handles component failures that are reasonably likely to occur, and as things scale up, failures become a certainty. So let's talk about fault tolerance.

If one of these things happens:

  • Node failure
  • Network failure

One of these two bad effects will result, depending on your system and the type of failure:

  • Loss of availability: clients cannot write
  • Loss of consistency: there will be divergent or stale copies of your database, and writes that occurred right before the failure may be lost.

Presumably there is some node (or group of nodes) that is monitoring the master and waiting to elect a new master if the old master fails. This would be fine if it were actually possible. The unfortunate truth is that it's impossible to tell the difference between a node failure and a network failure. The slaves only know that they are not receiving packets from the master.

A decision must be made: should a new master be promoted?

  • If no, you're screwed, because the original master may have actually crashed, so there's no master, and no one can write. Users get angry and write tweets about how much you suck.
  • If yes, you're screwed, because the original master may still actually be running and accepting writes and there's just a network issue. Now you have two masters, and two divergent copies of the database. Chaos ensues. Also, writes that recently committed on the old failed master may not have replicated to the new master before the failure. [Footnote 1]

In brief: you're screwed unless you can do perfect failure detection, which you can't. In even more brief: you're screwed.

More problems:

If clients are not geographically near the master, writes and high-consistency reads will be slow.

Scaling a database or shard to multiple data centers is problematic. There are some interesting tricks that are possible here. For example, Yahoo's PNUTS system migrates a user's data between data centers to follow them around geographically. You can think of a user's data as a special case of shard (since it can be a horizontal partition of the data).

Is there anything we can do?

If you're willing to give up on total consistency and you're willing to deal with a bit more complexity, you can get great performance, latency and fault tolerance. Amazon's Dynamo system has some really interesting ideas, some of which have been implemented in the open source projects Cassandra and Riak. 

Conclusion:

Go forth and make smart trade-offs when choosing a database for your next cool scalable backend. Good luck!

 

Footnotes:

1: You could design a database system that allows multiple masters to be cleanly merged into a single master. This is fine, but it makes ACID transactions impossible and takes away the nice consistency properties that made master/slave a good idea to start with.