Notes from Hadoop World 2011 day 2
Yesterday I posted some thoughts and notes from day 1 of Hadoop World 2011 (http://drevell.posterous.com/hadoop-world-2011-thoughts-and-notes), here's more from day 2.
It looks like all the slide decks have been posted at http://www.cloudera.com/resources/Hadoop+World/ .
Also, high fives to all the Hadoop & HBase folks I met (and re-met) in NYC. It was a real pleasure partying with you.
Doug Cutting's keynote
He thinks that Hadoop arose as an effective way of exploiting the exponential growth of hardware performance. Networks, disks, and CPUs have all gotten faster but software hasn't improved similarly. He sees Hadoop as the "kernel" of a distributed OS for big data. He envisions Hadoop working like a Linux distro where the pieces evolve independently and are packaged into distributions.
He says the next CDH3u3 will include Mahout and will emphasize Avro.
We can expect Hadoop 0.23 early next year, and there's a potential alpha release being voted on now.
He thinks crunch, giraph, S4, and blur are interesting new projects in the Hadoop ecosystem that could be important.
He says that Bigtop, the integration testing tool, will serve as a "basis for future CDH releases" by evaluating compatiblity between Hadoop subprojects (e.g. how well does Hive version X work with HBase version Y running on HDFS version Z). It seems to me Bigtop could play an important role in defining what a Hadoop distro means and how a distro's components are chosen.
Hadoop Performance
Todd Lipcon of Cloudera talked about new performance tweaks that are coming in Hadoop 0.23. Mapreduce has changes to the built-in sort that save CPU and IO by
- using kernel features (fadvise, sync) more cleverly
- changing some in-memory data structures for better cache locality
- usings low-level Sun JVM features and native assembly
Small clusters will send heartbeats more frequently to optimize time-to-completion for workloads with lots of small jobs.
Upgrading Linux kernels can give a big performance boost due to recent writeback and readahead improvements. Kernel 2.6.32 was suggested.
There were some big wins in the HDFS layer. Half of datanodes' CPU usage was spent checksumming, so new code was added to use Intel's native SSE checksum instructions. In the benchmark presented, this change cut latency in half and cut CPU usage by 3x. Another change added persistent TCP connections from clients to datanodes.
The HDFS changes gave a ~40% YCSB throughput improvement for HBase, which is huge. This will be a strong incentive for HBase shops to upgrade to 0.23.
Yanpei Chen presented a more scientific benchmark scheme for MapReduce that uses samples from real cluster workloads. Briefly, he presented a scheme for distilling large traces into a representative benchmark, and showed that the scheme accurately approximates a real workload. One key idea is that MR clusters need a realistic workload in order to quantify their performance. The definition of realistic varies between clusters, as does the performance goal (time-to-completion, throughput, efficiency, etc.).
Berkeley has a repository of MapReduce workloads at www.eecs.berkeley.edu/~ychen2/SWIM.html for simulating different workloads.
HDFS Federation
Suresh Srinivas of Hortonworks talked about the some interrelated changes to the HDFS namenode that are happening under the umbrella of "HDFS federation." The biggest change is that the namenode is broken into two separate services, which will eventually be separate daemons. The two services are:
- namespace: dealing with files and directories. File creation, modification, and deletion go here.
- block management: block location, allocation, and deletion. Also manages replication and datanode membership. Receives block reports.
So there could be potentially many namespaces services sharing the same block storage service. It's operationally much nicer to have a single cluster with multiple namespaces rather than having multiple clusters. The block storage service can scale independently from the namespace service, which is harder to scale.
The "block pool" is a new abstraction offered by the storage service. Each pool might have different configurations, which would allow the storage service to handle different namespaces differently. For example, different namespaces would be backed by separate block pools. You could say that block pools are the unit of multitenancy for the block storage layer.
HBase Roadmap
Jonathan Gray of Facebook gave a rough sketch of upcoming HBase development. Version 0.92 will be coming out within "weeks." It includes:
- coprocessors: hooks in the regionserver to run arbitrary code
- ACL-based security
- HFile v2: file format changes that make very large regions possible. Metadata blocks are scattered through the files forming a B-tree-like structure.
- Log splitting on regionservers instead of the master
- Compaction improvements: multi-threaded compaction, smarter input file selection
- Operational improvements: better hbck, slow query log.
Version 0.94, currently trunk, will have a focus and stability, usability, and simplicity (again). Features will probably include:
- Fine grained operational metrics (e.g. latency) at the table and column family level
- Backups with point-in-time recovery
- Multi-slave replication
- Embedded thrift server (instead of the current proxy server)
Beyond 0.94, we're likely to see:
- A new RPC protocol
- Dynamic configuration (rather than file-based), I wasn't clear exactly what this would look like
- More fine-grained monitoring
He said that Facebook uses pre-split tables and does not use automatic region splitting.
Hadoop 0.23
Arun Murthy of HortonWorks talked about Hadoop 0.23 and its new features. It's going into alpha any day, and we should expect the full release early next year.
- HDFS federation, covered above
- Next-generation MapReduce, covered yesterday
- HDFS/Mapreduce performance improvements, covered above
- HDFS HA changes, covered yesterday
- Maven build system for all of Hadoop
We should expect to be able to run 6,000 node clusters with 0.23.
