Using program analysis to tame big data configuration

Error: Unable to create directory uploads/2024/11. Is its parent directory writable by the server?

One of the major problems with big data systems is that they can be devilishly hard to manage and configure. One approach we’re exploring in the AMP lab is to use program analysis as a tool to automatically reason about programs and infer what configuration options they have. The analysis takes a Java program as input and spits out a list of all the configuration options, where in the code they are read, and what type they likely have.

I’ve been spending my summer at Cloudera [a lab affiliate] applying this work in the real world. It’s been successful. Aspects of it are being used in several ways. Perhaps most importantly, we use it to analyze customer configurations for problems. We’ve also used it to find configuration-related bugs in Hadoop, such as undocumented and wrongly documented options. It was also used to provide a central listing of options to guide developers of Cloudera’s add-on management tools.

A full description is available on the Cloudera blog. See also the paper “Static Extraction of Program Configuration Options”, presented at ICSE 11; this past May.

About Ariel Rabkin

Ari Rabkin is a fifth (and final)-year PhD candidate in the RAD Lab at UC Berkeley, working with Randy Katz. He is formerly from Cornell University, AB 2006, MEng 2007). He is interested in software quality and software intelligibility. He expects to graduate in May 2012. His dissertation is about applying program analysis to system management, including automatically describing program configuration options and diagnosing configuration errors. He is a contributor to several open source projects, including Hadoop, the Chukwa log collection framework, and the JChord program analysis toolset.