Using program analysis to tame big data configuration

Error: Unable to create directory uploads/2024/05. Is its parent directory writable by the server?

One of the major problems with big data systems is that they can be devilishly hard to manage and configure. One approach we’re exploring in the AMP lab is to use program analysis as a tool to automatically reason about programs and infer what configuration options they have. The analysis takes a Java program as input and spits out a list of all the configuration options, where in the code they are read, and what type they likely have.

I’ve been spending my summer at Cloudera [a lab affiliate] applying this work in the real world. It’s been successful. Aspects of it are being used in several ways. Perhaps most importantly, we use it to analyze customer configurations for problems. We’ve also used it to find configuration-related bugs in Hadoop, such as undocumented and wrongly documented options. It was also used to provide a central listing of options to guide developers of Cloudera’s add-on management tools.

A full description is available on the Cloudera blog. See also the paper “Static Extraction of Program Configuration Options”, presented at ICSE 11; this past May.