Wisteria: Nurturing Scalable Data Cleaning Infrastructure

Analysts report spending upwards of 80% of their time on problems in data cleaning. The data cleaning process is inherently iterative, with evolving cleaning workflows that start with basic exploratory data analysis on small samples of dirty data, then refine analysis with more sophisticated/expensive cleaning operators (e.g., crowd- sourcing), and finally apply the insights to a full dataset. While an analyst often knows at a logical level what operations need to be done, they often have to manage a large search space of physical operators and parameters. We present Wisteria, a system designed to support the iterative development and optimization of data clean- ing workflows, especially ones that utilize the crowd. Wisteria separates logical operations from physical implementations, and driven by analyst feedback, suggests optimizations and/or replace- ments to the analyst’s choice of physical implementation. We high- light research challenges in sampling, in-flight operator replace- ment, and crowdsourcing. We overview the system architecture and these techniques, then provide a demonstration designed to show- case how Wisteria can improve iterative data analysis and cleaning. The code is available at http://www.sampleclean.org.