Talk Title: Towards Self-Generating Data Management Systems
Speaker: Alvin Cheung, U Washington
Location: 373 Soda Hall
Time and Date: 12 Noon, Thursday, Nov. 10
Abstract: Advances in parallel data processing systems and libraries such as Spark have enabled large-scale data analytics in many disciplines. To effectively leverage such systems, however, programmers often need to rewrite their applications to leverage the optimizations provided by such systems, and implement various specialized storage and query processing components to further improve application performance.
In this talk, we overview our research on building computer-aided tools to automatically find optimal implementations for a given data analytic application. We will start with verified lifting, which automatically discovers a clean functional specification from existing code that might contain optimizations targeting other (potentially legacy) systems. The discovered spec, written in a DSL, can then be compiled to new platforms. We have applied verified lifting in various domains. As an illustration, we will focus on Casper, a new compiler built using verified lifting to translate sequential Java code fragments into Spark and Hadoop, resulting in up to 5x code performance.
Being able to execute code across different data processing systems solves only part of the problem, as the data to compute can reside on different systems, each encoded using a different format. In the second part of the talk, we will describe Pipegen, a new tool that automatically synthesizes data pipes for transferring data efficiently between systems. Pipegen analyzes system implementations and generates data pipes given user queries. Furthermore, it leverages existing system test cases to verify the correctness of the generated pipes. We have used Pipegen to automatically generate pipes among 5 different systems, and each generated pipe speeds up data transfer by up to 3.8x as compared to the traditional means of transferring using disk files.
Presenter Bio: Alvin Cheung is an assistant Professor of Computer Science at the University of Washington, affiliated with the database and programming languages and software engineering research groups. His research focuses on designing new and applying programming systems techniques to solve system problems. His group has applied such techniques to database applications, compiler construction, and programmable network hardware. He has also worked on software replay and query authoring tools.