Increasingly, data-science applications require the creation, manipulation, and analysis of large graphs ranging from social networks to language models. While existing graph systems (e.g., GraphBuilder, Titan, and Giraph) address specific stages of a typical graph-analytics pipeline (e.g., graph construction, querying, or computation), they do not address the entire pipeline, forcing the user to deal with multiple systems, complex and brittle file interfaces, and inefficient data-movement and duplication.
The GraphX project unifies graphs and tables enabling users to express an entire graph analytics pipeline within a single system. The GraphX interactive API makes it easy to build, query, and compute on large distributed graphs. In addition, GraphX includes a growing repository of graph algorithms for a range of analytics tasks. By casting recent advances in graph processings systems as distributed join optimizations, GraphX is able to achieve performance comparable to specialized graph processing systems while exposing a more flexible API. By building on top of recent advances in data-parallel systems, GraphX is able to achieve fault-tolerance while retaining in-memory performance and without the need for explicit checkpoint recovery.
People: Joseph E. Gonzalez, Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael J. Franklin, Ion Stoica,
- Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, Ion Stoica. GraphX: Graph Processing in a Distributed Dataflow Framework. OSDI’14.
- Reynold S. Xin, Daniel Crankshaw, Ankur Dave, Joseph E. Gonzalez, Michael J. Franklin, Ion Stoica. GraphX: Unifying Data-Parallel and Graph-Parallel Analytics. arxiv. 2014
- Reynold S. Xin, Joseph E. Gonzalez, Michael J. Franklin, Ion Stoica. GraphX: A Resilient Distributed Graph System on Spark. GRADES (SIGMOD workshop), 2013.