GraphX: Large-Scale Graph Analytics




Increasingly, data-science applications require the creation, manipulation, and analysis of large graphs ranging from social networks to language models.  While existing graph systems (e.g., GraphBuilder, Titan, and Giraph)  address specific stages of a typical graph-analytics pipeline (e.g., graph construction, querying, or computation), they do not address the entire pipeline, forcing the user to deal with multiple systems, complex and brittle file interfaces, and inefficient data-movement and duplication.

The GraphX project unifies graphs and tables enabling users to express an entire graph analytics pipeline within a single system.  The GraphX interactive API makes it easy to build, query, and compute on large distributed graphs.  In addition, GraphX includes a growing repository of graph algorithms for a range of analytics tasks.  By casting recent advances in graph processings systems as distributed join optimizations, GraphX is able to achieve performance comparable to specialized graph processing systems while exposing a more flexible API.   By building on top of recent advances in data-parallel systems, GraphX is able to achieve fault-tolerance while retaining in-memory performance and without the need for explicit checkpoint recovery.

GraphX is available as part of the Spark Apache Incubator project as of version 0.9.0, and the active research version of GraphX can be obtained from the github project page.

People: Joseph E. Gonzalez, Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael J. Franklin, Ion Stoica,