Announcing KeystoneML

Error: Unable to create directory uploads/2024/12. Is its parent directory writable by the server?

We’ve written about machine learning pipelines in this space in the past. At the AMPLab Retreat this week, we released (live, on stage!) KeystoneML, a software framework designed to simplify the construction of large scale, end-to-end, machine learning pipelines in Apache Spark. KeystoneML is alpha software, but we’re releasing it now to get feedback from users and to collect more use cases.

Included in the package is a type-safe API for building robust pipelines and example operators used to construct them in the domains of natural language processing, computer vision, and speech. Additionally, we’ve included and linked to several scalable and robust statistical operators and machine learning algorithms which can be reused by many workflows.

Also included in the code are several example pipelines that demonstrate how to use the software to reproduce recent academic results in computer vision, natural language processing, and speech processing. Here’s an example:

Sample SIFT and Fisher Vector based Image Classification pipeline, included in KeystoneML.

Sample SIFT and Fisher Vector based Image Classification pipeline, included in KeystoneML.

This pipeline for image classification reproduces a recent academic result in image classification from Chatfield, et. al. when applied to the “VOC 2007” dataset, and is automatically distributed to run on a cluster.

Users familiar with the new spark.ml package may recognize several similarities in the API concepts and interfaces presented between these two projects. This is no coincidence since we contributed to the design and initial implementation of spark.ml. However, KeystoneML provides both a richer set of built-in operators – for example, image featurizers and large-scale linear solvers – and modifies the spark.ml interface to provide type-safety and ensure further robustness.

To try out KeystoneML, head over the the project home page and check out the quick start and programming guides, and check out the code on GitHub. We’d love to hear feedback from early users — please feel free to join the mailing list.