Many machine learning (ML) algorithms iteratively transform some global state (e.g., model parameters or variable assignment) giving the illusion of serial dependencies between each operation. However, due to sparsity, exchangeability, and other symmetries, it is often the case that many, but not all, of the state-transforming operations can be computed concurrently while still preserving serializability: the equivalence to some serial execution where individual operations have been reordered.
This opportunity for serializable concurrency forms the foundation of distributed database systems. In this project, we implement updates in ML algorithms as concurrent transactions in a distributed database. As a result, we achieve high scalability while maintaining the semantics and theoretical properties of original serial algorithm.
ML algorithms that have been implemented using concurrency control include clustering, feature modeling, online facility location, and submodular maximization.
Optimistic Concurrency Control for Distributed Unsupervised Learning (NIPS 2013)
Parallel Double Greedy Submodular Maximization (NIPS 2014) [ code ]
Parallel Correlation Clustering on Big Graphs (arXiv pre-print)