Web applications and services today collect, store and analyze an immense amount of data. As data sizes continue to grow, the bottlenecks in systems for big data analytics have undergone a fundamental change. In particular, with memory bandwidth and CPU performance growing at a rate much faster than the bandwidth between CPU and slower storage devices (SSD, disk, etc.), existing big data systems are increasingly bottlenecked by I/O. These I/O bottlenecks are (and will continue to be) getting worse!
A fundamental approach to alleviating the I/O bottlenecks is to use data compression. Traditional compression techniques have led to significant gains in terms of storage costs, energy costs, and performance for a wide variety of batch processing jobs. These techniques have also been used for reducing I/O bottlenecks in columnar stores with significant performance improvements for OLAP workloads that typically require scanning the entire dataset (see Daniel Abadi’s thesis, this paper, and references within).
However, the aforementioned compression and query execution techniques are unsuitable for a wide variety of workloads that do not necessarily require data scans (e.g., point queries). One example is search, a fundamental primitive supported by many web applications and services. Examples include Facebook search, Twitter search, LinkedIn search, airline and hotel search, and services that are specifically built around search (Google, Bing, Yelp, to name a few). Another example is random access as typically performed via get() interface in key-value stores, NoSQL stores, document stores, etc. Queries in such workloads are often short-lived (ideally sub-millisecond), and data scans and/or decompression are not useful for such short-lived queries. Given the large number of applications that run such workloads, we at AMPLab decided to take a stab at this problem and asked the following fundamental question:
Is it possible to execute point queries (e.g., search and random access) directly on compressed data without performing data scans?
Exploring the above question led to the Succinct project! At a high-level, Succinct enables a wide range of queries including search, range and wildcard queries over arbitrary strings as well as random access into the input data directly on a compressed representation of the input. What differentiates Succinct from previous systems that support point queries is that Succinct supports these queries without storing any indexes, without data scans and without data decompression — all the required information is embedded within the compressed representation and queries are executed directly on the compressed representation.
On real-world and benchmark datasets, Succinct can execute sub-millisecond search queries while keeping as much as an order of magnitude more input data in faster storage compared to state-of-the-art systems that provide similar functionality using indexes. For example, on a server with 128GB RAM, Succinct can push as much as 163 — 250GB of raw data, depending on the dataset, while executing search queries within a millisecond. Thus, Succinct executes more queries in faster storage, leading to lower query latency than existing systems for a much larger range of input sizes.
For more information on Succinct — techniques, tradeoffs and benchmark results— see the Succinct webpage. A good place to start experimenting with Succinct is Succinct on Apache Spark, an Apache Spark package that enables queries directly on compressed RDDs. There are a large number of interesting follow up projects in AMPLab on Succinct exploring the fundamental limits to querying on compressed data, adding new applications on top of Succinct, and improving the performance for existing applications. We will write a lot more about these very exciting projects on Succinct webpage.