Fast and Flexible Query Analysis at MapD with Apache Calcite
Back when we started the current incarnation of the MapD Core database, we wrote our own parser (written using flex and GNU bison), semantic analysis and optimizer. This approach offers the most control since everything in the pipeline can be adjusted to the unique needs of our system, but we've realized that our main strength lies in the actual execution of the queries. In the context of the limited resources of a startup, we have to pick our battles.
We soon faced a dilemma: are there any fully-featured SQL frontends which still allow control over our destiny? Several options, both commercial and open source were available, ranging from just the parser to comprehensive frameworks doing everything, from parsing to query optimization. Some of them aren't standalone projects, but it's feasible to decouple them from the full system.
A quick word about partial operator push down
Some of the available frameworks provide their own implementation for the SQL operations. The system using the framework can handle some of the operators (push down) and use the framework to fill the gaps in functionality. Note that Calcite offers its own version of this (allowing filter and projection push down) and a well-structured tutorial on using it. It was very tempting to go that route, but meaningful off-loading to the framework involves slower execution (often by much, in our case) and intermediate buffers. Being fast on our workload is all about eliding buffers (or making them as small as possible) and avoiding memory copies. We've therefore decided for an all-or-nothing approach and reject altogether queries we don't fully support on our end.
For example, suppose we didn't support the LIKE operation (for the sake of example; we actually do) and we want to avoid implementing it. A query like SELECT COUNT(*) FROM test WHERE str LIKE 'foo%' AND x > 5 would require us to return all the rows which meet the x > 5 part of the filter to the collaborative execution framework, so that it can apply the additional str LIKE 'foo%'filter and finish the query. On the other hand, an implementation which supports the LIKE operator can evaluate the entire filter and compute COUNT(*) in place, without using any intermediate buffers. For tables with billions of rows, the additional cost of the intermediate buffers is prohibitive, both from a memory usage perspective and the cost of writing to it.
Choosing the right SQL framework
After evaluating a few other options, we decided for Apache Calcite, an incubation stage project at the time. It takes SQL queries and generates extended relational algebra, using a highly configurable cost-based optimizer. Several projects use Calcite already for SQL parsing and query optimization.
One of the main strengths of Calcite is its highly modular structure, which allows for multiple integration points and creative uses. It offers a relational algebra builder, which makes moving to a different SQL parser (or adding a non-SQL frontend) feasible.
In our product, we need runtime functions which are not recognized by Calcite by default. For example, trigonometric functions are necessary for on-the-fly geo projections used for point map rendering. Fortunately, Calcite allows specifying such functions and they become first-class citizens, with proper type checking in place.
Calcite also includes a highly capable and flexible cost-based optimizer, which can apply high-level transformations to the relational algebra based on query patterns and statistics. For example, it can push part of a filter through a join in order to reduce the size of the input, like the following figure shows:
You can find this example and more about the cost-based optimizer in Calcite in this presentation on using it in the Apache Phoenix project. Such optimizations complement the low-level optimizations we do ourselves to achieve great speed improvements.
Relational algebra example
Let's take a simple query: SELECT A.x, COUNT(*) FROM test JOIN B ON A.x = B.x WHERE A.y > 41 GROUP BY A.x; and analyze the relational algebra generated for it.
In Calcite relational algebra, there are a few main node types, corresponding to the theoretical extended relational algebra model: Scan, Filter, Project, Aggregate and Join. Each type of node, except Scan, has one or more (in the case of Join) inputs and its output can become the input of another node. The graph of nodes connected by data flow relationships is a
directed acyclic graph (abbreviated as "DAG"). For our query, Calcite outputs the following DAG:
The Scan nodes have no inputs and output all the rows and the columns in tables A and B, respectively. The Join node specifies the join condition (in our case A.x = B.x) and its output contains the columns in A and B concatenated. The Filter node only allows the rows which pass the specified condition and its output preserves all columns of input. The Project node only preserves the specified expressions as columns in the output. Finally, the Aggregate specifies the group by expressions and aggregates.
The physical implementation of the nodes is up to the system using Calcite as a frontend. Nothing in the Join node mandates a certain implementation of the join operation (equijoin in our case). Indeed, using a condition which can't be implemented as a hash join, like A.x < B.x, would only be reflected by the condition in the Filter node.
We realized we can reuse a lot of our in-house frontend as a foundation for integrating Calcite. Where direct reuse failed, we were able to extract common abstractions idiomatic to both the legacy and the new Calcite frontend. Our in-house frontend didn't use canonical relational algebra nodes, but the building blocks were similar enough to make migration possible.
Calcite is a Java project, so integration with C++ isn't entirely trivial. Fortunately, we can serialize the relational algebra it outputs to JSON and use this string as a starting point for our C++ query execution. We had to extend the existing JSON serialization in Calcite in order to preserve more details about literal types and subqueries, which proved to be a straightforward task. This approach simplifies the JNI interface since we don't need to move complex objects across the boundary.
We set out to quickly validate or discard the idea of using Calcite. We decided to go for a shallow integration as a first stage, faithfully converting relational algebra generated by Calcite to our existing in-memory representation of SQL queries. In a few weeks, we had this integration done and the generated LLVM IR code was completely identical to the one generated using the legacy frontend. We committed to Calcite at that point and then gradually converted our system to work with relational algebra instead.
Regarding performance, Apache Calcite needs several milliseconds to parse and convert from SQL to serialized relational algebra and the JNI marshalling is completely negligible. This is fast enough for now, but we can bring this additional time down if necessary. For example, the cross-filtered nature of our Immerse dashboards leads to several queries being generated simultaneously. We currently run the Calcite frontend on a single thread (still allowing Calcite to process the queries asynchronously with respect to the execuction engine), but we could easily multi-thread it if needed. Also, we could parametrize the queries to bypass the parse and analyze phase entirely.
Relational algebra operator fusion
Calcite generates canonical relational algebra. Sometimes, executing operations as they come would involve redundant intermediate buffers and, as we've already said, we must avoid them. Therefore we walk the DAG looking for patterns to be coalesced into a synthetic node to be executed without intermediate buffers while preserving the observable effects. For example, we coalesce the Filter, Project, Aggregate chain into a single synthetic node, called Compound which evaluates the filter and the aggregate on the fly and avoids the intermediate buffers for Filter and Project outputs. Let's take, for example, the previous example and see how this optimization works (nodes before optimization drawn with dashed lines):
The Compound node contains all information needed to evaluate the filter and (potentially grouped) aggregates using just the memory buffer required for the final result.
Current state and future work
Integrating Calcite enabled us to quickly get basic subqueries and outer joins working. We're continually expanding the range of SQL queries we can execute. The real challenge, as usual, is recognizing the practically interesting patterns and discovering fast ways to execute them. The wide range of queries supported by Calcite allows us to plan ahead since we can see the relational algebra before we do any work on the execution side.
To conclude we'd like to thank the Apache Calcite team for their work on building a great foundation for databases.