MapD 2.0 - Under the hood with the MapD Core GPU database and Iris Rendering Engine
Continuing where we left off in our earlier post on MapD 2.0’s Immerse visualization client, today we want to walk you through some of version 2.0’s major improvements to our GPU-accelerated Core database and Iris Rendering Engine.
Before we delve into the details, main themes for this release are: speed, robustness, and further visual analytics power. Our system is able to steadily deliver extremely fast query speeds across a larger set of SQL queries and when analyzing datasets of daunting size. On the visual analytics front, our Iris Rendering Engine can now express a deeper set of visually compelling, data-driven insights.
In particular, here are some of our advances, starting with the MapD Core database:
Robust query execution model for subqueries. Your queries may be complex, but the execution of your queries doesn’t always have to be. When evaluating subqueries, we found that our SQL parser (the result of the excellent folks at the open source Apache Calcite project) was translating subqueries into JOINs. Joins can be harder to optimize because they may create more complex relational algebra patterns which bear little relation to the original query. To avoid this, we took advantage of a newly added Calcite goody, RexSubQuery, which preserves more information about the original structure of the query as it is passed down the chain into relational algebra. This allows each subquery to be optimized as a separate entity, laying out queries as a simple “tree” rather than as a complex “graph” of dependencies.
Moving subquery planning from a graph (left) to a tree (right).
Due to this work, multi-table joins, IN and NOT IN subqueries are now more robust and have a broader set of supported patterns. Queries will perform optimally more often without the need for tinkering with the query.
More compact and faster baseline hash. A hash dictionary is a data structure used to represent the output items of a GROUP BY query. Version 2 introduces a new “baseline hash” which is used for GROUP BY queries where a perfect hash is not feasible. Benefits of this work include: 1) query speeds which are 1-2 orders of magnitude faster for queries which group by dictionary-encoded strings 2) the ability to have tens of millions of groups in a query output, as compared with roughly 100K in previous versions.
With this work, MapD now uses system RAM more efficiently, meaning users now benefit from lightning fast queries even on result sets with a huge number of groups. For example visualizing large combinations of groups, like every twitter sender_id by every operating system, is no problem.
Furthermore, the addition of fast multi-GPU radix sort means that those groups can be sorted "at MapD speed." Even for tens of millions of groups, we can now sort in a fraction of a second.
Pre-flight size estimator. For the baseline hash data structure we spoke about above, in order to run a query efficiently we need to know the optimal number of slots which should be allocated to handle the data at hand. To do this, we have developed the ability to quickly estimate the cardinality of a query’s output even before the query is run. This is done by using a probabilistic counter data structure, allowing cardinality of large outputs to be estimated in a memory- and time-efficient way.
This stingier use of RAM means that our users can handle larger datasets on smaller machines, or can extract additional speed from existing resources.
The pre-flight estimator also is being used to allow large projection queries to run where they had previously been restricted. Before the actual query runs, a simple COUNT query is run to determine the filtered number of rows in the output. The projection query is allowed to proceed provided that number falls below a large (and ever growing) threshold. Users get to run more projection queries, more often.
Up to 30% faster data import times through better memory management, and support for importing multiple small files more quickly. We’ve also improved memory management for query execution, enabling us to have long-running servers without performance degradation over time.
Each of these features provides additional speed or scalability to our already record-breaking numbers, and allows us to address a broader set of data analytics use cases for our customers.
Moving beyond the Core database, we also have major new capabilities in the Iris Rendering Engine, which is the part of our system which takes the result of database queries and turns them into beautiful and insightful visualizations.
Multi-layered and grouped query rendering. Often geographic information can be more carefully understood by considering both point-level information for maximum detail and aggregate-level information for a sense of regional trends. In version 2.0, we’ve introduced the ability to overlay several layers of points and shapes, adding both information density and better context for analysis. Each layer is the result of a separate MapD Core SQL query, which are then composited to form the layers. By harnessing our Vega API, our users can leverage this functionality to aid their GIS analytics, but also to perform non-geographic visual analysis, such as layered scatterplots.
A layer of points on a layer of zipcodes
Density accumulation rendering. Often it’s the case that many pieces of data reside at the exact same location on a map or in a chart. But since they are at the same location, how do you represent the density of information underneath that one point? This is where density accumulation comes in, which debuted in the Iris Rendering Engine in version 2.0. Density accumulation allows for areas of high density to be colored in a more intense or “hot” color, and vice versa for areas of low density, letting the user understand the depth of data at any one point.
Density accumulate all the things
In addition to expressing density at a point, users can now make use of categorical accumulation rendering, where each category is assigned a color, and the mix of colors in any one area represents the population of categories in the area. Similar to how one might mix blue paint with yellow to make green paint, a yellow category and a blue category would cause a region to appear different shades of green based on each category’s weight. We do hope our users mix colors tastefully.
Categorical accumulation of a red category and a blue category for Massachusetts
More color spaces, and packed 32-bit integer storage of colors. In the past, the Iris Rendering Engine has supported the broadly used RGB color space. Version 2.0 introduces support for three additional color spaces which our users may specify when rendering data-driven graphics: HSL, CIE-LAB and HCL. HSL and HCL, are cylindrical representations of color which allow a larger gamut and make it easier to specify color via alternate properties such as hue, saturation, and lightness. CIE-LAB is a preferred color space for data visualization, since it tends to better match the human eye’s perception of color differences, allowing data distribution to align better with color perception.
Version 2.0’s added support for packing RGB and CIE-LIB colors into 32-bit integers means that colors are stored in a database-friendly format where they can be manipulated directly through SQL, and subsequently rendered at no conversion cost. We expect that version 2.0’s suite of color tools, on top of a more powerful MapD Core database, will let our customers achieve deeper understanding of their data and reach fast, intuitive visual insights.
Well, from front to back, that is an overview MapD version 2.0, and we hope you agree that it represents a seriously powerful package of tools for analytical discovery. And there's still much more to come! Try out a demo to see it for yourself, or reach out to us at email@example.com for an in-person demo.