HEAVY.AI Team
Apr 26, 2017

MapD 3.0 - Bringing distributed scale-out to GPU analytics

Try HeavyIQ Conversational Analytics on 400 million tweets

Download HEAVY.AI Free, a full-featured version available for use at no cost.

GET FREE LICENSE

We’re very happy to announce that with today’s release of version 3.0 of the MapD Analytics Platform we're bringing GPU-accelerated analytics onto distributed clusters!

We’ve been hard at work for months to extend the unique advantages of our SQL-compliant GPU database from being able to run on one server to now being able to scale across multiple servers, allowing our customers to take on even larger datasets while still maintaining the fluid, instant data exploration experience which we’ve become known for. Check out our distributed demo to see how quickly 11 billion rows of shipping location data can be visualized on a four-node cluster of MapD.

Version 3.0 of the MapD Core database introduces support for a High Availability configuration, ensuring our enterprise customers are able to rely on MapD for robust and redundant coverage of their analytics needs.

While going distributed was a significant piece of work, our process was eased by the scalability which we had already built into our single-node setup. Within one node, MapD uses a shared nothing architecture between GPUs, meaning that when a query is launched, each GPU processes a slice of data fully independently of other GPUs, with data then being aggregated on a CPU. Even though multiple GPUs reside within a single machine, you could think of this architecture as a miniature distributed setup, since data is fanned out from CPU to multiple GPUs and then gathered back together onto the CPU.

In order to run this distributed architecture within a single node, previous versions of MapD introduced data structures (flatmaps) which were ideal in efficiently packaging and transporting data between GPU and host while maintaining its meaning. In the same way, MapD 3.0’s distributed setup now efficiently transports data between the leaf nodes and aggregator node of a distributed cluster. It’s only slightly oversimplifying to say that going distributed simply required adding one new layer of aggregation, first from GPU to CPU and now finally from leaf node to aggregator.

We’ve gone to lengths to maintain MapD’s characteristic extreme levels of performance by being careful to minimize overhead in sending data over the wire between nodes. Rather than fully converting data into idiomatic Thrift (our binary protocol of choice), we encapsulate our flatmap blob as a Thrift binary field, virtually eliminating any conversion cost. In addition to transporting result data efficiently, we’ve also thought carefully about how to transport image data between nodes, an important consideration since a key advantage of our GPU-accelerated database is the generation of visual representations of result sets. Good performance is maintained by compressing the images’ color channel information before being sent over the wire for final compositing on the aggregator node.

A final advantage of 3.0’s distributed setup is faster data load times. Import times speed up linearly with the number of nodes since our database has no centralized index to maintain, and hence no need to communicate between leaves during import. An independent benchmark saw a 1.2B row 477GB taxi dataset imported in 26 minutes across 2 servers, down from 48 minutes on a single server.

Beyond distributed, MapD 3.0’s other major addition is support for High Availability configurations. Due to MapD’s astounding performance, many customers are keen to use our services in more demanding environments where higher volumes of data and higher uptime are crucial. High Availability configurations allow a set of databases that are running together in a High Availability Group to be synchronized in a guaranteed way. By load balancing requests to servers in a High Availability Group, customers are assured that service will remain available in the event of failure of a server. In addition to this added robustness, while multiple servers are active in a High Availability Group response times can improve substantially, due to distribution of query load across the members.

There are a few other goodies worth mentioning as well, such as a native ODBC driver to connect to clients like Tableau, Excel and other applications. Previously customers could use a JDBC-to-ODBC bridge, but the native ODBC driver is easier to install and offers better performance. Also notable are additional SQL capabilities and query performance enhancements. For example, COUNT(DISTINCT)queries have been moved to the GPU for significantly better performance, and IN subqueries can now handle large input result sets.

These improvements together with our distributed and HA capability mark a major step forward for our product, and we're very excited to bring an even more scalable and powerful analytics platform to our customers. As always please feel free to reach out to us to see how MapD can help you gain better insights from your own data.


HEAVY.AI Team

HEAVY.AI (formerly OmniSci) is the pioneer in GPU-accelerated analytics, redefining speed and scale in big data querying and visualization. The HEAVY.AI platform is used to find insights in data beyond the limits of mainstream analytics tools. Originating from research at MIT, HEAVY.AI is a technology breakthrough, harnessing the massive parallel computing of GPUs for data analytics.