GPU-powered visual analytics to enable exploration and interaction of big data for Machine Learning
H2O World 2017 was held in the beginning of December at the Computer History Museum in Mountain View, CA. The sold-out conference featured more than fifty speakers from across industries, who presented on a wide variety of topics. Data scientists, data engineers, and business analysts all gathered to learn how deep learning, data science, and artificial intelligence are transforming businesses today.
It was a privilege to present at the conference with Mateusz Dymczyk, Software Engineer from H2O.ai. Together, we shared how data scientists can leverage GPUs to explore and interact with billions of rows of data sets, determine features, train and validate models, and interrogate and visualize the models. Our presentation was titled “MapD & H2O.ai: GPU-powered Visualization, Data Analysis and Machine Learning” and in case you missed it, here are 3 key takeaways.
MapD Delivers Visual Analytics at the Speed of Thought
MapD is built on two components: 1) MapD Core, which is our analytical SQL engine that allows users to query billions of records in milliseconds and 2) MapD Immerse, which is our self-service visualization system that leverages the analytical processing speed and rendering power of GPGPUs (General Purpose GPUs). MapD allows analysts to explore big data at the speed of thought because of its extreme parallelized analytical processing speeds, in-situ rendering, and intuitive self-service interface.
“MapD queries are orders of magnitude faster” - CPU vs. GPU
Industry blogger Mark Litwintschik benchmarked MapD on a billion-row taxi dataset and found it to be up to orders-of-magnitude faster than the fastest CPU databases. The taxi dataset includes not only taxis, but also incorporates receipts from ride-sharing companies such as Uber and Lyft from 2009-2016. During our demo, we showed how easy it is to look at the taxi dashboard, form a hypothesis, interact with data by applying cross-filters, and immediately get results to draw insights, e.g., we were able to zoom directly into Times Square to discover the most popular pick-up times were mostly on weekdays in the morning, whereas the pickup times in the Meatpacking district were in the wee hours of the morning on weekends. We also found that the primary form of payment is now credit cards compared to cash during 2009-2010. This was all done at the interactive speeds powered by GPUs.
Integration of MapD and H2O.ai (GOAI) Accelerates Data Scientist’s Workflow
While GPUs can accelerate both data analytics and machine learning workloads, the systems and platforms are unable to harness these disruptive performance gains because they remain isolated from each other. The GPU Open Analytics Initiative (GOAI) was founded by MapD, H2O.ai, and Anaconda in May 2017 to break these silos. The GPU data frame (GDF), based on Apache Arrow, allows seamless passing of data between processes without needing to serialize/deserialize via the PCIe.
The seamless integration between MapD and H2O.ai allows a data scientist to use MapD to explore data and determine relevant features. This rich, interactive descriptive analytics helps to pick both business-driven and data-driven features that can predict the objective function. The data scientist can then choose the Machine Learning algorithm, and hyperparameters to score and validate the model using H2O.ai. H2O.ai makes machine learning accessible and allows business users to extract insights from data, without needing expertise in deploying or tuning machine learning models. Finally, MapD can be used to improve explainability of the model to improve transparency and interpretability of the trained model.
Thanks to everyone who attended our presentation at H2O World. In case you missed it, watch the video here.