MapD is a founding and active member of GOAi (the GPU Open Analytics Initiative). One of the primary goals of GOAi is to enable end-to-end analytics on GPUs. The reason for this is that while each technology in the process leverages GPU’s beautifully on their own, if data has to leave the GPU to move to the next system in the process, this can have significant latency implications. So, keeping the data in a GPU buffer through the exploration, extraction, preprocessing, model training, validation, and prediction makes it much faster and simpler.
MapD and Anaconda, another GOAi founding member, are involved in development of pythonic clients such as pymapd (interface to MapD's SQL engine supporting DBAPI 2.0), pygdf (Python interface to access and manipulate the GPU Dataframe) along with our core platform modules MapD Core SQL engine and MapD Immerse, visual analytics tool.
With the help of Apache Arrow, an efficient data interchange is created between MapD and pygdf to leverage various machine learning tools like h2o.ai, PyTorch, and others. In this post, we will explore the analytics of GPUs with the help of aforementioned MapD and python tools along with XGBoost.
The example used in this post, the Customer Automotive Churn dataset (which focuses on a real-world problem of customer vehicle churning) has been obtained from Volkswagen as a result of our joint collaboration on implementing analytics workflow on GPUs. We invite you to this session at GTC San Jose to to learn more: How the Auto Industry Accelerates Machine Learning with Visual Analytics.
Some details, such as column names and some column values, have been masked for obvious reasons. The dataset contains 21 feature columns including details like car model, production year, etc. The target variable has two classes 0 and 1, which explains if the next servicing of a vehicle took place in the VW garage or not. The dataset has already been divided into training and test sets. In the next few steps, we will perform feature engineering in MapD, extract data from MapD, preprocess it in Pygdf, train a model, do the predictions with XGBoost, and Store the results back in MapD. Do not worry, a notebook has been provided so that you do not have to copy+paste all the steps in this post.
The first step is to get the MapD Community Edition (includes MapD Core SQL database and front-end visualization tool called Immerse). Then install pygdf (from GOAi initiative), pymapd (from conda-forge), and XGBoost which can be directly installed from source.
conda install -c conda-forge pymapd
conda install -c gpuopenanalytics/label/dev pygdf
Setting up MapD Connection
A connection to Mapd can be established by passing the basic parameters. We will be using the connection object received to perform all transactions with the database.
A dataset can be loaded into Mapd using the load_table method with pandas dataframe as an input variable. Being a high level, load_table automatically chooses pyarrow. Here's an example of loading Iris dataset on the fly:
Feature Engineering in MapD
Assuming that you have loaded the churn dataset into MapD, let’s start to build some charts in MapD Immerse, which by default starts on https://localhost:9092 . The capability used in this post to display charts from different tables in one dashboard is limited to MapD Immerse Enterprise edition, but one can use the Community Edition to create a separate dashboard for each chart. We can find a few insights in the example charts below:
In this dataset, we see 1,424,232 records for training and 357,019 for testing. It can be observed that car models produced in early years, especially models 8, 10, and 11, are more prone to churn.
Now we will be extracting the train and test data using two separate queries into pygdf dataframes. During extraction of test data, we will also extract MapD’s native “rowid,” which contains a virtual id for each row generated. We will talk about rowid usage and importance later.
Each of the two queries had 21 feature columns, and the combined queries had 1.7 Million data points and extraction of data using two queries combined took just 0.45 seconds. The reason being with the help of Arrow, pointers of memory buffer holding data on GPUs are being passed from MapD to pandas which give us back pygdf dataframe.
Preprocessing in pygdf
Since we will be using Decision Trees with XGBoost to build out the model we don't have to do the necessary transformations such as normalizing, encoding, etc. Instead, we will just fill the null values.
One more thing will be to copy rowid columns to new dataframe. We will be using rowid as a unique identifier to Mapd the predicted values later. Rowid comes handy during mapping of predicted values for datasets that do not have a unique identifier or primary key column.
The last step would be to separate target variable from the train and tests sets.
Training the Model
Now here’s the fun part. Training is an iterative process which requires continuous tuning of hyper parameters in order to reach the optimal. Hyper parameter tuning varies for each dataset. Thankfully, we have already done the hard part of this tutorial for you. Below are parameters for XGBoost to train our model on GPUs:
Let's begin training the model, but our data is on GPU and unfortunately, XGBoost's API doesn't support points. Therefore, we would have to copy the data on CPU in the form of pandas dataframe and pass it to the algorithm, which will perform the training on GPUs. we will also validate the model using 5 folds of cross-validation.
So, it took roughly 1.3 seconds to train model with approximately 1.7 million data points, 14 seconds to cross-validate and just 0.3 seconds to copy the data into pandas dataframe. The minimum test Area Under Curve (AUC) of all cross-validations obtained is 0.80, which is not bad for this specific data set.
After completion of training, let's store the fscores of each feature in MapD so that we can save them to the dashboard in Immerse.
Our model shows that col_8 has the highest importance on of all the other features followed by col_20, col_10, col_9, and col_4. So lets now go ahead and calculate the partial dependencies of our top 5 features.
Partial Dependence Plots
Partial dependency is a measure of how dependent target variable is on a certain feature. A good explanation can be found in Ron Pearson’s article on interpreting partial dependence plots. The next step would be to plot some partial dependence plots of our top 5 features and visualize all of them in one chat in Immerse. A small helper method is defined to calculate the dependency on GPUs.
Finally, let's proceed to the predictions on the test set and store the results back in MapD.
Great! Now all the metrics of the model, original data, and the predicted values are all available in the same dashboard, awaiting your observations to turn them into valuable insights. Having everything at hand in one place makes things very easy for analysis. We can repeat this process multiple times, and the best part about it is that we can find the results within a few minutes as compared to hours assessing and assembling data with the traditional means.
Try It Out
Liked what you saw? You can download the docker version of Jupyter notebook demo here. Let us know what you think, on our community forums , or on GitHub . You can also download a fully featured community edition of MapD, which includes the open source MapD Core SQL engine, and our MapD Immerse data exploration UI.