Using Python for data visualization with MapD Core
Creating charts in the MapD framework can happen in a number of different ways. All of them involve our hyper-fast GPU-powered, SQL-compliant, relational database, MapD Core.
One can also use MapD’s published API to create entirely custom front ends on top of MapD Core.
Finally, using ODBC, JDBC or Thrift, users can connect to other visual analytics interfaces like Tableau, Qlik or Birst (just to name a few).
Within this world of connectors is a pretty standard implementation: Python over JDBC.
We have a number of clients using this implementation so we wanted to document it in a blog post. There is a range of use cases but very often it is part of a data science workflow that involves R.
I have made a few assumptions for this post and wanted to share those upfront.
- MapD Core will be installed on a remote server
- Python will be installed on a server where MapD is not installed. This is so that we replicate the behavior of an application server accessing a remote database server.
- Reader is able to install all the components listed below and follows the steps required to verify that the install went through correctly.
To get started we need get the our environment up and running. To do that we need to complete the following steps:
- Install MapD on a remote server as outlined in the installation documentation
- Install Python on your application server
- Install the Python Panda libraries to be able to display charts.
- Install sshpass. This is optional. I installed this because I didn’t want to be entering the password to the remote server as I stream data into MapD Core.
- Follow the instructions listed here for using the code snippet provided
- MapD Core supports Python via JayDeBeApi. mapd_jdbc.py in the sample code directory is a wrapper around jaydebeapi that returns a standard Python Connection object. The code assumes that the MapD JDBC driver (mapdjdbc-1.0-SNAPSHOT-jar-with-dependencies.jar) is available in the same directory. Users may create a cursor object using the returned connection object. Please be sure to close the connection at the end of your Python script.
- Before using, ensure that jaydebeapi is installed on the application server by running:
- pip install jaydebeapi
- The jar is available at $MAPD_PATH/bin/mapdjdbc-1.0-SNAPSHOT-jar-with-dependencies.jar
- The code snippet uses the mapd_jdbc wrapper to query MapD Core and plot the results with pyplot. The code is available in $MAPD_PATH/SampleCode as mapd_jdbc_example.py:
- Ensure that your python script is running on the same library where the following files are available on the MapD server in the directory. So you will have to physically move these two items to the application server directory location where the python code is going to executed from
- mapdjdbc-1.0-SNAPSHOT-jar-with-dependencies.jar which is located in the $MAPD_PATH/bin/mapdjdbc-1.0-SNAPSHOT-jar-with-dependencies.jar
- mapd_jdbc.py which is located in the $MAPD_PATH/SampleCode directory
Once this environment is set up we are ready to execute an example.
- Open a file with thousands of rows (this scales to the billions)
- Iterate thru the dataset and inserting 100 rows at a time into MapD Core
- Create a scatterplot graph outside of MapD Immerse using Pyplot
- After five seconds the graph window closes automatically and the program will go back to the beginning, executing steps one through four
- The code is in an infinite loop and can be exited using ctrl c
Note that as the numbers of rows into the database increase you will see the chart changing to plot the newer values. You can also verify the numbers of rows in MapD Core by running a query against it using the command line and getting the # of rows.
Disclaimer: Any piece of code that needs to go into production needs to have proper error handling and graceful exits coded, the goal is not to showcase a production ready code. The goal is to showcase all the important steps in connecting, inserting, extracting and graphing the data using Python and MapD Core.
MapD Core is a relational, columnar database that utilizes denormalized tables with speed and precision. If you are pulling data from a third normal form schema or a star schema; optimal performance can be gained from de-normalizing it, pulling the data from the source with all the joins and creating a single table on MapD Core.
I have used the Freddie Mac Loan Level dataset (available free with registration). Needless to say, any CSV file can be used after making the appropriate replacements in the code that is provided.
Python Code Snippet
The following code snippet will execute the steps listed above in the code summary provided you have set up your environment correctly.
Explanation of the Code
So let’s talk through what is happening at each step. Again, this is not a Python tutorial so I will focus only on the code blocks and what they are doing, not anything underlying them. Keep in mind you will need to replaceserver names, user ids, file paths etc. as necessary.
# !/usr/bin/env python
Defines the path to your local Python install.
import mapd_jdbc import pandas
import matplotlib.pyplot as plt
This segment imports all the libraries required to instantiate a JDBC connection. This also imports the Panda and the Matplotlib libraries required to plot our graphs. Finally this imports the OS libraries required to run OS commands from within a Python script.
dbname = 'database name' user = 'user name on database'
host = host name:port'
password = 'password'
This segment is user specific and defines the variables for the connection string required to instantiate the connection into MapD Core.
mapd_con = mapd_jdbc.connect(dbname=dbname,user=user,host=host,password=password) mapd_cursor = mapd_con.cursor()
This segment opens a JDBC connection into MapD Core and then defines a cursor to pull the rows.
query = "CREATE TABLE hist ( " + "CREDIT_SCORE SMALLINT, " + "FIRST_PAYMENT_DATE INTEGER, " + "FIRST_TIME_HOMEBUYER_FLAG TEXT ENCODING DICT(32), " + "MATURITY_DATE INTEGER, " +"MSA INTEGER, " +"MIP SMALLINT, " +"NUMBER_OF_UNITS SMALLINT, " +"OCCUPANCY_STATUS TEXT ENCODING DICT(32), " +"ORIGINAL_COMBINED_LOAN SMALLINT, " +"ORIGINAL_DEBT SMALLINT, " +"ORIGINAL_UPB INTEGER, " +"ORIGINAL_LOAN SMALLINT, " +"ORIGINAL_INTEREST_RATE FLOAT, " +"CHANNEL TEXT ENCODING DICT(32), " +"PPM_FLAG TEXT ENCODING DICT(32), " +"PRODUCT_TYPE TEXT ENCODING DICT(32), " +"PROPERTY_STATE TEXT ENCODING DICT(32), " +"PROPERTY_TYPE TEXT ENCODING DICT(32), " +"POSTAL_CODE INTEGER, " +"LOAN_SEQUENCE_NUMBER TEXT ENCODING NONE, " +"LOAN_PURPOSE TEXT ENCODING DICT(32), " +"ORIGINAL_LOAN_TERM SMALLINT, " +"NUMBER_OF_BORROWERS SMALLINT, " +"SELLER_NAME TEXT ENCODING DICT(32), " +"SERVICER_NAME TEXT ENCODING DICT(32))"
This block defines a variable with a query in order to create a table in MapD. Note that when you run this code multiple times you must be sure to drop the table each time before running it again.
This executes the query, thereby creating the table on MapD Core.
This block opens the CSV file hist.csv for reading. This file should already exist and this is your input file.
The code will then iterate thru the hist.csv infinitely or till it reaches the end of the file. When it reaches the end of the file it will write out a new file called streamfile.txt in the directory. If you specified 10,000 records it will close the file upon reaching this threshold and will SCP it over to the MapD Core server. This is needed as the COPY command requires the data to be local to the server.
Now the code issues a COPY..FROM command to insert the 10,000 rows into MapD Core. Next is the query to get the data for plotting the graph and with the results rendered as a scatterplot graph. After 5 seconds, the chart window will close.
While we have chosen a scatter plot it should be noted that there are several other graph types available in the Panda libraries.
Please note that Python is very sensitive to indentation. So if you are cut and pasting the code recheck the indentation based on explanation I have provided above.The code above does not handle errors or end of file conditions etc.
This simple example underscores a powerful set of functionality in our software - that is the ability to call different applications to move data in and out of MapD Core - taking advantage of is exceptional capabilities to operate on the data.
This is particularly valuable in the context of data science where Python reigns supreme and the ability to interact with data from applications such as R, Matlab or other applications is important to the overall workflow.
For additional resources in order to build production ready code see the link below
Using the Thrift interface data can be streamed into MapD continuously instead of the SCP example shown in the code above.
Examples are available in the SampleCode folder of your MapD installation.