Veda Shankar
Aug 2, 2018

How to Deploy MapD Community Edition on OpenShift Origin

Try HeavyIQ Conversational Analytics on 400 million tweets

Download HEAVY.AI Free, a full-featured version available for use at no cost.

GET FREE LICENSE

OpenShift is Red Hat’s enterprise-grade container platform based on Docker, Kubernetes and Red Hat Enterprise Linux. Origin is the upstream, open source, community version of OpenShift, so pairing it with the MapD Community Edition feels like a natural fit!

In this blog post, I will show you how to deploy MapD on top of OpenShift Origin. The MapD Community Edition is available as Docker images for two hardware configurations – one for a CPU-only node and the other for GPU-enabled. For this example, I leveraged shared storage to operate on the same backend MapD database using either the CPU or GPU node with its corresponding docker image. The ability to switch between a CPU and GPU node allows us to use the CPU image for tasks that do not demand a lot of performance, and use the GPU image for running SQL queries and rendering complicated charts in milliseconds.

Setting up the infrastructure

For testing MapD on OpenShift Origin, I used Amazon EC2 cloud, but the instructions are also applicable to an on-premise deployment of OpenShift. I followed the instructions in Sysdig’s tutorial How to deploy OpenShift on AWS to set up a minimalistic OpenShift Origin cluster. I modified the CloudFormation script provided in this tutorial to launch the following configuration in AWS:

  • 1 t2.medium CentOS 7 based master node
  • 1 p2.xlarge CentOS 7 based worker node that has a GPU (Tesla-K80)
  • 1 t2.large CentOS 7 based worker node that has CPU only
  • A separate VPC and subnets to provide logical network isolation for the OpenShift cluster
  • Security groups that will open the following public ports:
  • 22 SSH for all host
  • 8443 for the OpenShift Web console, master node
  • 10250 master proxy to node hosts, master node
  • 9090 thru 9092 for MapD

In a production environment, the amount of data you can process with MapD Core depends primarily on the amount of GPU RAM and CPU RAM available on a MapD worker node. The MapD Hardware Configuration Reference Guide will help you size the MapD node for both on-premise as well as cloud deployment.

As I want to access the same database from both the CPU- and GPU-based worker node, I decided to set up shared storage based on Amazon EFS (Elastic File System). EFS allows the nodes to mount using the NFSv4 protocol. I created the EFS file system mount target in the same VPC as the OpenShift cluster.

OpenShift cluster.

Deploying OpenShift Using Ansible

The OpenShift Origin project uses Ansible playbooks to automate the installation. I used the OpenShift master node as the Ansible server which needs to be set up with several required packages.

Login to the Master node and install Ansible:

$ sudo yum install epel-release -y
$ sudo yum update –y
$ sudo yum install ansible -y
$ sudo yum install git -y

I used the OpenShift Ansible GitHub project version 3.9, which contains the Ansible roles and playbooks to install and manage OpenShift clusters.

$ git clone https://github.com/openshift/openshift-ansible.git
$ cd openshift-ansible
$ git checkout release-3.9
$ cd ..

On the Master node, I downloaded the project:

I ran the Ansible playbook prepare.yml to do some of the pre-configuration of the CentOS hosts. Ansible uses the hosts inventory file to set the names of the master and worker nodes correctly. In this setup, the master node runs all of the components of the master controller and also hosts the etcd key-value store.

As you can see, the hosts file has the same server for both the master and etcd:
...
[masters]
ip-10-0-0-6.ec2.internal
[etcd]
ip-10-0-0-6.ec2.internal
[nodes]
ip-10-0-0-6.ec2.internal openshift_node_labels="{'region':'infra','zone':'east'}" openshift_schedulable=true
ip-10-0-0-4.ec2.internal openshift_node_labels="{'region': 'primary', 'zone': 'east'}"
ip-10-0-0-10.ec2.internal openshift_node_labels="{'region': 'primary', 'zone': 'east'}

I ran the pre-configuration Ansible playbook:
$ ansible-playbook prepare.yml -i ./hosts --key-file mapd-east1.pem

Then, I applied the OpenShift installation playbook to the configured nodes

$ ansible-playbook -i ./hosts openshift-ansible/playbooks/prerequisites.yml  --key-file mapd-east1.pem
$ ansible-playbook -i ./hosts openshift-ansible/playbooks/deploy_cluster.yml --key-file mapd-east1.pem

The OpenShift cluster became available once the playbook completed execution successfully, and I could then access the web console.

The final step in setting up the cluster was to create a user account and password:

$ sudo htpasswd -b /etc/openshift/openshift-passwd admin MapD1@

With a browser, I accessed the OpenShift Console at the master node’s public IP address using port 8443. You can login as admin with password MapD1@.

OpenShift Console

On the Master Node (ip0-10-0-0-6), I logged in as admin into the default project using the oc command line utility, and confirmed that the master and worker nodes were in a ready state:

$ oc login -u system:admin
$ oc project default
$ oc get nodes
NAME                        STATUS    ROLES     AGE       VERSION
ip-10-0-0-10.ec2.internal   Ready     compute   13d       v1.9.1+a0ce1bc657
ip-10-0-0-4.ec2.internal    Ready     compute   13d       v1.9.1+a0ce1bc657
ip-10-0-0-6.ec2.internal    Ready     master    13d       v1.9.1+a0ce1bc657

In my setup, ip-10-0-0-10 is the GPU node and ip-10-0-0-4 is the CPU-only node, with the nodes labeled accordingly:

$ oc label node ip-10-0-0-4.ec2.internal  nodetype=cpu
$ oc label node ip-10-0-0-10.ec2.internal nodetype=gp

GPU Setup - NVIDIA Driver Installation

To use the Tesla-K80 GPU on the OpenShift node, I had to install the NVIDIA drivers and NVIDIA docker container runtime modules. The operating system on the node with the GPU is CentOS Linux release 7.3.1611 (Core). I installed the Extra Packages for Enterprise Linux (EPEL) repository, because RHEL-based distributions require Dynamic Kernel Module Support (DKMS) to build the GPU driver kernel modules.

I performed the following commands on the GPU node:

$ sudo yum install epel-release

Then I installed the latest kernel for which the headers are available and reboot the node.

$ sudo yum install epel-release

After reboot, I installed the kernel headers and CUDA drivers. The CUDA platform gives direct access to the GPU virtual instruction set and parallel computation elements.

$ sudo yum install kernel-devel-$(uname -r)
$ curl -O http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-9.1.85-1.x86_64.rpm
$ sudo rpm --install cuda-repo-rhel7-9.1.85-1.x86_64.rpm
$ sudo yum clean expire-cache
$ sudo yum install dkms
$ sudo yum install cuda-drivers
$ curl -s -L https://nvidia.github.io/nvidia-container-runtime/centos7/x86_64/nvidia-container-runtime.repo | sudo tee /etc/yum.repos.d/nvidia-container-runtime.repo
$ sudo yum install  nvidia-container-runtime
$ sudo modprobe -r nouveau
$ sudo nvidia-modprobe
$ sudo nvidia-modprobe -u
$ sudo vi  /usr/libexec/oci/hooks.d/oci-nvidia-hook
        #!/bin/bash
        /usr/bin/nvidia-container-runtime-hook $1
$ sudo chmod +x /usr/libexec/oci/hooks.d/oci-nvidia-hook
$ sudo systemctl restart docker
$ sudo systemctl status docker

I labeled the installed NVIDIA files with the correct SELinux label and rebooted the system to ensure that all changes were active.

$ sudo chcon -t container_file_t  /dev/nvidia* // this alone did not work
$ sudo reboot

After reboot, I confirmed that the NVIDIA drivers were loaded and that the GPU was recognized.

$ lsmod | grep nvidia
$ nvidia-smi

Deploying the GPU Version of MapD Community Edition

These are the steps for installing MapD Community Edition as a Docker container on the OpenShift node running on a Tesla-K80 GPU. The image mapd/mapd-ce-cuda is optimized to run on the CUDA platform and is available on the Docker hub. I created the YAML file deploy_mapd_gpu.yml for launching the pod with the MapD docker image:

apiVersion: v1
kind: Pod
metadata:
  name: mapd
  labels:
    app: mapd
spec:
  nodeSelector:
    # Node with GPU
   nodetype: gpu
  containers:
  - name: mapd
    # MapD Docker GPU image
   image: mapd/mapd-ce-cuda
    volumeMounts:
    # EFS mount point
    - mountPath: "/mapd-storage"
      name: mystor
    # MapD ports
    ports:
    -name: mapd-port0
    containerPort: 9090
    -name: mapd-port1
    containerPort: 9091
    -name: mapd-port2
    containerPort: 9092
  volumes:
    - name: mystor
      nfs:
    # EFS Access
        server: fs-2bc69063.efs.us-east-1.amazonaws.com
        path: /


After I created the MapD pod, I checked the status to make sure it was running:

$ oc create -f deploy_mapd_gpu.yml 
$ oc get pods
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-1-26c5j    1/1       Running   2          13d
mapd                       1/1       Running   0          27s
registry-console-1-lpldp   1/1       Running   0          2d
router-1-c7hds             1/1       Running   2          13d

I then got details of the launched MapD pod and ran the nvidia-smi management and monitoring command line utility to make sure that the MapD process was running on the GPU:

$ oc describe pod mapd
Name:         mapd
Namespace:    default
Node:         ip-10-0-0-10.ec2.internal/10.0.0.10
...
$ ssh -i mapd-east1.pem ip-10-0-0-10.ec2.internal “nvidia-smi”
Wed Jul 25 06:37:50 2018      
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37                 Driver Version: 396.37                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   57C    P0    57W / 149W |    699MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+                         
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     11462    C+G   ./bin/mapd_server                            686MiB |
+-----------------------------------------------------------------------------+

I accessed the command line in the MapD docker image to examine the processes and ran MapD utilities. Notice that the MapD server and MapD web server were launched automatically and the database was stored on the NFS shared storage mounted at /mapd-storage.

$ oc rsh mapd 
# ps -ef | grep mapd
root         1     0  0 01:15 ?        00:00:00 /bin/sh -c /mapd/startmapd --non-interactive --data /mapd-storage/data --config /mapd-storage/mapd.conf
root         8     7  0 01:15 ?        00:00:08 ./bin/mapd_server /mapd-storage/data --port 9091 --http-port 9090 --calcite-port 9093
root         9     7  0 01:15 ?        00:00:00 ./bin/mapd_web_server --port 9092 --backend-url http://localhost:9090 --data /mapd-storage/data
root        25     8  0 01:15 ?        00:00:13 -Xmx1024m -jar /mapd/bin/calcite-1.0-SNAPSHOT-jar-with-dependencies.jar -e /mapd/QueryEngine/ -d /mapd-storage/data -p 9093 -m 9091
# ls /mapd-storage 
data
# ls /mapd-storage/data
mapd_catalogs  mapd_data  mapd_export  mapd_import  mapd_log  mapd_server_pid.lck

To verify that the system was working, I loaded some sample data, ran a SQL query using mapdql command line utility, and finally generated a scatter plot using MapD Immerse. MapD ships with two sample datasets of airline flight information collected in 2008, and one dataset of New York City census information collected in 2015. To install the sample data, I ran the insert_sample_data command and chose option 2 for inserting 10k rows of Flights data.

# /mapd/insert_sample_data
        ...
Enter dataset number to download, or 'q' to quit:
 #     Dataset                   Rows    Table Name             File Name
 1)    Flights (2008)            7M      flights_2008_7M        flights_2008_7M.tar.gz
 2)    Flights (2008)            10k     flights_2008_10k       flights_2008_10k.tar.gz
 3)    NYC Tree Census (2015)    683k    nyc_trees_2015_683k    nyc_trees_2015_683k.tar.gz
2
/mapd/sample_datasets /mapd
- downloading and extracting flights_2008_10k.tar.gz
...
- inserting file: /mapd/sample_datasets/flights_2008_10k/flights_2008_10k.csv
User mapd connected to database mapd
Result
Loaded: 10000 recs, Rejected: 0 recs in 15.960000 secs
User mapd disconnected from database mapd

I connected to MapD Core using mapdql, using the default password “HyperInteractive”:

# /mapd/bin/mapdql -p HyperInteractive
User mapd connected to database mapd

Then I listed the tables in the database:

mapdql> \t
mapd_states
mapd_counties
mapd_countries
Flights_2008_10k

I printed the schema for the flights table. The fields used in the SQL query are listed below:

mapdql> \d flights_2008_10k
CREATE TABLE flights_2008_10k (
flight_year SMALLINT,
flight_month SMALLINT,
flight_dayofmonth SMALLINT,
flight_dayofweek SMALLINT,
deptime SMALLINT,
origin_city TEXT ENCODING DICT(32),
dest_city TEXT ENCODING DICT(32),
airtime SMALLINT,
distance SMALLINT,
 ...
dest_merc_x FLOAT,
dest_merc_y FLOAT)
WITH (FRAGMENT_SIZE = 2000000)

I then ran a SQL query on the table to find flights where the distance between the origin and destination cities were less than 175 miles:

mapdql> SELECT origin_city AS "Origin", dest_city AS "Destination", AVG(airtime) AS "Average Airtime" FROM flights_2008_10k WHERE distance < 175 GROUP BY origin_city, dest_city;
Origin|Destination|Average Airtime
West Palm Beach|Tampa|33.81818181818182
Norfolk|Baltimore|36.07142857142857
Ft. Myers|Orlando|28.66666666666667
Indianapolis|Chicago|39.53846153846154
Tampa|West Palm Beach|33.25
Orlando|Ft. Myers|32.58333333333334
Austin|Houston|33.05555555555556
Chicago|Indianapolis|32.7
Baltimore|Norfolk|31.71428571428572
Houston|Austin|29.61111111111111

In order to access MapD Immerse through the web interface, I started an OpenShift Service with a NodePort attached to the MapD pod’s port 9092 (Immerse). OpenShift will transparently route incoming traffic on the NodePort to the service irrespective of which node you connect to, even if the pod is not running on that node. The service is defined in mapd_service.yml:

apiVersion: v1
kind: Service
metadata:
  name: mapd-service
spec:
  ports:
  - port : 80
    # open port on every node
    nodePort: 30092
    targetPort: 9092
    protocol: TCP
  selector:
    app: mapd
  type: NodePort

I created a MapD service and confirmed that it is available:

$ oc create -f mapd_service.yml 
service "mapd-service" created
[centos@ip-10-0-0-6 ~]$ oc get service 
NAME               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                   AGE
docker-registry    ClusterIP   172.30.165.174           5000/TCP                  14d
kubernetes         ClusterIP   172.30.0.1               443/TCP,53/UDP,53/TCP     14d
mapd-service       NodePort    172.30.89.245            80:30092/TCP              8s
registry-console   ClusterIP   172.30.163.247           9000/TCP                  14d
router             ClusterIP   172.30.171.246           80/TCP,443/TCP,1936/TCP   14d

With a web browser connected to Immerse, using one of the node’s external (public) IP address on port 30092 (http://54.164.113.161:30092), I created a dashboard with a scatterplot using the newly ingested flights_2008_10k table and then followed these steps to create a scatterplot:

Clicked DASHBOARDS -> New Dashboard -> Add Chart -> SCATTER
Clicked SOURCES, chose flights_2008_10k table as the data source
Clicked MEASURES X Axis -> + Add Measure, chose depdelay
Clicked MEASURES Y Axis -> + Add Measure, chose arrdelay
The resulting chart shows, unsurprisingly, that there is a correlation between departure delay and arrival delay. Finally, by clicking Apply and Save, I created the dashboard with the name Flights Dashboard.

Flights Dashboard.

Before I switched over to using the CPU only version of the MapD docker image, I deleted the MapD service and pod.

$ oc delete service mapd-service
service "mapd-service" deleted
$ oc delete pod mapd
pod "mapd" deleted

Deploying CPU Version of MapD Community Edition

To deploy the CPU-only version of the MapD Community Edition, I used the pod definition YAML file deploy_mapd_cpu.yml for installing MapD Community Edition as a Docker container on the CPU-only OpenShift node. Notice that the main difference from the GPU version are the docker image and nodetype. The CPU version of the MapD pod used the same EFS shared storage, so it comes up pointing to the same database that I worked on using the GPU pod.

kind: Pod
metadata:
 name: mapd
 labels:
   app: mapd
spec:
 nodeSelector:
   # Node with CPU
   nodetype: cpu
 containers:
 - name: mapd
   # MapD docker CPU image
   image: mapd/mapd-ce-cpu
   volumeMounts:
   - mountPath: "/mapd-storage"
     name: mystor
   ports:
   -name: mapd-port0
   containerPort: 9090
   -name: mapd-port1
   containerPort: 9091
   -name: mapd-port2
   containerPort: 9092
 volumes:
   - name: mystor
     nfs:
       server: fs-2bc69063.efs.us-east-1.amazonaws.com
       path: /

The service YAML file is the same that I used for launching the GPU pod, and I launched the MapD pod and service.

$ oc create -f deploy_mapd_cpu.yml
$ oc create -f mapd_service.yml

After confirming that the pod was running, I opened the browser to connect to Immerse using one of the node’s external (public) IP address on port 30092. As the CPU pod is accessing the same database that was created by the GPU pod, I was able to open the same dashboard, “Flights Dashboard,” that I created using the GPU pod. Please note that MapD in CPU-only configuration disables backend rendering with limited access to compute-intensive charts.

compute-intensive charts.

immerse-rendering-disabled

Conclusion

OpenShift is the leading open source container management platform with advanced features and a large community of developers and users. MapD is also emerging as an indispensable open source analytics tool for data scientists that provides  unprecedented interactivity with massive datasets. MapD on OpenShift makes deployment, management, and scaling easy across different infrastructures, from a physical, virtual, public, private, or hybrid cloud.

Veda Shankar

Veda Shankar is a Developer Advocate at HEAVY.AI working actively to assist the user community to take advantage of HEAVY.AI's open source analytics platform. He is a customer oriented IT specialist with a unique combination of experience in product development, marketing and sales engineering. Prior to HEAVY.AI, Veda worked on various open source software defined data center products at Red Hat.