Using Python Libraries with OmniSci for Mac
Download OmniSci Free, a full-featured version available for use at no cost.
Source code and notebooks for this article can be found at https://github.com/mikehinchey/article-python-omnisci-mac/blob/master/Using%20Python%20Libraries%20with%20OmniSci%20for%20Mac.ipynb
OmniSci for Mac
OmniSci has created an open source database platform capable of running analytic queries over billions of rows in milliseconds by leveraging both CPU and GPU hardware. SQL queries are compiled with LLVM to hardware instructions for minimal overhead. Data is packed in columnar format to save space and locality, enabling data to be loaded from disk into memory quickly. The compiled queries are then executed with as much parallelism as the CPU or GPU supports, providing results in milliseconds.
OmniSci Immerse is a webapp for interactive data visualization. It provides a BI interface, to intuitively build charts and dashboards, but it was built from scratch to handle complex visualizations of billions of rows, including geospatial maps. Charts based on the same table will automatically cross-filter, allowing ad hoc, interactive exploration of even the largest datasets.
Currently in preview release, OmniSci has packaged both the database and Immerse UI into a Mac app for analyzing data on a laptop. To get started, go to OmniSci for Mac, enter your email and download the app. Run the downloaded file to install OmniSci. Then, run OmniSci - you'll need to keep the UI window open while working in Jupyter because the database processes are contained in the same app.
The code in this article will also work with OmniSci Enterprise or Open Source editions, though connection details will differ.
Overview of the Python Libraries
OmniSci for Mac Preview contains the database processes and the Immerse UI, which is a web client displayed in a packaged browser. The OmniSci DB stores data on disk and provides access through an SQL interface.
Jupyter is a web-based "notebook" interface for Python and other languages.
Pymapd is OmniSci's python client library that provides a standard DB access API for executing SQL and other functions. We won't use this directly in this article because the next library, Ibis, hides all of these details. Pymapd will soon be renamed to pyomnisci.
Ibis is a python library which provides an API similar to Pandas, but allows various backends to store data and execute queries remotely. Some of the supported backends are: Pandas, SQLite, PostgreSQL, and OmniSciDB. This let's python programmers write concise code rather than constructing SQL strings.
Altair is a Python API to declaratively specify charts for data visualization. It is built on top of Vega-Lite, a declarative visualization grammar. The API is simple and consistent, the charts are powerful and look great, and the abstracted implementation allows us to tie in Ibis and OmniSciDB as the data source.
OmniSci's Data Science Jupyter package
While Immerse provides a powerful UI for constructing and interacting with charts and dashboards, and also has features for loading data, these features can not provide the full range of what is possible by writing code. Data scientists need the power and automation capabilities provided by Python, Jupyter, and related libraries.
Quansight has packaged JupyterLab together with a few Python libraries for interacting with OmniSci DB. (This is an early stage project, so installation is not yet seamlessly integrated with the OmniSci for Mac app.)
If you already have conda or your own jupyter and wish to install the libraries manually, the instructions at omnisci-datascience-installer provide options and details.
To install the entire package, go to omnisci-data science-installer/releases to download the sh installer. If you have an old installation, archive or delete the ~/omnisci dir. Then, in a shell console:
Next, change to the directory where you'll access and save your notebooks.
conda activate $HOME/omnisci
This will run Jupyter Lab, and remain in that console. It should open your default web browser to show JupyterLab (otherwise, browse to http://localhost:8888/lab).
Connect to the Database
Now that we have the database and Jupyter running, we can connect the two.
We'll use Ibis as the interface to the database. With the Mac app, the username, password and other connection parameters are given.
Download Some Data - NYC Parking Tickets
To demonstrate the capabilities of OmniSci, Ibis and Altair, we'll use this dataset from Kaggle. It's not small with tens of millions (though that's not large), and has many numeric features for visualization.
You'll need an account on Kaggle to download the file. Then, unzip the file to access the csv data files.
Load Data into OmniSci DB
Like most data projects, the data must be cleaned and transformed. In this case, the only data preparation needed is to define the table schema. This is often performed using Pandas, perhaps on a subset of data. (For brevity, the full CREATE TABLE statement is included in the source code, but not printed in this article.)
There are multiple ways to load data into OmniSci. If the data needs to be cleaned and transformed before loading, we would likely load into a Pandas DataFrame, work on it, then load directly from Pandas into OmniSci. This file does not need transformation. OmniSci can load CSV files from disk and will do so making use of all CPU cores if the file is large.
Basic Exploration of the Table with Ibis
Ibis represents each table and each query as an "expression" object. That is, starting from the connection, get a reference to a table by name. Then, running the function count on the table creates another expression object. The expression can be compiled, which depends on the backend in use. In the case of OmniSci, it will compile to SQL. Instead of compiling, you would typically execute the expression, which does compile and then execute the query.
Each column can also be referenced as an expression, and supports functions which can be executed. This gets the set of unique values for a particular column.
Basic Visualization with Altair
To begin to understand the data, let’s choose one field of interest and find out the distribution of values with a histogram chart. This field has too many distinct values to show on a single chart at once, so the code below limits it to the top 20 most common values.
Interactive Timeline Charts
Multiple Altair charts can be combined and made to work together based on interactive input. Here, we'll create 2 timeline charts. The range chart on the bottom shows the timeline on the X axis and the record count on the Y axis. The larger timeline chart on top, uses the same timeline for X, but Y has multiple lines, grouped and colored by another column, county. When the user brushes on the bottom range chart, the top chart will redraw with the more narrow time range to show more detail.
The Altair library features many other chart types and display customization options including Area, Scatter, and Geo Maps.
We're able to use these three technologies together because they each handle a specific domain and are flexible enough to integrate and complement each other. OmniSciDB stores and queries data efficiently while scaling up to multi-GPU servers with hundreds of GB of CPU and GPU memory, or scaled down to a laptop. Ibis provides a concise and powerful data query abstraction API that works with Pandas or a number of databases. Altair provides a declarative API for building charts. When tied together, the chart can push down filters. groupings and projection into the data layer, removing such redundancy from the programmer.
Related Articles about Ibis and OmniSci
OmniSci has partnered with Quansight, experts in data and open source, to build the Ibis adapter for OmniSciDB and other python libraries to complete the stack.
Quansight's Tony Fast has published Ibis: an idiomatic flavor of SQL for Python programmers.
Quansight's Troy Bailey recently published an explanation for those new to Python in Ibis and OmniSci; A powerful combination for easy access to fast big data analytics with Python.
In this article, we'll demonstrate the ease of using Jupyter notebooks with open source Python libraries for visual charts to interact with data in OmniSci Database on a Mac.