To Jupyter and Beyond: Interactive Data Science at Scale with OmniSci
At OmniSci, we have always seen it as part of the original vision of the company to unify the worlds of Data Science and what is considered ‘traditional’ analytics. Our pioneering innovation was to leverage modern hardware and software to allow humans to ask questions of data at the speed of curiosity. It is clear though that we now live in a world where AI is rapidly extending human perception and intuition. So, we naturally see AI as part of the continuum of ways for people to understand the world through data, that spans the gamut from traditional Analytics/Dataviz to Machine Learning to Deep Learning.
It is useful to think of a natural ‘loop’ in how this understanding develops, going from data exploration with visual analytics tools (since there is no match for human visual perception in quickly understanding trends in data), to experimentation, where a data scientist dives in deeper to build models using AI/ML methods and finally, explanation, where both visual analytics tools and ML methods combine to surface key insights in a seamless workflow.
The open data science stack of today is clearly a de-facto platform for experimentation, founded on interlocking open source innovation across multiple ecosystems: the PyData stack (Numpy/Scipy, Pandas, MatPlotLib among several others, and recently, Dask and Numba), the R language and its thriving ecosystem, and also the Julia language. Alongside these great projects, the Jupyter project has been a significant revolution driving the idea of interactive computing in general, and data-driven storytelling in particular. These tools, by lowering the cost of curiosity, form the foundation of the age of data that we live in today. Here’s the first ever photograph of a black hole, coming to life inside a Jupyter notebook, powered by these tools.
Back in 2017, when OmniSci was known as MapD, we recognized the need to be part of this world and contribute to its growth. We were founding members of the GoAI project, which took the first real steps in bringing data science to the world of GPUs. GoAI’s primary outcome, the GPU DataFrame, has evolved into cuDF in the NVIDIA-owned RAPIDs toolkit, and a noted open source effort in its own right. In the meantime, the GPU has become the de-facto technology infrastructure for Deep Learning and fueled the AI revolution by providing users a combination of accessibility and power, something we strive to do at OmniSci as well.
With this goal in mind, we began a collaboration last year with Quansight, founded by Travis Oliphant (a Converge keynote speaker), who is like Methuselah in the data science community, having created Numpy (on which Wes McKinney built Pandas), Scipy and subsequently founded Anaconda Inc among several other defining contributions to the PyData ecosystem. Travis and the Quansight team have been amazing as collaborators, and instrumental in helping us both shape and deliver on this vision.
We started off with some basic foundational goals: First, how could we bring the proven performance at scale of OmniSci as transparently as possible into a data scientist’s workflow, by leveraging the familiarity of known APIs? Next, since we’ve always believed (and proved before anyone else) that interactive visual analytics could be transformative when coupled with massive scale, how could we harness the power of OmniSciDB with data visualization tools already in the PyData stack? Last but not least, how could we do all this while working within the existing open source ecosystem and contribute back to it along the way?
With OmniSci 4.8, we announced the availability of the OmniSci data science foundation, with significant contributions led by Quansight. This effort has multiple key components to address the above questions. Let’s look at each in turn.
OmniSci + JupyterLab: Tell your Big(gest) Data-Driven Stories
We believe OmniSci Immerse is the best platform for visual data exploration at scale today, in how it provides a fluid, interactive user experience over tens or even hundreds of billions of data points through close integration with the OmniSciDB engine. At the same time, we recognized that our users (data scientists in particular) typically used visual analytics tools such as Immerse as a starting point for deeper experimentation, for which they prefer interactive computing environments like Jupyter.
So we first focused on really deep workflow integration with the Jupyter ecosystem. As a first step, with OmniSci 4.8, a data scientist working with OmniSci can now access OmniSciDB from JupyterLab, the latest version of the notebook IDE, in multiple ways.
The first is via a single click of the familiar Jupyter icon in Immerse to open up a fully configured instance of Jupyterlab, connected to OmniSciDB underneath. Here’s what this looks like
By leveraging our enterprise-grade security model, we ensured that only specific users can access this capability. The above icon is only visible when a specific role (omnisci_jupyter) is assigned to a user, allowing for administrators to provision it in a controlled way.
Going deeper, you can also get to JupyterLab from the SQL Editor in Immerse via a single click from the results pane after a successful query execution. The great thing about this integration is it also carries through the query into JupyterLab, and wraps it as an Ibis expression for further exploration within the PyData workflow. Here’s an example of this feature, running against an airplane telemetry dataset.
We’re working on launching JupyterLab from within any Immerse chart - after all, every OmniSci chart is backed by a query against OmniSciDB. Soon, data scientists will be able to use Immerse to quickly explore a large dataset visually, and then use the data backing a specific chart as the basis of a deeper exploration in JupyterLab, all within a few seconds. We engineered this whole handoff to use OmniSciDB’s enterprise-ready authentication and authorization by integrating with JupyterHub, the multi-user version of the Jupyter infrastructure via a custom OmniSci Authenticator for JupyterHub.
Besides this integration, we also allow a completely Jupyter-side connection command within our set of python utilities, so you don’t need Immerse as the starting point.
We also created a lab extension jupyterlab_omnisci, which provides the scaffolding for a seamless OmniSci experience in JupyterLab. This extension allows you to do everything from manage connection parameters and state, to firing up a SQL Editor within JupyterLab for analyzing data with raw queries. There is a whole set of additional tools in here as well, to do everything from running SQL magics to mirroring our omnisql CLI inside JupyterLab.
Ibis: Like Pandas, but with 'Omni' Speed and Scale
While the JupyterLab tools and integration is a start, a large part of why we built it was to allow for deeper data exploration in a pythonic way. Pandas is the de-facto answer within the PyData ecosystem, and we decided it would be counterproductive to replicate the whole API surface of Pandas (besides, RAPIDs is already doing this). Instead, we feel that a great deal of Pandas’ value lies in its powerful, expressive way to evaluate analytic expressions on a dataframe in (either CPU or GPU) memory. This is how OmniSciDB query execution works already, so the problem became one of finding (or building) an API that was familiar and pythonic, but could be extended to leverage OmniSciDB underneath.
As it turned out, Wes McKinney, the creator of Pandas (and another keynote speaker at Converge!) had already started down this road with Ibis, whose stated goal is to take the productivity that Pandas provides, and, in Wes’ own words, adapt it to ‘scalable computational idioms like SQL’. Ibis, as we are discovering, is an absolute delight to use especially paired with OmniSci’s speed and scale. The deferred expression model makes it really easy to perform complex analytics on any backend that is accessible via SQL (and in a nice twist, Pandas itself).
Here’s Ibis at work on a telematics dataset with 1.45 billion rows where we are setting up an aggregate expression and evaluating it, producing a Pandas dataframe as a result. Notice how the expression is lazily evaluated - Ibis compiles it into a SQL query and then executes it only when needed - also notice that the response is close to instantaneous even running against an OmniSci server in our data center over a VPN.
Thanks to our pymapd API, we can already direct Ibis output to Arrow-based data frames either in CPU memory (via Pandas) or GPU memory (via cuDF). Work is underway to make this a zero overhead interface with our internal result set format.
Our Quansight collaborators helped us build a fully-featured Ibis OmniSci backend with an API surface covering almost all of our SQL features, including ongoing work to support our recently announced Window Functions. With their help, we are also taking the lead in making Ibis become geospatial-aware (not just for OmniSci but other Ibis backends like Postgres/PostGIS), and targeting GeoPandas as the default output container in such cases.
We’re also working on deep Ibis support for User Defined Functions (UDFs) in OmniSci, building on our proven, high performance JIT/LLVM query compilation infrastructure and adding a Numba-based API on top, that is usable directly from Ibis. In other words, you will soon be able to author UDFs in Python, and have them be compiled to optimized, low-level code and executed transparently in OmniSciDB, leveraging either the CPU or GPU.
A pretty cool aside—Ibis already supports several backends (including Pandas), meaning you can do analysis across data sources inside a single JupyterLab notebook with one API. Simply create a connection over any of those backends, and you can run ibis expressions against them. Not only that, since the returned results for remote backends default to pandas, you can wrap an ibis connection around that pandas dataframe - the possibilities are endless!
Altair + Ibis + OmniSci - Open, Scalable Data Viz for Everyone
But wait, there’s even more! As I pointed out earlier, OmniSciDB’s signature feature has always been performance, with its ultimate proving ground in Immerse, our data visualization solution that combines interactivity and scale in a way not seen before. While Ibis support by itself is already powerful, we took it to the next level, quite literally, by making Ibis work with Altair, a pythonic wrapper around the popular Vega and Vega-Lite ecosystems (thanks a ton, Jake VanderPlas!)
These projects are really modern, declarative approaches to data visualization, and we have been using Vega already within Immerse since 2016. We worked with the community building Vega and Altair, to integrate Ibis’ deferred expression model deeply into Altair using Vega transforms. From a user perspective, this allows anyone using Altair to use Ibis expressions where Altair expects a Pandas Dataframe. The end result is that Altair’s already excellent charting capabilities, powered by Vega Lite, can now be used to build charts on large datasets, with the interactivity that OmniSci alone can provide.
Since a (moving) picture is worth several thousand words, here’s how quick it is to produce a dual-measure line chart with Ibis and Altair on the same telematics dataset from above. Notice how powerful and composable Altair itself is, in developing an interactive chart in 4 lines of python code, and also how seamlessly Ibis expressions can now be used as data sources for Altair.
We then took it a couple of steps further. Immerse is best known for its interactivity, particularly in how charts can crossfilter one another. We were wondering if we could give OmniSci users, particularly data scientists, something approaching the fluidity of Immerse, within the OmniSci+Altair+Ibis workflow.
First, Altair’s built-in selector support provides a natural way to parametrize charts. Here’s an example of how you can build the same chart as above, but drive the chart with a selection filter - the major difference is that each time you pick a value from a list, altair generates a new query, and the chart is refreshed. The whole experience is near-instantaneous, even while running against 1.45 billion rows.
Next, here’s a preview of an upcoming feature - Altair+Ibis+OmniSci inside JupyterLab, doing crossfiltered charting, powered by Vega transforms and signals infrastructure. Note how the chart expression here is only moderately more complex in order to set up the crossfilter between 2 distinct Altair charts. Under the hood, our collaborators at Quansight added a new set of transforms to Altair, allowing the selection/interactivity hooks to drive crossfilter via Ibis.
Since Ibis itself works with a number of other backends, and Altair is now ‘Ibis-capable’, this new infrastructure can be used to drive crossfiltered charts across data sources - think about how you can now use charts built on an OmniSci datasource to crossfilter against charts built on BigQuery!
A huge shout out to Dominik Moritz in particular for his enthusiastic support of OmniSci within the Vega ecosystem! Dominik’s awesome Falcon project points to what’s possible when you combine the power of Vega with OmniSciDB - and an area of continued focus for us in the near future.
There and Back Again
Finally, we’ve worked on pymapd too, keeping up with the breathless pace of change on underlying projects (particularly Arrow and RAPIDs). A data scientist can now use all these tools, and produce a dataframe that as always, they can load back into OmniSci via the load_table APIs in pymapd (and ibis). The icing on the cake is OmniSci Immerse’s VDF (VIsual Data Fusion) feature we announced in OmniSci 4.7—which lets a user set up charts (combo and multi-layer geo charts for now) across multiple tables/sources.
Putting it All Together
A key point bears repeating - we have developed every one of the above capabilities within the respective open source project communities rather than simply fork them to develop our own - our work on Ibis and Altair are available to everyone involved with those communities, in addition to OmniSciDB itself being open source as it has been for the last 2+ years. Further, we invested deeply in the packaging/installation aspects so that users can simply and frictionlessly add OmniSci into their data science workflows, in multiple ways.
First, we took care to package everything with Docker - our Jupyterlab image includes all these tools and others such as Facebook’s Prophet, Uber’s kepler.gl and the RAPIDs toolkit from Nvidia, as well as the really nice workflow tool prefect. As a matter of fact, you can download and try the whole setup including OmniSciDB on your Mac or Linux laptop running Docker, by following these instructions—note, you won't have access to GPU capabilities unless you’re in Linux and have an NVIDIA GPU-equipped machine.
Next, we’re working on far more seamless packaging within Python - we already allow native conda installs of both the OmniSciDB engine on Mac/Linux, and also the above PyData tools. You can simply set up a conda environment, install and run the OmniSciDB Open Source edition, and all the above tools with ‘conda install’ commands (we don’t ship OmniSci Immerse or Render with the OmniSci Open Source edition, but you can always try the fully featured Enterprise Edition for a 30-day trial period, if you want these tools).
Looking Back, Looking Forward
First, credit where it is due. None of these new capabilities would have seen the light of day without the work and guidance of the entire Quansight team. The OmniSci product, cloud, solutions and engineering teams, especially on the Immerse side have been fabulous as well—thanks to everyone involved for helping get this done.
While this was all happening, Randy Zwitch, OmniSci’s Senior Director of Developer Relations, added early support for OmniSci in Julia—yet another testament to how committed we are to working within the Open Source ecosystem for Data Science.
Looking ahead, we’ve really only touched the surface on where we’re headed as a platform in truly scaling data science, building on our 5+ years of transforming visual analytics. We’re going to be driving further and deeper into how to integrate AI and Machine Learning in a seamless way into OmniSci, particularly focused on the ‘explanation’ part of the above loop. Ultimately, we believe cutting-edge methods in data science, or the plumbing for that matter, are in the service of the user, not the other way around - as far as a consumer of insights is concerned, we think it’s better to have the entire assembly of tools become invisible, but ensure that the insights and their explanations become obvious.
We look forward to sharing a lot more at Converge, our inaugural user conference in October - hope to see you there!
We’re just getting started. Over the next few weeks and months, we’ll be publishing a series of even deeper dives into the whole OmniSci data science experience, including getting started guides and topical notebooks with datasets you can download and try out. Stay tuned, and please feel free to reach out to us on social media and let us know your thoughts and feedback, which is and will always be invaluable to us