Vega makes visualizing BIG data easy
We’re delighted to announce the availability of Vega, the JSON specification for creating custom visualizations of large datasets. Using Vega you can create server-rendered visualizations in the community version and enterprise versions of MapD.
MapD Vega is based on the open-source Vega specification developed by Jeffrey Heer and his group at the University of Washington. We’ve adapted the original specification to the MapD platform so you can use the power of SQL to investigate your data and quickly render it as a custom visualization. MapD uses Vega to drive the rendering engine directly on the result set of a SQL query without ever requiring the data to leave the GPU, enabling users to visualize granular data at scale in a way not possible with purely frontend visualization tools. Here’s a Vega-rendered image of tweets in Europe, color-coded by language, which you can find in our Vega documentation example:
As a developer or data analyst you want to be able to easily understand your data with the help of visualization, but without being burdened by the complexities of the geometric visualization details. Vega lets you work at a higher level than is possible with most visualization tools and readily supports custom algorithms and advanced visualization techniques. We’ll give you a simple example that demonstrates how easy it is to use Vega, which you can use as a springboard for exploring your own data.
The Vega specification is in JSON structure so it is easy to create, understand, and operate on programmatically. You can even create common chart types. A Vega specification consist of:
- a data source selection, which can be SQL statements or in-line data.
- options for representing the data on your chart or plot as a:
- geometric shape symbol
- options for scaling rendered data and quantifying data attributes:
- quantitative - linear, log, power, square root, and quantize scales
- discrete - ordinal and threshold scales
The MapD Connector API makes it easy to send the Vega JSON to the backend, which renders the visualization and returns a base64-encoded PNG image to the client. You can make a render request using either the API or Apache Thrift, directly, in a few steps:
- Create the Vega specification.
- Use Thrift or the renderVega() API function to make a render request. The API depends on node-connector.js or browser-connector.js.
- Asynchronously, receive the rendered image and display it in your application.
We encourage you to explore your data using Vega. Our Vega documentation includes tutorials, a Vega reference, and example source code for using Vega in your client browser. Discover how easy it is to find what your data is telling you.
The Fastest Animal on the Planet Meets the Fastest Big Data Exploration Platform on the Planet
Much like the MapD platform, the peregrine falcon has been recorded in a pursuit dive, called a stoop, at 242 miles per hour, making it the fastest animal on the planet. Using Vega to pursue a better understanding of your data is faster and more intuitive than using most other visualization tools on the planet. Let’s take a closer look at what’s going on with the peregrine.
In our simple investigation, we look at bird migration along the Pacific flyway. Specifically, the San Francisco Bay area, where birds are counted as they are funneled between the Pacific Ocean and the San Francisco Bay on their way south. In addition to reports by amateur bird enthusiasts, birds are counted by the Golden Gate Raptor Observatory on Hawk Hill in Marin County, and by the Point Blueorganization on Pt. Reyes and the Farallon Islands.
If we use the MapD Immerse SQL Editor with which you’re already familiar, we get a list chart visualization that shows the latitude and longitude coordinates of peregrine falcon sightings in a particular month in 2015:
From a list alone, it can be difficult to gain much insight into the peregrine population. If we want to see how the number of peregrine falcons differs each month and possibly make some assessment about the cause of any change, a more powerful visualization can help. Each added degree of visibility into the data typically suggests other data fields that we might want to factor into the visualization, and adding more data and fields to our analysis is easy with Vega. In this example, we might want to know the migratory characteristics of birds the peregrine falcon preys upon or the weather in certain months. Does the peregrine really wait for fair winds before attempting to cross the Golden Gate as the docents on Hawk Hill will tell you?
Let’s look at a Vega visualization that uses latitude and longitude as x and y coordinates on a chart that includes our peregrine data and number of sightings in a month superimposed on a map of the Bay Area:
Each plot point is a Vega marks specification, sized by the number of sightings and colored-coded to the month of the sighting.
We can see that the largest peregrine population occurs in October, as might be expected, and were observed at Hawk Hill, just north of the Golden Gate Bridge, and in the Farallon Islands. The large June peregrine count in the northwest area of Pt. Reyes is unexpected and suggests the need for more data or a different visualization to better understand peregrine migratory influences. We can also see that sightings are concentrated along coastal areas, presumably because sea birds are a favorite diet of the peregrine and in Golden Gate Park, an open space with a large bird population.
Let’s see how we created this visualization from the millions of recorded bird sightings in our ebird database.
A Vega specification includes:
- a data property that specifies and filters data source(s).
- a marks property that defines the basic visualization graphic of a data item.
- a scales property that defines geometry or applies additional attributes to the data item visualization.
- viewing area dimensions.
Vega supports SQL statements, which we use to extract latitude, longitude, month, and number of sightings from the ebird data set. Further, we limit the data to the peregrine falcon species in two Bay Area counties, in 2015.
By assigning a name, ptable, to our data set, other specification properties can reference the data source.
Using ptable data, the marks property defines latitude and longitude as x and y coordinates of the plot. Each rendered data point is sized according to a count scale, which corresponds to the number of peregrines observed, and colored using a color scale, which is associated with the month of the observation. These scales also need to be specified.
The scales property maps the input data domain to an output visualization range. Vega supports both quantitative and discrete scaling to match the inherent data continuity.
Here, latitude and longitude define the x- and y-coordinate axes and essentially maps the boundaries of the two counties to the visualization area.
The count scale defines the rendered point size, which we have quantized to a range of 10 values. This covers the range of observation counts, which we noticed when we queried the data using the SQL Editor showed 10 as the maximum number of peregrines per observation.
The color scale assigns discrete colors to each month. If the month is not specified in the recorded data, we assign the color of the peregrine falcon’s middle coverts as a default value. A null data entry is represented in light gray although the SQL statement already removed any null data entries.
Visualization Area Properties
Finally, we used the following viewing area dimensions, which conveniently matched the decimal GPS coordinates of our data:
Here is the complete JSON structure used in this example:
Vega makes it easy to quickly discover data relationships and dependencies. Each visualization gives you more insight into your data and drives you to want to extract even more meaning from it. By spending a few more minutes with our sample data and making a small change to the data property SQL statement, we can get a very different view of the peregrine. Let’s see the bigger picture by looking at total peregrine observations each month over a 10 year interval:
And, instead of latitude and longitude we use month and year to scale our data points. Using this specification, first with county = ‘San Francisco’ then with county = ‘Marin’, we render two charts, which we’ve combined here and annotated:
Over a 10-year interval, we can see that the peregrine population has increased, with concentration in the Fall months, confirming the map-based visualization. But, over the last two years, the peregrine seems to be lingering year-round in Marin County. This could, in part, explain the unexpected number of sightings we saw in June.
Maybe we need yet another visualization … or more data.