Simon Eves

MapD 4.0 Goes Big with Geospatial!

One of the headline new features of MapD 4.0 is our geospatial implementation. Earlier versions of MapD had some basic geospatial data import and rendering features, but geospatial is now a first class feature, along with the new back-end choropleth rendering mode in Immerse, which offers a huge increase in performance and scalability over the front-end equivalent.

Here’s an example of the new choropleth driven from a join query of Los Angeles assessor data with parcel border geo shapes:

choropleth - parcel border geo shapes


SELECT la_parcels_polys.rowid, Assessor_Parcels_Data_2017.AIN as key0, 
avg(EffectiveYearBuilt) as color FROM Assessor_Parcels_Data_2017, 
la_parcels_polys WHERE (Assessor_Parcels_Data_2017.AIN = 
la_parcels_polys.ain) GROUP BY la_parcels_polys.rowid, key0;

The raw geo query mechanism is not yet exposed in Immerse, but available from all the other connectors.

What geo data do we support?

We currently support 2D geo in the form of the new column types: POINT (a single point), LINESTRING (multiple points forming an arbitrary line), POLYGON (multiple points forming an outer polygon ring, with optional other points forming hole rings), and MULTIPOLYGON (like POLYGON but allowing for multiple outer polygon rings).

Here’s a basic geo table creation

mapdql> CREATE TABLE geo_table (
  name TEXT, 
  pt POINT,
  line LINESTRING,
  poly POLYGON
);

The geo data is stored in the table in an efficient binary form, and some attributes of multi-point data (such as the bounding box) are pre-computed and cached to speed up subsequent query and rendering operations.

How do I get geo data into MapD?

Geo data can be inserted as literals, imported from either CSV/TSV delimited files with WKT (Well Known Text) strings, or from standard geo file formats such as ESRI Shapefile, GeoJSON, and KML. When importing from CSV/TSV, a geo POINT column can be populated from two consecutive lon/lat or lat/lon scalar columns.

Client-local or server-local files can be imported using command-line tools or using the Immerse web front-end, or files can be imported directly from a remote location in an Amazon S3 bucket or on a regular HTTP/HTTPS website.

Importing from geo files is now faster (taking advantage of multi-threading) and more robust (more tolerant of errant geometry, particularly polygons, which no longer need to be pre-triangulated on import).

Here’s how to insert geo literals into the table. Let’s add a name, a single point, a four-point line, and a square polygon with a triangular hole. Note that polygon rings can be provided as open or closed (the OGC standard is closed) and winding order is retained (outer rings should be specified in counter-clockwise order, and inner rings - holes - in clockwise order, again per the OGC standard).

mapdql> INSERT INTO geo_table VALUES (
  ‘Billy’,
  ‘POINT(0.0 1.0)’,
  ‘LINESTRING(2.0 2.0, 3.0 3.0, 4.0 4.0, 5.0 5.0)’,
  ‘POLYGON((0.0 0.0, 3.0 0.0, 3.0 3.0, 0.0 3.0), (1.0 1.0, 1.5 2.0, 2.0 1.0))’
};

Or maybe you have a CSV like this (some geometry contents omitted for clarity):

“name”, “pt”, “line”, “poly”, “mpoly”
“Fred”, “POINT(1.0 1.0)”, “LINESTRING(...)”, “POLYGON(...)”
“Jim”, “POINT(2.0 2.0)”, “LINESTRING(...)”, “POLYGON(...)”
“Sheila”, “POINT(3.0 3.0)”, “LINESTRING(...)”, “POLYGON(...)”

Import it like any other CSV file. In this case, we append it to the existing table:

mapdql> COPY geo_table FROM ‘geo.csv’ WITH (quoted=’true’);

Or you have some geo data as an ESRI Shapefile, GeoJSON, or KML. For these, you must use the geo=’true’ option on COPY to indicate that it’s a special geo file. In this case, just as with the old (now deprecated) \copygeo command, the table must not already exist and will be created and populated automatically.

mapdql> COPY geo_table_2 FROM ‘shapefile.shp’ WITH (geo=’true’);

We also now offer support for loading geo files directly from individual compressed files, or from inside an archive file, again either local or remote. This is especially useful for importing large public published geo data without having to download and unpack it first, particularly when using a cloud instance of MapD without direct file system access.

For example, say there’s some geo data in a neatly-packaged zip archive file on a government website. You could download it manually and unpack it, in which case it would look something like this, and then you could import the .shp file (and the required secondary files) directly:

$ unzip -l cb_2014_us_state_20m.zip
Archive:  cb_2014_us_state_20m.zip
   Length      Date    Time    Name
---------  ---------- -----   ----
    19204  2015-04-21 09:58   cb_2014_us_state_20m.shp.ea.iso.xml
    30219  2015-05-12 09:42   cb_2014_us_state_20m.shp.iso.xml
    18599  2015-05-12 09:42   cb_2014_us_state_20m.shp.xml
   446116  2015-04-03 13:02   cb_2014_us_state_20m.shp
      516  2015-04-03 13:02   cb_2014_us_state_20m.shx
     8434  2015-04-03 13:02   cb_2014_us_state_20m.dbf
      165  2015-04-03 13:02   cb_2014_us_state_20m.prj
        9  2015-04-03 13:02   cb_2014_us_state_20m.cpg
---------                     -------
   523262                     8 files

But there’s no need to download it and unpack it. You can import it directly from the website, and the primary file (and required secondary files) inside the archive will be found automatically:

mapdql> COPY cb_2014_us_state_20m FROM 
‘http://www2.census.gov/geo/tiger/GENZ2014/shp/cb_2014_us_state_20m.zip’
WITH (geo=’true’);

How about a local gzip’d GeoJSON file?:

mapdql> COPY geo_table_4 FROM ‘data.geojson.gz’ WITH (geo=’true’);

Or some KML files in Amazon S3 buckets. For private buckets, you can provide the access credentials on the command line or pull them from the standard AWS_* environment variables in the MapDQL shell environment. Note that for public buckets, you can use either the s3:// or https:// URL syntax:

mapdql> COPY geo_table_5a FROM ‘https://s3-us-west-1.amazonaws.com/public_bucket/geo.kml’ WITH (geo=’true’);
mapdql> COPY geo_table_5b FROM ‘s3://public_bucket/geo.kml’ WITH (geo=’true’, s3_region=’us-west-1’);
mapdql> COPY geo_table_5c FROM ‘s3://private_bucket/geo.kml’ WITH (geo=’true’, s3_region=’us-west-1’, s3_access_key=’*****’, s3_secret_key=’***********’);

What control do I have over how the geo data is stored?

Various options exist for compressing geo data to reduce the memory footprint and increase query bandwidth, and for converting incoming data to standard coordinate systems. The default behavior for storing WGS84 EPSG:4326 lon/lat geo data is to use 4326-specific 50% compression (giving a worst-case resolution of ~4" at the equator). Compression can be disabled to retain full 64-bit floating-point accuracy.

Here’s that original table creation, but this time explicitly specifying the geo as type GEOMETRY, with spatial reference 4326, and selective compression (no compression on the point, 50% compression on the others):

mapdql> CREATE TABLE geo_table_6 (
pt GEOMETRY(POINT, 4326) ENCODING NONE,
line GEOMETRY(LINESTRING, 4326) ENCODING COMPRESSED(32),
poly GEOMETRY(POLYGON, 4326) ENCODING COMPRESSED(32),
mpoly GEOMETRY(MULTIPOLYGON, 4326)
);

What can I do with the geo data once it’s in MapD?

The raw data can be extracted with a regular SQL projection which returns a WKT string, through any of the connector interfaces, and easily parsed by third-party layers.

mapdql> SELECT name, pt FROM geo_table WHERE name LIKE ‘J%’;
name|pt
Jim|POINT(2.0 2.0)

The Query Engine currently implements a subset of the well-known OGC “ST” functions as used by PostGIS, and more will be added in future releases. For now we support the cartesian GEOMETRY flavors of ST_Distance and ST_Contains for all combinations of geo types (e.g. distance from POINT to LINESTRING, or whether a POINT is inside a POLYGON or MULTIPOLYGON, whether from a column or expressed as a WKT string literal in the query).

Here’s an example of ST_Contains(POLYGON, POINT) showing two candidate points and a polygon with a hole. The upper point would return TRUE. The lower point is in a hole and would return FALSE.

ST_contains(point,polygon)

(image courtesy of PostGIS)

All these geometric operations run on the GPU at the blazing speed you have come to expect from MapD.

Here’s a simple geometric query using a POINT literal to count how many rows define points within a radius of 5.0 from the origin:

mapdql> SELECT COUNT(p) FROM geo_table WHERE ST_Distance(ST_GeomFromText('POINT(0.0 0.0)'), p) < 5.0;

Here’s a more complex JOIN query matching taxi pick-ups with New York City borough boundaries:

mapdql> SELECT a.pickup_boroname, b.boroname FROM sample_taxi_10k a, nyct2010 b WHERE ST_Contains(b.mapd_geo, a.pickup_point) LIMIT 10;

We also support functions for extracting the X or Y component of a 2D point, extracting individual points from a LINESTRING, accessing the bounding box of the geo for a row, and transforming geo into a different coordinate system.

The back-end rendering engine can interpret all the new geo types, providing various new rendering modes in Immerse, most notably the new back-end-rendered Choropleth mode, and the ability to render a path defined by a LINESTRING in a single row. Polygon rendering uses a more robust multi-pass technique which does not require pre-triangulation, and now correctly renders holes. For those using our Vega API directly, we have added support for coordinate system transformations.

What’s next?

These new features in MapD 4.0 are just the tip of the iceberg for GPU-accelerated geospatial analytics. Upcoming functionality includes:

  • Support for import and rendering of geo data from other source file types and in other coordinate systems
  • Support for OGC standard GEOGRAPHY types and their associated alternative “ST” functions and conversions
  • Support for geo data in multi-fragment tables on multi-GPU and distributed systems
  • Improvements to polygon rendering scalability in general, such as level-of-detail control

About the Author

Simon is a Senior Software Engineer in the Back-End Graphics and Rendering team at OmniSci in San Francisco. He joined the company in October 2017, after six years working on mobile GPUs and ray-tracing, and fifteen years as an R&D Engineer and Technical Artist in movie and TV visual effects in the Bay Area and London, including time at Industrial Light & Magic, Digital Domain, Double Negative, The Orphanage, and Cinesite. He can be seen in the background of a scene of “Star Wars Episode II: Attack of the Clones”, but even his own mother didn’t recognize him.