April 3, 2024

The Need for a Multimodal Cloud Native Database for Geospatial Data

With the need for and usage of geospatial data growing at a rapid pace, a way to efficiently and effectively store this data is badly needed.

Guest Author

Lidar & Geospatial

Primary Featured Spot (Large)

Contributed by Norman Baker, VP of Geospatial, TileDB

Geospatial data is widely used in many applications, including earth observation, location-based services, defense, biomedical research and more. However, it’s vital to understand that geospatial data is not just one thing. In fact, the data is diverse and comes in various forms, including point clouds (e.g., lidar and sonar); polygons (e.g. buildings and areas of interest) and rasters.

The costs and complexities of managing all of this data can be quite high. For example, consider collaborative initiatives designed to bring together data from many different partners around the world. This demands a highly efficient data storage mechanism to overcome many challenges, including the fact that different parties may be modifying the data and sharing it with others. This duplication can make it difficult to keep track of the original source and create data governance headaches. In addition, foundation data may exist in open and proprietary formats that aren’t well supported, and existing APIs often don’t work well with the object stores.

Other technical challenges include:

Managing clusters at scale while working with GIS and geospatial data, enabling large-scale data processing in parallel. It can be tricky to allocate resources efficiently in a cluster, partition datasets effectively and determine the right level of resources for each job. If these resources fall short, jobs are likely to run too slowly or even crash. On the other hand, if you over-allocate, you risk wasting resources.
PostGIS allows efficient spatial data storage and retrieval through optimized indexing mechanisms. It enables spatial indexing, supporting faster query processing by organizing data based on geometric properties. However, a big challenge of PostGIS is installing and configuring it properly, as well as ongoing maintenance - and many organizations do not have the resources to accomplish this.
Integrating disparate geospatial pipelines, and making sure the data is reliable, complete and consistent. Real-world geospatial data is noisy, contains a lot of errors, and is usually not in its best format. In fact, it’s estimated that 80-90 percent of data scientists’ time is spent cleaning their data and preparing it for merging. You cannot expect your analysis to be accurate unless you ensure your data is correct; it’s a fundamental aspect of the machine learning cycle.
Supporting diverse data modalities. Today’s data teams often purchase or deploy different databases or open source formats and in-house solutions. It can take months to cobble these systems together in an effort to derive meaningful insights. Support for diverse data modalities is crucial for overlays, or the process of superimposing layers of geographic data that cover the same area, to study the relationships between them. An example of this is reinsurance in farming, where real-time weather dashboards can be overlaid on top of drought indexes (satellite-based soil moisture indexes) in order to allow for more accurate risk monitoring.

Geospatial Data Management Reimagined

Clearly, there needs to be an easier way to manage all of this data - a single, unified solution that manages the geospatial data objects along with the raw original data (e.g. images, text files, etc.), the machine learning embedding models, and all other data modalities in an application (tables, rasters, point clouds, etc.). Moreover, this unified database does not need to just be a geospatial database - it can be one space to store, manage and analyze all geospatial, tabular, ML data and files.

However, when it comes to geospatial data specifically, providing an efficient mechanism for handling the querying and storing of geometries at a grand scale (hundreds of billions of points and polygons) is particularly important. “Geometries” refer to geospatial data representing geographic data objects in the world, and supporting them enables users to achieve simpler workflows for processing all of their geospatial data.

In addition to supporting all data modalities, a unified solution also includes code and compute, so processing happens right next to the data, reducing egress and downloads. This is especially critical for geometries, which often number in the billions and require fast query response times. Advances in serverless make it possible to ingest huge geospatial datasets in parallel, taking just a few minutes and at a reduced cost. Coupling serverless compute and code with the actual data increases performance while reducing cost and time to get insights.

A multimodal cloud-native database for utilizing geospatial data, such as geometries, offers many benefits. These include performance; serverless options to reduce time and cost; and cloud-native applications to optimize object stores to scale vectors to cloud storage. Geospatial analysis is no longer a separate domain in today's ever exploding digital landscape for organizations in all verticals. Analysts, enterprise architects and data scientists are now able to tap into data modalities along with geometry support to juxtaposition completely new questions and use cases for the so-called "full-picture" never possible before. For this world to become a reality, a multimodal geospatial data management foundation will be critical.

Norman Barker is the VP of Geospatial at TileDB. Prior to joining TileDB, Norman focused on spatial indexing and image processing, and held engineering positions at Cloudant, IBM and Mapbox.