Why would a normal person build its own spatiotemporal database?

"Storing and querying datasets that contain objects in a geometric space have always required special treatment. The choice of data structures and query algorithms can easily make the different between a query that runs in seconds or in days.“ (Werner Vogels, Amazon CTO, https://www.allthingsdistributed.com/2013/08/spatial-databases.html)

Mireo has been in the vehicle tracking business since 2001, and during that time we have learned many valuable lessons. One of the most important is – these systems usually don't scale. What does this mean? As the number of records exceeds a few billion mark, processing of data requires exponentially more time. This is true for all systems that process data produced by moving objects, but which are not designed as natively horizontally scalable.

In the late summer of 2014, we got an interesting and challenging business proposal: to design and develop a system capable of live tracking 2 million trucks and keeping their log history for a period of 5 years. Can you imagine how much spatiotemporal data we would need to store and process? In order to correctly perform map-matching process (i.e. detect the road where the vehicle is currently located, based on its imprecise GPS position), a vehicle needs to send its GPS positions relatively frequently. For the sake of simplicity, let's approximate this with „1 position every 3 seconds“ while a vehicle is driving. Now, let's do the math:

2.000.000 vehicles
x 1.200 GPS positions per vehicle per hour
x 5 years
= positions

Yes, that's more than 105 trillion. Needless to say, our old vehicle tracking system that relies on the combination of relational and NoSQL databases would fall apart under this amount of data. Additionally, the new system needs to answer queries such as „fetch the total number of truck visits to the specified factory“.

There are several hard problems to be solved in order to fulfill this business proposal, such as load distribution, map-matching, networking (tracking 2 million vehicles means 2 million permanently open TCP connections). However, the hardest is definitely choosing the right storage. Where shall we store the data? To a file system? To a relational database? To a NoSQL database? To a general purpose object storage like Hadoop? Each of the mentioned options have problems of its own: traditional relational SQL databases are not designed as distributed and horizontally scalable, NoSQL key-value databases have records with the same index dispersed around the disk (performance killer!), with clustered file systems (Ceph, GlusterFS, HDFS..) it is not possible to use maximal disk bandwidth with the level of abstraction which those file systems bring etc.

Storing of data needs to be efficient, but the most important is to have fast searching. For example, the execution time of the query „fetch all positions and timestamps when specified 4 trucks met each other“ must not exceed few seconds in the worst case scenario. Additionally, we wanted to have the possibility to make OLAP-like queries, grouping, joining.. in other words, some special kind of relational-like database would be the best choice. But which? We had to do some serious brainstorming.
  • we need a relational-like database with an excellent spatiotemporal index and appropriate query language
  • we need to store n-dimensional data sets – how would one sort such data?
  • distribution of data from „real world“ is highly non-uniform – we will get a lot of data from the crowded area with many vehicles, and small amount of data from rural area

Mantra in the software engineering world is „don't reinvent the wheel“. Naturally, we started to analyze current state-of-the-art solutions for processing massive spatiotemporal data. Besides the other obvious properties of such system, we’ve quickly figured out that the most critical part of the system is the way how system indexes multidimensional data. It turned out that there were three major multidimensional index types used in major spatial systems: R-tree family of trees (PostgreSQL/PostGIS, Oracle Spatial, MemSQL, ...), kd-tree family (Oracle Spatial Quad tree index) and Geohash (GeoMesa and various variants of products found on Google Cloud). Guess what – none of these solutions is suitable for indexing huge amount of real-world spatiotemporal data! Let's see why.

Due to the aforementioned non-uniform data distribution, R-tree family of indexes is practically useless on a large scale because index traversal over high-density areas effectively becomes linear. Similar, but much worse effect happens with fixed-grid space/time partition indexes like Geohash. The index traversal on high density areas with Geohash will again become very linear with no possibility for more granular partitioning. To make things worse, Geohash-based indexes can in theory yield good record filtering only if range queries are finitely bounded by all dimensions. For example, in the context of vehicle tracking system, it is natural to store and index triplets having id of the vehicle, timestamp and location. Then, trying to execute simple query of the form “fetch full history for one single vehicle for the period of one year“ using Geohash would effectively linearly scan the complete database because the query is not bounded by location!

kd-tree family of indexes are the most efficient type of indexing n-dimensional interval data, but with one huge drawback: they are built on the top of static data sets. It is possible to efficiently index even a large interval data set using kd-tree type of index, but only if the underlying recordset does not change afterwards. There have been attempts in scientific literature to relax this requirement, but to the best of our knowledge there is no commercially available system which uses some of those approaches. In our scenario, new data keeps arriving all the time, so static data sets are not an option.

To wrap up, the system capable of storing and analyzing large amount of spatiotemporal real-time data should contain at least the following:

1. it must be designed from the ground up as clustered, distributed system with data processing modules deployed as close as possible to the physical data storage (in other words, traditional RDMBS are useless for this purpose)

2. it must be highly available, fault tolerant and natively horizontally scalable

3. the index on spatiotemporal records is by far the most important ingredient of an efficient spatiotemporal database processing system; as we saw, none of the current state-of-the-art spatiotemporal indexes is suitable for dealing with large amount of records coming from moving objects

4. partitioning scheme must exploit spatiotemporal properties of the underlying data - records that are close (based on vehicle id, position and timestamp) need to be in the same or adjacent shards

We had to come up with something new, something absurdly fast and massively scalable. After 5 years of hard work by the group of extremely talented people, we are proud to present Mireo SpaceTime Cluster. For the details about SpaceTime, please read our next blog post :-)