Introduction
In this post, we will take some time to look at the architectural differences between Machbase and MongoDB and compare their performance in processing time-series sensor data. When developing a sensor data processing system, there is a strong tendency to choose a DBMS that the developer is familiar with. However, I would also like to describe the fact that there are unknown difficulties in data processing if developed using familiar tools without considering the characteristics of time series sensor data.
Machbase
Machbase is a DBMS developed for fast input, search, and statistics of time-series sensor data. It supports edge devices such as Raspberry Pi, as well as general single-server and multi-server clusters. It has features and architecture specialized for processing time-series sensor data and machine log data. Details regarding the architecture of Machbase continue below.
MongoDB
MongoDB is a document-based database engine that can store and retrieve schema-less data in the form of JSON, so it is very conveniently used in the web-related field. It is not very suitable for time-series sensor data applications, but since semiconductor production process data is created in the form of XML in production facilities, it seems to be used in many cases due to the convenience of use in this case.
Comparison of features and architectures of time-series sensor data
Time-series sensor data is a type of data in which the amount of unit data is small and continuously increases over time without being updated. It is a form in which only the value of the data per sensor is continuously changed in chronological order. As the number of installed sensors increases and the data collection cycle tends to shorten, the amount of data generated per unit time increases enormously.
Due to the mass generation of time series sensor data, it is challenging to develop systems for data storage and processing in situations where data needs to be kept for a long time. Retrieval of such data is often performed using time and sensor identifiers as keys, and input performance is critical for real-time processing.
In the next section, we will take a closer look at the comparison of different systems in this situation.
Schema vs schema-less
Machbase requires a schema definition, which is defined in SQL DDL by a relational model, while MongoDB does not require a schema.
It is a very common technique to define user-related configuration data for a website in the form of JSON and process it in a web application. MongoDB is very convenient for entering this data into MongoDB without modification, and for quickly retrieving this data by user ID. MongoDB stores data in the form of binary JSON. Take a look at the example below.
< from Wikiped >
Unstructured data is very simplified compared to textual JSON data, but for each attribute, it expresses the type, name, and length of the attribute. Because there is no schema, columns that were present in previous data may not be present in this data, and the length of the data is always in flux. To find the attribute you’re looking for, you’d have to read the entire BSON document, and it might not exist, or the attribute might be of a different type.
Schema data, on the other hand, always has constant column names and the same column type. You know right away where the column data you want is without having to parse the entire record.
Real-time sensor data is often treated as structured data. This data has pre-set column values and can be represented and processed in a fixed schema, which makes it efficient to use a structured data model.
Of course, MongoDB is also possible.
Here’s a quick comparison between structured and unstructured data:
To convert unstructured data into structured data, you need to design a data model using an entity-relationship (ER) model, convert the raw data into appropriate tables, and enter it. However, this conversion process can make it difficult to accurately represent the data in certain cases.
On the other hand, unstructured databases like MongoDB are convenient because the data can be used as it is without any transformation. However, when searching unstructured data or performing statistics, real-time parsing of the data is required to obtain the column values of the desired data. This can cause performance degradation, and techniques such as distributed processing may need to be applied to address these performance issues. This can lead to additional server and cost issues.
When dealing with structured data, schema models provide a significant performance boost for processing large amounts of data because they can read directly from the data without parsing. When the data structure is structured, such as real-time time series data, the structured data model is much more efficient than MongoDB’s unstructured data model.
SQL vs No-sql
Machbase supports SQL as a data query language, and MongoDB has its own query methods. This section examines the differences between the two.
Typically, schema data is queried using a structured query language (SQL). No-SQL is also widely used these days, and each product handles it differently, so I’ll only describe MongoDB’s query language rather than provide a comprehensive description.
As mentioned earlier, MongoDB is optimized for applications like web services. In a web service, data is represented as JSON, and a set of this data is mapped to a collection object (a table in the relational data model). The main operation is to find the specific data you want by key value — a very simple query and most web services can cover pretty much what you want by entering data and finding the data by key value.
You can extend the search condition clause further by listing the desired conditions in the Find method. This looks very natural in the object model.
What about time series data? For time series data, the most common query is to get data with a specific sensor ID over a specific time period. Of course, MongoDB can do this, but when you compare the two queries, you start to see differences.
Machbase Query (SQL)
It’s very declarative and easy to understand: Get a day’s worth of data with a specific sensor identifier.
MongoDB Query
Has the query language become a bit verbose and complex? Is it still relatively easy to use?
Statistics
Due to the large volume of time-series sensor data, it is often necessary to compute statistics on an hourly basis and visualize those values. Machbase automatically generates hourly statistics for this purpose, a feature referred to as “automatic Rollup execution,” which I will explain later. Using this statistical data, it is easy to obtain the desired hourly statistics, and it is also possible to retrieve statistical data using SQL.
Using the ROLLUP hint, we’ve shown a precomputed statistic without actually performing the statistic.
MongoDB can execute a similar query. Now query becomes very complex.
Ease of use is a matter of personal opinion, but when expressing statistics or complex conditional statements, MongoDB tends to make queries very verbose.
In time-series sensor data processing, queries involving hourly statistics are heavily used. Therefore, as queries become more complex, it can be expected that creating queries in MongoDB may become significantly more challenging than anticipated.
Sensor data index vs B+Tree
A B+Tree index, which is also commonly used in RDBMSs, looks like this.
Each key is entered into a leaf node, where the key values are sequentially accessible as a linked list.
The algorithm is designed in such a way that when data is inserted and there is no more space in the leaf nodes, upper-level nodes are added. This design ensures that the tree does not become skewed to one side regardless of the distribution of input key values. For this reason, this structure is reasonably good for single-condition search performance on disk-based RDBMSs.
However, can it also solve well for two-dimensional problems, such as time-series sensor data, where time and sensor identifiers must be considered simultaneously?
Meanwhile, the sensor data index in Machbase has a special structure, sorted by time and sensor identifier.
For time series sensor data, most queries are based on time range and sensor ID, so we have implemented an index structure that is optimized for these queries as much as possible. Here’s a comparison of the features of these two indexes
B+Tree indexes can lead to poor input performance when entering large amounts of data.
B+Tree indexing involves frequent Structure Modification Operations (SMOs) when data is inserted rapidly. To maintain the consistency between data insertion and read operations (index search), locking occurs during SMOs, which can affect either a portion or the entirety of the tree. Of course, during this time, ongoing input operations from other sessions may be temporarily halted.
B+Tree indexes are optimized for search in environments with tens to hundreds of millions of inputs and infrequent, partial updates. It performs well in web-related applications where MongoDB is often used, but it cannot perform well enough in time-series sensor data environments where large amounts of data are input at high rates.
When searching using a B+Tree index, the index is stored in the form of <key value, record identifier>. In some cases, you may want to record data by key value, which is known as a clustered index. MongoDB does not provide clustered indexes. Once you have traversed the key values to get the record identifiers, you read the desired records from the data file by their record identifiers. At this point, a random read operation to disk occurs.
In the example below, if we indexed by time, there would be a lot of sensor data between the two-time ranges. Assuming the desired data is red squares, the red squares within the two-time ranges are not clustered, so we sequentially traverse the leaf nodes of the B+Tree index, obtain each RID, and perform a random read each time. As the number of sensors increases and the data collection interval becomes shorter, the number of index records that need to be visited by the index sequential search increases, and more random read operations occur. This is why even if you create a B+Tree index, you won’t get the data retrieval speed you want.
On the other hand, Machbase’s sensor data index creates index files by partitioning them by time. Each index file stores data sorted by <sensor ID, time, sensor value>, and the index partition keeps information about the minimum and maximum time values for that partition in memory. Therefore, index partitions that do not contain the time values you want to search for are excluded from the search.
In addition to improved search performance due to index partitioning, partitioned indexes can achieve the same performance even with very large data inputs. This is because, unlike B+Tree, newly entered data is only reflected in the last recorded index partition and does not change the index partitions created in the past. For this reason, Machbase is implemented to work well in parallel, even if input, indexing, and searching are performed in parallel.
Returning to the topic of searching, once the key value is located within the selected index partition, filtering based on the target time conditions can be done quickly. The data with the same key values are sorted in chronological order, allowing for efficient filtering. Since all the desired results, such as specific sensor data values by time, are recorded within the same index, the search can be performed with short-range sequential reads.
Comparing the two indexing architectures, we can see that the time series sensor data index solves all the problems that B+Tree has when applied to time series data processing.
Compare input and simple query performance
It’s easy to predict that Machbase will perform better with time series sensor data due to the formal schema structure and sensor data-specific indexes described earlier.
Tests were run in the following environments
I imported the previously prepared CSV file (107GB) using dedicated tools for each product, such as CSV loader and Mongo import.
Indexing on tag name and timestamp
4Core Xeon CPU, 32GB RAM, SSD
You can see that MongoDB was slow to enter data, consuming more resources, while the database entered data at more than eight times the speed.
Next, the schema structure and index differences produced the following results for simple search performance.
Performance of simple search queries
A query which retrieves all 1-day data for a given day.
The following results were obtained from the performance of the above two queries.
The result is a performance difference of over 200x between the two DBs. You can see that the time series sensor data index in the machbase has made a difference.
Generating automatic rollups of sensor data vs. real-time statistics
Because sensor data stores so much data, visualizing long-term statistics (overall trends over 10 years of data) requires a large amount of data reading and statistical computation.
In visualization applications, fast statistical extraction is essential because if the data takes too long to read, it cannot be displayed interactively (zooming in and out, moving through periods, etc.). Even if you can quickly retrieve the data you want to visualize, if you have hundreds of millions of data points, displaying them over time will again require massive computation.
If you can set up statistics on a time basis in advance, you can run statistical queries based on time values very quickly. Machbase provides the ability to automatically generate statistics on time series sensor data, store them, and quickly retrieve them through queries.
The results of running the same statistical query on Machbase and MongoDB are as follows. It should be noted that MongoDB does not support SQL, which makes the query language very complicated.
The following query retrieves the maximum value per minute for a particular sensor over a day. Machbase uses the following tips to help you retrieve the statistical data you have entered.
MongoDB runs by passing the following JSON to the query language.
The performance results are as follows.
You can see the performance difference of almost 400x.
Conclusion
We looked at how Machbase and MongoDB differ from an architectural point of view, and the performance impact We found that while MongoDB is very convenient for environments like web applications, and we were able to get good performance out of it, it has an architecture that provides indexes that are not ideal for time series sensor data applications.
Smart factories and smart cities, which have become a hot topic in recent years, not only generate large amounts of sensor data but also a lot of concerns and attempts to process it. I have written this article in the hope that Machbase can be a good solution at the heart of the problem.
I’ll leave it at that, with the promise of a comparison with a larger database of time-series sensor data when I get the chance.
Thank you.
Grey Shim, CRO of Machbase