Starting
One of the biggest roles expected of a time series database is to perform various forms of statistical computation, data extraction, and visualization on big data collected over a long period because it is tough to achieve the desired performance with the data processing techniques expected of other types of databases (otherwise, why would you need a time series database?). However, unless you have unlimited resources to perform parallel data operations on big data collected in time series, there is a limit to data processing performance depending on the data size.
Here's an example.
Let's say you've been collecting sensor data for a specific purpose for 12 months, with 100 billion data points and 700 GB of compressed storage. If you have 50 sensors (or tags), you're keeping at least 2 billion data points per sensor in series.
Let's say you want to perform any statistical operation on a random tag over 6 months, such as averaging. Such a requirement would require you to visit all the tags from January to December and perform a count divide operation on the sum. What if you changed the conditions and asked What about the maximum value? Statistical operations on the whole range or a large range, as opposed to operations looking for a specific value, are very expensive operations.
However, in the real world, we need to meet the following requirements.
l What has been the average temperature on this machine over the last 3 months?
l What was the peak pressure on this appliance in the last year?
l What was the minimum current in this instrument over the last 6 months and when was it?
l Plot and report a graph of the average temperature every second for the last 12 hours of data on this device... no, not every second, but every 15 seconds from tomorrow.
l In real-time, plot and update the average pressure in seconds for the last hour.
l Alert me if the 15-minute average temperature is below 33 degrees for any of the last 6 hours.
l Determine when in the last year of data the 1 hour average temperature is above 98 degrees and email me a CSV of 30 minutes of raw data for the 15 minutes around that time.
It's very difficult to respond quickly to the above requirements, even if all the data is entered into the database.
Existing methods
To obtain statistical data without having to visit the entire database, the industry creates 'statistical tables' for pre-defined periods, so that when the above requirements come in, the data can be quickly extracted from the pre-defined user 'statistical tables' to achieve the objective.
However, this approach has some limitations as follows.
1. You can't utilize the “Statistics table” if the data start time changes.
If the statistics table is based on 0:00 minutes and 0:00 seconds every hour, you can't change it to 2:30 minutes and 30 seconds.
2. You can't utilize a “statistical table” if the statistical time unit changes.
For example, if you created a “statistical data” with 1-minute increments, it is impossible to get data with 30-second increments.
3. “Statistical tables” are very difficult to manage.
Depending on your requirements, you can create a lot of statistical tables with different tags, supported statistical functions, and time units, and you have to manage them one by one.
4. It is difficult to delete “statistical tables” and add data continuously.
In other words, utilizing “statistical tables” is a technique that can be utilized in a very narrow range of customer requirements.
What is a rollup?
In this document, the term rollup refers to the automatic generation of the above 'statistics table' using the Tag table in Machbase as the raw data source. The process is called a rollup and the resulting table is called a rollup table. So what is the difference between the 'statistics table' and the 'rollup table' briefly mentioned above?
The following table shows the differences between the two, which are designed to overcome the shortcomings of the traditional 'statistical table' and make it easier and more convenient for users.
Comparisons | Rollup tables | Statistics tables |
Changing the statistics start point | Freely change the start point | Cannot be changed |
Specify rollup type | Can be specified as seconds, minutes, hours | Keep as fixed type |
Change statistical unit time | Can be changed freely | Cannot be changed |
How to create and manage | Create, add, and delete via SQL | Create manually |
How to create rollup data | Automatically generated internally | Requires an external application to perform and manage |
Supported statistics types | Min, Max, Number of Records, Grand Total, Average, Sum of Squares Extended options (first value, first time, last value, last time) | Depends on your requirements |
In a nutshell, this means that if I have a year's worth of data from thousands of different sensors in a tag table, and I have hundreds of billions of records, and I create a rollup for this table, I can instantly get any given statistic, i.e. min, max, number of records, sum, average, sum of squares, for any statistical unit of time (15 seconds, 3 seconds, 100 minutes, 12 hours, 3 weeks) for any sensor over any time range (up to years).
Limits of rollups
Nothing in this world is all things to all people. Rollups are no different, and it's much more helpful to consider the following basic limitations before using them.
When you need new statistics
If you need something other than the 10 statistics listed above, rollups are useless. Sure, we've collected what most industries need, but the world is a big place and the needs are endless.
In the presence of corrupted raw data
Due to the nature of time series databases, if you enter incorrect data, the values will naturally be reflected in the rollup table.
Therefore, try to avoid making unnecessary noise.
When resources are scarce due to excessive raw data entry
Of course, the rollup process involves reading data from the tag table, calculating it, and re-entering it into the rollup table.
As a result, the rollup table is slow to reflect data, ranging from a few milliseconds to tens of seconds.
In particular, if the amount of data input is very large and the system's resources are not freely available for rollups, the gap between the last time of the raw data and the last time of the rollup table can become larger and larger.
This gap can be seen with the ‘show rol lup gap’ command in SHELL in Markbase Neo, where the closer the value in the ‘GAP’ column is to zero, the closer to real-time the data is generated.
If this GAP is consistently increasing, you may want to free up resources to improve CPU, memory, and disk performance for rollups.
How to use rollup-based long-term statistical data
1. install neo and get neo-apps source code
This installation is the same as the previous blog. For the actual installation method, you only need to refer to step 1 (install neo) and step 2 (download neo-apps) from the links in the previous blog and come back here. (For step 2, just follow the steps below again)
When installing Machbase Neo, try to use version 8.0.19 or later if possible (week, month, and grammar improvements)
2. check the fast-rollup directory
Once you have neo-apps, you will find the fast-rollup directory. We will run the tests in the numbered order.
Once you have neo-apps, you will find the fast-rollup directory. We will run the tests in the numbered order.
3. Verify schema creation and data entry
In the fast-rollup demo, we will create the schema and enter the data in the order after opening 0-Schema and Data Population.wrk. The data input has been pre-prepared for input via TQL.
You can use rollups in two ways, as shown below.
1. Create your own via the CREATE ROLLUP syntax (link to manual)
This creates the desired rollups, optionally SEC, MIN, and HOUR.
It would be cumbersome to have to adjust the START/STOP for each ROLLUP.
Instead, you can save disk and resources by eliminating the unnecessary.
2. Create a default default ROLLUP on the CREATE tag table (link to manual)
It will automatically create and drive SEC, MIN, and HOUR internally.
This is the most convenient and easiest way.
Below is an example of how to create a rollup in the DEFAULT form (WITH ROLLUP).
You can create the schema by pressing button 1 as shown above.
In the second case, it takes about 2-3 minutes to enter a year's worth of data at a rate of one tag per second, so you should be patient because the nature of rollup is to check the results over a long period, so we chose one year.
4. Checking and closing rollup gaps
When all the data is entered, you can see the progress of the current rollup through the ‘show rollup gap’ command through SHELL. In the figure below, you can see the value of the GAP that was checked immediately after input, and you can see that the GAP occurred because you entered a large amount of data in an instant. Then, you can check the GAP for a rollup of HOUR after about 10 seconds by performing it at intervals of about 2-3 seconds. (This is because the rollup thread is currently sleeping because a rollup of HOUR size has an execution cycle of 1 hour.)
Note that the GAP generated below does not occur under normal circumstances, but is an anomaly that only occurs when a large input is being processed using all resources, as in this demo. In general, if there are enough resources, the GAPs will all converge to zero in less than a second, ensuring real-time performance.
So, if we force the thread responsible for HOUR to wake up with the command below to create a ROLLUP, we can see that all the GAPs have been reduced to zero, as shown below.
You're now all set to run a roll-up test.
As shown in the figure on the left, a year's worth of data at 1 Hz for one sensor has a total of 31,536,000 entries.
5. Query to extract rollup data
To get this rollup data, Machbase Neo provides SQL with a very simple syntax as shown below. Also, this syntax is new from 8.0.19, and the old syntax is deprecated, so please refer to this manual link.
Below is an example of a typical ROLLUP query that is supported starting with 8.0.19.
select ROLLUP('sec', 15, time) as time, AVG(value)
FROM mybigdata
WHERE name = 'pressure' and
time BETWEEN TO_DATE('2023-05-15') AND TO_DATE('2023-05-16')
GROUP BY time
ORDER BY time;
The query above is to get the 15-second average value for the pressure sensor over a given time range.
The first argument can be 'sec', 'min', 'hour', 'day', 'week', 'month', or 'year' and the second argument, a number, can be one of the minimum to maximum values in that unit. The third argument is the name of the time column, given as the base time. The fourth argument is presented as an option to specify the start time of the rollup, which will be explained later in the example.
6. perform a rollup of seconds, minutes, hours, and days
Let's look at the prepared Tuesday 1-Rollup-min-day.wrk. (The to_char() function is for converting to a string in TQL, and is not needed to execute)
Get a 15-second average over an hour
select to_char(rollup('sec', 15, time, '2023-01-01')) as time, avg(value), max(value) from mybigdata where name = 'pressure' and time between TO_DATE('2023-05-15 0:0:0') AND TO_DATE('2023-05-15 23:59:59') group by time order by time;
Get a 30-minute minimum for a day
select to_char(rollup('min', 30, time, '2023-01-01')) as time, min(value), sumsq(value) from mybigdata where name = 'pressure' and time between TO_DATE('2023-09-11') AND TO_DATE('2023-09-11 23:59:59') group by time order by time;
Get the average value per day for half a year
select to_char(rollup('day', 1, time, '2023-01-01')) as time, avg(value) from mybigdata where name = 'pressure' and time betw
The dashboard that visualizes the above results, 2-DASHBOARD-MIN-HOUR.dsh, looks like this.
The third chart above, a 12-hour average of statistics over 6 months, also shows the amazing ability to extract and visualize data very quickly.
7. Perform a weekly and monthly rollup
Let's take a look at the prepared Tuesday 3-Rollup-week-month.wrk.
Not only weeks, but months are also a difficult task to generalize as the length of each month varies and leap months exist. Markbase Neo absorbs all these tricky parts, allowing users to get the desired result by simply specifying ‘week’ or ‘month’. However, in the case of ‘week’, the starting day of the week must be specified to get accurate statistics, so we use the fourth argument of ROLLUP, the start time.
Get a weekly average over 6 months
Since 2 January 2023 falls on a Monday, we use 2 days for the Monday to Sunday statistics.
select to_char(rollup('week', 1, time, '2023-01-02'), 'YYYY-MM-DD') as time, avg(value) from mybigdata where name = 'pressure' and time between TO_DATE('2023-01-2 0:0:0') AND TO_DATE('2023-06-30 23:59:59') group by time order by time;
Get a two-week average over 6 months
The 3rd of July below is also a Monday.
select to_char(rollup('week', 2, time, '2023-07-03'), 'YYYY-MM-DD') as time, avg(value) from mybigdata where name = 'pressure' and time between TO_DATE('2023-07-03 0:0:0') AND TO_DATE('2023-12-31 23:59:59') group by time order by time;
Get monthly average values for a full-year range
select to_char(rollup('month', 1, time, '2023-01-01'), 'YYYY-MM') as time, avg(value) from mybigdata where name = 'pressure' and time between TO_DATE('2023-01-01') AND TO_DATE('2023-12-31 23:59:59') group by time order by time
4-DASHBOARD-WEEK-MONTH.dsh, a dashboard visualizing the above results, looks like this
Final thoughts
The features of ROLLUP presented here are the best way to quickly calculate and visualize statistics over long periods of data. Of course, there are exceptions where you may need to visit and calculate the entire data, but we believe this is a basic data processing technique to support decision-making in general business areas.
Although we haven't mentioned it in this article, there are also more statistical methods for deleting ROLLUP data and using EXTENSION mode.
Comments