The Internet of things is a network of physical devices–peripherals, appliances, vehicles, wearable, and other items–embedded with electronics, software, and sensors that enable these objects to connect and exchange data over the internet or, sometimes, a private network.
These devices often include sensors, digitally controlled actuators, and communication terminals especially engineered for physical, moving objects. The ability to collect, process and analyze data from these devices and derive valuable insights is the catalyst that began the new trends driving much of the modern tech industry, industry 4.0.
IoT and the ability to derive valuable insights from the physical world is driving much of the modern tech industry, industry 4.0. Click To Tweet
Along with these new trends, however, come new technical challenges. The operation of and interactions between thousands of physical objects generates huge volumes of data at a mind-boggling pace. As technology continues to improve and integration becomes increasingly common, more information and data is generated, collected, analyzed and turned into intelligent insights.
This article will describe the types of IoT data generated, unique characteristics of IoT data, and how to overcome the immense challenge of processing it all.
IoT Data Types And Characteristics
In the early days of the Internet of Things, the only data available was via RFID. Nowadays, the improved development of communication and computing technology offers a wider variety of data sources and types–modern sensors and actuators can record almost anything imaginable. Here are the different types of IoT data available, and their characteristics.
RFID (Radio Frequency Identification)
(Image source: https://www.flickr.com/photos/christiaancolen/21845323893)
Radio-frequency identification or RFID technology uses electromagnetic fields to automatically identify and track tagged objects or equipment.
A typical RFID setup consists of an IC chip that stores electronic information and an antenna that transmits and receives the data wirelessly via radio waves. Active RFID often features a battery or other local power source that allow them to interact with an RFID reader hundred of meters away. Unlike with barcodes, RFID chips don’t need to be lined up and scanned directly by their readers, allowing easy interaction and data processing.
Barcodes may still be the method of choice in distribution, but since RFID technology is fairly cheap, it’s beating out barcodes nearly everywhere else. RFID technology has a wide variety of uses including travel, smart devices, factory operations, warehousing, farming, and healthcare.
RFID allows the tracking and recording of positional information through time series data, providing invaluable information for logistical applications.
The problem with RFID technology is security. RFID data can be read from your devices at a distance, often without you knowing. When that data can be personal or confidential in nature, that can be a huge problem.
(Image source: https://pixabay.com/en/code-data-programming-code-944504 )
Log data is information automatically generated by software and hardware, and it plays a very important role in managing IoT devices.
At a base level, log data records the time each log is generated. In addition, log data often also includes various environmental information, such as the ID information of IP/MAC address, the system usage/load information, and the temperature/humidity information from the input records. Since log data is unique, a special conversion process–such as parsing of log messages–is required in order for the information to be represented as a schema of relational DBMS.
Typically, log data is generated in a text form and is automatically deleted when a certain capacity is reached. Obviously, that won’t do. Since ongoing data collection is important for IoT, a different approach is required in order to collect and analyze data long-term.
Another challenge that can make log data difficult to process is that it can be recorded in various formats depending on which programs generate it.
Location And Environmental Data
As we mentioned regarding RFID, location information from moving objects is extremely useful. Another important type of environmental information is weather data. In general, positional information is obtained by using data from global positioning systems (GPS). Since GPS information is obtained through satellites, it is difficult to obtain highly accurate positional data using that technology alone. However, using GPS in conjunction with local positioning technology makes it possible to obtain more accurate and detailed information.
The location data of non-moving equipment can also be treated as very important information. For example, the combination of environmental and location information, such as temperature, humidity, and air pressure from floating sensors in the ocean can be useful for weather forecasts and disaster alarms.
Currently, location and environmental data are being studied in combination with other technologies such as geographic information systems and mobile computing in order to discover additional use cases.
Sensor Data And Time Series Data
Nowadays, every mobile phones is equipped with numerous sensors such as cameras, GPS and acceleration sensors. There are also many sensors in equipment used in factories and public services, such as roads, railways, seaports, and airports. As more and more smart devices are connected, the opportunity to collect and analyze useful IoT data also continues to rise.
Each device and sensor has unique identifiers allowing simultaneous data recording and measurement. This special data, recorded in “Timestamp, Sensor identifier, Sensor value” format, can be sequentially stored based on the input time for data analysis later. This data is called the time series sensor data.
Through analysis, time series sensor data can offer a wealth of real-world data from IoT device sensors that can be used to solve a multitude problems previously impossible to address.
Sensor and Control Data
In time series, sensor data can be collected by actuators in real time while control signal data manages actuator recordings. Since this data keeps changing in real time, a large amount of data is generated and it can be difficult to store and analyze.
This type of data can be used in accident analysis, defective product prediction, quality improvement and production control.
When sensor data includes the time it is collected, it can be categorized as historical data. Since it is time-based, the volume of historical data can increase very quickly depending on the length of the data collection period.
That said, when being collected for detailed analysis, it’s possible to selectively shorten the data collection period in order to avoid larger the data amounts, which can be an issue unless you’re able to resolve within your DBMS.
DBMS For Processing IoT Data
The Problems With Handling Large Amounts Of Data In Real Time
As previously mentioned, more devices and more sensors mean you’ll start to generate incredible amounts of data. To compound the problem even further, continued storage of data for analysis and historical record means you’ll only ever have more data to deal with.
Typical database management systems (DBMS) are unsuitable and make it difficult to manage petabytes of information within a single system. Also, while conventional Big Data platforms are optimized for batch processing, distributed storage, and massive data retrieval, they do not necessarily make it possible to analyze large amounts of data in real time.
Since the utility of time series data is based on continuing to collect, store, index, and retrieve it without data loss, your DBMS must be scalable, and able to handle large amounts of sensor data in real time.
Specialized IoT DBMS make time series data management much easier by generating search indexes and processing statistical data for visualization very quickly.
Query Language And Interface
IoT sensor data can come as either regular structured data and semi-structured data. While SQL is the most common query language for structure data, there is no query language specifically used for semi-structured data. No-SQL query languages have emerged with the introduction of Big Data systems, but heterogeneous query languages (e.g. MongoDB) are not widely used or supported. As a result, SQL on Hadoop products such as Spark and Impala have become popular and encouraged increased usage of SQL. DBMS supporting the SQL language typically also support traditional interfaces, such as ODBC and JDBC.
Operating historian products use JSON based query interface through HTTP protocol through REST API. Since the REST API is very easy to use, nearly every environment has been adapted and it has become the most popular interface to support.
In a distributed data system that needs to be able to process more than 100 billion data per second in real time, it’s difficult to perform transaction processing. Transaction processing can be summarized by ACID or two-phase locking.
Traditional relational DBMS that complete ACID-based transaction processing are difficult to use due to the nature of time series data, since there is typically no data updating operation before data deletion. Typically, this data is being processed with eventual consistency techniques based on CAP theorem in Big Data platform.
In order to process massive amounts of IoT data in real-time, you require a DBMS with more efficient data processing techniques that reflect the characteristics of time series data, rather than than those used in traditional ACID-based transactions.
Statistical Processing For Time Series Data
Time-series data requires time-based statistical processing in order for it to be used for visualization or statistical analysis.
It can be difficult to generate statistical data in an environment where a large amount of data is input in real time. In fact, even general sum, count and average functions in time series statistical processing requires a special sampling function. Since it is difficult to process the time series statistical data with a traditional relational database, stream databases are being studied as a possible alternative.
Machbase DBMS Optimized For IoT Data
The DBMS for IoT data should be able to:
- Process large amounts of data in real time,
- Support a convenient and efficient query language,
- Process effective transactions,
- Process time series data statistics.
Machbase is the only available solution that meets all of these requirements.
Real-Time Processing Of Large Data Volumes
- With distributed data storage and query structure, Machbase can input and index 200 million data from a single device. With addition of equipment, the performance improves and more than 10 million sensor data can be processed per second.
- It has a dedicated API for high-speed data input and an index structure for high-speed index generation.
- For increasing time-series data over time, Machbase allows users to add additional equipment to the cluster for better performance and space.
Efficient Query Language and Interface
- Machbase supports an optimized SQL language for data processing. No-SQL products are starting to offer SQL language again.
- Inverted index and related syntax are provided for efficiently retrieving semi-structured data, making it easy to search and process semi-structured data.
- It also provides REST APIs as well as the SQL standard interface, ODBC/JDBC.
Efficient Transaction Processing
- Optimal transaction technique for time series data
- It does not provide any update but inserts and deletes are possible. Even when restarting by node fails, recovery process is performed and the consistency of data and index is maintained.
- Machbase Enterprise Edition solves the data loss caused by node failures with a distributed data storage technique.
Time Series Statistics Processing
- Automatic statistics function for time series sensor data: Automatically generate statistics for each sensor per time unit (second, minute, hour) and sensor identifier.
- It supports extended query conditional clauses optimized for time series data.
Machbase is the solution that is implemented by considering all the functions and performance requirements for processing time series data and is suitable for processing IoT data.