Times series database storage format, point arrays, and downsampling

Pedro

I'm looking into reading over 2,000 new datapoints, each reporting once per minute, and storing them either in Mango or in another time-series database. Most values are 16 bits wide, but many may be only 8 bits wide, and all could be stored with the same timestamp each minute. This is a lot of data, so I'm looking to store it efficiently, and then probably analyze and visualize it with the Anaconda Python toolset.

I currently have almost 1,300 enabled points in a Mango installation. I recommend the command line tool jfq to use JSONata queries to quickly answer such questions:

$ jfq '$count(dataPoints[enabled=true].[name])' Mango-Configuration-Aug-01-2019_000500.json
1289
The .[name] is not needed, but I like to list the point names as a sanity check before invoking the count.

This brings me to Mango's NoSQL time-series database:

How does Mango's time-series database store binary or numerical values?
Are all point values stored as 64 bits? Even binary values?
Is each TSDB entry 128 bits? (64bit timestamp+64bit value)?

Each four 16-bit readings would be transmitted packed into 64-bit wide words, with the same timestamp from each data source, either over MQTT-SN into a broker and into the Mango MQTT client subscriptions, or over CoAP and REST PUTs. Those four stored numbers can then be used to generate the additional two numbers I need for each timestamp; incidentally, if you ever integrate an MQTT broker into Mango, please be sure it supports MQTT-SN like the Mosquitto broker. Values can therefore be calculated upon retrieval; storing such values would be redundant and bloat storage needs by another 50%.

To summarize, I could pack four 16-bit values into a 64-bit pointvalue, and then generate the remaining two points when I read and unpack the pointvalue, thus generating my desired six 16-bit values. Unpacking could almost be done with metadata functions, but metadata functions are triggered when the data enters Mango rather than when the data is retrieved from Mango's TSDB.

Have you considered implementing ad-hoc metadata functions, where the data is generated when it is retrieved rather than when it is stored? This would reduce the need to store data that can simply be calculated from other stored points. Depending on the Ad-hoc metadata Logging properties , calculated points could either be stored or recalculated each time they're retrieved. Metadata point value calculations are performed on retrieved stored data now by clicking on the metadata point source Generate history function, but they're not ad-hoc in that they are not automatically generated when the metadata point is queried for data that is does not have.

The most efficient storage in my case would be where each 2,000 element row begins with a single timestamp, and each column is a different data value. Since the timestamp will be the same for all the points, that timestamp could be stored only once. Rare missing values could be stored with a special value like a NaN. This avoids the need to write 2,000 identical timestamps to disk. Calculating the additional 2/4 points upon value retrieval from disk would further reduce the data storage demands by 1/3.

Is there any way to handle point arrays in Mango, where each array index is a different point? e.g. voltage[0], voltage[1], ... voltage[1000]?
As the data ages, it could be down-sampled (decimated) from 1-minute intervals to 10-minute intervals. Most time-series databases support automatic downsampling. Are there any plans to add automated downsampling to Mango? Purging should not be the only option for old point data.

I've considered using influxDB. Although they have downsampling policies, it seems that they do not support arrays; the closest data element they support is a 64-bit number. Consequently, I would still have to pack and unpack my 16-bit values. One consequence of having to pack and unpack values is that it makes it unlikely that the visualization tools would render the stored data directly. This forces me to choose between direct visualization and storage efficiency.

Thank you for your input.

phildunlap

Hi Pedro, quite a lot going on there! Let me know if I miss a question...

How does Mango's time-series database store binary or numerical values?

Are all point values stored as 64 bits? Even binary values?

Is each TSDB entry 128 bits? (64bit timestamp+64bit value)?

Numeric data points store the value as a 64-bit double. Binary points store a whole byte for their one bit of information. Therefore not all values are stored as 64 bits. As to the size of each entry, they can be variable even within a data type because Mango keeps the annotations for the point values stored with the point values. The best way to estimate the size of a record is to hexdump the shard file after the point has had another sample written to it, and compare to before. Typically a record will be somewhere from 13 to 25 bytes, but it can certainly vary. The maximum record size is somewhere around 65550 bytes, with 65535 of that being the user's largest encodable record.

Have you considered implementing ad-hoc metadata functions, where the data is generated when it is retrieved rather than when it is stored?

Yes, though I'm not aware of anyone actually using it that way currently. The POST /rest/v2/script/run allows you to submit a script and get the results of its run. For longer running scripts it would be nice if we provided temporary resources of the output, but that hasn't been asked for yet. The PointValueQuery script tool should be used for streaming large amounts of data in scripts.

Metadata point value calculations are performed on retrieved stored data now by clicking on the metadata point source Generate history function, but they're not ad-hoc in that they are not automatically generated when the metadata point is queried for data that is does not have.

There certainly has been discussion on that, but no one has offered to support the development so it hasn't gotten to the front of the queue.

Is there any way to handle point arrays in Mango, where each array index is a different point? e.g. voltage[0], voltage[1], ... voltage[1000]?

I do not find it clear what you mean, but the answer is likely no. You can download data in this format from the watch lists and the API but there is no data type that is "array of data type: numeric" and handled in Mango as such. You could encode the data into an alphanumeric point to get closer to achieving that (or even an image point if the data was enormous) but the entirety of the handling would be in your implementation.

As the data ages, it could be down-sampled (decimated) from 1-minute intervals to 10-minute intervals. Most time-series databases support automatic downsampling. Are there any plans to add automated downsampling to Mango? Purging should not be the only option for old point data.

It has been discussed, but no one has desired it in earnest. Our record sizes could be improved upon as well so there's been some exploratory work on different compression tactics for old data, but whatever tools exist were never suggested to the community for use for a wide variety of reasons.

One could easily downsample their data via script, which I can provide a simple or efficient example of if desired.

Pedro

@phildunlap said in Times series database storage format, point arrays, and downsampling:

quite a lot going on there!

Yes, I was considering whether to split my post into different topics. Thank you for all your answers.

The POST /rest/v2/script/ run allows you to submit a script and get the results of its run.

That sounds very interesting. I just completed a python script using XSRF token. Nice: I like how it eliminated the need to login.

One could easily downsample their data via script, which I can provide a simple or efficient example of if desired.

If you already have an existing script, it would be nice if you could post it under its own forum topic, as I'm sure many of us would like to downsample old data. In addition to downsampling old data, I have numerous points that were logged much too often due to initially setting a log tolerance threshold that was too small. However, I won't have time to run this script right away due to my other project.

there is no data type that is "array of data type: numeric" and handled in Mango as such.

You understood me correctly. I'm looking to store an array of readings (as in multiple channels for each timestamp). Basically, a 2D numerical array where the rows are for different timestamps and the columns are the same type of data type but from different sources (channels). If it were stored in a CSV or spreadsheet, it would look like this:

timestamp t+0, channel[1],channel[2],channel[3],channel[4],channel[5],...,channel[1000]
timestamp t+1, channel[1],channel[2],channel[3],channel[4],channel[5],...,channel[1000]
timestamp t+2, channel[1],channel[2],channel[3],channel[4],channel[5],...,channel[1000]
timestamp t+3, channel[1],channel[2],channel[3],channel[4],channel[5],...,channel[1000]
timestamp t+4, channel[1],channel[2],channel[3],channel[4],channel[5],...,channel[1000]
...

It seems to me that in order to reduce data redundancy (by not storing the same timestamp multiple times) I could store the data in HDF5 format. HDF5 includes the metadata for the stored information, so the data can be retrieved into a meaningful format using generic tools, even without the source code that stored it. Additionally, it can efficiently compress and decompress binary data such as numerical arrays. My array elements could be any number of bytes. HDF5 is also extremely fast.

Summary Points - Benefits of HDF5

Self-Describing The datasets with an HDF5 file are self describing. This allows us to efficiently extract metadata without needing an additional metadata document.
Supports Heterogeneous Data: Different types of datasets can be contained within one HDF5 file.
Supports Large, Complex Data: HDF5 is a compressed format that is designed to support large, heterogeneous, and complex datasets.
Supports Data Slicing: "Data slicing", or extracting portions of the dataset as needed for analysis, means large files don't need to be completely read into the computers memory or RAM.
Open Format - wide support in the many tools: Because the HDF5 format is open, it is supported by a host of programming languages and tools, including open source languages like R and Python and open GIS tools like QGIS.

I also found TsTables, a PyTables wrapper that will enable storing time stamped arrays into HDF5 files in daily shards and seamlessly stitch them together during queries. Appends are also efficient. The HDF5 tools will also help for debugging whether any inconsistency is occurring during the read or write operation.