@phildunlap said in Times series database storage format, point arrays, and downsampling:
quite a lot going on there!
Yes, I was considering whether to split my post into different topics. Thank you for all your answers.
The POST /rest/v2/script/
run allows you to submit a script and get the results of its run.
That sounds very interesting. I just completed a python script using XSRF token. Nice: I like how it eliminated the need to login.
One could easily downsample their data via script, which I can provide a simple or efficient example of if desired.
If you already have an existing script, it would be nice if you could post it under its own forum topic, as I'm sure many of us would like to downsample old data. In addition to downsampling old data, I have numerous points that were logged much too often due to initially setting a log tolerance threshold that was too small. However, I won't have time to run this script right away due to my other project.
there is no data type that is "array of data type: numeric" and handled in Mango as such.
You understood me correctly. I'm looking to store an array of readings (as in multiple channels for each timestamp). Basically, a 2D numerical array where the rows are for different timestamps and the columns are the same type of data type but from different sources (channels). If it were stored in a CSV or spreadsheet, it would look like this:
timestamp t+0, channel[1],channel[2],channel[3],channel[4],channel[5],...,channel[1000]
timestamp t+1, channel[1],channel[2],channel[3],channel[4],channel[5],...,channel[1000]
timestamp t+2, channel[1],channel[2],channel[3],channel[4],channel[5],...,channel[1000]
timestamp t+3, channel[1],channel[2],channel[3],channel[4],channel[5],...,channel[1000]
timestamp t+4, channel[1],channel[2],channel[3],channel[4],channel[5],...,channel[1000]
...
It seems to me that in order to reduce data redundancy (by not storing the same timestamp multiple times) I could store the data in HDF5 format. HDF5 includes the metadata for the stored information, so the data can be retrieved into a meaningful format using generic tools, even without the source code that stored it. Additionally, it can efficiently compress and decompress binary data such as numerical arrays. My array elements could be any number of bytes. HDF5 is also extremely fast.
Summary Points - Benefits of HDF5
- Self-Describing The datasets with an HDF5 file are self describing. This allows us to efficiently extract metadata without needing an additional metadata document.
- Supports Heterogeneous Data: Different types of datasets can be contained within one HDF5 file.
- Supports Large, Complex Data: HDF5 is a compressed format that is designed to support large, heterogeneous, and complex datasets.
- Supports Data Slicing: "Data slicing", or extracting portions of the dataset as needed for analysis, means large files don't need to be completely read into the computers memory or RAM.
- Open Format - wide support in the many tools: Because the HDF5 format is open, it is supported by a host of programming languages and tools, including open source languages like R and Python and open GIS tools like QGIS.
I also found TsTables, a PyTables wrapper that will enable storing time stamped arrays into HDF5 files in daily shards and seamlessly stitch them together during queries. Appends are also efficient. The HDF5 tools will also help for debugging whether any inconsistency is occurring during the read or write operation.