Tips to duplicate massive time series data on Postgresql

A few days ago, I was testing the lttb function from the timescaledb_toolkit extension to downsample data into fewer points.

Using the weather dataset from Timescale, we got 20k points per device, but I need more data from the same device_id to downsample. Utilizing random or generate_series doesn’t bring a good quality of data for this case.

This blog post aims to show how to insert massive data by reusing the values from a dataset and shifting the time column to another period.

Let’s use the conditions table as an example:

playground=# \d conditions
                         Table "public.conditions"
┌─────────────┬──────────────────────────┬───────────┬──────────┬─────────┐
│   Column    │           Type           │ Collation │ Nullable │ Default │
├─────────────┼──────────────────────────┼───────────┼──────────┼─────────┤
│ time        │ timestamp with time zone │           │ not null │         │
│ device_id   │ text                     │           │          │         │
│ temperature │ numeric                  │           │          │         │
│ humidity    │ numeric                  │           │          │         │
└─────────────┴──────────────────────────┴───────────┴──────────┴─────────┘

We’re going to just duplicate the data for a target device_id. So, to start, you just need to know the size of the interval between your data.

SELECT MAX(time) - MIN(time)
FROM conditions
WHERE device_id = 'weather-pro-000000'

Now, you can reuse the interval and specify the order of the columns correctly to reinsert the data shifting the dataset backward or forward. In this case, I’m prepending data backward. This means inserting more data from the past instead of the future.

INSERT INTO conditions
  SELECT time - INTERVAL '108 days' as time,
    device_id, temperature, humidity
  FROM conditions
  WHERE device_id = 'weather-pro-000000';

The trick is the time - INTERVAL '108 days', which will reset the shift time from the dataset, and the rest of the data will be reused. To remove the hardcoded 108 days from the example, we’ll need to create a materialized CTE.

WITH previous AS materialized
(SELECT MAX(time) - MIN(time) AS period
FROM conditions
WHERE device_id = 'weather-pro-000001' )
TABLE previous;
┌──────────────────┐
│      period      │
├──────────────────┤
│ 27 days 18:38:00 │
└──────────────────┘
(1 row)

Now, enhancing the example to preview all the data joining the conditions table:

WITH previous AS materialized (
  SELECT device_id, MAX(time) - MIN(time) as period
  FROM conditions
  WHERE device_id = 'weather-pro-000001'
  GROUP BY 1
)
SELECT cond.time + previous.period,
  previous.device_id,
  cond.temperature,
  cond.humidity
FROM previous
LEFT JOIN LATERAL
  (SELECT * from conditions)
AS cond ON cond.device_id = previous.device_id;

Now, it’s more about prepending the previous statement with the INSERT INTO conditions.

INSERT INTO conditions
WITH previous AS materialized (
  SELECT device_id, MAX(time) - MIN(time) as period
  FROM conditions
  WHERE device_id = 'weather-pro-000001'
  GROUP BY 1
)
SELECT cond.time + previous.period,
  previous.device_id,
  cond.temperature,
  cond.humidity
FROM previous
LEFT JOIN LATERAL
  (SELECT * from conditions)
AS cond ON cond.device_id = previous.device_id;

Checking the performance with \timing in the psql can give you some idea of how fast it is:

\timing
INSERT 0 40000 -- Time: 326.921 ms
INSERT 0 80000 -- Time: 582.390 ms
INSERT 0 160000 -- Time: 1086.917 ms (00:01.087)

160k rows per second!

This post was inspired by a pairing session with David K. He is an SQL expert and came up with this fascinating idea to just shift the time and reinsert the same data.

If you like Postgres and SQL, David is also creating a very instructive video series’ named the Postgresql foundations on Youtube.

Jônatas Davi Paganini

Four Years at Timescale: My Journey as Dev 🥑

Share this article

Jônatas Davi Paganini

Related Articles

Four Years at Timescale: My Journey as Dev 🥑