Point in Time Queries With Redshift

Oct. 9, 2020

I’ve been moving one of our entity changelogs to a CRUD style event. At the same time, analytics was considering how to deal with point in time queries. One thing led to another, and I’m staring down a monstrous five hundred page book about temporal databases.

Point in time - running queries as if you had gone backward or forward to a particular moment in time. For example, who were the active employees as of January 1, 2020?

Having never thought too deeply about temporal databases before, Developing Time-Oriented Database Applications in SQL was invaluable for getting a lay of the land. In particular, chapters 1-4 were great for concepts, 8 and 9 for techniques for tackling this type of question.

For the problem at hand, let’s build a simple historical table to answer our question about employees. Examples will be geared towards Redshift (believe they also work with Postgres).

CREATE TABLE IF NOT EXISTS employee_events(
  event_id varchar(4),
  id int,
  name varchar(255),
  status varchar(100),
  date date
);

INSERT INTO employee_events VALUES
('3zx7', 1, 'Zane_Lee', 'OFFER', '2018-06-10'),
('z0qu', 1, 'Zane_Lee', 'ACTIVE', '2018-07-01'),
('6yre', 2, 'Julia_Qwerty', 'OFFER', '2019-09-20'),
('o4a1', 2, 'Julia_Qwerty', 'ACTIVE', '2019-10-01'),
('8vgg', 1, 'Zane_Lee', 'INACTIVE', '2019-11-16'),
('uolk', 3, 'Luis_Dale', 'OFFER', '2019-12-20');

After running our queries, we now have a table that should look something like that.

event_id id name status date
3zx7 1 Zane Lee OFFER 2018-06-10
z0qu 1 Zane Lee ACTIVE 2018-07-01
6yre 2 Julia Qwerty OFFER 2019-09-20
o4a1 2 Julia Qwerty ACTIVE 2019-10-01
8vgg 1 Zane Lee INACTIVE 2019-11-16
uolk 3 Luis Dale OFFER 2019-12-20

If we wanted to answer the question of “who were the active Employees on January 1, 2020?”, our query might look like this:

SELECT *
FROM employee_events e
WHERE e.status = 'ACTIVE'
  AND e.date IN (
    SELECT MAX(date)
    FROM employee_events se
    WHERE se.date <= '2020-01-01' AND se.id = e.id
  );

This works, but there was a concern your performance potentially ends up O(n^2) since you have to run the subquery for each row. We could optimize this a bit with common table expressions, giving us O(2*n).

WITH latest AS (
  SELECT se.id, MAX(date) AS max_date
  FROM employee_events se
  WHERE se.date <= '2020-01-01'
  GROUP BY se.id
)
SELECT *
FROM employee_events e
JOIN latest ON latest.id = e.id AND latest.max_date = e.date
WHERE e.status = 'ACTIVE';

Either of those two methods is workable for a one-off query, but having the SELECT MAX() query is still clunky to me.

Taking a step back, the only records we care about are the ones valid at that point in time. If we stored one more piece of information we could eliminate this part of the query; the time at which our record becomes invalid.

event_id id name status date date_end
3zx7 1 Zane Lee OFFER 2018-06-10 2018-07-01
z0qu 1 Zane Lee ACTIVE 2018-07-01 2019-11-16
6yre 2 Julia Qwerty OFFER 2019-09-20 2019-10-01
o4a1 2 Julia Qwerty ACTIVE 2019-10-01 null
8vgg 1 Zane Lee INACTIVE 2019-11-16 null
uolk 3 Luis Dale OFFER 2019-12-20 null

Then we could instead do:

SELECT *
FROM employee_events e
WHERE e.status = 'ACTIVE'
  AND e.date <= '2020-01-01'
  AND (e.date_end > '2020-01-01' OR e.date_end IS NULL);

Of course, this assumes that date windows can’t overlap! Clearly our events can’t know the future when they’re emitted, but we can do some post-processing to pretend they did. Luckily other people actually know Redshift and instantly knew we could use the LEAD function to pull in this information.

LEAD - window function that returns the values for a row at a given offset after the current row in the partition (default 1)

One way to handle this is to generate a view and populate it with this end date.

CREATE OR REPLACE VIEW employee_events_history AS
  SELECT *,
    LEAD(e.date) OVER (PARTITION BY e.id ORDER BY e.date) AS date_end
  FROM employee_events e;

So instead of querying the employee_events table, you could query our employee_events_history view. Since we already generate views for other reasons, this technique worked for us.

Other articles I went through: * Working with Time-Series Data on Redshift - Some context for working with time data Redshift, but honestly not that helpful for the point in time question. * Historical records with PostgreSQL, temporal tables and SQL:2011 - An explicit example of adding support for temporal tables/point in time queries in Postgres. * Database Design: A Point in Time Architecture - Not sure on this architecture, but the selection part was particularly relevant. * Stack Overflow - Hat tip for first reference (and direct link) I found for Snodgrass’ book.