How to Aggregate an Archive Log’s Deltas into a Snapshot with SQL

Spread the love
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  
  •  

A customer of my popular SQL training (which you should book!) has recently challenged me to optimise a hierarchical query that merges an archive log’s deltas in order to obtain a snapshot of some record at a given point in time. In this article, I will reproduce their problem statement in a simplified version and show how this can be done with SQL Server, using a few cool SQL features:
JSON (because that’s how they stored their deltas)
Window functions, which are so useful
CROSS APPLY, a cool, alternative way to join dependent data
Aggregating CASE expressions
All of these are topics covered in the training, which were immediately applicable to this problem statement.
The problem statement
This was their archive design. They designed for uncertainty, meaning that for some entities in their system, they did not know what kinds of attributes will be part of the entity in the future. Given their application design, users could even add their own custom attributes to an entity.
This kind of thing is typically solved with the EAV (Entity Attribute Value) model, a “workaround” to denormalise data sets in SQL databases in the event of such schema uncertainty.
EAV can be implemented in several ways:
Through classic SQL tables only
An example implementation is this:

CREATE TABLE eav_classic (
entity_type VARCHAR (100) NOT NULL,
entity_id BIGINT NOT NULL,
attribute_name VARCHAR (100) NOT NULL,
attribute_type VARCHAR (100) NOT NULL,
attribute_value VARCHAR (100) NULL,

CONSTRAINT eav_classic_pk
PRIMARY KEY (entity_type, entity_id, attribute_name)
);

The drawbacks of this non-normalised design are immediately obvious. Most specifically, there is no simple way to establish referential integrity. But this may be totally OK, especially for archive logs, and for smaller databases (datomic does something similar)
Through tables containing JSON or XML data
Whenever you have schema-on-read data, JSON or XML data types may be appropriate, so this is a perfectly valid alternative:

CREATE TABLE eav_json (
entity_type VARCHAR (100) NOT NULL,
entity_id BIGINT NOT NULL,
attributes VARCHAR (10000) NOT NULL
CHECK (ISJSON(attributes) = 1),

CONSTRAINT eav_json_pk
PRIMARY KEY (entity_type, entity_id)
);

If your database supports a JSON data type, obviously, you will prefer that over the above emulation
For the rest of this article, I will use the JSON
Versioning the EAV table
Versioning data in an EAV model is quite easier than in a normalised schema. We can just add a version number and/or timestamp to the record. In their case, something like this may make sense:

CREATE TABLE history (
id BIGINT IDENTITY (1, 1) NOT NULL PRIMARY KEY,
ts DATETIME NOT NULL,
entity_type VARCHAR(100) NOT NULL,
entity_id BIGINT NOT NULL,
delta VARCHAR(8000) NOT NULL
CHECK (ISJSON(delta) = 1)
);

INSERT INTO history (entity_type, entity_id, ts, delta)
VALUES (‘Person’, 1, ‘2000-01-01 00:00:00’, ‘{“first_name”: “John”, “last_name”: “Doe”}’),
(‘Person’, 1, ‘2000-01-01 01:00:00’, ‘{“age”: 37}’),
(‘Person’, 1, ‘2000-01-01 02:00:00’, ‘{“age”: 38}’),
(‘Person’, 1, ‘2000-01-01 03:00:00’, ‘{“city”: “New York”}’),
(‘Person’, 1, ‘2000-01-01 04:00:00’, ‘{“city”: “Zurich”, “age”: null}’)
;

This table now contains a set of deltas applied to the Person entity with ID = 1. It corresponds to the following sequence of SQL statements on an ordinary entity:

INSERT INTO person (id, first_name, last_name)
VALUES (‘John’, ‘Doe’);
UPDATE person SET age = 37 WHERE id = 1;
UPDATE person SET age = 38 WHERE id = 1;
UPDATE person SET city = ‘New York’ WHERE id = 1;
UPDATE person SET city = ‘Zurich’, age = null WHERE id = 1;

You could even see their hand-written log like a transaction log of the database system, kinda like what you can extract using products like Golden Gate or Debezium. If you think of the transaction log as an event stream, the RDBMS’s current data representation is like a snapshot that you can get when applying any number of deltas to your tables.
Sometimes, you don’t want to completely change your architecture and go full “event sourcing”, but just need this kind of log for a specific set of auditable entities. And e.g. for reasons like still supporting very old SQL Server versions, as well as supporting other databases, you may choose also not to use the SQL:2011 temporal table feature, which has also been implemented in SQL Server 2016 and more recent versions.
With that out of our way…
How to access any arbitrary snapshot version?
When we visually process our HISTORY table, we can see that Person ID = 1 had the following values at any given time:

TIME FIRST_NAME LAST_NAME AGE CITY
——————————————————
00:00:00 John Doe
01:00:00 John Doe 37
02:00:00 John Doe 38
03:00:00 John Doe 38 New York
04:00:00 John Doe Zurich

Remember, this is always the same record of Person ID = 1, its snapshots represented at different times in the time axis. The goal here is to be able to find the record of John Doe at any given time.
Again, if we had been using the SQL:2011 temporal table feature, we could write

— SQL Server
SELECT *
FROM Person
FOR SYSTEM_TIME AS OF ‘2000-01-01 02:00:00.0000000’;

— Oracle (flashback query)
SELECT *
FROM Person
AS OF TIMESTAMP TIMESTAMP ‘2000-01-01 02:00:00’

Side note: Do note that Oracle’s flashback query needs to be properly configured:
Not all data is “flashbackable”
DDL tends to destroy the archive
Proper grants are needed to access the flashback archive
Similar limitations may apply in SQL Server.
What if the RDBMS can’t help us?
If again for some reason, we cannot use the RDBMS’s temporal table features, we’ll roll our own as we’ve seen. So, our query in SQL Server to access the snapshot at any given time may be this:

SELECT
‘{‘
+ string_agg(
CASE type WHEN 0 THEN NULL ELSE
‘”‘ + [key] + ‘”: ‘ +
CASE type WHEN 1 THEN ‘”‘ + value + ‘”‘ ELSE value END
END, ‘, ‘)
+ ‘}’
FROM (
SELECT *, row_number() OVER (
PARTITION BY [key] ORDER BY ts DESC) rn
FROM history
OUTER APPLY openjson(delta)

— Apply all deltas prior to any given snapshot
WHERE ts

X ITM Cloud News

Emily

Next Post

How to Unit Test Your Annotation Processor using jOOR

Sun Nov 24 , 2019
Spread the love          Annotation processors can be useful as a hacky workaround to get some language feature into the Java language. jOOQ also has an annotation processor that helps validate SQL syntax for: Plain SQL usage (SQL injection risk) SQL dialect support (prevent using an Oracle only feature on MySQL) You […]
X- ITM

Cloud Computing – Consultancy – Development – Hosting – APIs – Legacy Systems

X-ITM Technology helps our customers across the entire enterprise technology stack with differentiated industry solutions. We modernize IT, optimize data architectures, and make everything secure, scalable and orchestrated across public, private and hybrid clouds.

This image has an empty alt attribute; its file name is x-itmdc.jpg

The enterprise technology stack includes ITO; Cloud and Security Services; Applications and Industry IP; Data, Analytics and Engineering Services; and Advisory.

Watch an animation of  X-ITM‘s Enterprise Technology Stack

We combine years of experience running mission-critical systems with the latest digital innovations to deliver better business outcomes and new levels of performance, competitiveness and experiences for our customers and their stakeholders.

X-ITM invests in three key drivers of growth: People, Customers and Operational Execution.

The company’s global scale, talent and innovation platforms serve 6,000 private and public-sector clients in 70 countries.

X-ITM’s extensive partner network helps drive collaboration and leverage technology independence. The company has established more than 200 industry-leading global Partner Network relationships, including 15 strategic partners: Amazon Web Services, AT&T, Dell Technologies, Google Cloud, HCL, HP, HPE, IBM, Micro Focus, Microsoft, Oracle, PwC, SAP, ServiceNow and VMware

.

X ITM