-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Labels
Description
Real life customer issues:
- The
is set
andis not set
) filters in the events explorer don't work, as those filters are translated intoproperties.thing is not null
, but properties are at most an ambiguous empty string''
(see below) - Using HogQL is complicated, as null access is not intuitive when dealing with properties, and you end up writing
properties.$browser != '' and properties.$browser is not null
just in case. No joy is sparked.
The problem:
- We want
null
orundefined
properties to be matched asproperties.bla IS NULL
in HogQL queries, notproperties.bla = ''
like now. - Our property extraction code is
replaceRegexpAll(JSONExtractRaw(event.properties, 'bla'), '^\"|\"$', '')
- It gets the raw JSON like
"bla"
or{}
and strips the quotes if some are present. - With this, all empty strings equal
""
, and all unset properties equal""
, and allnull
properties equal the string"null"
. - Numbers and booleans are returned as strings, and get converted later on.
The suggestion:
- Replace our property extraction code with
replaceRegexpAll(nullIf(nullIf(JSONExtractRaw(events.properties, '$browser'), ''), 'null'), '^\"|\"$', '')
- This returns
NULL
if the property is actuallynull
orundefined
in the original JSON object. - This changed query does not introduce any extra property access that would slow down queries (e.g.
properties.bla != '' ? properties.bla : null
adds one more property access)
The new problem:
- This is already implemented for HogQL, with a fix possible for all old places that access properties. However...
- Once a property is materialized, it reverts back to the old behaviour, as we swap it out in queries.
- What's more, we have limited visibility and control into materialized properties. We can launch materialization only after SSH-ing into a running container and running a management task or python code in a shell.
- Materialisation comes with a backfill, that usually only backfills 90 days worth of data. After materialization, all history behind that point is lost to queries... if I'm reading the code correctly. Edit: Probably things get backfilled eventually when CH updates past partitions.
- To solve this, we need to rematerialize every column and offer some kind of migration path to make the system perform fast while rematerialization is happening.
Questions:
- Can we use Temporal for these long running jobs? Should we use async jobs?
- Do we want to build a "materialized columns" admin view with dials we can control? SSH-ing into a pod id dodgy.
- Could we use HogQL to select between the materialized and old property dynamically at query time?
- Do expressions like
if(timestamp > $when_started, mat_column, properties.column)
always extractproperties.column
, even if it's never used? Or do we need toUNION
select queries together to make this performant? - Does the system that adds new materialized columns every Saturday just lose old data in the process?
Suggestion:
- Build a new system to track and materialize columns using the new property access code.
- Add a postgres table to track the materialization start, backfill status, etc.
- Re-materialize all columns with this new system into non-conflicting property names (instead of
pmat_email
, call thempmat_new2_final_email
or something). - Swap out the systems in both HogQL and the old queries.
- Delete the old materialized columns