Skip to content

RFC: HogQL null access and re-materialising all columns #15474

@mariusandra

Description

@mariusandra

Real life customer issues:

  1. The is set and is not set) filters in the events explorer don't work, as those filters are translated into properties.thing is not null, but properties are at most an ambiguous empty string '' (see below)
  2. Using HogQL is complicated, as null access is not intuitive when dealing with properties, and you end up writing properties.$browser != '' and properties.$browser is not null just in case. No joy is sparked.

The problem:

  • We want null or undefined properties to be matched as properties.bla IS NULL in HogQL queries, not properties.bla = '' like now.
  • Our property extraction code is replaceRegexpAll(JSONExtractRaw(event.properties, 'bla'), '^\"|\"$', '')
  • It gets the raw JSON like "bla" or {} and strips the quotes if some are present.
  • With this, all empty strings equal "", and all unset properties equal "", and all null properties equal the string "null".
  • Numbers and booleans are returned as strings, and get converted later on.

The suggestion:

  • Replace our property extraction code with replaceRegexpAll(nullIf(nullIf(JSONExtractRaw(events.properties, '$browser'), ''), 'null'), '^\"|\"$', '')
  • This returns NULL if the property is actually null or undefined in the original JSON object.
  • This changed query does not introduce any extra property access that would slow down queries (e.g. properties.bla != '' ? properties.bla : null adds one more property access)

The new problem:

  • This is already implemented for HogQL, with a fix possible for all old places that access properties. However...
  • Once a property is materialized, it reverts back to the old behaviour, as we swap it out in queries.
  • What's more, we have limited visibility and control into materialized properties. We can launch materialization only after SSH-ing into a running container and running a management task or python code in a shell.
  • Materialisation comes with a backfill, that usually only backfills 90 days worth of data. After materialization, all history behind that point is lost to queries... if I'm reading the code correctly. Edit: Probably things get backfilled eventually when CH updates past partitions.
  • To solve this, we need to rematerialize every column and offer some kind of migration path to make the system perform fast while rematerialization is happening.

Questions:

  • Can we use Temporal for these long running jobs? Should we use async jobs?
  • Do we want to build a "materialized columns" admin view with dials we can control? SSH-ing into a pod id dodgy.
  • Could we use HogQL to select between the materialized and old property dynamically at query time?
  • Do expressions like if(timestamp > $when_started, mat_column, properties.column) always extract properties.column, even if it's never used? Or do we need to UNION select queries together to make this performant?
  • Does the system that adds new materialized columns every Saturday just lose old data in the process?

Suggestion:

  • Build a new system to track and materialize columns using the new property access code.
  • Add a postgres table to track the materialization start, backfill status, etc.
  • Re-materialize all columns with this new system into non-conflicting property names (instead of pmat_email, call them pmat_new2_final_email or something).
  • Swap out the systems in both HogQL and the old queries.
  • Delete the old materialized columns

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions