feat(hogql): symbol resolution #14185

mariusandra · 2023-02-09T23:33:35Z

Problem

PostHog/meta#81

Changes

Implements symbol resolution (makes sure every part of a query resolves to some value)
Adds JOIN and alias support.

What now?

Resolver

This PR implements a separate "resolver" step, which run after the "parser".

The flow is now:

HogQL string -> Parser -> AST -> Resolver -> AST with symbols -> Printer -> ClickHouse string

In the future we'll add more steps. I already know of two more: 1) adds joins for person properties if on a different table, 2) adds the team_id guard (now done in the printer, should be a separate step)

Symbols

Symbols are a parallel data structure to the AST nodes, which link to the actual meaning behind each field. Imagine the query:

SELECT distinct_id, e.timestamp, pdi.person_id 
FROM (select event, timestamp from events) e 
LEFT JOIN person_distinct_id pdi ON pdi.distinct_id = e.distinct_id

Symbol resolution lets us know that distinct_id in the beginning of this SELECT clause unambiguously refers to the distinct_id field on person_distinct_id, and not the one from the subquery on events, as it doesn't export this field as a column.

That's what this PR implements. The above query is a valid HogQL query that works. If you'd run it, we'd also add a
team_id guards on each of these joined tables before printing out the ClickHouse SQL.

We assign a class that inherits from ast.Symbol to each field, property, alias, subquery or table node. Each SELECT query is its own lexical scope, and can't access values defined in other queries, except via JOINs.

ClickHouse's scope semantics are hard and loose. I followed them to the best of my ability, and simplified down to the least ambiguity when possible. The biggest difference is that in HogQL aliases can't redefine other aliases with the same name when in the scope of the same select query. ClickHouse would consider select ((1 as a) as b) as a; valid code, but HogQL does not.

Other changes

Implements a representation of queriable database tables via a new set of classes in hogql/database.py
Implements joins and queries from these other tables
Implements symbol classes that chain each other (PropertySymbol -> FieldSymbol -> TableAliasSymbol -> TableSymbol -> Table -> Database)
Adds a team_id guard on each of those tables in the printer
Fixes JOIN parsing (some stuff was in reverse)
Splits the EverythingVisitor into TraversingVisitor and CloningVisitor. These are the classes to inherit from when designing your own simple visitors.

Coming up in next PRs (out of scope)

Array joins, window functions, union all, and the rest of ClickHouse SQL.
Automatic join for person or group property tables
UI where you can actually run these queries and see results. It's all in a test file now.
A lot of security and sanity checking
Fixing the "select countDistinct()" ANSI SQL discrepancy
HogQL & Data Exploration next steps meta#81

How did you test this code?

Wrote many tests.

…into hogql-symbol-resolution

posthog/hogql/ast.py

Twixes

Nothing stands out as off to me here. Although I don't feel entirely confident I've grasped everything here – I could definitely use a walkthrough. 😄

posthog/hogql/ast.py

posthog/hogql/test/test_resolver.py

posthog/hogql/printer.py

posthog/hogql/test/test_printer.py

neilkakkar · 2023-02-17T14:03:09Z

+1 on walkthrough (and not just this PR, but a few previous ones that this one builds on top of lol, if possible) - seems I blinked away and the entire backend changed 🙈 - having a hard time building a model of what's new and how this fits into existing queries 🤔.

Appreciate this can be a lot of effort, so I'll leave the decision to do one or not up to you. (maybe once an existing insight uses this, so I understand the entire flow?). Right now this feels too complicated, not sure if we can freely optimise queries and have this at the same time.

neilkakkar

ok I need a break, this is taking a while, will be back soon.

One blocking comment, but otherwise looking great! Nice work!

posthog/hogql/ast.py

neilkakkar · 2023-02-21T10:44:33Z

posthog/hogql/ast.py



 class Expr(AST):
-    pass
+    symbol: Optional[Symbol]


this parallel data structure seems very awkward to me.

Since we're dealing with steps in a pipeline, and we have a hard requirement to call Resolver() in the pipeline, and in the end everything would probably have its own symbol, is there a reason not to transform Expressions into Symbols or something similar, like:

class Symbol(Expr), vs what we have.

Basically, I'm saying we should go for inheritance here over composition. And then everything that follows in the pipeline should operate on Symbol, SelectQuerySymbol, etc. etc. vs the raw nodes. For now, it will probably be a union of the two, but in the end we can strongly type it to be one. (which is great for enforcing resolver has run on the entire tree).

This also makes it one uniform structure rather than two parallel ones.

Maybe a question that informs this decision:

class And(Expr) vs class And(Symbol). (with maybe better names).

Or, an Expr can be what is today's Symbol.

ok this might make the visitor too confusing 😅 .. .. maybe..

Well, this is definitely doable, but we'd likely be doing ourselves a disservice. A few steps down the development pipeline is "type resolution", for which every node will need to have a symbol. Every expr will have one. Every mathematical operation will have one, every function will return one, every constant will have one (already has), etc. Parsing and traversing that soup is going to be the stuff of nightmares 😅.

I didn't really invent this myself though. I'm somewhat borrowing from TypeScript's API, which also connects a symbol to each expression.

Those symbols are the window into the meaning behind the values, including their types, and where they were initialised.

Sweet, having prior art makes me a lot more comfortable with this :D

neilkakkar · 2023-02-21T10:59:28Z

posthog/hogql/ast.py

+
+class FieldSymbol(Symbol):
+    name: str
+    table: Union[TableSymbol, TableAliasSymbol, SelectQuerySymbol, SelectQueryAliasSymbol]


it seems very easy to generate symbols without the requisite data.

Think it's fine for now, but we should enforce symbol creation with all required components (i.e. have an init that disallows passing nones for all these attributes). Should simplify rest of the things where we right now have to do error handling to check whether they exist or not.

I think SelectQuerySymbol was where I found this to be worrysome - noticed the empty creation in the resolver I think.

posthog/hogql/hogql.py

posthog/hogql/printer.py

neilkakkar · 2023-02-21T11:32:45Z

posthog/hogql/printer.py

        if node.op == ast.CompareOperationType.Eq:
            if isinstance(node.right, ast.Constant) and node.right.value is None:
-                response = f"isNull({left})"
+                return f"isNull({left})"


ooo tricky, special handling for nulls because the user did something that probably wouldn't work?

iirc it's not invalid Clickhouse SQL to have equals(NULL, 1)

Hmm... I have nothing against splitting this back to a separate "IsNull" comparison. This basically came out of a property filter. The logic is that to check if anything is or isn't null, you can't just == null or != null you always have to change it to is or is not. This effectively makes comparing with null work.

But, that might not be what's expected, so let's revisit. I'll add it to the list of checkboxes.

posthog/hogql/printer.py

neilkakkar · 2023-02-21T11:36:15Z

posthog/hogql/printer.py

            elif node.name == "countDistinct":
-                response = f"count(distinct {translated_args})"
+                return f"count(distinct {translated_args})"


not a big fan of these translations, why do we have them?

We're going to get rid of them, but outside of this PR 😅

PostHog/meta#81

[ ] Clean up the confusion with our bespoke countDistinct() function

They're leftovers from the Python AST days. I want to support native ANSI-compliant SQL here.

neilkakkar · 2023-02-21T11:39:11Z

posthog/hogql/printer.py

+
+            field_sql = self._print_identifier(resolved_field.name)
+
+            # :KLUDGE: Legacy person properties handling. Assume we're in a context where the tables have been joined,


I'm not very confident here, but will not block because we'll eventually test all of this with our existing clickhouse insight tests 😬 .

This feels pretty complicated / hard to get right, but don't have a better option on my mind yet, trying to use the existing ColumnOptimizer seems even harder here..

Well, we have a test everywhere you can use HogQL property filter (trends, funnels, global filters on all insights, events list, session recordings list, etc) to make sure it works with both person.properties.email == 'something' and properties.something == 'bla'. Those test are run with materialised properties on and off for those columns, and with person-on-events both on and off. So far they're green. 😅

This all is a big kludge, and I'm now working on a true fix in the context of https://github.com/PostHog/posthog/pull/14286/files

neilkakkar

🚀

mariusandra · 2023-02-21T15:55:33Z

Thank you @neilkakkar for taking the time to go through all of this!

Let's get it in and 🚀 on...

mariusandra and others added 16 commits February 8, 2023 21:46

feat(hogql): select statements

7e44195

visitor

a87115f

cleanup

c55f170

parse limit by

d47bc61

parse limit by

839d505

merge limit clauses

fef9463

Update snapshots

9cd96bc

fix placeholders

2d0d1e8

resolve symbols for events table

5064dad

resolve aliases

62dd464

refactor column and table aliases

6c04d20

column resolver

c550147

make sure some things error

2253f41

annotate

a006872

constants

b7b5521

simple sql query

60376dd

mariusandra changed the base branch from master to hogql-further-improvements February 9, 2023 23:33

github-actions bot and others added 7 commits February 9, 2023 23:39

Update snapshots

9683e92

introduce "print name"

ac3048e

visit_unknown

86ddae5

basic printer via a visitor

c48024c

completely redo printing

afe485d

Merge branch 'hogql-symbol-resolution' of github.com:PostHog/posthog …

761a398

…into hogql-symbol-resolution

Merge branch 'master' into hogql-further-improvements

fb35bc7

mariusandra mentioned this pull request Feb 13, 2023

feat(hogql): better error if placeholder in HogQL expression #14153

Merged

Merge branch 'hogql-further-improvements' into hogql-symbol-resolution

8af1c79

Base automatically changed from hogql-further-improvements to master February 13, 2023 14:37

mariusandra added 3 commits February 13, 2023 15:39

some sample queries

076916e

Merge branch 'master' into hogql-symbol-resolution

f1cbd94

query tests

fd271dd

Twixes reviewed Feb 16, 2023

View reviewed changes

posthog/hogql/ast.py Outdated Show resolved Hide resolved

Twixes reviewed Feb 17, 2023

View reviewed changes

posthog/hogql/ast.py Outdated Show resolved Hide resolved

posthog/hogql/test/test_resolver.py Outdated Show resolved Hide resolved

posthog/hogql/printer.py Outdated Show resolved Hide resolved

posthog/hogql/test/test_printer.py Outdated Show resolved Hide resolved

mariusandra and others added 9 commits February 20, 2023 23:05

Merge branch 'master' into hogql-symbol-resolution

2f7d3df

asterisk and obelisk

947e7bf

yeet

74bf09c

class is for internal use only

d8eb091

fix aliases

9f1ae92

not needed

c767159

fix a few fields

e496907

Update snapshots

f883dbf

Update snapshots

544c3e2

mariusandra mentioned this pull request Feb 21, 2023

feat(hogql): asterisk expander #14271

Merged

neilkakkar reviewed Feb 21, 2023

View reviewed changes

mariusandra added 9 commits February 21, 2023 14:42

join the 21st century

f8acbc3

typo

bfa4068

simplify translate_hogql function

57d8e31

unblock prewhere

66f8364

postwhere

562afbc

too many spaces

09e3732

rename

cd1ae6c

improve comment

b02ff65

let's get it right

e0d8e6e

neilkakkar approved these changes Feb 21, 2023

View reviewed changes

mariusandra merged commit 5345975 into master Feb 21, 2023

mariusandra deleted the hogql-symbol-resolution branch February 21, 2023 15:55

mariusandra mentioned this pull request Jun 7, 2023

Sprint June 12 to June 23 #15932

Closed

mariusandra mentioned this pull request Jun 21, 2023

Sprint June 26 to July 7 #16168

Closed

thmsobrmlr mentioned this pull request Jul 5, 2023

Sprint July 10 to July 21 #16382

Closed


		field_sql = self._print_identifier(resolved_field.name)

		# :KLUDGE: Legacy person properties handling. Assume we're in a context where the tables have been joined,

feat(hogql): symbol resolution #14185

feat(hogql): symbol resolution #14185

Uh oh!

Conversation

mariusandra commented Feb 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

What now?

Resolver

Symbols

Other changes

Coming up in next PRs (out of scope)

How did you test this code?

Uh oh!

Uh oh!

Twixes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

neilkakkar commented Feb 17, 2023

Uh oh!

neilkakkar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neilkakkar left a comment

Choose a reason for hiding this comment

Uh oh!

mariusandra commented Feb 21, 2023

Uh oh!

Uh oh!

mariusandra commented Feb 9, 2023 •

edited

Loading