-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat(hogql): symbol resolution #14185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…into hogql-symbol-resolution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing stands out as off to me here. Although I don't feel entirely confident I've grasped everything here – I could definitely use a walkthrough. 😄
+1 on walkthrough (and not just this PR, but a few previous ones that this one builds on top of lol, if possible) - seems I blinked away and the entire backend changed 🙈 - having a hard time building a model of what's new and how this fits into existing queries 🤔. Appreciate this can be a lot of effort, so I'll leave the decision to do one or not up to you. (maybe once an existing insight uses this, so I understand the entire flow?). Right now this feels too complicated, not sure if we can freely optimise queries and have this at the same time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I need a break, this is taking a while, will be back soon.
One blocking comment, but otherwise looking great! Nice work!
|
||
|
||
class Expr(AST): | ||
pass | ||
symbol: Optional[Symbol] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this parallel data structure seems very awkward to me.
Since we're dealing with steps in a pipeline, and we have a hard requirement to call Resolver()
in the pipeline, and in the end everything would probably have its own symbol, is there a reason not to transform Expressions into Symbol
s or something similar, like:
class Symbol(Expr)
, vs what we have.
Basically, I'm saying we should go for inheritance here over composition. And then everything that follows in the pipeline should operate on Symbol
, SelectQuerySymbol
, etc. etc. vs the raw nodes. For now, it will probably be a union of the two, but in the end we can strongly type it to be one. (which is great for enforcing resolver has run on the entire tree).
This also makes it one uniform structure rather than two parallel ones.
Maybe a question that informs this decision:
class And(Expr)
vs class And(Symbol)
. (with maybe better names).
Or, an Expr
can be what is today's Symbol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this might make the visitor too confusing 😅 .. .. maybe..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this is definitely doable, but we'd likely be doing ourselves a disservice. A few steps down the development pipeline is "type resolution", for which every node will need to have a symbol. Every expr
will have one. Every mathematical operation will have one, every function will return one, every constant will have one (already has), etc. Parsing and traversing that soup is going to be the stuff of nightmares 😅.
I didn't really invent this myself though. I'm somewhat borrowing from TypeScript's API, which also connects a symbol to each expression.
Those symbols are the window into the meaning behind the values, including their types, and where they were initialised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet, having prior art makes me a lot more comfortable with this :D
|
||
class FieldSymbol(Symbol): | ||
name: str | ||
table: Union[TableSymbol, TableAliasSymbol, SelectQuerySymbol, SelectQueryAliasSymbol] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems very easy to generate symbols without the requisite data.
Think it's fine for now, but we should enforce symbol creation with all required components (i.e. have an init that disallows passing nones for all these attributes). Should simplify rest of the things where we right now have to do error handling to check whether they exist or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think SelectQuerySymbol
was where I found this to be worrysome - noticed the empty creation in the resolver I think.
if node.op == ast.CompareOperationType.Eq: | ||
if isinstance(node.right, ast.Constant) and node.right.value is None: | ||
response = f"isNull({left})" | ||
return f"isNull({left})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooo tricky, special handling for nulls because the user did something that probably wouldn't work?
iirc it's not invalid Clickhouse SQL to have equals(NULL, 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I have nothing against splitting this back to a separate "IsNull" comparison. This basically came out of a property filter. The logic is that to check if anything is or isn't null, you can't just == null
or != null
you always have to change it to is
or is not
. This effectively makes comparing with null work.
But, that might not be what's expected, so let's revisit. I'll add it to the list of checkboxes.
elif node.name == "countDistinct": | ||
response = f"count(distinct {translated_args})" | ||
return f"count(distinct {translated_args})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a big fan of these translations, why do we have them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're going to get rid of them, but outside of this PR 😅
[ ] Clean up the confusion with our bespoke countDistinct() function
They're leftovers from the Python AST days. I want to support native ANSI-compliant SQL here.
|
||
field_sql = self._print_identifier(resolved_field.name) | ||
|
||
# :KLUDGE: Legacy person properties handling. Assume we're in a context where the tables have been joined, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very confident here, but will not block because we'll eventually test all of this with our existing clickhouse insight tests 😬 .
This feels pretty complicated / hard to get right, but don't have a better option on my mind yet, trying to use the existing ColumnOptimizer seems even harder here..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we have a test everywhere you can use HogQL property filter (trends, funnels, global filters on all insights, events list, session recordings list, etc) to make sure it works with both person.properties.email == 'something'
and properties.something == 'bla'
. Those test are run with materialised properties on and off for those columns, and with person-on-events both on and off. So far they're green. 😅
This all is a big kludge, and I'm now working on a true fix in the context of https://github.com/PostHog/posthog/pull/14286/files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Thank you @neilkakkar for taking the time to go through all of this! Let's get it in and 🚀 on... |
Problem
PostHog/meta#81
Changes
What now?
Resolver
This PR implements a separate "resolver" step, which run after the "parser".
The flow is now:
In the future we'll add more steps. I already know of two more: 1) adds joins for person properties if on a different table, 2) adds the team_id guard (now done in the printer, should be a separate step)
Symbols
Symbols are a parallel data structure to the AST nodes, which link to the actual meaning behind each field. Imagine the query:
Symbol resolution lets us know that
distinct_id
in the beginning of this SELECT clause unambiguously refers to thedistinct_id
field onperson_distinct_id
, and not the one from the subquery on events, as it doesn't export this field as a column.That's what this PR implements. The above query is a valid HogQL query that works. If you'd run it, we'd also add a
team_id
guards on each of these joined tables before printing out the ClickHouse SQL.We assign a class that inherits from
ast.Symbol
to each field, property, alias, subquery or table node. Each SELECT query is its own lexical scope, and can't access values defined in other queries, except via JOINs.ClickHouse's scope semantics are hard and loose. I followed them to the best of my ability, and simplified down to the least ambiguity when possible. The biggest difference is that in HogQL aliases can't redefine other aliases with the same name when in the scope of the same select query. ClickHouse would consider
select ((1 as a) as b) as a;
valid code, but HogQL does not.Other changes
hogql/database.py
EverythingVisitor
intoTraversingVisitor
andCloningVisitor
. These are the classes to inherit from when designing your own simple visitors.Coming up in next PRs (out of scope)
How did you test this code?
Wrote many tests.