Skip to content

Conversation

mariusandra
Copy link
Collaborator

@mariusandra mariusandra commented Feb 9, 2023

Problem

PostHog/meta#81

Changes

  • Implements symbol resolution (makes sure every part of a query resolves to some value)
  • Adds JOIN and alias support.

What now?

Resolver

This PR implements a separate "resolver" step, which run after the "parser".

The flow is now:

HogQL string -> Parser -> AST -> Resolver -> AST with symbols -> Printer -> ClickHouse string

In the future we'll add more steps. I already know of two more: 1) adds joins for person properties if on a different table, 2) adds the team_id guard (now done in the printer, should be a separate step)

Symbols

Symbols are a parallel data structure to the AST nodes, which link to the actual meaning behind each field. Imagine the query:

SELECT distinct_id, e.timestamp, pdi.person_id 
FROM (select event, timestamp from events) e 
LEFT JOIN person_distinct_id pdi ON pdi.distinct_id = e.distinct_id 

Symbol resolution lets us know that distinct_id in the beginning of this SELECT clause unambiguously refers to the distinct_id field on person_distinct_id, and not the one from the subquery on events, as it doesn't export this field as a column.

That's what this PR implements. The above query is a valid HogQL query that works. If you'd run it, we'd also add a
team_id guards on each of these joined tables before printing out the ClickHouse SQL.

We assign a class that inherits from ast.Symbol to each field, property, alias, subquery or table node. Each SELECT query is its own lexical scope, and can't access values defined in other queries, except via JOINs.

ClickHouse's scope semantics are hard and loose. I followed them to the best of my ability, and simplified down to the least ambiguity when possible. The biggest difference is that in HogQL aliases can't redefine other aliases with the same name when in the scope of the same select query. ClickHouse would consider select ((1 as a) as b) as a; valid code, but HogQL does not.

Other changes

  • Implements a representation of queriable database tables via a new set of classes in hogql/database.py
  • Implements joins and queries from these other tables
  • Implements symbol classes that chain each other (PropertySymbol -> FieldSymbol -> TableAliasSymbol -> TableSymbol -> Table -> Database)
  • Adds a team_id guard on each of those tables in the printer
  • Fixes JOIN parsing (some stuff was in reverse)
  • Splits the EverythingVisitor into TraversingVisitor and CloningVisitor. These are the classes to inherit from when designing your own simple visitors.

Coming up in next PRs (out of scope)

  • Array joins, window functions, union all, and the rest of ClickHouse SQL.
  • Automatic join for person or group property tables
  • UI where you can actually run these queries and see results. It's all in a test file now.
  • A lot of security and sanity checking
  • Fixing the "select countDistinct()" ANSI SQL discrepancy
  • HogQL & Data Exploration next steps meta#81

How did you test this code?

Wrote many tests.

@mariusandra mariusandra changed the base branch from master to hogql-further-improvements February 9, 2023 23:33
Base automatically changed from hogql-further-improvements to master February 13, 2023 14:37
Copy link
Member

@Twixes Twixes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing stands out as off to me here. Although I don't feel entirely confident I've grasped everything here – I could definitely use a walkthrough. 😄

@neilkakkar
Copy link
Contributor

+1 on walkthrough (and not just this PR, but a few previous ones that this one builds on top of lol, if possible) - seems I blinked away and the entire backend changed 🙈 - having a hard time building a model of what's new and how this fits into existing queries 🤔.

Appreciate this can be a lot of effort, so I'll leave the decision to do one or not up to you. (maybe once an existing insight uses this, so I understand the entire flow?). Right now this feels too complicated, not sure if we can freely optimise queries and have this at the same time.

Copy link
Contributor

@neilkakkar neilkakkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I need a break, this is taking a while, will be back soon.

One blocking comment, but otherwise looking great! Nice work!



class Expr(AST):
pass
symbol: Optional[Symbol]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this parallel data structure seems very awkward to me.

Since we're dealing with steps in a pipeline, and we have a hard requirement to call Resolver() in the pipeline, and in the end everything would probably have its own symbol, is there a reason not to transform Expressions into Symbols or something similar, like:

class Symbol(Expr), vs what we have.

Basically, I'm saying we should go for inheritance here over composition. And then everything that follows in the pipeline should operate on Symbol, SelectQuerySymbol, etc. etc. vs the raw nodes. For now, it will probably be a union of the two, but in the end we can strongly type it to be one. (which is great for enforcing resolver has run on the entire tree).

This also makes it one uniform structure rather than two parallel ones.


Maybe a question that informs this decision:

class And(Expr) vs class And(Symbol). (with maybe better names).

Or, an Expr can be what is today's Symbol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this might make the visitor too confusing 😅 .. .. maybe..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is definitely doable, but we'd likely be doing ourselves a disservice. A few steps down the development pipeline is "type resolution", for which every node will need to have a symbol. Every expr will have one. Every mathematical operation will have one, every function will return one, every constant will have one (already has), etc. Parsing and traversing that soup is going to be the stuff of nightmares 😅.

I didn't really invent this myself though. I'm somewhat borrowing from TypeScript's API, which also connects a symbol to each expression.

image

Those symbols are the window into the meaning behind the values, including their types, and where they were initialised.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet, having prior art makes me a lot more comfortable with this :D


class FieldSymbol(Symbol):
name: str
table: Union[TableSymbol, TableAliasSymbol, SelectQuerySymbol, SelectQueryAliasSymbol]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems very easy to generate symbols without the requisite data.

Think it's fine for now, but we should enforce symbol creation with all required components (i.e. have an init that disallows passing nones for all these attributes). Should simplify rest of the things where we right now have to do error handling to check whether they exist or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SelectQuerySymbol was where I found this to be worrysome - noticed the empty creation in the resolver I think.

if node.op == ast.CompareOperationType.Eq:
if isinstance(node.right, ast.Constant) and node.right.value is None:
response = f"isNull({left})"
return f"isNull({left})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooo tricky, special handling for nulls because the user did something that probably wouldn't work?

iirc it's not invalid Clickhouse SQL to have equals(NULL, 1)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I have nothing against splitting this back to a separate "IsNull" comparison. This basically came out of a property filter. The logic is that to check if anything is or isn't null, you can't just == null or != null you always have to change it to is or is not. This effectively makes comparing with null work.

But, that might not be what's expected, so let's revisit. I'll add it to the list of checkboxes.

elif node.name == "countDistinct":
response = f"count(distinct {translated_args})"
return f"count(distinct {translated_args})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big fan of these translations, why do we have them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to get rid of them, but outside of this PR 😅

PostHog/meta#81

[ ] Clean up the confusion with our bespoke countDistinct() function

They're leftovers from the Python AST days. I want to support native ANSI-compliant SQL here.


field_sql = self._print_identifier(resolved_field.name)

# :KLUDGE: Legacy person properties handling. Assume we're in a context where the tables have been joined,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very confident here, but will not block because we'll eventually test all of this with our existing clickhouse insight tests 😬 .

This feels pretty complicated / hard to get right, but don't have a better option on my mind yet, trying to use the existing ColumnOptimizer seems even harder here..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we have a test everywhere you can use HogQL property filter (trends, funnels, global filters on all insights, events list, session recordings list, etc) to make sure it works with both person.properties.email == 'something' and properties.something == 'bla'. Those test are run with materialised properties on and off for those columns, and with person-on-events both on and off. So far they're green. 😅

This all is a big kludge, and I'm now working on a true fix in the context of https://github.com/PostHog/posthog/pull/14286/files

Copy link
Contributor

@neilkakkar neilkakkar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@mariusandra
Copy link
Collaborator Author

Thank you @neilkakkar for taking the time to go through all of this!

Let's get it in and 🚀 on...

@mariusandra mariusandra merged commit 5345975 into master Feb 21, 2023
@mariusandra mariusandra deleted the hogql-symbol-resolution branch February 21, 2023 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants