-
-
Notifications
You must be signed in to change notification settings - Fork 873
New Rule ST11: Detect unused tables in join #5266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Coverage Results ✅
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a fan of this rule. I know I've run into queries that have had those unneeded joins.
My concern is it might trigger a bit too often which might lead to people disabling it entirely.
-- getting a count of two joined tables triggers this rule.
SELECT COUNT(*) AS row_count -- could also be COUNT(1) or COUNT(0)
FROM taba AS a
INNER JOIN tabb AS b
ON a.col1 = b.col1;
-- ANTI and SEMI joins will always trigger this as well, `b` cannot be referenced
SELECT a.col1, a.col2
FROM taba AS a
SEMI JOIN tabb AS b
ON a.col1 = b.col1;
SELECT a.col1, a.col2
FROM taba AS a
ANTI JOIN tabb AS b
ON a.col1 = b.col1;
-- an INNER JOIN might be used like an EXISTS/SEMI JOIN
SELECT
a.key_column,
a.col2,
a.col3,
a.col4
FROM all_records AS a
INNER JOIN targeted_records AS t
ON a.key_column = t.key_column;
It might make more sense to only apply this to LEFT
, RIGHT
, and FULL
joins and have an exception for joins that utilize COUNT
.
Overall, I'm excited to see this rule in place in the near future!
@keraion - great feedback 👍 . On the On the "inner join as filter" case I totally see your point, but I also think those are some of the cases which we're probably trying to detect too 🤔 . I think you're right to suggest we exclude them, which would also exclude an unqualified |
src/sqlfluff/rules/structure/ST10.py
Outdated
designed to work alongside :sqlfluff:ref:`references.qualification` | ||
(:sqlfluff:ref:`RF02`). | ||
|
||
This rule does not propose a fix, because it assumes that it an unused |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I could imagine unusual cases where someone joins with a table simply for filtering purposes, and there's nothing "wrong" with that. (Although one could argue in that case WHERE EXISTS
with a sub-select would be clearer about the intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address this issue, I've added an allowlist
of specific keywords, which includes INNER
to prevent this situation. By making it configurable, people can make it more or less aggressive as they prefer.
src/sqlfluff/rules/structure/ST10.py
Outdated
In the (*very rare*) situations that it is logically necessary to include | ||
a table in a join clause, but not otherwise refer to it (likely for | ||
granularity reasons, or as a stepping stone to another table), we recommend | ||
ignoring this rule for that specific line by using ``-- noqa: ST10`` at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it's worth mentioning WHERE EXISTS
as another possible remedy?
@@ -0,0 +1,65 @@ | |||
rule: ST10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered the possibility that an "unused" table is actually referenced in a subquery? Below is a quick example of what I mean. I don't have a real-world situation in mind -- just threw together this query as an example.
SELECT a.col1
FROM a
JOIN b ON a.id = b.a_id
WHERE a.some_column IN (
SELECT c.some_column
FROM c
WHERE c.other_column = b.col
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@barrywhart - I've added your query (slightly adjusted) as a test case, and we're all good 👍 . Introspection into subqueries is covered.
Could one or more of the maintainers run this rule on a large set of real-world queries? I do like the rule, but we may see surprise warnings from it for cases we hadn't considered. |
I've run this on my current large project, and got a few flags from |
Update post maintainers call: I've removed the proposed config value, and now it only flags on explicit |
mismatches. | ||
|
||
NOTE: If *any* references aren't appropriately qualified, | ||
this rule will abort (because it won't know how to resolve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a nitpick, but assuming:
- At least some of the references are qualified
- The ones that are qualified reference all the tables
then should it be okay if some of the references are unqualified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well... if all the tables are referenced by some of the columns, then there's no error - so this rule wouldn't fire anyway.
You're correct that we could check if not all the columns are qualified, but if we'd still only kick up an error if we were sure that a table was unused then we need both a table to be unreferenced and all columns to be qualified. Given that second condition is the easiest to check, it seems smart to do that first.
@@ -21,6 +21,17 @@ def get_configs_info() -> Dict[str, Any]: | |||
"Defaults to ``earlier``." | |||
), | |||
}, | |||
"allowlist_join_keywords": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used anywhere? I don't see any other references to allowlist_join_keywords
in the PR. It looks like it's hardcoded above to {"FULL", "LEFT", "RIGHT"}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GOOD CATCH. Yeah I can remove that 👍
) | ||
|
||
test_fail_subquery: | ||
# "b" is not referenced outside it's join (despite subquery), so unused. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's -> its
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @barrywhart for the review. I'll get this merged now those parts are addressed 👍
@alanmcruickshank edits:
Based on the original bones of this PR by @danparizher, I've fleshed out the logic so that it's a working rule. Fixes #5251.
NOTE: I've decided not to enable a fix routine for this rule, because it's effectively ambiguous what the intent was - and just removing a join could change the granularity of a table if not done with care. I've also decided that this rule shouldn't return unless all references are qualified - meaning it's effectively dependent on RF02 to enforce that first.
Very open to feedback!