Skip to content

Conversation

danparizher
Copy link
Contributor

@danparizher danparizher commented Oct 2, 2023

@alanmcruickshank edits:

Based on the original bones of this PR by @danparizher, I've fleshed out the logic so that it's a working rule. Fixes #5251.

NOTE: I've decided not to enable a fix routine for this rule, because it's effectively ambiguous what the intent was - and just removing a join could change the granularity of a table if not done with care. I've also decided that this rule shouldn't return unless all references are qualified - meaning it's effectively dependent on RF02 to enforce that first.

Very open to feedback!

@danparizher danparizher changed the title Add new rule ST10: Remove unused tables in join [WIP] Add new rule ST10: Remove unused tables in join Oct 2, 2023
@alanmcruickshank alanmcruickshank changed the title [WIP] Add new rule ST10: Remove unused tables in join New rule ST10: Detect unused tables in join Oct 17, 2024
@alanmcruickshank alanmcruickshank marked this pull request as ready for review October 17, 2024 08:37
Copy link
Contributor

github-actions bot commented Oct 17, 2024

Coverage Results ✅

Name    Stmts   Miss  Cover   Missing
-------------------------------------
TOTAL   19084      0   100%

247 files skipped due to complete coverage.

@alanmcruickshank alanmcruickshank added the minor release A PR or issue queued for minor release. label Oct 17, 2024
@alanmcruickshank alanmcruickshank changed the title New rule ST10: Detect unused tables in join New Rule ST10: Detect unused tables in join Oct 17, 2024
Copy link
Contributor

@keraion keraion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a fan of this rule. I know I've run into queries that have had those unneeded joins.

My concern is it might trigger a bit too often which might lead to people disabling it entirely.

-- getting a count of two joined tables triggers this rule.
SELECT COUNT(*) AS row_count -- could also be COUNT(1) or COUNT(0)
FROM taba AS a
INNER JOIN tabb AS b
    ON a.col1 = b.col1;

-- ANTI and SEMI joins will always trigger this as well, `b` cannot be referenced
SELECT a.col1, a.col2
FROM taba AS a
SEMI JOIN tabb AS b
    ON a.col1 = b.col1;

SELECT a.col1, a.col2
FROM taba AS a
ANTI JOIN tabb AS b
    ON a.col1 = b.col1;

-- an INNER JOIN might be used like an EXISTS/SEMI JOIN
SELECT
    a.key_column,
    a.col2,
    a.col3,
    a.col4
FROM all_records AS a
INNER JOIN targeted_records AS t
    ON a.key_column = t.key_column;

It might make more sense to only apply this to LEFT, RIGHT, and FULL joins and have an exception for joins that utilize COUNT.

Overall, I'm excited to see this rule in place in the near future!

@alanmcruickshank
Copy link
Member

@keraion - great feedback 👍 . On the Count(*) example and with explicit ANTI and SEMI joins I totally agree. I'll include some of those in the test cases.

On the "inner join as filter" case I totally see your point, but I also think those are some of the cases which we're probably trying to detect too 🤔 . I think you're right to suggest we exclude them, which would also exclude an unqualified JOIN too, because I think most database engines default to an INNER join if one isn't specified. In my mind, ideally we would flag this as an issue because they may often be mistakes, but do you have a way we'd recommend to people to indicate when it's intentional (rather than just suggesting a noqa line, which is my current best suggestion)?

designed to work alongside :sqlfluff:ref:`references.qualification`
(:sqlfluff:ref:`RF02`).

This rule does not propose a fix, because it assumes that it an unused
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I could imagine unusual cases where someone joins with a table simply for filtering purposes, and there's nothing "wrong" with that. (Although one could argue in that case WHERE EXISTS with a sub-select would be clearer about the intent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address this issue, I've added an allowlist of specific keywords, which includes INNER to prevent this situation. By making it configurable, people can make it more or less aggressive as they prefer.

In the (*very rare*) situations that it is logically necessary to include
a table in a join clause, but not otherwise refer to it (likely for
granularity reasons, or as a stepping stone to another table), we recommend
ignoring this rule for that specific line by using ``-- noqa: ST10`` at
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think it's worth mentioning WHERE EXISTS as another possible remedy?

@@ -0,0 +1,65 @@
rule: ST10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered the possibility that an "unused" table is actually referenced in a subquery? Below is a quick example of what I mean. I don't have a real-world situation in mind -- just threw together this query as an example.

SELECT a.col1
FROM a
JOIN b ON a.id = b.a_id
WHERE a.some_column IN (
    SELECT c.some_column
    FROM c
    WHERE c.other_column = b.col
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barrywhart - I've added your query (slightly adjusted) as a test case, and we're all good 👍 . Introspection into subqueries is covered.

@barrywhart
Copy link
Member

Could one or more of the maintainers run this rule on a large set of real-world queries? I do like the rule, but we may see surprise warnings from it for cases we hadn't considered.

@alanmcruickshank alanmcruickshank changed the title New Rule ST10: Detect unused tables in join New Rule ST11: Detect unused tables in join Dec 3, 2024
@alanmcruickshank
Copy link
Member

Could one or more of the maintainers run this rule on a large set of real-world queries? I do like the rule, but we may see surprise warnings from it for cases we hadn't considered.

I've run this on my current large project, and got a few flags from INNER joins being used as filters - but since adding the allowlist config - not issues. Seems to be running ok 👍

@alanmcruickshank
Copy link
Member

Update post maintainers call: I've removed the proposed config value, and now it only flags on explicit LEFT, RIGHT and FULL joins. I think this is otherwise ready for a review. @WittierDinosaur ?

mismatches.

NOTE: If *any* references aren't appropriately qualified,
this rule will abort (because it won't know how to resolve
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a nitpick, but assuming:

  • At least some of the references are qualified
  • The ones that are qualified reference all the tables

then should it be okay if some of the references are unqualified?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well... if all the tables are referenced by some of the columns, then there's no error - so this rule wouldn't fire anyway.

You're correct that we could check if not all the columns are qualified, but if we'd still only kick up an error if we were sure that a table was unused then we need both a table to be unreferenced and all columns to be qualified. Given that second condition is the easiest to check, it seems smart to do that first.

@@ -21,6 +21,17 @@ def get_configs_info() -> Dict[str, Any]:
"Defaults to ``earlier``."
),
},
"allowlist_join_keywords": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere? I don't see any other references to allowlist_join_keywords in the PR. It looks like it's hardcoded above to {"FULL", "LEFT", "RIGHT"}.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GOOD CATCH. Yeah I can remove that 👍

)

test_fail_subquery:
# "b" is not referenced outside it's join (despite subquery), so unused.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's -> its

Copy link
Member

@alanmcruickshank alanmcruickshank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @barrywhart for the review. I'll get this merged now those parts are addressed 👍

@alanmcruickshank alanmcruickshank added this pull request to the merge queue Dec 6, 2024
Merged via the queue into sqlfluff:main with commit 25f1248 Dec 6, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
minor release A PR or issue queued for minor release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rule: Unused table in join
4 participants