Skip to content

join log_action in segment sql queries instead of using idaction IN (...) #20379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 42 commits into from
Mar 27, 2023

Conversation

diosmosis
Copy link
Member

@diosmosis diosmosis commented Feb 17, 2023

Description:

Fixes #19897

This changes the SQL generated from segments to use joins instead of IN ( SELECT * FROM ... ) expressions. SQL that previously looked like:

SELECT ... FROM log_link_visit_action WHERE idaction_url IN (
    SELECT idaction FROM log_action WHERE name LIKE '%abc%'
) OR idaction_url IN (
    SELECT idaction FROM log_action WHERE name NOT LIKE '%def%'
)

will now look like:

SELECT ... FROM log_link_visit_action
LEFT JOIN log_action log_action_segment_log_link_visit_actionidaction_url ON log_action_segment_log_link_visit_actionidaction_url.idaction = log_link_visit_action.idaction_url
WHERE log_action_segment_log_link_visit_actionidaction_url.name LIKE '%abc%' AND log_action_segment_log_link_visit_actionidaction_url.name NOT LIKE '%def%'

Changes:

  • In Segment::getCleanedExpression() handle Dimension::getDbColumnJoin() metadata if the dimension has it. This is what adds joins to tables like log_action or goal.
  • Added a new Dimension::$sqlFilterMatch property that allows segments to customize SQL used to match segments. Used by idaction segments to add a condition on the hash column.
  • Modify result of Segment::getAvailableSegments() to index the result by the segment name.
  • Removed the segment subquery cache and related INI options.
  • Add GoalName segment that allows segmenting on goal name while IdGoal continues to accept int ID values.

Review

@diosmosis diosmosis marked this pull request as draft February 17, 2023 20:08
@diosmosis diosmosis added this to the 5.0.0 milestone Feb 25, 2023
@diosmosis diosmosis added the Needs Review PRs that need a code review label Feb 25, 2023
@diosmosis diosmosis marked this pull request as ready for review February 25, 2023 23:49
Copy link
Member

@tsteur tsteur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice @diosmosis

I've tested this and it improved some queries 20 times when there is an account with many actions. From > 40 seconds to 2 seconds for a simple select. 🎉

Can I suggest we add some more tests to ensure it behaves as expected and will also do this in the future? For example tests/PHPUnit/Integration/SegmentTest.php test cases for getSelectQuery to ensure the SQL and bind is as expected for:

  • Multiple action related segments. eg entryPageUrl, pageUrl, pageTitle, ... in one. Ideally sometimes even the same segment multiple times present
  • Multiple action related segments containing And / or segments.
  • Also ensuring that when we use a == that it speeds up the query correctly where it effectively does a where idaction = $id.

For example, I did create a test using 'entryPageUrl!@https%25253A%25252F%25252Fmatomo.org%25252Fjobs%25252F;entryPageUrl!@https%25253A%25252F%25252Fmatomo.org%25252Fsupport%25252F;entryPageUrl!@https%25253A%25252F%25252Fmatomo.org%25252Fchangelog%25252F;entryPageUrl!@https%25253A%25252F%25252Fdeveloper.matomo.org%25252F;entryPageUrl!@https%25253A%25252F%25252Fmatomo.org%25252Ftranslations%25252F;entryPageUrl!@https%25253A%25252F%25252Fmatomo.org%25252Fweb-analytics-training%25252F;entryPageUrl!@https%25253A%25252F%25252Fmatomo.org%25252Ftag-manager-training%25252F;entryPageUrl!@https%25253A%25252F%25252Fmatomo.org%25252Ffaq%25252F;entryPageUrl!@https%25253A%25252F%25252Fmatomo.org%25252Fhelp%25252F;entryPageUrl!@https%25253A%25252F%25252Fmatomo.org%25252Fcontact%25252F;entryPageUrl!@https%25253A%25252F%25252Fmatomo.org%25252Fdocs%25252F' and one thing I oddly noticed is that while in 4.x-dev it would bind matomo.orgtranslations with this patch it would bind https://matomo.org/translations as value

And ideally, we would maybe also add a system test for some complicated segment if it's not present yet to ensure we get the correct value.

@tsteur
Copy link
Member

tsteur commented Mar 2, 2023

FYI some tests are failing also

@diosmosis
Copy link
Member Author

@tsteur

Multiple action related segments. eg entryPageUrl, pageUrl, pageTitle, ... in one. Ideally sometimes even the same segment multiple times present
Multiple action related segments containing And / or segments.

These two already exist in: https://github.com/matomo-org/matomo/blob/4.x-dev/tests/PHPUnit/Integration/SegmentTest.php#L1427. But they only use pageUrl. I'll add one that does something similar w/ different action segments.

Also ensuring that when we use a == that it speeds up the query correctly where it effectively does a where idaction = $id.

Curently not doing that, but instead, LEFT JOIN ON log_action idaction_url_log_action WHERE idaction_url_log_action.name = ? AND idaction_url_log_action.hash = CRC32(?) AND idaction_url_log_action.type = ?. This should have the same effect but require less queries, correct?

This should also be covered by existing SegmentTest tests.

And ideally, we would maybe also add a system test for some complicated segment if it's not present yet to ensure we get the correct value.

Will add a system test.

@diosmosis
Copy link
Member Author

@tsteur updated

And in regards to the URL encode. Below segment is actually one from our production account as it is stored in the database. As mentioned in 4.x-dev the bind parameters don't include the HTTP. I see you added the urlencode to fix this. I do wonder though if this could have any implications on already existing stored segments? Could anything regress there?

Reverted this.

FYI trying a segment string like pageUrl=@foo.com;pageUrl!@baz;pageUrl=@test;pageUrl!@hello generates a query like below. Not sure if there's anything we can do to optimise it. I saw this today in production. It's basically about optimising "NOT IN" but maybe it can't be, not sure.

I was thinking the same thing, but it looks like a significant amount of work to look into so I would consider it out of scope here. I could work on it as a follow up PR though, if desired.

@tsteur
Copy link
Member

tsteur commented Mar 14, 2023

I was thinking the same thing, but it looks like a significant amount of work to look into so I would consider it out of scope here. I could work on it as a follow up PR though, if desired.

Agreed 👍

So far things are looking good. I've been doing more testing and more performance testing too and just seen one query that it improves by 80 times! That's because so far according to explain in the past all these subqueries where dependent subqueries and executed for each row very likely which is of course very ineffecient and slow.

The only issue I'm still seeing is that it used to behave differently for the segment entryPageUrl=@https%253A%252F%252Fmatomo.org%252Ffaq%252F,entryPageUrl=@https%253A%252F%252Fmatomo.org%252Fdocs%252F,entryPageUrl=@https%253A%252F%252Fmatomo.org%252Fcontact%252F,entryPageUrl=@https%253A%252F%252Fmatomo.org%252Fhelp%252F,entryPageUrl=@https%253A%252F%252Fmatomo.org%252Ftag-manager-training%252F,entryPageUrl=@https%253A%252F%252Fmatomo.org%252Fweb-analytics-training%252F,entryPageUrl=@https%253A%252F%252Fdeveloper.matomo.org%252F,entryPageUrl=@https%253A%252F%252Fmatomo.org%252Ftranslations%252F,entryPageUrl=@https%253A%252F%252Fmatomo.org%252Fchangelog%252F,entryPageUrl=@https%253A%252F%252Fmatomo.org%252Fsupport%252F,entryPageUrl=@https%253A%252F%252Fmatomo.org%252Fjobs%252F. In 4.x-dev it would have bind parameters like matomo.org vs now it has bind parameters like https://matomo.org. Somehow, it doesn't seem to remove the HTTP protocol prefix anymore. Not sure where or why.

I'm testing this with below test

    public function test_getSelectQuery_whenUrls()
    {
        $select = 'sum(log_visit.idvisit)';
        $from = 'log_visit';
        $where = 'log_visit.idsite = ?';
        $bind = array(1);

        $segment = 'entryPageUrl=@https%253A%252F%252Fmatomo.org%252Ffaq%252F,entryPageUrl=@https%253A%252F%252Fmatomo.org%252Fdocs%252F,entryPageUrl=@https%253A%252F%252Fmatomo.org%252Fcontact%252F,entryPageUrl=@https%253A%252F%252Fmatomo.org%252Fhelp%252F,entryPageUrl=@https%253A%252F%252Fmatomo.org%252Ftag-manager-training%252F,entryPageUrl=@https%253A%252F%252Fmatomo.org%252Fweb-analytics-training%252F,entryPageUrl=@https%253A%252F%252Fdeveloper.matomo.org%252F,entryPageUrl=@https%253A%252F%252Fmatomo.org%252Ftranslations%252F,entryPageUrl=@https%253A%252F%252Fmatomo.org%252Fchangelog%252F,entryPageUrl=@https%253A%252F%252Fmatomo.org%252Fsupport%252F,entryPageUrl=@https%253A%252F%252Fmatomo.org%252Fjobs%252F';
        $segment = new Segment($segment, $idSites = array());

        $query = $segment->getSelectQuery($select, $from, $where, $bind);

        $this->assertQueryDoesNotFail($query);

        $expected = array(
            "sql"  => "....",
            "bind" => array('matomo.org...'));

        $this->assertEquals($this->removeExtraWhiteSpaces($expected), $this->removeExtraWhiteSpaces($query));
    }

And getting these binds. This would not give us any results since the URLs have the protocol removed.
image

Can you reproduce this @diosmosis ?

@diosmosis
Copy link
Member Author

@tsteur

Can you reproduce this @diosmosis ?

Yes, I can reproduce. Before this logic was handled by the TableLogAction method, but that isn't invoked when joining the segment. I'll look into a fix.

@diosmosis
Copy link
Member Author

@tsteur should be fixed

@tsteur
Copy link
Member

tsteur commented Mar 15, 2023

Awesome! Thanks for this @diosmosis . I've just tested this and works nicely. Seeing partially some great performance gains!

This looks all good from my perspective. Not sure if @matomo-org/core-reviewers had any extra thoughts? We'll likely want to cherry pick this also into 4.x-dev as it doesn't seem to be breaking API and would bring us some great performance improvement. I'd create separate issues for this once merged.

Copy link
Member

@sgiehl sgiehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments. Besides that the branch currently has conflicts. Would be good to merge in the latest changes and resolve the conflicts

@diosmosis
Copy link
Member Author

@sgiehl updated except for the one comment that still has a question

@mattab mattab added the 5.0.0 label Mar 23, 2023
Copy link
Member

@sgiehl sgiehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of suggestions for cs improvements and stuff that can be changed/removed.
Otherwise this looks fine to me.

@tsteur you had done some performance checks on the new queries, right? If needed I can also do some testing. Otherwise guess this could be merged, once the suggestions were applied.

@diosmosis
Copy link
Member Author

@sgiehl applied review feedback

@sgiehl sgiehl removed the Needs Review PRs that need a code review label Mar 24, 2023
@tsteur
Copy link
Member

tsteur commented Mar 26, 2023

@sgiehl yes I did performance checks here and they looked really good 👍 We're looking forward to this improvement a lot.

@tsteur
Copy link
Member

tsteur commented Mar 26, 2023

Once merged we can also run this on production for a while and keep a closer look on how it's impacting things overall in case certain things are causing issues.

@sgiehl sgiehl merged commit 7d8a6c4 into 5.x-dev Mar 27, 2023
@sgiehl sgiehl deleted the m19897 branch March 27, 2023 09:49
@bx80 bx80 added the not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org. label Apr 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not-in-changelog For issues or pull requests that should not be included in our release changelog on matomo.org.
Development

Successfully merging this pull request may close these issues.

Improve performance of segment queries when multiple action segment parts are used
5 participants