-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Some archiving optimizations #14091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some archiving optimizations #14091
Conversation
…reate new archives unless forced.
… when checking to skip site in core:archive.
@diosmosis in Also I think the PR will need to a similar logic in |
@tsteur made a comment in the original issue about the segment related change (the code already exists in CronArchive): #13387 (comment)
Wouldn't the first created archive be the first archive of the day? That would defeat the purpose of making this change? This SQL was supposed to find the latest ts_archived value for an archive that can be used to query visits. Then the idea is, from that point on until 'now', if there are no visits found, then we don't need to initiate archiving and can just use the existing archive. The archiving code is extremely dense, though, so I am not entirely sure if this code has that effect... if you have any examples where it breaks, would like to see them. |
It wouldn't need to use the first archive of the day I think (not 100% about everything though). Just the launch date of the last launch of the archiving for a given The thinking is that archiving may take several minutes or hours. If it uses the latest date, which may be I don't think we have that check in the code already as https://github.com/matomo-org/matomo/blob/3.x-dev/core/CronArchive.php#L1261-L1292 is around midnight. |
If the ts_archived time is 23:26:06, then wouldn't include visits before that time (so, any visits between 23:24:41 => 23:26:06)? Ie, ts_archived is for when the archiving query finishes (AFAIK), so when we aggregate for nb_visits and create that archive for today, the nb_visits would include visits from start of day => ts_archived, right? Of course, this doesn't apply to archiving dates in the past, just when running archiving for today. |
only for some archives. See above in the screenshot they are all for the same archive. But some may include those visits, some not... as it takes several minutes to hours to finish it. |
to finish all archiving or just the one archiving query? in this case we're only looking at archives for VisitsSummary / no segment. though I guess if it gets the all plugins that might take more time... actually, i'm not sure how ts_archived is even set... will take a look quickly. |
looks like the ts_archived value is set when inserting individual idarchives, so the |
it should work in most cases. There may be edge cases where it takes also a few ms or seconds or minutes to query and insert the |
So we add a |
That sounds right 👍 |
@tsteur Updated |
Left some comments @diosmosis have you looked into triggering the archiving of segments only when needed in the Also wondering have you tested what happens when receiving a tracking request for an older date and this older date needs to be rearchived maybe because the archive was invalidated or so? So far looked only at the code so haven't tested it yet. Might be quite tricky. |
That's what this commit should achieve: 23b5a18 Don't think I've tested a lot, will have to do more... |
I mean we may have archived some segments at a later time than the regular archive and we wouldn't need to issue a request for a specific segment (but other segments) or maybe the segment was last executed a longer time ago and it needs to issue the segment request but not the regular archiving. It sounds all pretty tricky and not sure how easily things are doable. |
This might require a larger refactoring of the cron archive process... maybe we can turn it into something a bit easier to manipulate in this way? I don't expect a change like this could be done by 3.9... |
It won't be doable in 3.9 Maybe we could split it in two parts though:
This might make it also easier to review and I think most of the logic or all logic for |
Sorry I might be confused. I see you added some code for During one archive we may trigger heaps of requests for different periods and segments like
And if I see this correctly then we need to check for each request when that information was last archived before requesting the archive trigger URL. Because the I would have basically kind of expected to see some new code before each Eg we might consider a |
I see, so do the check I added before each request, for the specific archive in question, correct? |
Yep |
@diosmosis left a few comments. I reckon the main thing be to change as discussed |
@tsteur made the requested changes & merged w/ 3.x-dev. Not 100% sure about whether the cronarchive changes will have side effects or not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some basic testing and it worked. Although not sure re some edge cases, left some comments to discuss.
@@ -62,6 +62,7 @@ public function __construct(Parameters $params, $isTemporaryArchive, ArchiveWrit | |||
$this->isTemporaryArchive = $isTemporaryArchive; | |||
$this->archiveWriter = $archiveWriter ?: new ArchiveWriter($this->params, $this->isTemporaryArchive); | |||
$this->archiveWriter->initNewArchive(); | |||
$this->archiveWriter->insertRecord(ArchiveWriter::ARCHIVE_START_RECORD_NAME, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not 100% sure... for segments... should we append the segment hash to this? similar to what we do for the done
flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will have the idArchive for the specific done flag, so as long as we find the idarchive first, it wouldn't need the suffix. Which makes the code a bit simpler since we don't have to compute it when querying.
|
||
$nameCondition = self::getNameCondition(['VisitsSummary'], $segment); | ||
|
||
$sql = "SELECT MAX(idarchive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not too important, but I reckon if we were able to add the segment hash to the start
flag, then we maybe could remove the query below and in this query do something like select max(ts_archived) ... where .... and name= ArchiveWriter::ARCHIVE_START_RECORD_NAME . $segmentHash
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see response in below comment)
$sql = "SELECT ts_archived | ||
FROM $table | ||
WHERE idarchive = ? AND name = ?"; | ||
$bind = [$idArchive, ArchiveWriter::ARCHIVE_START_RECORD_NAME]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I was wondering if we only should look at archives where the archiving finished in case the previous archiving had an error and interrupted (same idArchive should also have the done flag with status OK or temporary).
Then realised we can likely not do this because the current archiving might be in progress and that's the reason why the done flag is missing. => This should not be a problem though because the archiving would then be simply skipped as the logic should notice for that period/date there is already an archiver running right now.
So what if the last archive stopped in the middle because of an error, then we would maybe not start that archive again under circumstances? Say start flag is 00:15am
, then the archiving broke in the middle, ... it would maybe not start the archiving again? Maybe, to play it safe, we might need to always check to only look at ts_archived
for archives that have the start and the done flag.
Not sure I explain it well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that's what we do, at least that's what $nameCondition
is for. Ie, it should be (name IN ... AND value IN (DONE_OK, DONE_OK_TEMPORARY)
. So we're only looking at done archives.
Incidentally I think this is also why we shouldn't append the suffix to the start
record, since we'd have to check the done flag value anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we've all said it before, but the archiving process confuses the heck out of me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we've all said it before, but the archiving process confuses the heck out of me.
me too :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding was that's what we do, at least that's what $nameCondition is for. Ie, it should be (name IN ... AND value IN (DONE_OK, DONE_OK_TEMPORARY). So we're only looking at done archives.
Makes sense . Just need to double check that we also look at that in the CronArchive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CronArchive uses ArchiveSelector::getLatestArchiveStartTimestampForToday()
which uses the name condition, so I think we're good here.
@@ -863,6 +863,13 @@ protected function processArchiveDays($idSite, $lastTimestampWebsiteProcessedDay | |||
return false; | |||
} | |||
|
|||
if (!$this->hasThereBeenVisitsSinceLastArchiveTimestamp($idSite, 'day', $date)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I see this correctly that if a user has no new visit recorded since last archive, but recorded some visits in the past, we wouldn't trigger the archiving until there's actually a new visitor with a more recent date? Do you know what I mean? Probably can't work around this I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean, I think we could take this into account by using idvisit somehow... Maybe find the maximum idvisit that is <= ts_archived, then check if there is an idvisit for that site > than the maximum idvisit? Does that sound right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like this could work. Might be complicated though. Wonder if #14091 (comment) would maybe make a lot of things easier. In theory the archiving logic should all be simple. After all it's not that complicated. We generate an archive and assume it's completed. Once a new tracking request comes in, we make sure to rearchive it. If no tracking request comes in again, we don't need to archive it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see if I can add it w/ a test.
Your other comment makes sense to me, I'm actually not sure currently what temporary archives are used for currently. Removing and making the process simpler might be difficult given how complicated the system is currently, might be good to wait until matomo 4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reckon it might not be all that complicated? And maybe a lot easier than this logic? Not sure though. @mattab any thoughts on this idea maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just not too sure if it will break anything, and we also have to deal w/ the fact that existing archives may have DONE_TEMPORARY. Seems like there could be room for bugs that would be hard to track down... Maybe we can schedule it for 3.11 at least, so when it's deployed on cloud you'll be around.
Random thought @mattab @diosmosis Why don't we get rid of temporary archives and always mark them as completed once the archiving finished. When then a new tracking request comes in, it will mark this archive automatically as invalid for the correct periods. |
I can create an issue to get rid of the done_temporary archives, otherwise I can move this to 3.11. |
I'll try to get rid of temporary tables in a new PR and get back to this one. |
@diosmosis what's the status here? |
@tsteur closing this, we can fix this in a better way since the invalidated archive change was merged |
WIP
Fixes #13387