-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
Hi,
There seems to be a race condition when using mutiple archiver jobs on the same node.
Lines 426 to 432 in 647ac56
if (!file_exists($pathToFile)) { | |
return; | |
} | |
$filesize = filesize($pathToFile); | |
$factor = $units[$unit]; | |
$converted = $filesize / $factor; |
Notice that the file is checked for existence before the filesize is checked.
Two archive jobs are started in a bash as such:
CONCURRENT_ARCHIVERS=2
for i in $(seq 1 $CONCURRENT_ARCHIVERS)
do
(/var/www/console core:archive -vvv --concurrent-archivers=$CONCURRENT_ARCHIVERS) &
pids+=($!)
done
A few times a day, I get this error message for 1 of the 2 started processes:
WARNING [2020-04-27 03:39:26] 2654 /var/www/core/Filesystem.php(430): Warning - filesize(): stat failed for /var/www/tmp/climulti/archive.sharedsiteids.pid - Matomo 3.13.4
. I'm not able to forcefully reproduce it, but we know if happens a few times a day since we get a notification if a archive process exits with non-zero code.
We only see this in our test environment, which has many sites but no new stats added to the sites. Run time for each archive job is <3s.
I looked through the code, and the only thing that I could spot as a potential source is this:
matomo/core/CronArchive/SharedSiteIds.php
Lines 119 to 150 in 1155273
/** | |
* If there are multiple archiver running on the same node it makes sure only one of them performs an action and it | |
* will wait until another one has finished. Any closure you pass here should be very fast as other processes wait | |
* for this closure to finish otherwise. Currently only used for making multiple archivers at the same time work. | |
* If a closure takes more than 5 seconds we assume it is dead and simply continue. | |
* | |
* @param \Closure $closure | |
* @return mixed | |
* @throws \Exception | |
*/ | |
private function runExclusive($closure) | |
{ | |
$process = new Process('archive.sharedsiteids'); | |
while ($process->isRunning() && $process->getSecondsSinceCreation() < 5) { | |
// wait max 5 seconds, such an operation should not take longer | |
usleep(25 * 1000); | |
} | |
$process->startProcess(); | |
try { | |
$result = $closure(); | |
} catch (Exception $e) { | |
$process->finishProcess(); | |
throw $e; | |
} | |
$process->finishProcess(); | |
return $result; | |
} |
Is there something there, where if running multiple archivers that completes in less than 5s, where this can cause issues ?
I don't have a suggestion for a fix now. Attached the output from the archive processes that runs, where one of them (b) is giving a WARNING
a.log
b.log