Skip to content

Add Amazonbot and simplify some bot regexes #6843

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 9, 2021
Merged

Add Amazonbot and simplify some bot regexes #6843

merged 1 commit into from
Sep 9, 2021

Conversation

MichaIng
Copy link
Contributor

@MichaIng MichaIng commented Sep 8, 2021

Description:

Amazonbot is Amazon's web crawler: https://developer.amazon.com/support/amazonbot

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)

The third option of the Baiduspider regex supersedes the first, and hence both are combined. While regexes are checked case-insensitive, it is changed to capital B for consistency with the other patterns, which have correct cases consequently.

All cases of leading and tailing (...)? optional strings have been removed, as they don't have any effect on whether a string matches or not. Optional characters are only relevant when being surrounded by non-optional characters.

Review

@MichaIng MichaIng changed the title Fix Baiduspider regex and simplify some regexes Simplify some bot regexes Sep 8, 2021
sanchezzzhak
sanchezzzhak previously approved these changes Sep 8, 2021
@MichaIng
Copy link
Contributor Author

MichaIng commented Sep 8, 2021

Ah wait, sorry, now I found a missing user agent:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)

Sorry for the back and forth, background is that two pages got a several thousand percent increase, clearly associated to bot activity, so I checked our access logs and compared user agents.

I'm not sure if that one executes JavaScript and sends tracker requests, but it is extremely active on our website (~100,000 daily requests compared to ~5,000 by Googlebot). In the current access logs (not much more than one day), it does not send tracker requests, but it does access exactly massively the two pages which got this high increase.

@MichaIng MichaIng changed the title Simplify some bot regexes Add Amazonbot and simplify some bot regexes Sep 8, 2021
Amazonbot is Amazon's web crawler: https://developer.amazon.com/support/amazonbot

The third option of the Baiduspider regex supersedes the first, and hence both are combined. While regexes are checked case-insensitive, it is changed to capital B for consistency with the other patterns, which have correct cases consequently.

All cases of leading and tailing "(...)?" optional strings have been removed, as they don't have any effect on whether a string matches or not. Optional characters are only relevant when being surrounded by non-optional characters.

Signed-off-by: MichaIng <micha@dietpi.com>
@sanchezzzhak sanchezzzhak merged commit a44cf51 into matomo-org:master Sep 9, 2021
@MichaIng MichaIng deleted the patch-1 branch September 9, 2021 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants