Skip to content

Improve gzip detection (or allow override) for rotated files #249

@Krinkle

Description

@Krinkle

Hi, thanks for providing this great tool.

In trying to use it to import a bunch of Apache log files (common/ncsa_extended), I noticed a very large number of lines were being ignored (Using Matamo 3.10.0):

$ python import_logs.py --url="https://###" --idsite=5 --log-hostname='###' --enable-http-errors --enable-http-redirects --recorders=2 --log-format-name=ncsa_extended --strip-query-string ~/domains/###/logs/*2019.tar*

This would import the following files:

Aug-2019.tar.gz     Aug-2019.tar.gz.13  Aug-2019.tar.gz.2  Aug-2019.tar.gz.7  Sep-2019.tar.gz.10  Sep-2019.tar.gz.15  Sep-2019.tar.gz.2   Sep-2019.tar.gz.4  Sep-2019.tar.gz.9
Aug-2019.tar.gz.1   Aug-2019.tar.gz.14  Aug-2019.tar.gz.3  Aug-2019.tar.gz.8  Sep-2019.tar.gz.11  Sep-2019.tar.gz.16  Sep-2019.tar.gz.20  Sep-2019.tar.gz.5
Aug-2019.tar.gz.10  Aug-2019.tar.gz.15  Aug-2019.tar.gz.4  Aug-2019.tar.gz.9  Sep-2019.tar.gz.12  Sep-2019.tar.gz.17  Sep-2019.tar.gz.21  Sep-2019.tar.gz.6
Aug-2019.tar.gz.11  Aug-2019.tar.gz.16  Aug-2019.tar.gz.5  Sep-2019.tar.gz    Sep-2019.tar.gz.13  Sep-2019.tar.gz.18  Sep-2019.tar.gz.22  Sep-2019.tar.gz.7
Aug-2019.tar.gz.12  Aug-2019.tar.gz.17  Aug-2019.tar.gz.6  Sep-2019.tar.gz.1  Sep-2019.tar.gz.14  Sep-2019.tar.gz.19  Sep-2019.tar.gz.3   Sep-2019.tar.gz.8
Logs import summary
-------------------

    1114 requests imported successfully
    317 requests were downloads
    53042 requests ignored:
        0 HTTP errors
        0 HTTP redirects
        42658 invalid log lines
        0 filtered log lines
        0 requests did not match any known site
        0 requests did not match any --hostname
        9950 requests done by bots, search engines...
        434 requests to static resources (css, js, images, ico, ttf...)
        0 requests to file downloads did not match any --download-extensions

After setting the --dry-run and -ddd options, I realised this was because it was not uncompressing the files. It took me a while to figure this out because some of the lines did make it through.

Based on the logic around import_logs.py#L2299, I assume this is because the file names don't cleanly end with .gz. There is a numerical suffix after it.

It'd be great if these were supported as well, or if there was a way for me to specify that it should open them as gzip files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions