-
Notifications
You must be signed in to change notification settings - Fork 119
Description
The bz2 implementation in python 2.7 is unable to decompress multi-stream bzip-files. This is especially problematic because it does not actually output an error or a warning and just acts as if the file was parsed completely [0].
When using the debug flag you will see something like this:
…
2019-03-28 12:54:57,119: [DEBUG] Invalid line detected (line did not match): xxx.xxx.xxx.xxx - - [03/Feb/2019:08:42:34 +0100] "GET /resources/some.pdf HTTP/2.0" 200 811576 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 12_1_2 like Mac OS X)
…
After this, the parsing for the file ends. This happens because the content of the uncompressed log-file is not separated into streams be line (but by size), so you will most likely have the first and only stream the bz2-module actually reads end in the middle of a long line.
In the files I tried the first stream ended after exactly 900.000 bytes, so that could be an indication.
To check your bz2-file you might want to try running bzip2 on console (this is a multi-stream file):
~$ bzip2 -tvvv ../logs/access.log-20190203.bz2
../logs/access.log-20190203.bz2:
[1: huff+mtf rt+rld {0x0af23f00, 0x0af23f00}]
[2: huff+mtf rt+rld {0xb291b802, 0xb291b802}]
combined CRCs: stored = 0xa775c602, computed = 0xa775c602
[1: huff+mtf rt+rld {0x12f00a0d, 0x12f00a0d}]
[2: huff+mtf rt+rld {0xcbad012a, 0xcbad012a}]
…
combined CRCs: stored = 0xfcecf793, computed = 0xfcecf793
[1: huff+mtf rt+rld {0x84243689, 0x84243689}]
combined CRCs: stored = 0x84243689, computed = 0x84243689
ok
Possible solutions
- check the file-pointer after reading the compressed logfile, if it's exactly 900.000 bytes, output a warning or an error message
- Use an external dependency like bz2file [1] to read the compressed data
[0] https://docs.python.org/2/library/bz2.html#bz2.BZ2File
[1] https://pypi.org/project/bz2file/