Skip to content

python2's bz2-module does not support multi-stream files #243

@derdritte

Description

@derdritte

The bz2 implementation in python 2.7 is unable to decompress multi-stream bzip-files. This is especially problematic because it does not actually output an error or a warning and just acts as if the file was parsed completely [0].

When using the debug flag you will see something like this:

…
2019-03-28 12:54:57,119: [DEBUG] Invalid line detected (line did not match): xxx.xxx.xxx.xxx - - [03/Feb/2019:08:42:34 +0100] "GET /resources/some.pdf HTTP/2.0" 200 811576 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 12_1_2 like Mac OS X)
…

After this, the parsing for the file ends. This happens because the content of the uncompressed log-file is not separated into streams be line (but by size), so you will most likely have the first and only stream the bz2-module actually reads end in the middle of a long line.
In the files I tried the first stream ended after exactly 900.000 bytes, so that could be an indication.

To check your bz2-file you might want to try running bzip2 on console (this is a multi-stream file):

~$ bzip2 -tvvv ../logs/access.log-20190203.bz2 
  ../logs/access.log-20190203.bz2: 
    [1: huff+mtf rt+rld {0x0af23f00, 0x0af23f00}]
    [2: huff+mtf rt+rld {0xb291b802, 0xb291b802}]
    combined CRCs: stored = 0xa775c602, computed = 0xa775c602
    [1: huff+mtf rt+rld {0x12f00a0d, 0x12f00a0d}]
    [2: huff+mtf rt+rld {0xcbad012a, 0xcbad012a}]
…
    combined CRCs: stored = 0xfcecf793, computed = 0xfcecf793
    [1: huff+mtf rt+rld {0x84243689, 0x84243689}]
    combined CRCs: stored = 0x84243689, computed = 0x84243689
    ok

Possible solutions

  • check the file-pointer after reading the compressed logfile, if it's exactly 900.000 bytes, output a warning or an error message
  • Use an external dependency like bz2file [1] to read the compressed data

[0] https://docs.python.org/2/library/bz2.html#bz2.BZ2File
[1] https://pypi.org/project/bz2file/

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions