Skip to content

CDATA handling in HTML changed in lxml parser with libxml2 2.9.12 #220

@mgorny

Description

@mgorny

After upgrading the system libxml2 to 2.9.12 (or 2.9.11; 2.9.10 is the previous working version I have here), the two following tests fail with lxml built against the system library:

FAILED tests/test_extra/test_soup_contains.py::TestSoupContains::test_contains_cdata_html - AssertionError: Lists differ: ['1', '2'] != ['1']
FAILED tests/test_extra/test_soup_contains_own.py::TestSoupContainsOwn::test_contains_own_cdata_html - AssertionError: Lists differ: ['1', '2']...

The cause seems to be a different representation of CDATA:

        soup       = <html><body><div id="1">Testing that <span id="2">&lt;![CDATA[that]]&gt;</span>contains works.</div></body>
</html>

(i.e. &lt![CDATA[... instead of <!--[CDATA[...)

Note that in order to reproduce you need to both upgrade libxml2 and build lxml against the new version. Binary wheels are statically linked to an old version of libxml2, so they do not reproduce the issue yet. For example, I have been able to reproduce it with tox after swapping the installed lxml version:

. .tox/py39/bin/activate
pip uninstall lxml
pip install lxml --no-binary lxml

I am also not sure whether this isn't a bug in libxml2 or lxml.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions