Skip to content

Meta charset malformed - LookupError: unknown encoding #139

@nnick14

Description

@nnick14

Affected Goose3 version = 3.1.11

Some websites have malformed meta charset tags (example article url) where there are leading/trailing characters. In this case it's formatted as:

<meta charset="-->UTF-8">

When trying to run this article through goose, the following errors are thrown since "--" from the charset is extracted:

Code example with url:

from goose3 import Goose
g = Goose()
article = g.extract(url="https://euroweeklynews.com/2022/05/09/alicante-provinces-orihuela-plans-to-make-the-most-of-its-rainwater/")

Error thrown:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/goose3/__init__.py", line 113, in extract
    return self.__crawl(crawl_candidate)
  File "/goose3/__init__.py", line 140, in __crawl
    return crawler_wrapper(self.config.parser_class, parsers, crawl_candidate)
  File "/goose3/__init__.py", line 128, in crawler_wrapper
    article = crawler.crawl(crawl_candidate)
  File "/goose3/crawler.py", line 135, in crawl
    return self.process(raw_html, parse_candidate.url, parse_candidate.link_hash)
  File "/goose3/crawler.py", line 140, in process
    doc = self.get_document(raw_html)
  File "/goose3/crawler.py", line 338, in get_document
    doc = self.parser.fromstring(raw_html)
  File "/goose3/parsers.py", line 59, in fromstring
    html = smart_str(html, encoding=encoding)
  File "/goose3/utils/encoding.py", line 116, in smart_str
    return string.encode(encoding, errors)
LookupError: unknown encoding: --

Code example with raw bytes:

from goose3 import Goose
g = Goose()
article = g.extract(raw_html=article_content)

Error thrown:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/goose3/__init__.py", line 113, in extract
    return self.__crawl(crawl_candidate)
  File "/goose3/__init__.py", line 140, in __crawl
    return crawler_wrapper(self.config.parser_class, parsers, crawl_candidate)
  File "/goose3/__init__.py", line 128, in crawler_wrapper
    article = crawler.crawl(crawl_candidate)
  File "/goose3/crawler.py", line 135, in crawl
    return self.process(raw_html, parse_candidate.url, parse_candidate.link_hash)
  File "/goose3/crawler.py", line 140, in process
    doc = self.get_document(raw_html)
  File "/goose3/crawler.py", line 338, in get_document
    doc = self.parser.fromstring(raw_html)
  File "/goose3/parsers.py", line 60, in fromstring
    parser = lxml.html.HTMLParser(encoding=encoding)
  File "/lxml/html/__init__.py", line 1912, in __init__
    super(HTMLParser, self).__init__(**kwargs)
  File "src/lxml/parser.pxi", line 1725, in lxml.etree.HTMLParser.__init__
  File "src/lxml/parser.pxi", line 837, in lxml.etree._BaseParser.__init__
LookupError: unknown encoding: 'b'--''

Solution

I'm happy to make a PR for this, but I'm not sure what the cleanest way to fix this is. I can adjust the charset regex so that it's something like below and accounts for leading/trailing characters that shouldn't exist in a charset tag, but I don't want to cause regex explosion by adding to an already complex regex.

find_charset = re.compile(
    br'<meta.*?charset=["\']*[^a-zA-z0-9]*([a-zA-Z0-9\-_]+?)[^a-zA-z0-9]* *?["\'>]', flags=re.I
).findall

I could throw a try catch LookupError in fromstring and in smart_str so that if the encoding is malformed it defaults to utf-8 instead, but that doesn't look particularly clean.

Lastly, I could add a set of possible charset HTML tags to use for comparing the output of get_encodings_from_content to make sure the extracted tag is valid, but maintenance of that set may be annoying.

Let me know what solution works best or if there's some other approach that works better.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions