-
Notifications
You must be signed in to change notification settings - Fork 108
Description
Affected Goose3 version = 3.1.11
Some websites have malformed meta charset tags (example article url) where there are leading/trailing characters. In this case it's formatted as:
<meta charset="-->UTF-8">
When trying to run this article through goose, the following errors are thrown since "--" from the charset is extracted:
Code example with url:
from goose3 import Goose
g = Goose()
article = g.extract(url="https://euroweeklynews.com/2022/05/09/alicante-provinces-orihuela-plans-to-make-the-most-of-its-rainwater/")
Error thrown:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/goose3/__init__.py", line 113, in extract
return self.__crawl(crawl_candidate)
File "/goose3/__init__.py", line 140, in __crawl
return crawler_wrapper(self.config.parser_class, parsers, crawl_candidate)
File "/goose3/__init__.py", line 128, in crawler_wrapper
article = crawler.crawl(crawl_candidate)
File "/goose3/crawler.py", line 135, in crawl
return self.process(raw_html, parse_candidate.url, parse_candidate.link_hash)
File "/goose3/crawler.py", line 140, in process
doc = self.get_document(raw_html)
File "/goose3/crawler.py", line 338, in get_document
doc = self.parser.fromstring(raw_html)
File "/goose3/parsers.py", line 59, in fromstring
html = smart_str(html, encoding=encoding)
File "/goose3/utils/encoding.py", line 116, in smart_str
return string.encode(encoding, errors)
LookupError: unknown encoding: --
Code example with raw bytes:
from goose3 import Goose
g = Goose()
article = g.extract(raw_html=article_content)
Error thrown:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/goose3/__init__.py", line 113, in extract
return self.__crawl(crawl_candidate)
File "/goose3/__init__.py", line 140, in __crawl
return crawler_wrapper(self.config.parser_class, parsers, crawl_candidate)
File "/goose3/__init__.py", line 128, in crawler_wrapper
article = crawler.crawl(crawl_candidate)
File "/goose3/crawler.py", line 135, in crawl
return self.process(raw_html, parse_candidate.url, parse_candidate.link_hash)
File "/goose3/crawler.py", line 140, in process
doc = self.get_document(raw_html)
File "/goose3/crawler.py", line 338, in get_document
doc = self.parser.fromstring(raw_html)
File "/goose3/parsers.py", line 60, in fromstring
parser = lxml.html.HTMLParser(encoding=encoding)
File "/lxml/html/__init__.py", line 1912, in __init__
super(HTMLParser, self).__init__(**kwargs)
File "src/lxml/parser.pxi", line 1725, in lxml.etree.HTMLParser.__init__
File "src/lxml/parser.pxi", line 837, in lxml.etree._BaseParser.__init__
LookupError: unknown encoding: 'b'--''
Solution
I'm happy to make a PR for this, but I'm not sure what the cleanest way to fix this is. I can adjust the charset regex so that it's something like below and accounts for leading/trailing characters that shouldn't exist in a charset tag, but I don't want to cause regex explosion by adding to an already complex regex.
find_charset = re.compile(
br'<meta.*?charset=["\']*[^a-zA-z0-9]*([a-zA-Z0-9\-_]+?)[^a-zA-z0-9]* *?["\'>]', flags=re.I
).findall
I could throw a try catch LookupError
in fromstring and in smart_str so that if the encoding is malformed it defaults to utf-8 instead, but that doesn't look particularly clean.
Lastly, I could add a set of possible charset HTML tags to use for comparing the output of get_encodings_from_content to make sure the extracted tag is valid, but maintenance of that set may be annoying.
Let me know what solution works best or if there's some other approach that works better.