Meta charset malformed - LookupError: unknown encoding

Affected Goose3 version = 3.1.11

Some websites have malformed meta charset tags ([example article url](https://euroweeklynews.com/2022/05/09/alicante-provinces-orihuela-plans-to-make-the-most-of-its-rainwater/)) where there are leading/trailing characters. In this case it's formatted as:
```
<meta charset="-->UTF-8">
```
When trying to run this article through goose, the following errors are thrown since "--" from the charset is extracted:

### Code example with url:
```python
from goose3 import Goose
g = Goose()
article = g.extract(url="https://euroweeklynews.com/2022/05/09/alicante-provinces-orihuela-plans-to-make-the-most-of-its-rainwater/")
```
Error thrown:
```python
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/goose3/__init__.py", line 113, in extract
    return self.__crawl(crawl_candidate)
  File "/goose3/__init__.py", line 140, in __crawl
    return crawler_wrapper(self.config.parser_class, parsers, crawl_candidate)
  File "/goose3/__init__.py", line 128, in crawler_wrapper
    article = crawler.crawl(crawl_candidate)
  File "/goose3/crawler.py", line 135, in crawl
    return self.process(raw_html, parse_candidate.url, parse_candidate.link_hash)
  File "/goose3/crawler.py", line 140, in process
    doc = self.get_document(raw_html)
  File "/goose3/crawler.py", line 338, in get_document
    doc = self.parser.fromstring(raw_html)
  File "/goose3/parsers.py", line 59, in fromstring
    html = smart_str(html, encoding=encoding)
  File "/goose3/utils/encoding.py", line 116, in smart_str
    return string.encode(encoding, errors)
LookupError: unknown encoding: --
```
### Code example with raw bytes:
```python
from goose3 import Goose
g = Goose()
article = g.extract(raw_html=article_content)
```
Error thrown:
```python
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/goose3/__init__.py", line 113, in extract
    return self.__crawl(crawl_candidate)
  File "/goose3/__init__.py", line 140, in __crawl
    return crawler_wrapper(self.config.parser_class, parsers, crawl_candidate)
  File "/goose3/__init__.py", line 128, in crawler_wrapper
    article = crawler.crawl(crawl_candidate)
  File "/goose3/crawler.py", line 135, in crawl
    return self.process(raw_html, parse_candidate.url, parse_candidate.link_hash)
  File "/goose3/crawler.py", line 140, in process
    doc = self.get_document(raw_html)
  File "/goose3/crawler.py", line 338, in get_document
    doc = self.parser.fromstring(raw_html)
  File "/goose3/parsers.py", line 60, in fromstring
    parser = lxml.html.HTMLParser(encoding=encoding)
  File "/lxml/html/__init__.py", line 1912, in __init__
    super(HTMLParser, self).__init__(**kwargs)
  File "src/lxml/parser.pxi", line 1725, in lxml.etree.HTMLParser.__init__
  File "src/lxml/parser.pxi", line 837, in lxml.etree._BaseParser.__init__
LookupError: unknown encoding: 'b'--''
```

### Solution
I'm happy to make a PR for this, but I'm not sure what the cleanest way to fix this is. I can adjust the [charset regex](https://github.com/goose3/goose3/blob/master/goose3/text.py#L43) so that it's something like below and accounts for leading/trailing characters that shouldn't exist in a charset tag, but I don't want to cause regex explosion by adding to an already complex regex. 
```python
find_charset = re.compile(
    br'<meta.*?charset=["\']*[^a-zA-z0-9]*([a-zA-Z0-9\-_]+?)[^a-zA-z0-9]* *?["\'>]', flags=re.I
).findall
```

I could throw a try catch `LookupError` in [fromstring](https://github.com/goose3/goose3/blob/master/goose3/parsers.py#L57) and in [smart_str](https://github.com/goose3/goose3/blob/master/goose3/utils/encoding.py#L104) so that if the encoding is malformed it defaults to utf-8 instead, but that doesn't look particularly clean.

Lastly, I could add a set of possible charset HTML tags to use for comparing the output of [get_encodings_from_content](https://github.com/goose3/goose3/blob/39962b1fc9fd62d8416139480f875d57ed208511/goose3/text.py#L35) to make sure the extracted tag is valid, but maintenance of that set may be annoying. 

Let me know what solution works best or if there's some other approach that works better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Meta charset malformed - LookupError: unknown encoding #139

Code example with url:

Code example with raw bytes:

Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Meta charset malformed - LookupError: unknown encoding #139

Description

Code example with url:

Code example with raw bytes:

Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions