Skip to content

Bug Fix: Key error in author extraction #177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 16, 2023

Conversation

anchitshrivastava
Copy link
Contributor

Please find the traceback attached below.

  article_t = g.extract(url=url)
  File "/usr/local/lib/python3.10/dist-packages/goose3/__init__.py", line 125, in extract
    return self.__crawl(crawl_candidate)
  File "/usr/local/lib/python3.10/dist-packages/goose3/__init__.py", line 153, in __crawl
    return crawler_wrapper(self.config.parser_class, parsers, crawl_candidate)
  File "/usr/local/lib/python3.10/dist-packages/goose3/__init__.py", line 141, in crawler_wrapper
    article = crawler.crawl(crawl_candidate)
  File "/usr/local/lib/python3.10/dist-packages/goose3/crawler.py", line 135, in crawl
    return self.process(raw_html, parse_candidate.url, parse_candidate.link_hash)
  File "/usr/local/lib/python3.10/dist-packages/goose3/crawler.py", line 183, in process
    self.article._authors = self.authors_extractor.extract()
  File "/usr/local/lib/python3.10/dist-packages/goose3/extractors/authors.py", line 27, in extract
    authors_from_schema = self.__get_authors_from_schema()
  File "/usr/local/lib/python3.10/dist-packages/goose3/extractors/authors.py", line 73, in __get_authors_from_schema
    authors.append(author["name"])
KeyError: 'name'

@barrust barrust merged commit 92a2698 into goose3:master Jun 16, 2023
@erikvullings
Copy link
Contributor

erikvullings commented Jun 26, 2023

Although this fixes the error that is raised and not caught, it sets the author name to "", which implies that it will ignore the author name from meta.

    def extract(self):
        authors_from_schema = self.__get_authors_from_schema()
        authors_from_meta = self.__get_authors_from_meta()
        if authors_from_schema:
            return authors_from_schema
        return authors_from_meta

Instead, you should perhaps use something like below:

    def __get_authors_from_schema(self):
        authors = list()
        if self.article.schema and "author" in self.article.schema:
            schema_authors = self.article.schema["author"]
            if isinstance(schema_authors, dict):
                schema_authors = [schema_authors]
            for author in schema_authors:
                if isinstance(author, dict):
                    author = author.get("name", None)
                    if author:
                        authors.append(author)
                else:
                    authors.append(author)
        return authors

I received this error on a page prepared by Reuters, where the author in the schema was an object "author":{"@type":"Person","byline":"Nia Williams"}, which has no name key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants