Skip to content

return context instead of content when content is list #160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 4, 2023

Conversation

catdingding
Copy link
Contributor

To my knowledge, "schema" should be a dictionary rather than a list.

@codecov
Copy link

codecov bot commented Mar 4, 2023

Codecov Report

Merging #160 (a21e2e9) into master (5eb73c8) will not change coverage.
The diff coverage is 0.00%.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #160   +/-   ##
=======================================
  Coverage   91.07%   91.07%           
=======================================
  Files          30       30           
  Lines        2419     2419           
=======================================
  Hits         2203     2203           
  Misses        216      216           
Impacted Files Coverage Δ
goose3/extractors/schema.py 75.00% <0.00%> (ø)

@barrust
Copy link
Collaborator

barrust commented Mar 4, 2023

Do you have an example of when this returned the wrong type? I would like to include a test if there is an example of this happening.

@barrust
Copy link
Collaborator

barrust commented Mar 4, 2023

Even if it is a URL that can be used, I can make a test to ensure that this doesn't revert!

@catdingding
Copy link
Contributor Author

https://www.cna.com.tw/news/aipl/202302210388.aspx

The error that occurred when I was using the current version.

AttributeError                            Traceback (most recent call last)
Input In [52], in <cell line: 3>()
      1 url  = 'https://www.cna.com.tw/news/aloc/202303040120.aspx'
      2 g = Goose({'stopwords_class': StopWordsChinese})
----> 3 article = g.extract(url=url)
      4 print(article.cleaned_text)

File ~/.pyenv/versions/3.10.4/lib/python3.10/site-packages/goose3/__init__.py:125, in Goose.extract(self, url, raw_html)
    122     raise ValueError("Either url or raw_html should be provided")
    124 crawl_candidate = CrawlCandidate(self.config, url, raw_html)
--> 125 return self.__crawl(crawl_candidate)

File ~/.pyenv/versions/3.10.4/lib/python3.10/site-packages/goose3/__init__.py:153, in Goose.__crawl(self, crawl_candidate)
    151 parsers = list(self.config.available_parsers)
    152 parsers.remove(self.config.parser_class)
--> 153 return crawler_wrapper(self.config.parser_class, parsers, crawl_candidate)

File ~/.pyenv/versions/3.10.4/lib/python3.10/site-packages/goose3/__init__.py:141, in Goose.__crawl.<locals>.crawler_wrapper(parser, parsers, crawl_candidate)
    139 try:
    140     crawler = Crawler(self.config, self.fetcher)
--> 141     article = crawler.crawl(crawl_candidate)
    142 except (UnicodeDecodeError, ValueError) as ex:
    143     logger.error("Parser %s failed to parse the content", parser)

File ~/.pyenv/versions/3.10.4/lib/python3.10/site-packages/goose3/crawler.py:134, in Crawler.crawl(self, crawl_candidate)
    131     logger.warning("No raw_html is provided or could be fetched; continuing with an empty Article object")
    132     return self.article
--> 134 return self.process(raw_html, parse_candidate.url, parse_candidate.link_hash)

File ~/.pyenv/versions/3.10.4/lib/python3.10/site-packages/goose3/crawler.py:185, in Crawler.process(self, raw_html, final_url, link_hash)
    182 self.article._authors = self.authors_extractor.extract()
    184 # title
--> 185 self.article._title = self.title_extractor.extract()
    187 # jump through some hoops on attempting to get a language if not found
    188 if self.article._meta_lang is None:

File ~/.pyenv/versions/3.10.4/lib/python3.10/site-packages/goose3/extractors/title.py:103, in TitleExtractor.extract(self)
    102 def extract(self):
--> 103     return self.get_title()

File ~/.pyenv/versions/3.10.4/lib/python3.10/site-packages/goose3/extractors/title.py:84, in TitleExtractor.get_title(self)
     82 # rely on opengraph in case we have the data
     83 if "title" in self.article.opengraph:
---> 84     return self.clean_title(self.article.opengraph["title"])
     85 if self.article.schema and "headline" in self.article.schema:
     86     return self.clean_title(self.article.schema["headline"])

File ~/.pyenv/versions/3.10.4/lib/python3.10/site-packages/goose3/extractors/title.py:44, in TitleExtractor.clean_title(self, title)
     42 if "site_name" in self.article.opengraph and self.article.opengraph["site_name"] != title:
     43     site_name = self.article.opengraph["site_name"]
---> 44 elif schema and schema.get("publisher") and schema["publisher"].get("name"):
     45     site_name = self.article.schema["publisher"]["name"]
     47 # if there is a sperator, speratate and check if site name is present

AttributeError: 'list' object has no attribute 'get'

@barrust barrust merged commit ba2b334 into goose3:master Mar 4, 2023
@catdingding catdingding deleted the patch-1 branch March 5, 2023 05:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants