return context instead of content when content is list #160

catdingding · 2023-03-04T07:56:20Z

To my knowledge, "schema" should be a dictionary rather than a list.

codecov · 2023-03-04T13:49:41Z

Codecov Report

Merging #160 (a21e2e9) into master (5eb73c8) will not change coverage.
The diff coverage is 0.00%.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #160   +/-   ##
=======================================
  Coverage   91.07%   91.07%           
=======================================
  Files          30       30           
  Lines        2419     2419           
=======================================
  Hits         2203     2203           
  Misses        216      216

Impacted Files	Coverage Δ
goose3/extractors/schema.py	`75.00% <0.00%> (ø)`

barrust · 2023-03-04T13:51:33Z

Do you have an example of when this returned the wrong type? I would like to include a test if there is an example of this happening.

barrust · 2023-03-04T14:36:01Z

Even if it is a URL that can be used, I can make a test to ensure that this doesn't revert!

catdingding · 2023-03-04T18:19:20Z

https://www.cna.com.tw/news/aipl/202302210388.aspx

The error that occurred when I was using the current version.

AttributeError                            Traceback (most recent call last)
Input In [52], in <cell line: 3>()
      1 url  = 'https://www.cna.com.tw/news/aloc/202303040120.aspx'
      2 g = Goose({'stopwords_class': StopWordsChinese})
----> 3 article = g.extract(url=url)
      4 print(article.cleaned_text)

File ~/.pyenv/versions/3.10.4/lib/python3.10/site-packages/goose3/__init__.py:125, in Goose.extract(self, url, raw_html)
    122     raise ValueError("Either url or raw_html should be provided")
    124 crawl_candidate = CrawlCandidate(self.config, url, raw_html)
--> 125 return self.__crawl(crawl_candidate)

File ~/.pyenv/versions/3.10.4/lib/python3.10/site-packages/goose3/__init__.py:153, in Goose.__crawl(self, crawl_candidate)
    151 parsers = list(self.config.available_parsers)
    152 parsers.remove(self.config.parser_class)
--> 153 return crawler_wrapper(self.config.parser_class, parsers, crawl_candidate)

File ~/.pyenv/versions/3.10.4/lib/python3.10/site-packages/goose3/__init__.py:141, in Goose.__crawl.<locals>.crawler_wrapper(parser, parsers, crawl_candidate)
    139 try:
    140     crawler = Crawler(self.config, self.fetcher)
--> 141     article = crawler.crawl(crawl_candidate)
    142 except (UnicodeDecodeError, ValueError) as ex:
    143     logger.error("Parser %s failed to parse the content", parser)

File ~/.pyenv/versions/3.10.4/lib/python3.10/site-packages/goose3/crawler.py:134, in Crawler.crawl(self, crawl_candidate)
    131     logger.warning("No raw_html is provided or could be fetched; continuing with an empty Article object")
    132     return self.article
--> 134 return self.process(raw_html, parse_candidate.url, parse_candidate.link_hash)

File ~/.pyenv/versions/3.10.4/lib/python3.10/site-packages/goose3/crawler.py:185, in Crawler.process(self, raw_html, final_url, link_hash)
    182 self.article._authors = self.authors_extractor.extract()
    184 # title
--> 185 self.article._title = self.title_extractor.extract()
    187 # jump through some hoops on attempting to get a language if not found
    188 if self.article._meta_lang is None:

File ~/.pyenv/versions/3.10.4/lib/python3.10/site-packages/goose3/extractors/title.py:103, in TitleExtractor.extract(self)
    102 def extract(self):
--> 103     return self.get_title()

File ~/.pyenv/versions/3.10.4/lib/python3.10/site-packages/goose3/extractors/title.py:84, in TitleExtractor.get_title(self)
     82 # rely on opengraph in case we have the data
     83 if "title" in self.article.opengraph:
---> 84     return self.clean_title(self.article.opengraph["title"])
     85 if self.article.schema and "headline" in self.article.schema:
     86     return self.clean_title(self.article.schema["headline"])

File ~/.pyenv/versions/3.10.4/lib/python3.10/site-packages/goose3/extractors/title.py:44, in TitleExtractor.clean_title(self, title)
     42 if "site_name" in self.article.opengraph and self.article.opengraph["site_name"] != title:
     43     site_name = self.article.opengraph["site_name"]
---> 44 elif schema and schema.get("publisher") and schema["publisher"].get("name"):
     45     site_name = self.article.schema["publisher"]["name"]
     47 # if there is a sperator, speratate and check if site name is present

AttributeError: 'list' object has no attribute 'get'

Update schema.py

a21e2e9

barrust merged commit ba2b334 into goose3:master Mar 4, 2023

catdingding deleted the patch-1 branch March 5, 2023 05:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

return context instead of content when content is list #160

return context instead of content when content is list #160

Uh oh!

catdingding commented Mar 4, 2023

Uh oh!

codecov bot commented Mar 4, 2023

Uh oh!

barrust commented Mar 4, 2023

Uh oh!

barrust commented Mar 4, 2023

Uh oh!

catdingding commented Mar 4, 2023

Uh oh!

Uh oh!

return context instead of content when content is list #160

return context instead of content when content is list #160

Uh oh!

Conversation

catdingding commented Mar 4, 2023

Uh oh!

codecov bot commented Mar 4, 2023

Codecov Report

Uh oh!

barrust commented Mar 4, 2023

Uh oh!

barrust commented Mar 4, 2023

Uh oh!

catdingding commented Mar 4, 2023

Uh oh!

Uh oh!