-
Notifications
You must be signed in to change notification settings - Fork 108
Closed
Description
I am using goose3 to extract articles from news websites. I have noticed that letters/words which have been boldened or highlighted go missing after extraction. Try the following:
from goose3 import Goose
g = Goose()
article = g.extract(url='https://www.economist.com/united-states/2023/10/04/the-sacking-of-kevin-mccarthy-will-make-supporting-ukraine-harder')
print(article.cleaned_text)
The expected output:
Kevin mccarthy’s stint as speaker of America’s House of Representatives ended the way it had begun.........
Actual output:
K stint as speaker of America’s House of Representatives ended the way it had begun...........
This is because the words "evin mccarthy’s" are in the "small" tag.
I believe the problem stems from this line: Link
If I remove this function things work fine. I am willing to fix this problem myself and wanted some input from the maintainers. Should I add a boolean in the config file such as remove_fewwords_paragraphs. If true the function is executed, else not.
Metadata
Metadata
Assignees
Labels
No labels