Skip to content

Few tokens missing after extraction. #190

@tusharg7797

Description

@tusharg7797

I am using goose3 to extract articles from news websites. I have noticed that letters/words which have been boldened or highlighted go missing after extraction. Try the following:

from goose3 import Goose

g = Goose()
article = g.extract(url='https://www.economist.com/united-states/2023/10/04/the-sacking-of-kevin-mccarthy-will-make-supporting-ukraine-harder')
print(article.cleaned_text)

The expected output:

Kevin mccarthy’s stint as speaker of America’s House of Representatives ended the way it had begun.........

Actual output:

K stint as speaker of America’s House of Representatives ended the way it had begun...........

This is because the words "evin mccarthy’s" are in the "small" tag.

I believe the problem stems from this line: Link

If I remove this function things work fine. I am willing to fix this problem myself and wanted some input from the maintainers. Should I add a boolean in the config file such as remove_fewwords_paragraphs. If true the function is executed, else not.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions