Skip to content

Fix a bug when int publish date lead to TypeError #157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 21, 2023
Merged

Fix a bug when int publish date lead to TypeError #157

merged 6 commits into from
Feb 21, 2023

Conversation

Amaimersion
Copy link
Contributor

See https://www.linkedin.com/pulse/you-getting-raise-year-cnbc/

self.publishdate_extractor.extract() extracts int publish date - 1676551869000. Because of that dateutil failes to parse it and raises exception - TypeError: Parser must be a string or character stream, not int. Which lead to exception raise by self._publish_date_to_utc(), which lead to entire parsing fail.

I don't sure if catching this TypeError is the best way to fix it. Perhaps it is best to fix self.publishdate_extractor.extract() so that it shouldn't return int result at all.

@barrust
Copy link
Collaborator

barrust commented Feb 20, 2023

I wonder if just forcing it to a string would "fix" the error, or at least make it not fail with a TypeError.

That would mean changing line 175 in the crawler.py file to:

self.article._publish_date = str(self.publishdate_extractor.extract())

@codecov
Copy link

codecov bot commented Feb 21, 2023

Codecov Report

Merging #157 (d0a20f0) into master (7be4dbc) will decrease coverage by 0.01%.
The diff coverage is 84.61%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #157      +/-   ##
==========================================
- Coverage   91.03%   91.03%   -0.01%     
==========================================
  Files          30       30              
  Lines        2409     2420      +11     
==========================================
+ Hits         2193     2203      +10     
- Misses        216      217       +1     
Impacted Files Coverage Δ
goose3/crawler.py 91.28% <0.00%> (ø)
goose3/extractors/publishdate.py 92.30% <91.66%> (-0.55%) ⬇️

@Amaimersion
Copy link
Contributor Author

Maybe let's try to parse this UNIX time into expected ISO 8601 format?

@Amaimersion
Copy link
Contributor Author

And keep this TypeError for further cases.

@Amaimersion
Copy link
Contributor Author

If you want you can move this entire code block, which translates UNIX time into ISO 8601 time, into separate function https://github.com/goose3/goose3/pull/157/files#diff-fe6e242d7cae1fa6f728979e6467729f9ad50d16afbb0e50386a40c2547669beR269

This way in further we will be able to add support for UNIX time not only for datePublished

@Amaimersion
Copy link
Contributor Author

I thought about forcing it to string. But translating UNIX time into ISO 8601 time looks like a more correct solution because it will allow to initialize publish_datetime_utc field

@barrust
Copy link
Collaborator

barrust commented Feb 21, 2023

This PR looks great! Thanks!

@barrust barrust merged commit 991df51 into goose3:master Feb 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants