Skip to content

Conversation

adbar
Copy link
Contributor

@adbar adbar commented Jul 15, 2020

  • code copied from other extractors
  • trafilatura added to requirements.txt
  • JSON output works on my computer (added to the files)

Copy link
Member

@lopuhin lopuhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome, thanks @adbar !

I checked that I get the same output files locally, and ran evaluation:

AutoExtract          precision=0.984 ± 0.003  recall=0.956 ± 0.010  F1=0.970 ± 0.005 accuracy=0.470 ± 0.037 
Diffbot              precision=0.958 ± 0.009  recall=0.944 ± 0.013  F1=0.951 ± 0.010 accuracy=0.348 ± 0.035 
boilerpipe           precision=0.850 ± 0.017  recall=0.870 ± 0.020  F1=0.860 ± 0.017 accuracy=0.006 ± 0.006 
dragnet              precision=0.925 ± 0.012  recall=0.889 ± 0.018  F1=0.907 ± 0.013 accuracy=0.221 ± 0.032 
html-text            precision=0.500 ± 0.017  recall=0.994 ± 0.001  F1=0.665 ± 0.015 accuracy=0.000 ± 0.000 
newspaper            precision=0.917 ± 0.013  recall=0.906 ± 0.017  F1=0.912 ± 0.014 accuracy=0.260 ± 0.032 
readability          precision=0.913 ± 0.014  recall=0.931 ± 0.015  F1=0.922 ± 0.013 accuracy=0.315 ± 0.035 
trafilatura          precision=0.925 ± 0.011  recall=0.966 ± 0.009  F1=0.945 ± 0.009 accuracy=0.221 ± 0.031 
xpath-text           precision=0.246 ± 0.015  recall=0.992 ± 0.001  F1=0.394 ± 0.019 accuracy=0.000 ± 0.000 

so it's extremely impressive! I'll update results in the README after the merge.

@lopuhin lopuhin merged commit c69f560 into scrapinghub:master Jul 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants