-
Notifications
You must be signed in to change notification settings - Fork 290
Adding text lengths measurement #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think it should be: >>> lengths.compute(texts=['hello']) Does that work? |
It gives another error:
|
The documentation is not available anymore as the PR was closed or merged. |
I think this is the culprit: features=datasets.Features({
'texts': datasets.Value('string'),
}) The feature names have to match the |
Oh I didn't know that! Thank you for pointing it out. Are you ok with returning both
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @sashavor!
I left a few comments on the script. In addition it would be great if you could add a README.md
, requirements.txt
, and app.py
. You can have a look at the template for that.
"""Returns the lengths""" | ||
lengths = [len(word_tokenize(text)) for text in texts] | ||
average_length = mean(lengths) | ||
return {"average_length": average_length, "all lengths":lengths} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in favour of keeping the outputs simple where possible and just return aggregated score. If somebody wanted the score per sample they could still do:
for text in texts:
score = measure.compute([text])
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so it would only work with 1 input string, not a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, what I mean is if you want the aggregate statistics you do:
score = measure.compute(all_texts)
If you really need the result per samples you could do:
for text in texts:
score = measure.compute([text])
...
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you propose only returning the average_length
?
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
changing TextLength to WordLenth
Changing to single output, instead of mean and list of lengths.
adding docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor comments. LGTM 🚀
changing tokenizer to Callable
.... but there's a bug that's driving me mad, @lvwerra can you help? it's probably something silly but I can't figure it out 🙈
When I call
compute()
, it says:But there is a single argument, why is it saying I'm giving 2?...