Adding text lengths measurement #44

sashavor · 2022-05-17T15:46:07Z

.... but there's a bug that's driving me mad, @lvwerra can you help? it's probably something silly but I can't figure it out 🙈

When I call compute(), it says:

>>> lengths.compute('hello')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: compute() takes 1 positional argument but 2 were given

But there is a single argument, why is it saying I'm giving 2?...

lvwerra · 2022-05-17T15:50:39Z

I think it should be:

>>> lengths.compute(texts=['hello'])

Does that work?

sashavor · 2022-05-17T15:51:48Z

It gives another error:

>>> lengths.compute(texts=['hello'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sasha/Documents/HuggingFace/evaluate/src/evaluate/module.py", line 427, in compute
    self._finalize()
  File "/home/sasha/Documents/HuggingFace/evaluate/src/evaluate/module.py", line 379, in _finalize
    file_paths, filelocks = self._get_all_cache_files()
  File "/home/sasha/Documents/HuggingFace/evaluate/src/evaluate/module.py", line 296, in _get_all_cache_files
    raise ValueError(
ValueError: Evaluation module cache file doesn't exist. Please make sure that you call `add` or `add_batch` at least once before calling `compute`.

HuggingFaceDocBuilderDev · 2022-05-17T15:58:12Z

The documentation is not available anymore as the PR was closed or merged.

lvwerra · 2022-05-17T15:58:15Z

I think this is the culprit:

features=datasets.Features({
                'texts': datasets.Value('string'),
            })

The feature names have to match the _compute args.

sashavor · 2022-05-17T16:01:14Z

Oh I didn't know that! Thank you for pointing it out.

Are you ok with returning both average_length and all the individual lengths? I think both would potentially be useful as measurements:

{'average_length': 273.66, 'all lengths': [335, 253, 121, 157, 424, 142, 129, 359, 600, 268, 331, 157, 152, 166, 436, 240, 89, 1078, 95, 214, 259, 182, 230, 393, 136, 241, 309, 135, 281, 479, 183, 179, 170, 244, 475, 141, 174, 344, 272, 140, 156, 148, 994, 253, 663, 163, 143, 136, 164, 150]}

lvwerra

Thanks for working on this @sashavor!

I left a few comments on the script. In addition it would be great if you could add a README.md, requirements.txt, and app.py. You can have a look at the template for that.

measurements/textlengths/textlengths.py

lvwerra · 2022-05-18T07:10:52Z

measurements/textlengths/textlengths.py

+        """Returns the lengths"""
+        lengths = [len(word_tokenize(text)) for text in texts]
+        average_length = mean(lengths)
+        return {"average_length": average_length, "all lengths":lengths}


I am in favour of keeping the outputs simple where possible and just return aggregated score. If somebody wanted the score per sample they could still do:

for text in texts: score = measure.compute([text]) ...

ok, so it would only work with 1 input string, not a list?

No, what I mean is if you want the aggregate statistics you do:

score = measure.compute(all_texts)

If you really need the result per samples you could do:

for text in texts: score = measure.compute([text]) ...

Does that make sense?

so you propose only returning the average_length?

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

changing TextLength to WordLenth

Changing to single output, instead of mean and list of lengths.

adding docstring

lvwerra

Just a few minor comments. LGTM 🚀

measurements/word_length/README.md

changing tokenizer to Callable

added text lengths but there's a bug that's driving me mad

bbc6f0a

sashavor requested a review from lvwerra May 17, 2022 15:46

sashavor marked this pull request as draft May 17, 2022 15:46

lvwerra reviewed May 18, 2022

View reviewed changes

Sasha Luccioni and others added 13 commits May 18, 2022 10:01

Update measurements/textlengths/textlengths.py

7686f3d

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

Update textlengths.py

98117b0

changing TextLength to WordLenth

Update and rename textlengths.py to word_length.py

4da82cf

Changing to single output, instead of mean and list of lengths.

Update word_length.py

569fe72

adding docstring

renaming text length to word length

74df48c

fixing merge conflict, I hope

9fcd107

oops, fixing doc string

eba3fc3

missing parenthesis

c41652f

adding tokenizer arg, updating docstring

506e5a3

updating docstring

307ac0d

adding README

6766264

adding app.py

2025aad

adding requirements.txt

1db39f5

lvwerra marked this pull request as ready for review May 18, 2022 16:25

lvwerra approved these changes May 18, 2022

View reviewed changes

measurements/word_length/README.md Outdated Show resolved Hide resolved

measurements/word_length/README.md Outdated Show resolved Hide resolved

Sasha Luccioni added 4 commits May 19, 2022 08:28

Update word_length.py

051757b

Update README.md

b24c56a

changing tokenizer to Callable

Update README.md

0470808

Update word_length.py

90dda1e

sashavor merged commit ac19337 into main May 19, 2022

lvwerra mentioned this pull request May 19, 2022

fix measurement tests #52

Merged

lvwerra mentioned this pull request May 23, 2022

Adding measurements directory for DMT and other data measurement work… #35

Closed

lvwerra deleted the textlengths branch July 24, 2022 12:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding text lengths measurement #44

Adding text lengths measurement #44

Uh oh!

sashavor commented May 17, 2022

Uh oh!

lvwerra commented May 17, 2022

Uh oh!

sashavor commented May 17, 2022

Uh oh!

HuggingFaceDocBuilderDev commented May 17, 2022 •

edited

Loading

Uh oh!

lvwerra commented May 17, 2022

Uh oh!

sashavor commented May 17, 2022

Uh oh!

lvwerra left a comment

Uh oh!

Uh oh!

Uh oh!

lvwerra May 18, 2022

Uh oh!

sashavor May 18, 2022

Uh oh!

lvwerra May 18, 2022 •

edited

Loading

Uh oh!

sashavor May 18, 2022

Uh oh!

lvwerra left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Adding text lengths measurement #44

Adding text lengths measurement #44

Uh oh!

Conversation

sashavor commented May 17, 2022

Uh oh!

lvwerra commented May 17, 2022

Uh oh!

sashavor commented May 17, 2022

Uh oh!

HuggingFaceDocBuilderDev commented May 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lvwerra commented May 17, 2022

Uh oh!

sashavor commented May 17, 2022

Uh oh!

lvwerra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lvwerra May 18, 2022

Choose a reason for hiding this comment

Uh oh!

sashavor May 18, 2022

Choose a reason for hiding this comment

Uh oh!

lvwerra May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sashavor May 18, 2022

Choose a reason for hiding this comment

Uh oh!

lvwerra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 17, 2022 •

edited

Loading

lvwerra May 18, 2022 •

edited

Loading