-
Notifications
You must be signed in to change notification settings - Fork 7
Closed
Description
Hi, thanks for sharing the wonderful project and data. I am trying to use the released data for training my own T5-based paraphrasing model. However, there are multiple sets of train/dev/test.jsonl files under different folders. For example,
for paralex:
- wikianswers-para-splitforgeneval
- training-triples/wikianswers-triples-chunk-extendstop-realexemplars-resample-drop30-N5-R100/ (BTW, the name is also not the same as specified under the conf folder)
for qqp:
- qqp-splitforgeneval
- training-triples/qqp-clusters-chunk-extendstop-realexemplars-resample-drop30-N5-R100/
I also found there might be potential "overlaps" between train and test sets under the same folder, for example,
grep 'Do astrology really work' qqp-splitforgeneval/test.jsonl
{"tgt": "Dose astrology really work?", "syn_input": "Dose astrology really work?", "sem_input": "Do astrology really work?", "paras": ["Dose astrology really work?"]}
VS.
grep 'Dose astrology really work?' qqp-splitforgeneval/train.jsonl
{"tgt": "Dose astrology really work?", "syn_input": "Dose astrology really work?", "sem_input": "Does Rashi prediction really work?", "paras": ["Dose astrology really work?", "Does astrology works?", "Do astrology really work?", "Does astrology really work, I mean the online astrology?"]}
My questions are
- What is the relationship between qqp-splitforgeneval and training-triples?
- if I want to compare the results with the paper, which sets should I use, i.e. splitforgeneval or training-triples? (I do not need the "syn_input" utterances)
- is it safe to assume there are no overlaps among train/dev/eval sets under the same folder? (e.g., Is it possible for a test "sem_input" to appear in train.jsonl under the different folders?)
Thanks and I appreciate your help.
Metadata
Metadata
Assignees
Labels
No labels