Training/Dev/Test split: splitforgeneval vs. training-triples

Hi, thanks for sharing the wonderful project and data. I am trying to use the released data for training my own T5-based paraphrasing model. However, there are multiple sets of train/dev/test.jsonl files under different folders. For example, 

for paralex:
1. wikianswers-para-splitforgeneval
2. training-triples/wikianswers-triples-chunk-extendstop-realexemplars-resample-drop30-N5-R100/ (BTW, the name is also not the same as specified under the conf folder)

for qqp:
1. qqp-splitforgeneval
2. training-triples/qqp-clusters-chunk-extendstop-realexemplars-resample-drop30-N5-R100/


I also found there might be potential "overlaps" between train and test sets under the same folder, for example,

```
grep 'Do astrology really work' qqp-splitforgeneval/test.jsonl
{"tgt": "Dose astrology really work?", "syn_input": "Dose astrology really work?", "sem_input": "Do astrology really work?", "paras": ["Dose astrology really work?"]}
```
VS.
```
grep 'Dose astrology really work?' qqp-splitforgeneval/train.jsonl
{"tgt": "Dose astrology really work?", "syn_input": "Dose astrology really work?", "sem_input": "Does Rashi prediction really work?", "paras": ["Dose astrology really work?", "Does astrology works?", "Do astrology really work?", "Does astrology really work, I mean the online astrology?"]}
```

My questions are
1) What is the relationship between qqp-splitforgeneval and  training-triples?
2) if I want to compare the results with the paper, which sets should I use, i.e. splitforgeneval or training-triples? (I do not need the "syn_input" utterances)
3) is it safe to assume there are no overlaps among train/dev/eval sets under the same folder? (e.g., Is it possible for a test "sem_input" to appear in train.jsonl under the different folders?)

Thanks and I appreciate your help. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training/Dev/Test split: splitforgeneval vs. training-triples #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Training/Dev/Test split: splitforgeneval vs. training-triples #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions