Wrong filenames in dataset

Hi,
The filenames in code-clippy dedup dataset are wrong. In the repo with multiple files - though various files are present, they share a single random filename not adhering to the correct file extension as well. While for gpt-code-clippy training efforts this might not be an issue since only content of files might matter, it would be really great if this issue can be fixed or mentioned clearly otherwise.

sample code to reproduce the issue (prints filenames in first 100 rows of jsonl)
```
import os
import json
import uuid
import zstandard
import subprocess

def loadJsonL(fname):
    import json

    data = []
    with open(fname) as fp:
        for line in fp.readlines():
            data.append(json.loads(line))
    return data


def processZSTLink(url):
    zstfile = url.split('/')[-1]
    print(url)
    out = subprocess.run(f"wget {url}", shell=True, stdout=subprocess.DEVNULL)    
    jsonlfile = zstfile[:-4]    
    with open(zstfile, 'rb') as compressed:
        decomp = zstandard.ZstdDecompressor()
        with open(jsonlfile, 'wb') as destination:
            decomp.copy_stream(compressed, destination)

    data = loadJsonL(jsonlfile)
    newData = []
    for row in data[:100]:
        file_name = row['meta']['file_name']
        repo_name = row['meta']['repo_name']        
        print(f"{repo_name}/{file_name}")


processZSTLink('https://the-eye.eu/public/AI/training_data/code_clippy_data//code_clippy_dedup_data/test/data_2814_time1626332048_default.jsonl.zst')
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wrong filenames in dataset #71

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Wrong filenames in dataset #71

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions