💫 Proposal: New JSON(L) format for training and improved training commands

### Motivation

One of the biggest invonveniences and sources of frustration is spaCy's current [JSON format for training](https://spacy.io/api/annotation#section-training). It's weirdly specific, annoying to create outside of the built-in converters and difficult to read. Training the model with incomplete information is pretty unintuitive and inconvenient as well.

To finally fix this, here's my proposal for a new and simplfied training file format that is easier to read, generate and compose.

### Example

```json
{
  "text": "Apple Inc. is an American multinational technology company headquartered in Cupertino, California. It was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976.",
  "ents": [
    {"start": 0, "end": 10, "label": "ORG"},
    {"start": 17, "end": 25, "label": "NORP"},
    {"start": 76, "end": 85, "label": "GPE"},
    {"start": 87, "end": 97, "label": "GPE"},
    {"start": 117, "end": 127, "label": "PERSON"},
    {"start": 129, "end": 142, "label": "PERSON"},
    {"start": 148, "end": 160, "label": "PERSON"},
    {"start": 164, "end": 174, "label": "DATE"}
  ],
  "sents": [
    {"start": 0, "end": 98},
    {"start": 99, "end": 175}
  ],
  "cats": {
    "TECHNOLOGY": true,
    "FINANCE": false,
    "LEGAL": false
  },
  "tokens": [
    {"start": 0, "end": 5, "pos": "PROPN", "tag": "NNP", "dep": "compound", "head": 1},
    {"start": 6, "end": 10, "pos": "PROPN", "tag": "NNP", "dep": "nsubj", "head": 2},
    {"start": 11, "end": 13, "pos": "VERB", "tag": "VBZ", "dep": "ROOT", "head": 2},
    {"start": 14, "end": 16, "pos": "DET", "tag": "DT", "dep": "det", "head": 7},
    {"start": 17, "end": 25, "pos": "ADJ", "tag": "JJ", "dep": "amod", "head": 7},
    {"start": 26, "end": 39, "pos": "ADJ", "tag": "JJ", "dep": "amod", "head": 6},
    {"start": 40, "end": 50, "pos": "NOUN", "tag": "NN", "dep": "compound", "head": 7},
    {"start": 51, "end": 58, "pos": "NOUN", "tag": "NN", "dep": "attr", "head": 2},
    {"start": 59, "end": 72, "pos": "VERB", "tag": "VBN", "dep": "acl", "head": 7},
    {"start": 73, "end": 75, "pos": "ADP", "tag": "IN", "dep": "prep", "head": 8},
    {"start": 76, "end": 85, "pos": "PROPN", "tag": "NNP", "dep": "pobj", "head": 9},
    {"start": 85, "end": 86, "pos": "PUNCT", "tag": ",", "dep": "punct", "head": 10},
    {"start": 87, "end": 97, "pos": "PROPN", "tag": "NNP", "dep": "appos", "head": 10},
    {"start": 97, "end": 98, "pos": "PUNCT", "tag": ".", "dep": "punct", "head": 2},
    {"start": 99, "end": 101, "pos": "PRON", "tag": "PRP", "dep": "nsubjpass", "head": 16},
    {"start": 102, "end": 105, "pos": "VERB", "tag": "VBD", "dep": "auxpass", "head": 16},
    {"start": 106, "end": 113, "pos": "VERB", "tag": "VBN", "dep": "ROOT", "head": 16},
    {"start": 114, "end": 116, "pos": "ADP", "tag": "IN", "dep": "agent", "head": 16},
    {"start": 117, "end": 122, "pos": "PROPN", "tag": "NNP", "dep": "compound", "head": 19},
    {"start": 123, "end": 127, "pos": "PROPN", "tag": "NNP", "dep": "pobj", "head": 17},
    {"start": 127, "end": 128, "pos": "PUNCT", "tag": ",", "dep": "punct", "head": 19},
    {"start": 129, "end": 134, "pos": "PROPN", "tag": "NNP", "dep": "compound", "head": 22},
    {"start": 135, "end": 142, "pos": "PROPN", "tag": "NNP", "dep": "conj", "head": 19},
    {"start": 142, "end": 143, "pos": "PUNCT", "tag": ",", "dep": "punct", "head": 22},
    {"start": 144, "end": 147, "pos": "CCONJ", "tag": "CC", "dep": "cc", "head": 22},
    {"start": 148, "end": 154, "pos": "PROPN", "tag": "NNP", "dep": "compound", "head": 26},
    {"start": 155, "end": 160, "pos": "PROPN", "tag": "NNP", "dep": "conj", "head": 22},
    {"start": 161, "end": 163, "pos": "ADP", "tag": "IN", "dep": "prep", "head": 16},
    {"start": 164, "end": 169, "pos": "PROPN", "tag": "NNP", "dep": "pobj", "head": 27},
    {"start": 170, "end": 174, "pos": "NUM", "tag": "CD", "dep": "nummod", "head": 28},
    {"start": 174, "end": 175, "pos": "PUNCT", "tag": ".", "dep": "punct", "head": 16}
  ]
}
```

### Notes

* Each record contains a `"text"` and optional `"ents"` (named entity spans), `"sents"` (sentence spans), `"cats"` (text categories) and `"tokens"` (tokens with offsets into the text and optional attributes).
* Offsets into the text are standardised: `"start"` (start index) and `"end"` (end index). Other attributes match spaCy's API.
* The `"tokens"` don't have to include all attributes. If an attribute isn't present (e.g. a part-of-speech tag or dependency label), it's treated as a missing value.
* The token `"head"` is the index of the head token, i.e. `token.head.i`.
* The provided gold-standard tokenization can also be used to train the parser to split/merge tokens (coming in v2.1.x). This could be an argument / a flag to set during training.
* spaCy v2.1.x (nightly) already includes a `spacy.gold.docs2json` helper that generates the training format from `Doc` objects. It's intended to help keep the converters (`.conllu` etc.) in sync, since they can now all produce `Doc` objects and call into the same helper to convert to spaCy's format. This would also make the transition to a new format easy, because we'd only have to change the logic in `docs2json`.

### ✅ Pros 

* Easier to read and much closer to how the linguistic annotations are presented in spaCy's data structures.
* Easier to mix and match, and compose different types of data. With this format, you could easily omit the `"tokens"` and only train on the `"ents"` or update the `"sents"` to improve the sentence boundary detection.
* Easier to generate from other sources and corpora, because there are fewer restrictions around the shape of the text. While the previous format enforced a strict separation of paragraphs and sentences, this format will let you use longer and shorter texts and define sentence boundaries within each example.
* Easier to extend. If there are ever new annotations to be trained from, they can be added in a backwards-compatible way. Document-level annotations (spans like sentences or entities) at the root, and token-level annotations (other predicted attributes) within the tokens.

### 💡 Related ideas

* Use a [JSON schema](https://json-schema.org) to validate the training data format (!!!) and provide helpful feedback if there are problems. For example, imagine an error like: "tokens -> 20 -> start has the wrong format: integer required, received string ("5")".
* Speaking of validation: We could also add more in-depth data debugging and warnings (e.g. via an optional flag or command the user can run). For example: "Your data contains a new entity type `ANIMAL` that currently isn't present in the model. 1) You only have 15 examples of `ANIMAL`. This likely isn't enough to teach the model anything meaningful about this type. 2) Your data doesn't contain any examples of texts that *do not* contain an entity. This will make it harder for the model to generalise and learn what's *not* an entity."
* Make `spacy train` accept data in both `.json` and `.jsonl` (newline-delimited JSON). JSONL allows reading the file in line-by-line and doesn't require parsing the entire document.
* `spacy train` should make it much easier to update existing models or, alternatively, we should provide an analogous command with the same / similar arguments that takes the name of an existing model package instead of just the language to initialize. (Basically, if you know [Prodigy](https://prodi.gy), we want to provide the same smooth batch training experience natively in spaCy!)
* Make it easy for custom components and third-party models to hook into the training format! spaCy already supports `begin_training` and `update` methods on components (if you call `nlp.update`, spaCy will iterate over the components and call their `update` methods if available – just like `nlp.from_disk`). So we could, for instance, allow an `_` space in the training data, just like in the custom extension attributes, that can contain additional data – think coreference annotations, entity links etc.! Those would then automatically be added to the gold-standard `Doc` and become available in the custom component's `update` method.

---

What do you think? I'd love to hear your feedback in the comments!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

💫 Proposal: New JSON(L) format for training and improved training commands #2928

Motivation

Example

Notes

✅ Pros

💡 Related ideas

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

💫 Proposal: New JSON(L) format for training and improved training commands #2928

Description

Motivation

Example

Notes

✅ Pros

💡 Related ideas

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions