-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Closed
Labels
enhancementFeature requests and improvementsFeature requests and improvementsproposalProposal specs for new featuresProposal specs for new featurestrainingTraining and updating modelsTraining and updating models
Description
Motivation
One of the biggest invonveniences and sources of frustration is spaCy's current JSON format for training. It's weirdly specific, annoying to create outside of the built-in converters and difficult to read. Training the model with incomplete information is pretty unintuitive and inconvenient as well.
To finally fix this, here's my proposal for a new and simplfied training file format that is easier to read, generate and compose.
Example
{
"text": "Apple Inc. is an American multinational technology company headquartered in Cupertino, California. It was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne in April 1976.",
"ents": [
{"start": 0, "end": 10, "label": "ORG"},
{"start": 17, "end": 25, "label": "NORP"},
{"start": 76, "end": 85, "label": "GPE"},
{"start": 87, "end": 97, "label": "GPE"},
{"start": 117, "end": 127, "label": "PERSON"},
{"start": 129, "end": 142, "label": "PERSON"},
{"start": 148, "end": 160, "label": "PERSON"},
{"start": 164, "end": 174, "label": "DATE"}
],
"sents": [
{"start": 0, "end": 98},
{"start": 99, "end": 175}
],
"cats": {
"TECHNOLOGY": true,
"FINANCE": false,
"LEGAL": false
},
"tokens": [
{"start": 0, "end": 5, "pos": "PROPN", "tag": "NNP", "dep": "compound", "head": 1},
{"start": 6, "end": 10, "pos": "PROPN", "tag": "NNP", "dep": "nsubj", "head": 2},
{"start": 11, "end": 13, "pos": "VERB", "tag": "VBZ", "dep": "ROOT", "head": 2},
{"start": 14, "end": 16, "pos": "DET", "tag": "DT", "dep": "det", "head": 7},
{"start": 17, "end": 25, "pos": "ADJ", "tag": "JJ", "dep": "amod", "head": 7},
{"start": 26, "end": 39, "pos": "ADJ", "tag": "JJ", "dep": "amod", "head": 6},
{"start": 40, "end": 50, "pos": "NOUN", "tag": "NN", "dep": "compound", "head": 7},
{"start": 51, "end": 58, "pos": "NOUN", "tag": "NN", "dep": "attr", "head": 2},
{"start": 59, "end": 72, "pos": "VERB", "tag": "VBN", "dep": "acl", "head": 7},
{"start": 73, "end": 75, "pos": "ADP", "tag": "IN", "dep": "prep", "head": 8},
{"start": 76, "end": 85, "pos": "PROPN", "tag": "NNP", "dep": "pobj", "head": 9},
{"start": 85, "end": 86, "pos": "PUNCT", "tag": ",", "dep": "punct", "head": 10},
{"start": 87, "end": 97, "pos": "PROPN", "tag": "NNP", "dep": "appos", "head": 10},
{"start": 97, "end": 98, "pos": "PUNCT", "tag": ".", "dep": "punct", "head": 2},
{"start": 99, "end": 101, "pos": "PRON", "tag": "PRP", "dep": "nsubjpass", "head": 16},
{"start": 102, "end": 105, "pos": "VERB", "tag": "VBD", "dep": "auxpass", "head": 16},
{"start": 106, "end": 113, "pos": "VERB", "tag": "VBN", "dep": "ROOT", "head": 16},
{"start": 114, "end": 116, "pos": "ADP", "tag": "IN", "dep": "agent", "head": 16},
{"start": 117, "end": 122, "pos": "PROPN", "tag": "NNP", "dep": "compound", "head": 19},
{"start": 123, "end": 127, "pos": "PROPN", "tag": "NNP", "dep": "pobj", "head": 17},
{"start": 127, "end": 128, "pos": "PUNCT", "tag": ",", "dep": "punct", "head": 19},
{"start": 129, "end": 134, "pos": "PROPN", "tag": "NNP", "dep": "compound", "head": 22},
{"start": 135, "end": 142, "pos": "PROPN", "tag": "NNP", "dep": "conj", "head": 19},
{"start": 142, "end": 143, "pos": "PUNCT", "tag": ",", "dep": "punct", "head": 22},
{"start": 144, "end": 147, "pos": "CCONJ", "tag": "CC", "dep": "cc", "head": 22},
{"start": 148, "end": 154, "pos": "PROPN", "tag": "NNP", "dep": "compound", "head": 26},
{"start": 155, "end": 160, "pos": "PROPN", "tag": "NNP", "dep": "conj", "head": 22},
{"start": 161, "end": 163, "pos": "ADP", "tag": "IN", "dep": "prep", "head": 16},
{"start": 164, "end": 169, "pos": "PROPN", "tag": "NNP", "dep": "pobj", "head": 27},
{"start": 170, "end": 174, "pos": "NUM", "tag": "CD", "dep": "nummod", "head": 28},
{"start": 174, "end": 175, "pos": "PUNCT", "tag": ".", "dep": "punct", "head": 16}
]
}
Notes
- Each record contains a
"text"
and optional"ents"
(named entity spans),"sents"
(sentence spans),"cats"
(text categories) and"tokens"
(tokens with offsets into the text and optional attributes). - Offsets into the text are standardised:
"start"
(start index) and"end"
(end index). Other attributes match spaCy's API. - The
"tokens"
don't have to include all attributes. If an attribute isn't present (e.g. a part-of-speech tag or dependency label), it's treated as a missing value. - The token
"head"
is the index of the head token, i.e.token.head.i
. - The provided gold-standard tokenization can also be used to train the parser to split/merge tokens (coming in v2.1.x). This could be an argument / a flag to set during training.
- spaCy v2.1.x (nightly) already includes a
spacy.gold.docs2json
helper that generates the training format fromDoc
objects. It's intended to help keep the converters (.conllu
etc.) in sync, since they can now all produceDoc
objects and call into the same helper to convert to spaCy's format. This would also make the transition to a new format easy, because we'd only have to change the logic indocs2json
.
✅ Pros
- Easier to read and much closer to how the linguistic annotations are presented in spaCy's data structures.
- Easier to mix and match, and compose different types of data. With this format, you could easily omit the
"tokens"
and only train on the"ents"
or update the"sents"
to improve the sentence boundary detection. - Easier to generate from other sources and corpora, because there are fewer restrictions around the shape of the text. While the previous format enforced a strict separation of paragraphs and sentences, this format will let you use longer and shorter texts and define sentence boundaries within each example.
- Easier to extend. If there are ever new annotations to be trained from, they can be added in a backwards-compatible way. Document-level annotations (spans like sentences or entities) at the root, and token-level annotations (other predicted attributes) within the tokens.
💡 Related ideas
- Use a JSON schema to validate the training data format (!!!) and provide helpful feedback if there are problems. For example, imagine an error like: "tokens -> 20 -> start has the wrong format: integer required, received string ("5")".
- Speaking of validation: We could also add more in-depth data debugging and warnings (e.g. via an optional flag or command the user can run). For example: "Your data contains a new entity type
ANIMAL
that currently isn't present in the model. 1) You only have 15 examples ofANIMAL
. This likely isn't enough to teach the model anything meaningful about this type. 2) Your data doesn't contain any examples of texts that do not contain an entity. This will make it harder for the model to generalise and learn what's not an entity." - Make
spacy train
accept data in both.json
and.jsonl
(newline-delimited JSON). JSONL allows reading the file in line-by-line and doesn't require parsing the entire document. spacy train
should make it much easier to update existing models or, alternatively, we should provide an analogous command with the same / similar arguments that takes the name of an existing model package instead of just the language to initialize. (Basically, if you know Prodigy, we want to provide the same smooth batch training experience natively in spaCy!)- Make it easy for custom components and third-party models to hook into the training format! spaCy already supports
begin_training
andupdate
methods on components (if you callnlp.update
, spaCy will iterate over the components and call theirupdate
methods if available – just likenlp.from_disk
). So we could, for instance, allow an_
space in the training data, just like in the custom extension attributes, that can contain additional data – think coreference annotations, entity links etc.! Those would then automatically be added to the gold-standardDoc
and become available in the custom component'supdate
method.
What do you think? I'd love to hear your feedback in the comments!
honnibal, michalwols, mathetes87, samhita-alla, EmilStenstrom and 12 morehonnibal, mathetes87, axsaucedo, moreymat, bnriiitb and 7 more
Metadata
Metadata
Assignees
Labels
enhancementFeature requests and improvementsFeature requests and improvementsproposalProposal specs for new featuresProposal specs for new featurestrainingTraining and updating modelsTraining and updating models