-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
integrate transformer-smaller-training-vocab #3066
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Just thanks @helpmefindaname . You (and all flair team) is making flair one of the best library for production NLP |
7b1cd73
to
dca82f1
Compare
Would be awesome to test this feature with the upcoming XLM-V model (that has vocab size of 901,629) 🤗 |
5e5c06b
to
0ca0931
Compare
That sounds awesome! If I understand it correctly, XLM-V is like XLM-Roberta-base but with ~1M vocab, yielding 768M parameters. On standard float32 this would be ~3GB memory or - while training with adam 12GB (params + grad + 1st momentum + 2nd momentum). If the vocab need is ~7% like it is on XLM-Roberta on Conll, then this would require 11GB less training memory. At the current state of this branch, it should already work to train a SequenceTagger with the default setup (only 1 transformer embedding, no checkpointing, decoder_lr_factor==1, etc.) However, if that leads to any problem, you can always just checkout my repo and use it "manually": texts = [[t.text for t in sentence] for sentence in corpus.get_all_sentences()]
with reduce_train_vocab(model=embeddings.model, tokenizer=embeddings.tokenizer, texts=texts):
trainer.fine_tune(path, ... )
tagger.save(path + "/model.pt") |
@helpmefindaname this tweet about vocab size would be something to be considered?
|
This looks interesting, thank you for sharing it! However my initial assumption would be, that the speedup is on the last layer due to the huge matrix multiplicaiton, while the lookup in the first layer likely won't have much gains, but that is something I need to try out and verify. |
48298ed
to
8ac1eb6
Compare
a61944d
to
71c6ef1
Compare
71c6ef1
to
ebc100d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this @helpmefindaname, this is a great feature!
As per the review comments I worry about adding a new property (supports_smaller_training_vocab
) and method get_used_tokens
to the top classes (Model
and Classifier
) and all implementing classes since this is needed by only one feature.
One idea:
- the
supports_smaller_training_vocab
property could be replaced with a check when running the trainer with the reduce_transformer_vocab=True, similar to what happens in line 295 in trainer.py. If the model is not an instance of Classifier, issue a warning and set it to False - the
get_used_tokens
method could be refactored into a property, such that it does not require a Corpus as input. Instead of computing all tokens plus special characters, it would only return the special characters used by a model. I.e. for the RelationClassifier, it would only return the special chars used in the encoding strategy.
This would remove the Corpus dependency from the interface and make the logic cleaner. Then, in the trainer, you could call a generic get_all_sentences to get all tokens, and add the special chars returned be this property.
Then a few questions:
- how is the new FLERT special context token handled by this, since it only exists after a sentence is embedded?
- what do you need the
should_embed_sentence
parameter for?
@@ -572,6 +589,7 @@ def __init__( | |||
decoder: Optional[torch.nn.Module] = None, | |||
inverse_model: bool = False, | |||
train_on_gold_pairs_only: bool = False, | |||
should_embed_sentence: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this parameter added? Are there models that inherit from DefaultClassifier that do not embed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of the fix of the Text-Pair-Classifier. While all other implementations of the DefaultClassifier need to embedd the sentence first, the Text-Pair-Classifier doesn't want to have the "TextPair" embedded but rather decides afterwarts how to embedd (both sentences separated vs a combined sentence).
Notice that this parameter is set hardcoded in the __init__
of the respective implementation. It won't be saved or loaded
flair/nn/model.py
Outdated
@property | ||
def supports_smaller_training_vocab(self) -> bool: | ||
# the smaller training vocab expects classification tasks, otherwise it won't work. | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this is True for all models that inherit from Classifier, and False otherwise. If so, could this property be removed from the interface and replaced with an isinstance check in the Trainer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this is True for all models that inherit from Classifier, and False otherwise.
Actually I forgot to add an implementation for the TextRegressor
and will add that soon. Then this statement would be false.
I would say the correct rule is "Everything that is not generative, is true"
def get_used_tokens(self, corpus: Corpus) -> typing.Iterable[List[str]]: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, this could either be moved into Classifier or replaced with a function. I worry a bit about "bloating" the top level interfaces with two new functions that only one feature needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like since these are not abstract methods, but implementations, it is not important to implement them, and therefore not bloated in terms of needs to implement.
the FLERT special token is added as special token, which are always kept, see https://github.com/helpmefindaname/transformer-smaller-training-vocab/blob/main/transformer_smaller_training_vocab/token_stats.py#L12 however I am not 100% sure if that is fully right, I will check this |
@helpmefindaname thanks a lot for integrating this! I tested on the cluster. FLERT trains about 15% faster with the reduced vocab :) |
Integration of my neat transformer-smaller-training-vocab library, which helped me to train embeddings like xlm-roberta-large on my 6GB laptop.
This PR aims to reduce the memory overhead of unused tokens in vocabulary by using
trainer.train(..., reduce_transformer_vocab=True)
ortrainer.fine_tune(..., reduce_transformer_vocab=True)
respectively, taking any additional model-specific tokens (tars or RelationClassifier).Current todolist:
test with the same embedding being used multiple timesnot possible in flair