Skip to content

Conversation

helpmefindaname
Copy link
Member

@helpmefindaname helpmefindaname commented Jan 23, 2023

Integration of my neat transformer-smaller-training-vocab library, which helped me to train embeddings like xlm-roberta-large on my 6GB laptop.
This PR aims to reduce the memory overhead of unused tokens in vocabulary by using trainer.train(..., reduce_transformer_vocab=True) or trainer.fine_tune(..., reduce_transformer_vocab=True) respectively, taking any additional model-specific tokens (tars or RelationClassifier).

Current todolist:

  • basic implementation
  • test on tars
  • test on RelationClassifier
  • test on TextPairClassifier
  • test with "best-model.pt" and "pre-best-model.pt"
  • test with multiple transformers
  • test with the same embedding being used multiple times not possible in flair
  • lower the dependency requirements on the transformer-smaller-training-vocab library (pytorch 1.13 is too hard)
  • add a good unittest to ensure integration is working

@bratao
Copy link
Contributor

bratao commented Jan 25, 2023

Just thanks @helpmefindaname . You (and all flair team) is making flair one of the best library for production NLP

@stefan-it
Copy link
Member

stefan-it commented Feb 3, 2023

Would be awesome to test this feature with the upcoming XLM-V model (that has vocab size of 901,629) 🤗

@helpmefindaname
Copy link
Member Author

helpmefindaname commented Feb 4, 2023

Would be awesome to test this feature with the upcoming XLM-V model (that has vocab size of 901,629) 🤗

That sounds awesome! If I understand it correctly, XLM-V is like XLM-Roberta-base but with ~1M vocab, yielding 768M parameters. On standard float32 this would be ~3GB memory or - while training with adam 12GB (params + grad + 1st momentum + 2nd momentum). If the vocab need is ~7% like it is on XLM-Roberta on Conll, then this would require 11GB less training memory.

At the current state of this branch, it should already work to train a SequenceTagger with the default setup (only 1 transformer embedding, no checkpointing, decoder_lr_factor==1, etc.)

However, if that leads to any problem, you can always just checkout my repo and use it "manually":

texts = [[t.text for t in sentence] for sentence in corpus.get_all_sentences()]

with reduce_train_vocab(model=embeddings.model, tokenizer=embeddings.tokenizer, texts=texts):
      trainer.fine_tune(path, ... )

tagger.save(path + "/model.pt")

@bratao
Copy link
Contributor

bratao commented Feb 6, 2023

@helpmefindaname this tweet about vocab size would be something to be considered?

The most dramatic optimization to nanoGPT so far (~25% speedup) is to simply increase vocab size from 50257 to 50304 (nearest multiple of 64). This calculates added useless dimensions but goes down a different kernel path with much higher occupancy. Careful with your Powers of 2.

https://twitter.com/karpathy/status/1621578354024677377

@helpmefindaname helpmefindaname marked this pull request as ready for review February 6, 2023 11:17
@helpmefindaname
Copy link
Member Author

@helpmefindaname this tweet about vocab size would be something to be considered?

The most dramatic optimization to nanoGPT so far (~25% speedup) is to simply increase vocab size from 50257 to 50304 (nearest multiple of 64). This calculates added useless dimensions but goes down a different kernel path with much higher occupancy. Careful with your Powers of 2.

https://twitter.com/karpathy/status/1621578354024677377

This looks interesting, thank you for sharing it!
I will note that down and experiment with that.

However my initial assumption would be, that the speedup is on the last layer due to the huge matrix multiplicaiton, while the lookup in the first layer likely won't have much gains, but that is something I need to try out and verify.

@helpmefindaname helpmefindaname changed the title Draft: integrate transformer-smaller-training-vocab integrate transformer-smaller-training-vocab Feb 7, 2023
@helpmefindaname helpmefindaname marked this pull request as draft February 9, 2023 10:03
@helpmefindaname helpmefindaname force-pushed the smaller-training-vocab branch 2 times, most recently from a61944d to 71c6ef1 Compare February 19, 2023 03:37
@helpmefindaname helpmefindaname marked this pull request as ready for review February 19, 2023 04:09
Copy link
Collaborator

@alanakbik alanakbik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @helpmefindaname, this is a great feature!

As per the review comments I worry about adding a new property (supports_smaller_training_vocab) and method get_used_tokens to the top classes (Model and Classifier) and all implementing classes since this is needed by only one feature.

One idea:

  • the supports_smaller_training_vocab property could be replaced with a check when running the trainer with the reduce_transformer_vocab=True, similar to what happens in line 295 in trainer.py. If the model is not an instance of Classifier, issue a warning and set it to False
  • the get_used_tokens method could be refactored into a property, such that it does not require a Corpus as input. Instead of computing all tokens plus special characters, it would only return the special characters used by a model. I.e. for the RelationClassifier, it would only return the special chars used in the encoding strategy.
    This would remove the Corpus dependency from the interface and make the logic cleaner. Then, in the trainer, you could call a generic get_all_sentences to get all tokens, and add the special chars returned be this property.

Then a few questions:

  • how is the new FLERT special context token handled by this, since it only exists after a sentence is embedded?
  • what do you need the should_embed_sentence parameter for?

@@ -572,6 +589,7 @@ def __init__(
decoder: Optional[torch.nn.Module] = None,
inverse_model: bool = False,
train_on_gold_pairs_only: bool = False,
should_embed_sentence: bool = True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this parameter added? Are there models that inherit from DefaultClassifier that do not embed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the fix of the Text-Pair-Classifier. While all other implementations of the DefaultClassifier need to embedd the sentence first, the Text-Pair-Classifier doesn't want to have the "TextPair" embedded but rather decides afterwarts how to embedd (both sentences separated vs a combined sentence).

Notice that this parameter is set hardcoded in the __init__ of the respective implementation. It won't be saved or loaded

Comment on lines 236 to 239
@property
def supports_smaller_training_vocab(self) -> bool:
# the smaller training vocab expects classification tasks, otherwise it won't work.
return False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this is True for all models that inherit from Classifier, and False otherwise. If so, could this property be removed from the interface and replaced with an isinstance check in the Trainer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this is True for all models that inherit from Classifier, and False otherwise.

Actually I forgot to add an implementation for the TextRegressor and will add that soon. Then this statement would be false.
I would say the correct rule is "Everything that is not generative, is true"

Comment on lines +241 to +242
def get_used_tokens(self, corpus: Corpus) -> typing.Iterable[List[str]]:
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, this could either be moved into Classifier or replaced with a function. I worry a bit about "bloating" the top level interfaces with two new functions that only one feature needs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like since these are not abstract methods, but implementations, it is not important to implement them, and therefore not bloated in terms of needs to implement.

@helpmefindaname
Copy link
Member Author

  • how is the new FLERT special context token handled by this, since it only exists after a sentence is embedded?

the FLERT special token is added as special token, which are always kept, see https://github.com/helpmefindaname/transformer-smaller-training-vocab/blob/main/transformer_smaller_training_vocab/token_stats.py#L12 however I am not 100% sure if that is fully right, I will check this

@alanakbik
Copy link
Collaborator

@helpmefindaname thanks a lot for integrating this! I tested on the cluster. FLERT trains about 15% faster with the reduced vocab :)

@alanakbik alanakbik merged commit cb6b0e5 into master Mar 3, 2023
@alanakbik alanakbik deleted the smaller-training-vocab branch March 3, 2023 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants