integrate transformer-smaller-training-vocab #3066

helpmefindaname · 2023-01-23T17:29:57Z

Integration of my neat transformer-smaller-training-vocab library, which helped me to train embeddings like xlm-roberta-large on my 6GB laptop.
This PR aims to reduce the memory overhead of unused tokens in vocabulary by using trainer.train(..., reduce_transformer_vocab=True) or trainer.fine_tune(..., reduce_transformer_vocab=True) respectively, taking any additional model-specific tokens (tars or RelationClassifier).

Current todolist:

basic implementation
test on tars
test on RelationClassifier
test on TextPairClassifier
test with "best-model.pt" and "pre-best-model.pt"
test with multiple transformers
~~test with the same embedding being used multiple times~~ not possible in flair
lower the dependency requirements on the transformer-smaller-training-vocab library (pytorch 1.13 is too hard)
add a good unittest to ensure integration is working

bratao · 2023-01-25T13:41:57Z

Just thanks @helpmefindaname . You (and all flair team) is making flair one of the best library for production NLP

stefan-it · 2023-02-03T13:10:48Z

Would be awesome to test this feature with the upcoming XLM-V model (that has vocab size of 901,629) 🤗

helpmefindaname · 2023-02-04T22:57:38Z

Would be awesome to test this feature with the upcoming XLM-V model (that has vocab size of 901,629) 🤗

That sounds awesome! If I understand it correctly, XLM-V is like XLM-Roberta-base but with ~1M vocab, yielding 768M parameters. On standard float32 this would be ~3GB memory or - while training with adam 12GB (params + grad + 1st momentum + 2nd momentum). If the vocab need is ~7% like it is on XLM-Roberta on Conll, then this would require 11GB less training memory.

At the current state of this branch, it should already work to train a SequenceTagger with the default setup (only 1 transformer embedding, no checkpointing, decoder_lr_factor==1, etc.)

However, if that leads to any problem, you can always just checkout my repo and use it "manually":

texts = [[t.text for t in sentence] for sentence in corpus.get_all_sentences()]

with reduce_train_vocab(model=embeddings.model, tokenizer=embeddings.tokenizer, texts=texts):
      trainer.fine_tune(path, ... )

tagger.save(path + "/model.pt")

bratao · 2023-02-06T11:04:16Z

@helpmefindaname this tweet about vocab size would be something to be considered?

The most dramatic optimization to nanoGPT so far (~25% speedup) is to simply increase vocab size from 50257 to 50304 (nearest multiple of 64). This calculates added useless dimensions but goes down a different kernel path with much higher occupancy. Careful with your Powers of 2.

https://twitter.com/karpathy/status/1621578354024677377

helpmefindaname · 2023-02-06T11:24:16Z

@helpmefindaname this tweet about vocab size would be something to be considered?

The most dramatic optimization to nanoGPT so far (~25% speedup) is to simply increase vocab size from 50257 to 50304 (nearest multiple of 64). This calculates added useless dimensions but goes down a different kernel path with much higher occupancy. Careful with your Powers of 2.

https://twitter.com/karpathy/status/1621578354024677377

This looks interesting, thank you for sharing it!
I will note that down and experiment with that.

However my initial assumption would be, that the speedup is on the last layer due to the huge matrix multiplicaiton, while the lookup in the first layer likely won't have much gains, but that is something I need to try out and verify.

alanakbik

Thanks for adding this @helpmefindaname, this is a great feature!

As per the review comments I worry about adding a new property (supports_smaller_training_vocab) and method get_used_tokens to the top classes (Model and Classifier) and all implementing classes since this is needed by only one feature.

One idea:

the supports_smaller_training_vocab property could be replaced with a check when running the trainer with the reduce_transformer_vocab=True, similar to what happens in line 295 in trainer.py. If the model is not an instance of Classifier, issue a warning and set it to False
the get_used_tokens method could be refactored into a property, such that it does not require a Corpus as input. Instead of computing all tokens plus special characters, it would only return the special characters used by a model. I.e. for the RelationClassifier, it would only return the special chars used in the encoding strategy.
This would remove the Corpus dependency from the interface and make the logic cleaner. Then, in the trainer, you could call a generic get_all_sentences to get all tokens, and add the special chars returned be this property.

Then a few questions:

how is the new FLERT special context token handled by this, since it only exists after a sentence is embedded?
what do you need the should_embed_sentence parameter for?

alanakbik · 2023-02-22T04:26:52Z

flair/nn/model.py

@@ -572,6 +589,7 @@ def __init__(
        decoder: Optional[torch.nn.Module] = None,
        inverse_model: bool = False,
        train_on_gold_pairs_only: bool = False,
+        should_embed_sentence: bool = True,


Why is this parameter added? Are there models that inherit from DefaultClassifier that do not embed?

This is part of the fix of the Text-Pair-Classifier. While all other implementations of the DefaultClassifier need to embedd the sentence first, the Text-Pair-Classifier doesn't want to have the "TextPair" embedded but rather decides afterwarts how to embedd (both sentences separated vs a combined sentence).

Notice that this parameter is set hardcoded in the __init__ of the respective implementation. It won't be saved or loaded

alanakbik · 2023-02-22T04:29:20Z

flair/nn/model.py

+    @property
+    def supports_smaller_training_vocab(self) -> bool:
+        # the smaller training vocab expects classification tasks, otherwise it won't work.
+        return False


If I understand correctly, this is True for all models that inherit from Classifier, and False otherwise. If so, could this property be removed from the interface and replaced with an isinstance check in the Trainer?

If I understand correctly, this is True for all models that inherit from Classifier, and False otherwise.

Actually I forgot to add an implementation for the TextRegressor and will add that soon. Then this statement would be false.
I would say the correct rule is "Everything that is not generative, is true"

alanakbik · 2023-02-22T04:32:35Z

flair/nn/model.py

+    def get_used_tokens(self, corpus: Corpus) -> typing.Iterable[List[str]]:
+        pass


Similar to above, this could either be moved into Classifier or replaced with a function. I worry a bit about "bloating" the top level interfaces with two new functions that only one feature needs.

I feel like since these are not abstract methods, but implementations, it is not important to implement them, and therefore not bloated in terms of needs to implement.

helpmefindaname · 2023-02-23T13:20:25Z

how is the new FLERT special context token handled by this, since it only exists after a sentence is embedded?

the FLERT special token is added as special token, which are always kept, see https://github.com/helpmefindaname/transformer-smaller-training-vocab/blob/main/transformer_smaller_training_vocab/token_stats.py#L12 however I am not 100% sure if that is fully right, I will check this

…r text regressor

alanakbik · 2023-03-03T11:29:21Z

@helpmefindaname thanks a lot for integrating this! I tested on the cluster. FLERT trains about 15% faster with the reduced vocab :)

helpmefindaname force-pushed the smaller-training-vocab branch from 7b1cd73 to dca82f1 Compare January 30, 2023 11:17

helpmefindaname force-pushed the smaller-training-vocab branch from 5e5c06b to 0ca0931 Compare February 4, 2023 10:33

helpmefindaname marked this pull request as ready for review February 6, 2023 11:17

helpmefindaname requested a review from alanakbik February 6, 2023 11:17

helpmefindaname force-pushed the smaller-training-vocab branch from 48298ed to 8ac1eb6 Compare February 6, 2023 11:47

helpmefindaname changed the title ~~Draft: integrate transformer-smaller-training-vocab~~ integrate transformer-smaller-training-vocab Feb 7, 2023

helpmefindaname marked this pull request as draft February 9, 2023 10:03

helpmefindaname force-pushed the smaller-training-vocab branch 2 times, most recently from a61944d to 71c6ef1 Compare February 19, 2023 03:37

Benedikt Fuchs and others added 10 commits February 19, 2023 04:38

integrate transformer-smaller-training-vocab

2677757

extract tars embeddings too

f20cd80

fix tars get vocab

f717385

fix __exit__ call

acd95d3

reduce vocab before creating optimizer

a9177ba

fix text pair classification

bfdfc45

use context lib for cleaner ccontextualization

ec983c6

fix typing and formatting

99e727f

reformatting with updated black release

1a785d9

update smaller transformer vocab and use optimizer updates

ebc100d

helpmefindaname force-pushed the smaller-training-vocab branch from 71c6ef1 to ebc100d Compare February 19, 2023 03:43

helpmefindaname marked this pull request as ready for review February 19, 2023 04:09

helpmefindaname added 3 commits February 19, 2023 05:47

fix indentation errors

3ff3713

fix mypy error

95af6f3

fix pairwise classification model

86ed221

helpmefindaname mentioned this pull request Feb 20, 2023

[Bug]: DataPair main tag after classification sometime does not equal the tag with the biggest score #3117

Closed

Merge branch 'master' into smaller-training-vocab

c45ca47

alanakbik requested changes Feb 22, 2023

View reviewed changes

extract functions to mixin and implement smaller transformer vocab fo…

fb55e61

…r text regressor

alanakbik merged commit cb6b0e5 into master Mar 3, 2023

alanakbik deleted the smaller-training-vocab branch March 3, 2023 11:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

integrate transformer-smaller-training-vocab #3066

integrate transformer-smaller-training-vocab #3066

Uh oh!

helpmefindaname commented Jan 23, 2023 •

edited

Loading

Uh oh!

bratao commented Jan 25, 2023

Uh oh!

stefan-it commented Feb 3, 2023 •

edited

Loading

Uh oh!

helpmefindaname commented Feb 4, 2023 •

edited

Loading

Uh oh!

bratao commented Feb 6, 2023

Uh oh!

helpmefindaname commented Feb 6, 2023

Uh oh!

alanakbik left a comment

Uh oh!

alanakbik Feb 22, 2023

Uh oh!

helpmefindaname Feb 23, 2023

Uh oh!

alanakbik Feb 22, 2023

Uh oh!

helpmefindaname Feb 23, 2023

Uh oh!

alanakbik Feb 22, 2023

Uh oh!

helpmefindaname Feb 23, 2023

Uh oh!

helpmefindaname commented Feb 23, 2023

Uh oh!

alanakbik commented Mar 3, 2023

Uh oh!

Uh oh!

		def get_used_tokens(self, corpus: Corpus) -> typing.Iterable[List[str]]:
		pass

Uh oh!

integrate transformer-smaller-training-vocab #3066

integrate transformer-smaller-training-vocab #3066

Uh oh!

Conversation

helpmefindaname commented Jan 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bratao commented Jan 25, 2023

Uh oh!

stefan-it commented Feb 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

helpmefindaname commented Feb 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bratao commented Feb 6, 2023

Uh oh!

helpmefindaname commented Feb 6, 2023

Uh oh!

alanakbik left a comment

Choose a reason for hiding this comment

Uh oh!

alanakbik Feb 22, 2023

Choose a reason for hiding this comment

Uh oh!

helpmefindaname Feb 23, 2023

Choose a reason for hiding this comment

Uh oh!

alanakbik Feb 22, 2023

Choose a reason for hiding this comment

Uh oh!

helpmefindaname Feb 23, 2023

Choose a reason for hiding this comment

Uh oh!

alanakbik Feb 22, 2023

Choose a reason for hiding this comment

Uh oh!

helpmefindaname Feb 23, 2023

Choose a reason for hiding this comment

Uh oh!

helpmefindaname commented Feb 23, 2023

Uh oh!

alanakbik commented Mar 3, 2023

Uh oh!

Uh oh!

helpmefindaname commented Jan 23, 2023 •

edited

Loading

stefan-it commented Feb 3, 2023 •

edited

Loading

helpmefindaname commented Feb 4, 2023 •

edited

Loading