Skip to content

[Bug]: fasttext embeddings don't work #3291

@sinaahmadi

Description

@sinaahmadi

Describe the bug

Thanks for making it possible to use custom embeddings. Using FastText is particularly useful for less-resourced languages that are not supported in your own embeddings yet.

I have an issue working with FastText embeddings (binary files). When using FastText, it is not possible to save the model even though it gets trained without any problem. I get an error saying FileNotFoundError: [Errno 2] No such file or directory: '/var/folders/gj/dvhshvn52d92wrk8r7cqgx9r0000gp/T/tmpglky2zrb/fasttext.model.vectors_vocab.npy.

To make sure that there are not other issues regarding my code, I converted the FastText embeddings to Gensim and could train and save the model successfully.

I am not sure what the problem with FastText is but there seems to be a bug either with saving the embeddings or pointing to the correct directory.

To Reproduce

from flair.data import Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import WordEmbeddings, StackedEmbeddings, FlairEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from flair.embeddings import FastTextEmbeddings

columns = {0: 'text', 1: 'pos'}

# this is the folder in which train, test and dev files reside
data_folder = 'datasets'

# init a corpus using column format, data folder and the names of the train, dev and test files
corpus: Corpus = ColumnCorpus(data_folder, columns,
                              train_file='train.txt',
                              test_file='test.txt',
                              dev_file='dev.txt')

len(corpus.train)
print(corpus.train[1].to_tagged_string('pos'))

label_type = 'pos'
label_dict = corpus.make_label_dictionary(label_type=label_type)
print(label_dict)

embeddings_fasttext = FastTextEmbeddings('/Users/sina/Bucket/Embeddings/cc.ckb.300.bin')

model = SequenceTagger(hidden_size=256,
                        embeddings=embeddings_fasttext,
                        tag_dictionary=label_dict,
                        tag_type=label_type)

trainer = ModelTrainer(model, corpus)

trainer.train('models',
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=20)

Expected behavior

I expect that to see the model saved after training.

Logs and Stack traces

2023-08-05 21:56:32,546 Evaluating as a multi-label problem: False
2023-08-05 21:56:32,561 DEV : loss 2.1664223670959473 - f1-score (micro avg)  0.1777
2023-08-05 21:56:32,563 BAD EPOCHS (no improvement): 0
2023-08-05 21:56:32,563 saving best model
2023-08-05 21:56:39,026 ----------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/sina/POS/train_pos.py", line 35, in <module>
    trainer.train('models',
  File "/Users/sina/POS/venv/lib/python3.9/site-packages/flair/trainers/trainer.py", line 893, in train
    final_score = self.final_test(
  File "/Users/sina/POS/venv/lib/python3.9/site-packages/flair/trainers/trainer.py", line 1015, in final_test
    self.model.load_state_dict(self.model.load(base_path / "best-model.pt").state_dict())
  File "/Users/sina/POS/venv/lib/python3.9/site-packages/flair/models/sequence_tagger_model.py", line 1035, in load
    return cast("SequenceTagger", super().load(model_path=model_path))
  File "/Users/sina/POS/venv/lib/python3.9/site-packages/flair/nn/model.py", line 559, in load
    return cast("Classifier", super().load(model_path=model_path))
  File "/Users/sina/POS/venv/lib/python3.9/site-packages/flair/nn/model.py", line 198, in load
    model = cls._init_model_with_state_dict(state)
  File "/Users/sina/POS/venv/lib/python3.9/site-packages/flair/models/sequence_tagger_model.py", line 617, in _init_model_with_state_dict
    return super()._init_model_with_state_dict(
  File "/Users/sina/POS/venv/lib/python3.9/site-packages/flair/nn/model.py", line 86, in _init_model_with_state_dict
    embeddings = load_embeddings(embeddings)
  File "/Users/sina/POS/venv/lib/python3.9/site-packages/flair/embeddings/base.py", line 227, in load_embeddings
    return cls.load_embedding(params)
  File "/Users/sina/POS/venv/lib/python3.9/site-packages/flair/embeddings/base.py", line 97, in load_embedding
    embedding = cls.from_params(params)
  File "/Users/sina/POS/venv/lib/python3.9/site-packages/flair/embeddings/token.py", line 1085, in from_params
    return cls(**params, embeddings=str(out_path), use_local=True)
  File "/Users/sina/POS/venv/lib/python3.9/site-packages/flair/embeddings/token.py", line 1040, in __init__
    self.precomputed_word_embeddings = FastTextKeyedVectors.load(str(embeddings_path))
  File "/Users/sina/POS/venv/lib/python3.9/site-packages/gensim/models/fasttext.py", line 1001, in load
    return super(FastTextKeyedVectors, cls).load(fname_or_handle, **kwargs)
  File "/Users/sina/POS/venv/lib/python3.9/site-packages/gensim/utils.py", line 487, in load
    obj._load_specials(fname, mmap, compress, subname)
  File "/Users/sina/POS/venv/lib/python3.9/site-packages/gensim/models/fasttext.py", line 1005, in _load_specials
    super(FastTextKeyedVectors, self)._load_specials(*args, **kwargs)
  File "/Users/sina/POS/venv/lib/python3.9/site-packages/gensim/models/keyedvectors.py", line 263, in _load_specials
    super(KeyedVectors, self)._load_specials(*args, **kwargs)
  File "/Users/sina/POS/venv/lib/python3.9/site-packages/gensim/utils.py", line 529, in _load_specials
    val = np.load(subname(fname, attrib), mmap_mode=mmap)
  File "/Users/sina/POS/venv/lib/python3.9/site-packages/numpy/lib/npyio.py", line 427, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/var/folders/gj/dvhshvn52d92wrk8r7cqgx9r0000gp/T/tmpglky2zrb/fasttext.model.vectors_vocab.npy'

Screenshots

No response

Additional Context

No response

Environment

Python 3.9

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions