-
Notifications
You must be signed in to change notification settings - Fork 30.3k
Description
System Info
transformers
version: 4.34.0- Platform: Linux-5.4.0-162-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.18.0
- Safetensors version: 0.4.0
- Accelerate version: 0.23.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.1.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I am training codellama model on custom dataset. Training starts but when it tries to save the checkpoint then it gives the error and stop training.
ERROR:
2023-10-11 11:34:18,589 - ERROR - Error in Logs due to Object of type method is not JSON serializable
CODE:
import json
import torch
import pandas as pd
import datasets
from peft import LoraConfig,PeftModel
from transformers import (AutoModelForCausalLM,AutoTokenizer,TrainingArguments,BitsAndBytesConfig)
import transformers
from trl import SFTTrainer
import os
import logging
import sys
RANK = 16
LR = 1e-4
EPOCH = 10
BATCH = 11
output_dir = f"../results/10-10-2023/{RANK}_RANK--{LR}_LR--{EPOCH}_EPOCH--{BATCH}_BATCH/"
if not os.path.exists(output_dir):
# If the directory doesn't exist, create it
os.makedirs(output_dir)
print(f"Directory '{output_dir}' created.")
else:
print(f"Directory '{output_dir}' already exists.")
# Create a logger instance
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# Create a formatter with the desired format
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
# Create a stream handler to output log messages to the console
stream_handler = logging.StreamHandler(sys.stdout)
stream_handler.setFormatter(formatter)
logger.addHandler(stream_handler)
# Create a file handler to log messages to a file
file_handler = logging.FileHandler(f'{output_dir}/trl-trainer-codellama.txt', encoding='utf-8') # Specify the file name here
file_handler.setFormatter(formatter)
logger.addHandler(file_handler)
console_handler = logging.StreamHandler(stream=sys.stdout)
# DEVICE = "cuda:0" if torch.cuda.is_available() else 'cpu'
MODEL_NAME = "./CodeLlama-7b-Instruct-HF"
# loading dataset
dataset = datasets.load_from_disk("../verilog-dataset/codellama_800L_74052E/")
# loading model
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME,use_safetensors=True,load_in_8bit=True,trust_remote_code=True,device_map='auto')
# loading tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, add_special_tokens=False, add_eos_token=False, add_bos_token=False)
tokenizer.pad_token = "[PAD]"
# LORA Configuration
peft_config = LoraConfig(
lora_alpha=RANK*2,
lora_dropout=0.05,
r = RANK,
bias="none",
task_type = "CAUSAL_LM",
target_modules = ["q_proj", "v_proj","lm_head"]
)
training_arguments = TrainingArguments(
per_device_train_batch_size=BATCH,
gradient_accumulation_steps=2,
optim="paged_adamw_32bit",
learning_rate=LR,
fp16=True,
max_grad_norm=0.3,
num_train_epochs=EPOCH,
warmup_ratio=0.05,
logging_steps=5,
save_total_limit=100,
save_strategy="steps",
save_steps=2,
group_by_length=True,
output_dir=output_dir,
report_to="tensorboard",
save_safetensors=True,
lr_scheduler_type="cosine",
seed=42)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=800,
tokenizer=tokenizer,
args=training_arguments,
)
try:
trainer.train()
except Exception as e:
logger.error(f"Error in Logs due to {e}")
Expected behavior
I am expecting that model should continue training without stopping while saving the checkpoints.
Metadata
Metadata
Assignees
Labels
No labels