Skip to content

Error while saving checkpoint during training #26732

@ghost

Description

System Info

  • transformers version: 4.34.0
  • Platform: Linux-5.4.0-162-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.18.0
  • Safetensors version: 0.4.0
  • Accelerate version: 0.23.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am training codellama model on custom dataset. Training starts but when it tries to save the checkpoint then it gives the error and stop training.

ERROR:
2023-10-11 11:34:18,589 - ERROR - Error in Logs due to Object of type method is not JSON serializable
CODE:

import json
import torch
import pandas as pd
import datasets
from peft import LoraConfig,PeftModel
from transformers import (AutoModelForCausalLM,AutoTokenizer,TrainingArguments,BitsAndBytesConfig)
import transformers
from trl import SFTTrainer
import os

import logging
import sys

RANK = 16
LR = 1e-4
EPOCH = 10
BATCH = 11


output_dir = f"../results/10-10-2023/{RANK}_RANK--{LR}_LR--{EPOCH}_EPOCH--{BATCH}_BATCH/"


if not os.path.exists(output_dir):
    # If the directory doesn't exist, create it
    os.makedirs(output_dir)
    print(f"Directory '{output_dir}' created.")
else:
    print(f"Directory '{output_dir}' already exists.")


# Create a logger instance
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Create a formatter with the desired format
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')

# Create a stream handler to output log messages to the console
stream_handler = logging.StreamHandler(sys.stdout)
stream_handler.setFormatter(formatter)
logger.addHandler(stream_handler)

# Create a file handler to log messages to a file
file_handler = logging.FileHandler(f'{output_dir}/trl-trainer-codellama.txt', encoding='utf-8')  # Specify the file name here
file_handler.setFormatter(formatter)
logger.addHandler(file_handler)
console_handler = logging.StreamHandler(stream=sys.stdout)


# DEVICE = "cuda:0" if torch.cuda.is_available() else 'cpu'



MODEL_NAME = "./CodeLlama-7b-Instruct-HF"

# loading dataset
dataset = datasets.load_from_disk("../verilog-dataset/codellama_800L_74052E/")
# loading model
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME,use_safetensors=True,load_in_8bit=True,trust_remote_code=True,device_map='auto')
# loading tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, add_special_tokens=False, add_eos_token=False, add_bos_token=False)
tokenizer.pad_token = "[PAD]"

# LORA Configuration
peft_config = LoraConfig(
    lora_alpha=RANK*2,
    lora_dropout=0.05,
    r = RANK,
    bias="none",
    task_type = "CAUSAL_LM",
    target_modules = ["q_proj", "v_proj","lm_head"]
)



training_arguments = TrainingArguments(
    per_device_train_batch_size=BATCH,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    learning_rate=LR,
    fp16=True,
    max_grad_norm=0.3,
    num_train_epochs=EPOCH,
    warmup_ratio=0.05,
    logging_steps=5,
    save_total_limit=100,
    save_strategy="steps",
    save_steps=2,
    group_by_length=True,
    output_dir=output_dir,
    report_to="tensorboard",
    save_safetensors=True,
    lr_scheduler_type="cosine",
    seed=42)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=800,
    tokenizer=tokenizer,
    args=training_arguments,
)


try:
    trainer.train()
except Exception as e:
    logger.error(f"Error in Logs due to {e}")

Expected behavior

I am expecting that model should continue training without stopping while saving the checkpoints.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions