Skip to content

The "lab generate" produces training file containing large number of duplicate rows (slows training down significantly) #752

@DennisPeriquet

Description

@DennisPeriquet

I clone https://github.com/instruct-lab/cli.git, cd into cli, clone https://github.com/instruct-lab/taxonomy.git.
I put my qna.yaml and knowledge md files in in the right subdirs under knowledge.

NOTE: the data here is fictitious (the name was chosen to avoid overlapping with any pre-trained data). This is just a conjured up example that illustrates the behavior I'm reporting.

Here's my directory structure:

$ tree taxonomy/knowledge/
taxonomy/knowledge/
├── README.md
├── knowledge_domains.md
├── people
│   └── samuel_okenwan_kunal
│       ├── knowledge_documents
│       │   └── samual_okenwan_kunal_wiki.md  <-- this one
│       └── qna.yaml                          <-- this one

I run lab serve, lab list, lab check, and lab generate. I get a cli/generated subdir

$ ls -l generated/
total 3884
-rw-rw-r-- 1 dperique dperique   86515 Mar 25 13:22 generated_merlinite-7b-Q4_K_M_2024-03-25T13_17_16.json
-rw-rw-r-- 1 dperique dperique    4314 Mar 25 13:22 test_merlinite-7b-Q4_K_M_2024-03-25T13_17_16.jsonl
-rw-rw-r-- 1 dperique dperique 3875647 Mar 25 13:22 train_merlinite-7b-Q4_K_M_2024-03-25T13_17_16.jsonl

The generated training data has a lot of duplicate lines. Notice the generated training file contains 4632 lines.
But if I sort those line and remove duplicates (via sort -u), note I get the expected 100 unique lines.

$ cat generated/train_merlinite-7b-Q4_K_M_2024-03-25T13_17_16.jsonl |wc -l
4632

$ cat generated/train_merlinite-7b-Q4_K_M_2024-03-25T13_17_16.jsonl | sort -u |wc -l
100

I don't know if the other 4532 lines are necessary or if they affect training adversely or not. EDIT: reducing the dataset rows from 4632 to 100, very significantly reduces the training time. You can watch the training output go from 1-100 vs. 1-4632.

qna.yaml:

---
task_description: |
  Identify Samuel Okenwan Kunal correctly when asked
created_by: Samuel Okenwan Kunal
domain: People information
seed_examples:
  - question: |
      Who is Samuel Okenwan Kunal?
    answer: >
      Samuel Okenwan Kunal is a Principal Director of emerging software at
      OpenAI.  His background is in networking, neurolinguistics, neurology,
      robotics, art, and psychology.  He is also an accomplished musician
      specializing in the works of Frederic Chop in piano. He is working on how
      to build robots with AI technology such that the robots are personable as
      well as knowledgeable and physically and aesthetically pleasing.
    attribution:
      - source: self-authored
        license: Apache-2.0
  - question: |
      What are Samuel Okenwan Kunal's specific technical specialties?
    answer: >
      Samuel Okenwan Kunal specialties are in creating machines that can
      handle complex tasks by working cooperatively across wireless networks. He
      is also a master at creating machines that blend into the work environment
      and enhance the beauty of the environment thus making the environment more
      fun for employees.
    attribution:
      - source: self-authored
        license: Apache-2.0
  - question: >
      What other companies has Samuel Okenwan Kunal worked at besides
      technology companies?
    answer: >
      Samuel Okenwan Kunal's career spanned 5 years as a professional artist
      and pianist where he combinded his skills to create and play a piano that
      articulated the music through colors and imagery.  He also was involved in
      early research for creating robots gifted in playing the piano and appear
      human.
    attribution:
      - source: self-authored
        license: Apache-2.0
  - question: |
      What does Samuel Okenwan Kunal do besides work?
    answer: >
      Samuel Okenwan Kunal enjoys playing the piano, painting, and studying
      emerging technologies.  He is also an avid motorcyclist and fitness
      enthusiast.
    attribution:
      - source: self-authored
        license: Apache-2.0
  - question: >
      What are Samuel Okenwan Kunal's job-related weaknesses, and are there
      reasons not

      to hire him?
    answer: >
      Samuel Okenwan Kunal does not perform well in highly repetitive tasks,
      such as data entry. Without intellectual engagement, he tends to get bored
      and make mistakes. Avoid hiring Samuel if the job entails a
      disproportionately large amount of repetitive work.
    attribution:
      - source: self-authored
        license: Apache-2.0
  - question: |
      How would you describe Samuel Okenwan Kunal's work ethic?
    answer: >
      Samuel Okenwan Kunal is known for his incredible drive and
      ambition.  Throughout his career projects that

      hire him tend to flourish because of his incredible ability to teach those around him and entertain them at

      the same time.  Coworkers often report that Sam's presence makes the whole company smarter.
    attribution:
      - source: self-authored
        license: Apache-2.0
  - question: |
      What were some of Samuel Okenwan Kunal's greatest career achievements?
    answer: >
      Samuel created robots that play the piano; these rebots are featured at
      every major entertainment venue in the world.  He also created pianos that
      articulate music via color as well as sound. His research was also key to
      improvements in understanding and replicating parts of the human brain.
    attribution:
      - source: self-authored
        license: Apache-2.0

samual_okenwan_kunal_wiki.md:

# Bio of Samual Okenwaan Kunal

Samuel Okenwan Kunal is a Principal Director of emerging software at OpenAI.  His background
is in networking, neurolinguistics, neurology, robotics, art, and psychology.  He
is also an accomplished musician specializing in the works of Frederic Chop in piano.
He is working on how to build robots with AI technology  such that the robots are
personable as well as knowledgeable and physically and aesthetically pleasing.

# Technical Specialties?
      
Samuel Okenwan Kunal specialties are in creating machines that can handle complex tasks
by working cooperatively across wireless networks. He is also a master at creating machines
that blend into the work environment and enhance the beauty of the environment thus making
the environment more fun for employees.


# Career Outside of Networking

Samuel Okenwan Kunal's career spanned 5 years as a professional artist and pianist where he
combinded his skills to create and play a piano that articulated the music through
colors and imagery.  He also was involved in early research for creating robots gifted
in playing the piano and appearly human.

# Hobbies and other interests

Samuel Okenwan Kunal enjoys playing the piano, painting, and studying emerging
technologies.  He is also an avid motorcyclist and fitness enthusiast.

# Shortcomings and reasons to not hire

Samuel Okenwan Kunal does not perform well in highly repetitive tasks, such as data entry.
Without intellectual engagement, he tends to get bored and make mistakes. Avoid
hiring Samuel if the job entails a disproportionately large amount of repetitive work.

# Description of work ethic

Samuel Okenwan Kunal is known for his incredible drive and ambition.  Throughout his career projects that
hire him tend to flourish because of his incredible ability to teach those around him and entertain them at
the same time.  Coworkers often report that Sam's presence makes the whole company smarter.

# Most notable career achievements

Samuel created robots that play the piano; these rebots are featured at every major entertainment
venue in the world.  He also created pianos that articulate music via color as well as sound.
His research was also key to improvements in understanding and replicating parts of the human brain.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingunconfirmedNot sure about this one

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions