Use the taxonomy schema to validate taxonomy yaml files #776

bjhargrave · 2024-04-02T21:12:52Z

Changes

Which issue is resolved by this Pull Request:
Part of #760

Description of your changes:

This change updates read_taxonomy_file to also validate that the yaml file conforms to the schema for the part of the taxonomy in which the yaml file resides. This validation is in addition to the existing yaml linting checks.

Several tests required updating so that their test yaml files were schema valid.

For now we include the schema files in this repo. Once anonymous access to the schema repo is available, we will change to access the schema files by using git submodule on the schema repo.

bjhargrave · 2024-04-02T21:21:36Z

I think the functional test failure is flakiness in the tests. See #748.

bjhargrave · 2024-04-03T15:15:22Z

Updated to use patterns instead of pattern for knownledge schema. See instructlab/taxonomy#654 and instructlab/schema#3 (comment).

anik120 · 2024-04-03T22:21:57Z

Updated to use patterns instead of pattern for knownledge schema. See instruct-lab/taxonomy#654 and instruct-lab/schema#3 (comment).

@bjhargrave this needs to be updated too

anik120 · 2024-04-03T22:23:34Z

I'm adding this to the next milestone

bjhargrave · 2024-04-03T22:46:55Z

this needs to be updated too

Done in b3669de#diff-c930cb39432fe5403f7d83393641f63b81edd36cdd0a7788eb6f3548dee1a13fR186

anik120 · 2024-04-04T03:13:16Z

@bjhargrave doing a quick PR #791 to get us moving with patterns

bjhargrave · 2024-04-04T13:34:31Z

doing a quick PR #791 to get us moving with patterns

Great. I rebased.

bjhargrave · 2024-04-04T15:22:52Z

@xukai92 The packaging issue which caused the functional test failure is now fixed.

Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com> Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

We will later replace with a git submodule of the schema repo once anonymous access to the repo is available. But for now, we vendor in the files. Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

We use jsonschema to validate that input yaml files conform to the schema for their relative path. Any validation failures are logged and the process ends with an error. See #760 Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

anik120 · 2024-04-05T15:55:43Z

Capturing a discussion with @shivchander:

I was writing up a test case for the cli to test knowledge workflow, but the way that I laid out my qna.yaml as follows:

test_knowledge_valid = b"""created_by: test-bot
seed_examples:
- question: What is Operator Framework? 
  answer: 'The Operator Framework is a set of Kubernetes components and developer tools, 
  that aid in Operator development and central management on a multi-tenant cluster.'
- question: What is an Operator? 
  answer: 'The goal of an Operator is to put operational knowledge into software. 
  Previously this knowledge only resided in the minds of administrators, 
  various combinations of shell scripts or automation software like Ansible. 
  It was outside of your Kubernetes cluster and hard to integrate. 
  With Operators, CoreOS changed that. Operators implement and automate 
  common Day-1 (installation, configuration, etc.) and Day-2 (re-configuration, 
  update, backup, failover, restore, etc.) activities in a piece of software running 
  inside your Kubernetes cluster, by integrating natively with Kubernetes concepts and APIs. 
  We call this a Kubernetes-native application. 
  With Operators you can stop treating an application as a collection of primitives like Pods, 
  Deployments, Services or ConfigMaps, but instead as a single object that only exposes the knobs 
  that make sense for the application.'
- question: What is Operator Lifecycle Manager? 
  answer: 'OLM is a component of the Operator Framework, 
  an open source toolkit to manage Kubernetes native applications, 
  called Operators, in an effective, automated, and scalable way. 
  OLM extends Kubernetes to provide a declarative way to install, 
  manage, and upgrade Operators and their dependencies in a cluster.
task_description: to teach a large language model about the Operator Framework
document:
  repo: https://github.com/anik120/knowledge-doc-test
  commit: bf78d868f544e55d8e1d99f68d9105fc3b8751bd
  patterns:
  - operator-framework*.md

Essentially, the seed_example question/answers I have there are from our overarching project websites https://operatorframework.io, https://olm.operatorframework.io and https://sdk.operatorframework.io, and the documents I have in https://github.com/anik120/knowledge-doc-test are README.mds from our components' GitHub repository. In other words, the seed_example question/answers do not actually come from the documents hosted in document.repo.

The way I laid things out, the seed_examples are "product pitch/summary description" and document.repo contains all the docs I want the model to learn about.

Shiv tells me that that's the wrong way of thinking about it, and the verb document should be source in reality, and seed_examples are examples of questions and answers that can be answered once the model is trained on the docs hosted in docs.repo.

Eureka moment: Even after learning (only a little while ago, ie fresh info being processed by brain still) how the taxonomy interacts with the model, I was thinking about the structure of my knowledge doc, the wrong way. It's likely that other users will also confuse the taxonomy/model interactions and lay out the qna.yaml files the wrong way, leading to PR submissions that'll likely not improve model quality.

Proposed fix: Change document to source

anik120 · 2024-04-08T17:13:23Z

cli/lab.py

@@ -235,8 +235,9 @@ def diff(ctx, taxonomy_path, taxonomy_base, yaml_rules, quiet):
    if quiet:
        try:
            read_taxonomy(logger, taxonomy_path, taxonomy_base, yaml_rules)
-        except (Exception, yaml.YAMLError) as exc:
+        except (SystemExit, yaml.YAMLError) as exc:


This doesn't catch exceptions like https://github.com/instruct-lab/cli/blob/main/cli/generator/generate_data.py#L652 though

Yes. However https://github.com/instruct-lab/cli/blob/212e1c5234ff71006dd708dfc6e464c591ea7c74/cli/lab.py#L252 does not either. There are currently two code paths here: one for --quiet and one for not quiet and they have different excepts. I opened #800 to have a single mainline code path with ifs for non-quiet output.

anik120 · 2024-04-08T17:17:41Z

cli/utils.py

+TAXONOMY_FOLDERS: List[str] = ["compositional_skills", "knowledge"]
+"""Taxonomy folders which are also the schema names"""


Should this be part of the schema instead?

So a file in schema/v1/, perhaps schema/v1/constants.py ?

This is possible but I would do in a follow-on PR. It wont be useful in taxonomy repo however since that repo refers to the folder names in .github workflows (which need to work before the schema repo could be git submoduled).

Makes sense

bjhargrave requested review from abhi1092, soltysh, xukai92, markstur, hickeyma, afrittoli, spzala, Tomcli, mrutkows and anik120 as code owners April 2, 2024 21:12

This was referenced Apr 2, 2024

Use Taxonomy schema to validate YAML files #760

Closed

lab generate: keyerrors not easy to decipher #434

Closed

anik120 mentioned this pull request Apr 3, 2024

knowledge-support: Split large documents #777

Closed

bjhargrave mentioned this pull request Apr 3, 2024

knowledge "document": pattern -> patterns #785

Closed

anik120 added this to the Milestone 04/09 milestone Apr 3, 2024

anik120 assigned bjhargrave Apr 3, 2024

bjhargrave mentioned this pull request Apr 3, 2024

fix: Suppress valid message on diff --quiet #784

Closed

xukai92 mentioned this pull request Apr 4, 2024

update functional tests for newer llama_cpp_python version #788

Closed

This was referenced Apr 4, 2024

Taxonomy Path used in prompt should be relative to taxonomy base directory #797

Closed

Don't allow pyyaml to convert Yes/No/On/Off into bools #813

Closed

Cast yaml values to strings #710

Closed

fixes: Fix some minor errors found during development of schema support

834244c

Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com> Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

bjhargrave added 2 commits April 4, 2024 19:34

schema: Vendor in taxonomy schema files

7c98b0c

We will later replace with a git submodule of the schema repo once anonymous access to the repo is available. But for now, we vendor in the files. Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

schema: Use taxonomy schema to validate YAML files

6c0d640

We use jsonschema to validate that input yaml files conform to the schema for their relative path. Any validation failures are logged and the process ends with an error. See #760 Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>

xukai92 mentioned this pull request Apr 5, 2024

error when running lab generate - generate_data.py:538 'question' #601

Closed

anik120 mentioned this pull request Apr 5, 2024

Proposal: Change document to source instructlab/taxonomy#661

Closed

anik120 suggested changes Apr 8, 2024

View reviewed changes

anik120 approved these changes Apr 9, 2024

View reviewed changes

anik120 merged commit c0a1b15 into instructlab:main Apr 9, 2024

anik120 deleted the issues/760 branch April 9, 2024 18:14

derekhiggins mentioned this pull request Apr 9, 2024

lab generate: doesn't like numbers in values #709

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use the taxonomy schema to validate taxonomy yaml files #776

Use the taxonomy schema to validate taxonomy yaml files #776

Uh oh!

bjhargrave commented Apr 2, 2024 •

edited

Loading

Uh oh!

bjhargrave commented Apr 2, 2024

Uh oh!

bjhargrave commented Apr 3, 2024

Uh oh!

anik120 commented Apr 3, 2024

Uh oh!

anik120 commented Apr 3, 2024

Uh oh!

bjhargrave commented Apr 3, 2024 •

edited

Loading

Uh oh!

anik120 commented Apr 4, 2024

Uh oh!

bjhargrave commented Apr 4, 2024

Uh oh!

bjhargrave commented Apr 4, 2024

Uh oh!

anik120 commented Apr 5, 2024 •

edited

Loading

Uh oh!

anik120 Apr 8, 2024

Uh oh!

bjhargrave Apr 9, 2024

Uh oh!

anik120 Apr 8, 2024

Uh oh!

bjhargrave Apr 9, 2024 •

edited

Loading

Uh oh!

anik120 Apr 9, 2024

Uh oh!

Uh oh!

		TAXONOMY_FOLDERS: List[str] = ["compositional_skills", "knowledge"]
		"""Taxonomy folders which are also the schema names"""

Use the taxonomy schema to validate taxonomy yaml files #776

Use the taxonomy schema to validate taxonomy yaml files #776

Uh oh!

Conversation

bjhargrave commented Apr 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

bjhargrave commented Apr 2, 2024

Uh oh!

bjhargrave commented Apr 3, 2024

Uh oh!

anik120 commented Apr 3, 2024

Uh oh!

anik120 commented Apr 3, 2024

Uh oh!

bjhargrave commented Apr 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anik120 commented Apr 4, 2024

Uh oh!

bjhargrave commented Apr 4, 2024

Uh oh!

bjhargrave commented Apr 4, 2024

Uh oh!

anik120 commented Apr 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anik120 Apr 8, 2024

Choose a reason for hiding this comment

Uh oh!

bjhargrave Apr 9, 2024

Choose a reason for hiding this comment

Uh oh!

anik120 Apr 8, 2024

Choose a reason for hiding this comment

Uh oh!

bjhargrave Apr 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anik120 Apr 9, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bjhargrave commented Apr 2, 2024 •

edited

Loading

bjhargrave commented Apr 3, 2024 •

edited

Loading

anik120 commented Apr 5, 2024 •

edited

Loading

bjhargrave Apr 9, 2024 •

edited

Loading