Skip to content

Conversation

bjhargrave
Copy link
Contributor

@bjhargrave bjhargrave commented Apr 2, 2024

Changes

Which issue is resolved by this Pull Request:
Part of #760

Description of your changes:

This change updates read_taxonomy_file to also validate that the yaml file conforms to the schema for the part of the taxonomy in which the yaml file resides. This validation is in addition to the existing yaml linting checks.

Several tests required updating so that their test yaml files were schema valid.

For now we include the schema files in this repo. Once anonymous access to the schema repo is available, we will change to access the schema files by using git submodule on the schema repo.

@bjhargrave
Copy link
Contributor Author

I think the functional test failure is flakiness in the tests. See #748.

@bjhargrave
Copy link
Contributor Author

Updated to use patterns instead of pattern for knownledge schema. See instructlab/taxonomy#654 and instructlab/schema#3 (comment).

@anik120
Copy link
Contributor

anik120 commented Apr 3, 2024

Updated to use patterns instead of pattern for knownledge schema. See instruct-lab/taxonomy#654 and instruct-lab/schema#3 (comment).

@bjhargrave this needs to be updated too

@anik120
Copy link
Contributor

anik120 commented Apr 3, 2024

I'm adding this to the next milestone

@anik120 anik120 added this to the Milestone 04/09 milestone Apr 3, 2024
@bjhargrave
Copy link
Contributor Author

bjhargrave commented Apr 3, 2024

@anik120
Copy link
Contributor

anik120 commented Apr 4, 2024

@bjhargrave doing a quick PR #791 to get us moving with patterns

@bjhargrave
Copy link
Contributor Author

doing a quick PR #791 to get us moving with patterns

Great. I rebased.

@bjhargrave
Copy link
Contributor Author

@xukai92 The packaging issue which caused the functional test failure is now fixed.

Co-authored-by: Rafael Vasquez <rafvasq21@gmail.com>
Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
We will later replace with a git submodule of the schema repo once
anonymous access to the repo is available. But for now, we vendor in
the files.

Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
We use jsonschema to validate that input yaml files conform to the
schema for their relative path. Any validation failures are logged
and the process ends with an error.

See #760

Signed-off-by: BJ Hargrave <hargrave@us.ibm.com>
@anik120
Copy link
Contributor

anik120 commented Apr 5, 2024

Capturing a discussion with @shivchander:

I was writing up a test case for the cli to test knowledge workflow, but the way that I laid out my qna.yaml as follows:

test_knowledge_valid = b"""created_by: test-bot
seed_examples:
- question: What is Operator Framework? 
  answer: 'The Operator Framework is a set of Kubernetes components and developer tools, 
  that aid in Operator development and central management on a multi-tenant cluster.'
- question: What is an Operator? 
  answer: 'The goal of an Operator is to put operational knowledge into software. 
  Previously this knowledge only resided in the minds of administrators, 
  various combinations of shell scripts or automation software like Ansible. 
  It was outside of your Kubernetes cluster and hard to integrate. 
  With Operators, CoreOS changed that. Operators implement and automate 
  common Day-1 (installation, configuration, etc.) and Day-2 (re-configuration, 
  update, backup, failover, restore, etc.) activities in a piece of software running 
  inside your Kubernetes cluster, by integrating natively with Kubernetes concepts and APIs. 
  We call this a Kubernetes-native application. 
  With Operators you can stop treating an application as a collection of primitives like Pods, 
  Deployments, Services or ConfigMaps, but instead as a single object that only exposes the knobs 
  that make sense for the application.'
- question: What is Operator Lifecycle Manager? 
  answer: 'OLM is a component of the Operator Framework, 
  an open source toolkit to manage Kubernetes native applications, 
  called Operators, in an effective, automated, and scalable way. 
  OLM extends Kubernetes to provide a declarative way to install, 
  manage, and upgrade Operators and their dependencies in a cluster.
task_description: to teach a large language model about the Operator Framework
document:
  repo: https://github.com/anik120/knowledge-doc-test
  commit: bf78d868f544e55d8e1d99f68d9105fc3b8751bd
  patterns:
  - operator-framework*.md

Essentially, the seed_example question/answers I have there are from our overarching project websites https://operatorframework.io, https://olm.operatorframework.io and https://sdk.operatorframework.io, and the documents I have in https://github.com/anik120/knowledge-doc-test are README.mds from our components' GitHub repository. In other words, the seed_example question/answers do not actually come from the documents hosted in document.repo.

The way I laid things out, the seed_examples are "product pitch/summary description" and document.repo contains all the docs I want the model to learn about.

Shiv tells me that that's the wrong way of thinking about it, and the verb document should be source in reality, and seed_examples are examples of questions and answers that can be answered once the model is trained on the docs hosted in docs.repo.

Eureka moment: Even after learning (only a little while ago, ie fresh info being processed by brain still) how the taxonomy interacts with the model, I was thinking about the structure of my knowledge doc, the wrong way. It's likely that other users will also confuse the taxonomy/model interactions and lay out the qna.yaml files the wrong way, leading to PR submissions that'll likely not improve model quality.

Proposed fix: Change document to source

@@ -235,8 +235,9 @@ def diff(ctx, taxonomy_path, taxonomy_base, yaml_rules, quiet):
if quiet:
try:
read_taxonomy(logger, taxonomy_path, taxonomy_base, yaml_rules)
except (Exception, yaml.YAMLError) as exc:
except (SystemExit, yaml.YAMLError) as exc:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. However https://github.com/instruct-lab/cli/blob/212e1c5234ff71006dd708dfc6e464c591ea7c74/cli/lab.py#L252 does not either. There are currently two code paths here: one for --quiet and one for not quiet and they have different excepts. I opened #800 to have a single mainline code path with ifs for non-quiet output.

Comment on lines +114 to +115
TAXONOMY_FOLDERS: List[str] = ["compositional_skills", "knowledge"]
"""Taxonomy folders which are also the schema names"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be part of the schema instead?

So a file in schema/v1/, perhaps schema/v1/constants.py ?

Copy link
Contributor Author

@bjhargrave bjhargrave Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possible but I would do in a follow-on PR. It wont be useful in taxonomy repo however since that repo refers to the folder names in .github workflows (which need to work before the schema repo could be git submoduled).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

@anik120 anik120 merged commit c0a1b15 into instructlab:main Apr 9, 2024
@anik120 anik120 deleted the issues/760 branch April 9, 2024 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants