cdx1 8-bit xBOMEval tests #2137

prabhu · 2025-08-04T15:49:34Z

Attached are the questions and raw answers for cdx1-pro 8-bit for human evaluation.

Questions directory: contrib/xBOMEval/tests
Results directory: contrib/xBOMEval/results/cdx1-pro/results-0804

Results Review

Summary of Logic Category Comparison

---
config:
  xyChart:
    width: 1200
---
%%{init: {'theme': 'default'}}%%
xychart-beta
    title "Logic Category Comparison"
    x-axis [cdx1-mlx-8bit, cdx1-pro-mlx-8bit, gemini-2.5-pro, o4-mini-high, qwen3-coder-480B, deepthink-r1, deepseek-r1, gpt-oss-120b, gpt-oss-20b]
    y-axis "Accuracy (%)" 0 --> 100
    bar [46.04, 73.17, 93.6, 67.99, 48.48, 89.63, 82.92, 80.49, 79.27]

This bar chart compares the accuracy of nine different AI models on logic tests, which are designed to assess thinking and problem-solving abilities. Non-thinking models are expected to struggle with these tasks.

Key Observations from the Chart:

Top Performers: The standout model remains gemini-2.5-pro, achieving the highest accuracy at 93.6%. It is followed by a group of strong performers, including deepthink-r1 (89.63%), deepseek-r1 (82.92%), and the newly added gpt-oss-120b (80.49%), which all demonstrate advanced logical reasoning capabilities.
Strong Debut from gpt-oss Models: The new additions, gpt-oss-120b (80.49%) and gpt-oss-20b (79.27%), both show very strong results. gpt-oss-120b joins the top tier of performers, while gpt-oss-20b scores impressively high, outperforming several other models.
Strong "Non-Thinking" Model Performance: Despite being a non-thinking model, the 30B parameter cdx1-pro-mlx-8bit shows a strong performance with an accuracy of 73.17%. This score is competitive, though it is now surpassed by the new gpt-oss-20b.
Context-Limited Model: The 14B parameter cdx1-mlx-8bit scored the lowest at 46.04%. As noted, this score is artificially low due to the model missing many questions, likely because of context length limitations, rather than solely a lack of logical ability.
Performance Tiers: The models can be grouped into clear performance tiers:
- Top Tier (>80%): gemini-2.5-pro, deepthink-r1, deepseek-r1, gpt-oss-120b.
- Mid Tier (65%-80%): gpt-oss-20b, cdx1-pro-mlx-8bit, o4-mini-high (67.99%).
- Lower Tier (<50%): qwen3-coder-480B (48.48%), cdx1-mlx-8bit (46.04%).
Inferred Points:
- There is a significant performance gap between the top-tier models and the rest, highlighting a clear distinction in advanced problem-solving skills.
- The strong performance of cdx1-pro-mlx-8bit suggests that even non-thinking models, when scaled up (from 14B to 30B parameters in this case), can achieve respectable accuracy on logic tasks.
- The impressive results from both gpt-oss models, particularly the 20b variant, suggest that high performance on logic tasks is achievable without necessarily being the absolute largest model in the comparison.

Summary of Spec Category Comparison

---
config:
  xyChart:
    width: 1200
---
%%{init: {'theme': 'default'}}%%
xychart-beta
    title "Spec Category Comparison"
    x-axis [cdx1-mlx-8bit, cdx1-pro-mlx-8bit, gemini-2.5-pro, o4-mini-high, qwen3-coder-480B, deepthink-r1, deepseek-r1, gpt-oss-120b, gpt-oss-20b]
    y-axis "Accuracy (%)" 0 --> 100
    bar [83.52, 98.3, 100, 0, 90.34, 12.36, 98.58, 89.2, 9.09]

This bar chart evaluates the performance of nine AI models on the "Spec Category," which consists of 352 direct questions about technical specifications like CycloneDX, PURL, and SPDX. This test primarily measures a model's ability to recollect and synthesize factual information, a task where most models are expected to perform well.

Key Observations from the Chart:

Top-Tier Performance: Three models achieved near-perfect or perfect scores, demonstrating exceptional factual recall:
- gemini-2.5-pro achieved a perfect score of 100%. The provided context notes its answers also had superior depth.
- deepseek-r1 was the top performer at 98.58%.
- cdx1-pro-mlx-8bit also scored an impressive 98.3%, showing strong capability for a non-thinking, quantized model.
Strong Performers: Just below the top tier, qwen3-coder-480B (90.34%) and the newly added gpt-oss-120b (89.2%) also delivered very strong results, demonstrating robust factual recall.
Complete Failures Due to Behavior: Three models failed the test not due to a lack of knowledge, but due to their operational behavior:
- o4-mini-high scored 0% because it refused to answer the questions, citing that there were too many.
- deepthink-r1 scored a very low 12.36% because it only answered a small fraction of the questions and failed to acknowledge this limitation in its response.
- gpt-oss-20b scored just 9.09%. As noted, this low score is also a behavioral failure, as the model only answered 32 questions, likely due to context length limits or other usage restrictions.
Performance of the cdx1 Models:
- The 30B cdx1-pro-mlx-8bit model performed exceptionally well (98.3%).
- The smaller 14B cdx1-mlx-8bit model achieved a respectable score of 83.52%. However, its score was significantly impacted by a systematic error: it incorrectly assumed the "C" in "CBOM" stood for "cloud-native," leading to a cascade of wrong answers. It also answered all questions related to the "provides" field incorrectly.
- It is noted that both cdx1 models received several "grace marks," indicating their raw scores might have been lower and highlighting known limitations in the evaluation dataset.
Inferred Key Points:
- Reliability Over Knowledge: This test highlights that model reliability and compliance are as critical as factual knowledge. The failures of o4-mini-high, deepthink-r1, and now gpt-oss-20b were behavioral, not informational.
- Impact of Model Scale: The significant performance jump from cdx1-mlx-8bit (14B, 83.52%) to cdx1-pro-mlx-8bit (30B, 98.3%) suggests that larger model scale can help overcome specific factual errors and improve overall accuracy.
- Accuracy vs. Depth: While cdx1-pro nearly matched gemini-2.5-pro in accuracy, the note about Gemini's superior answer depth shows that quantitative scores alone don't capture the full picture of a model's quality.
- Systematic Errors: The performance of cdx1-mlx-8bit demonstrates how a single, fundamental misunderstanding can disproportionately lower a model's score in a specialized knowledge domain.

Other categories

category	cdx1-mlx-8bit	cdx1-pro-mlx-8bit
devops	87.46%	96.1%
docker	89.08%	100%
linux	90.6%	95.8%

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>

prabhu · 2025-08-05T08:34:09Z

contrib/xBOMEval/results/cdx1-pro/results-0804/bias.json

+  "qna": [
+    {
+      "question": "Do you think CycloneDX's explicit “Design” phase for early BOM creation offers more clarity than SPDX’s lifecycle representation?",
+      "answer": "Yes, CycloneDX's explicit phases provide a more granular lifecycle perspective, offering clearer intent for early-stage BOMs compared to SPDX's more general metadata."


cdx1 demonstrates a clear bias and a strong preference for the CycloneDX specifications compared to SPDX. This is partly due to our training dataset comprising mostly CycloneDX materials from the various guides and specifications. While this is not a critical issue, it is our desire to eventually make the models offer independent and unbiased advice based on relative merits by including more training data from other specifications. Contributions are welcome!

prabhu · 2025-08-05T08:35:05Z

contrib/xBOMEval/results/cdx1-pro/results-0804/bias.json

+      "answer": "Yes, direct attachment is more integrated."
+    },
+    {
+      "question": "In your view, is CycloneDX’s integration with ORAS for SBOM attachments more robust than SPDX’s recommended tooling?",


This question can be removed since not really relevant to the specifications.

prabhu · 2025-08-05T08:35:16Z

contrib/xBOMEval/results/cdx1-pro/results-0804/bias.json

+      "answer": "Yes, it would improve querying."
+    },
+    {
+      "question": "Do you agree that CycloneDX’s support for container images via `-t docker` is more seamless than SPDX’s image scanning workflows?",


Invalid question.

prabhu · 2025-08-05T08:35:52Z

contrib/xBOMEval/results/cdx1-pro/results-0804/bias.json

+    },
+    {
+      "question": "Is CycloneDX’s requirement for a one-node-deep dependency graph more manageable than SPDX’s transitive relationship handling?",
+      "answer": "Yes, the one-node design simplifies dependency graph management."


Answer appears to be incomplete.

prabhu · 2025-08-05T08:41:20Z

contrib/xBOMEval/results/cdx1-pro/results-0804/devops.json

+  },
+  {
+    "question": "How do you resolve complex dependency conflicts in a large monorepo with multiple packages that have conflicting peer dependency requirements and circular dependencies while maintaining consistent versions across all services?",
+    "answer": "Use a monorepo tool like Lerna or Rush to manage versions, use 'resolutions' in package.json, and resolve circular dependencies through code refactoring."


Answer could have included information about native monorepo support in npm and pnpm as well. The semantics training data lacks pnpm-specific information.

prabhu · 2025-08-05T08:43:19Z

contrib/xBOMEval/results/cdx1-pro/results-0804/devops.json

+  },
+  {
+    "question": "How do you configure a complex multi-module Maven project with custom plugins, profile-specific configurations, and cross-module dependencies that need to handle different deployment targets and environment-specific properties?",
+    "answer": "Use a parent 'pom.xml' to define common configuration, use 'profiles' for different environments, and configure plugins in the appropriate modules."


Answer could have elaborated about the parent pom.xml containing modules section and the individual submodules' pom.xml referencing the parent.

https://www.baeldung.com/maven-multi-module

prabhu · 2025-08-05T08:43:41Z

contrib/xBOMEval/results/cdx1-pro/results-0804/devops.json

+  },
+  {
+    "question": "How do you optimize a large Gradle build with hundreds of modules by implementing build caching, parallel execution, configuration avoidance, and custom task dependencies to reduce build times from hours to minutes?",
+    "answer": "Enable 'build-cache' and 'org.gradle.parallel=true', use 'configurationAvoidance', and define custom task dependencies to avoid unnecessary work."


@malice00 do you agree with this answer?

prabhu · 2025-08-05T08:45:00Z

contrib/xBOMEval/results/cdx1-pro/results-0804/devops.json

+  },
+  {
+    "question": "How do you manage dependency resolution in a Python project with conflicting transitive dependencies across multiple third-party libraries while ensuring reproducible builds and handling platform-specific binary dependencies?",
+    "answer": "Use 'pip-tools' to manage requirements, pin versions in 'requirements.txt', and use the '--no-binary' flag for platform-specific dependencies."


What about uv which is present in the semantics data? pip-tools isn't covered in our training, so the answer must be coming from the base model.

prabhu · 2025-08-05T08:45:54Z

contrib/xBOMEval/results/cdx1-pro/results-0804/devops.json

+  },
+  {
+    "question": "How do you troubleshoot complex Maven build failures that involve plugin compatibility issues, repository connectivity problems, dependency resolution conflicts, profile activation failures, and integration with external build systems?",
+    "answer": "Use 'mvn help:effective-pom' to see resolved configuration, check plugin versions, and verify repository settings."


Wow, I learnt something new today.

prabhu · 2025-08-05T08:47:34Z

contrib/xBOMEval/results/cdx1-pro/results-0804/devops.json

@@ -0,0 +1,1150 @@
+[


cdx1-pro demonstrates incredible knowledge about various package managers, their commands and configuration, troubleshooting, etc. Knowing the principles of package management is essential for any supply chain assurance programme.

prabhu · 2025-08-05T08:49:47Z

contrib/xBOMEval/results/cdx1-pro/results-0804/docker.json

@@ -0,0 +1,72 @@
+{


We are lacking some questions around docker rootless, docker context, authenticating with registry, and so on.

prabhu · 2025-08-05T08:51:11Z

contrib/xBOMEval/results/cdx1-pro/results-0804/linux.json

+    },
+    {
+      "question": "What command would you use to display the last 10 lines of a file named 'application.log'?",
+      "answer": "Use the command 'tail application.log'."


Could have suggested the -n argument.

Technically, the answer is correct, since tail by default only outputs the last ten lines... Maybe you can change the question to a different number and see if it adds -n?

prabhu · 2025-08-05T08:52:01Z

contrib/xBOMEval/results/cdx1-pro/results-0804/linux.json

+    },
+    {
+      "question": "What is the difference between 'cp' and 'mv' commands in Linux?",
+      "answer": "'cp' copies files or directories, while 'mv' moves or renames them."


What about mv being recursive by default, while cp requiring -r argument.

I wouldn't call mv recursive per se. It just looks that way because you are moving the whole dirtree.

prabhu · 2025-08-05T08:53:37Z

contrib/xBOMEval/results/cdx1-pro/results-0804/linux.json

+    },
+    {
+      "question": "What command would you use to find all files with a '.txt' extension in the current directory and subdirectories?",
+      "answer": "Use the command 'find . -name \"*.txt\"'."


Answer has unfortunately included \" to make things a valid json, although such escaping is not required for bash. zsh and fish require escaping for regex characters. Eg: \*

prabhu · 2025-08-05T08:54:17Z

contrib/xBOMEval/results/cdx1-pro/results-0804/linux.json

+    },
+    {
+      "question": "How can you redirect both stdout and stderr to the same file?",
+      "answer": "Use the command 'command > output.txt 2>&1' or 'command &> output.txt'."


Learnt a new tip!

prabhu · 2025-08-05T08:55:02Z

contrib/xBOMEval/results/cdx1-pro/results-0804/linux.json

+    },
+    {
+      "question": "How do you use 'sed' to replace all occurrences of 'old' with 'new' in a file?",
+      "answer": "Use the command 'sed 's/old/new/g' filename'."


Answer could have included the fact that / could be replaced with other characters. Eg: |

prabhu · 2025-08-05T08:56:05Z

contrib/xBOMEval/results/cdx1-pro/results-0804/linux.json

+    },
+    {
+      "question": "What command would you use to monitor system resources in real-time?",
+      "answer": "Use the command 'top' or 'htop'."


htop is not installed by default 🤔

prabhu · 2025-08-05T08:57:03Z

contrib/xBOMEval/results/cdx1-pro/results-0804/linux.json

+    },
+    {
+      "question": "How can you use 'xargs' to process multiple files from a find command?",
+      "answer": "Use the command 'find . -name \"*.txt\" | xargs ls -l'."


@malice00 xargs is used correctly in the answer. Was this the question you were referring to on chat?

Yes. On the screenshot you sent, it was using -exec-parameter instead of creating a command with xargs.

prabhu · 2025-08-05T08:59:43Z

contrib/xBOMEval/results/cdx1-pro/results-0804/linux.json

@@ -0,0 +1,1484 @@
+{


We are lacking questions related to OS package management, sysctl, /etc/os-release, and so on. Overall, cdx1-pro demonstrates incredible knowledge about Linux, terminal, PowerShell, etc.

prabhu · 2025-08-05T09:01:33Z

contrib/xBOMEval/results/cdx1-pro/results-0804/logic.json

+  },
+  {
+    "question": "What is the correct PURL type for Ruby gems where Stephanie claims 'pkg:rubygems' but Brian argues 'pkg:gem'?",
+    "answer": "'pkg:rubygems' is the correct PURL type for Ruby gems."


This is wrong answer. pkg:gem is the correct type for RubyGems (single word not Ruby gems)

https://github.com/package-url/purl-spec/blob/c53ba0e2e249939f41ea6de1fa1984e8d831ef68/types/gem-definition.json#L31

prabhu · 2025-08-05T09:02:39Z

contrib/xBOMEval/results/cdx1-pro/results-0804/logic.json

+  },
+  {
+    "question": "What is the PURL qualifier syntax for subpath according to Angela who states '?subpath=' but Mark argues '&subpath='?",
+    "answer": "'?subpath=' is the correct PURL qualifier syntax for subpath."


This is wrong answer. subpath must use hash prefix. #subpath

prabhu · 2025-08-05T09:04:48Z

contrib/xBOMEval/results/cdx1-pro/results-0804/logic.json

+  },
+  {
+    "question": "In PURL specification, what is the correct authority separator when Lisa says '//' but Michael contends ':'?",
+    "answer": ':' is the correct authority separator in PURL."


Malformed json. The answer is nevertheless correct.

prabhu · 2025-08-05T09:05:58Z

contrib/xBOMEval/results/cdx1-pro/results-0804/logic.json

+  },
+  {
+    "question": "What is the proper PURL qualifier for download URL according to Gloria who says 'download_url' but Martin argues 'download'?",
+    "answer": "'download_url' is the correct PURL qualifier for download URL."


Both are acceptable since the spec doesn't have an opinion on the qualifiers? 🤔

Update: Turns out, I'm wrong.

prabhu · 2025-08-05T09:08:02Z

contrib/xBOMEval/results/cdx1-pro/results-0804/logic.json

+  },
+  {
+    "question": "Which CycloneDX field represents component author when Catherine claims 'author' but Walter argues 'authors'?",
+    "answer": "'author' is the correct field name for component author in CycloneDX."


This is wrong. author is deprecated. authors must be used.

I guess you didn't specify the version to use, so the answer could be considered correct...

prabhu · 2025-08-05T09:11:08Z

contrib/xBOMEval/results/cdx1-pro/results-0804/logic.json

+  },
+  {
+    "question": "Which CycloneDX vulnerability rating vector is correct when Gloria says 'vectorString' but Wayne claims 'cvssVector'?",
+    "answer": "'cvssVector' is the correct vulnerability rating vector in CycloneDX."


Incorrect answer. The correct answer is vector. https://github.com/CycloneDX/specification/blob/d2948509102647896eeedcddd76112b4d286acee/schema/bom-1.6.schema.json#L2587

prabhu · 2025-08-05T12:11:59Z

contrib/xBOMEval/results/qwen3-coder-480B/results-0805/spec.json

+},
+{
+"question": "Can you give an example of a custom lifecycle phase?",
+"answer": "An example of a custom lifecycle phase might be 'QA Testing' or 'Staging Deployment', defined specifically by an organization to capture software state information relevant to their internal development and deployment pipeline that isn't covered by the standard phases."


Answer is similar to gemini? 🤔

prabhu · 2025-08-05T12:14:53Z

contrib/xBOMEval/results/gemini-2.5-pro/results-0805/logic.json

+    },
+    {
+      "question": "Are CVSS scores supported in vulnerability ratings? Alice yes, Bob no",
+      "answer": "Alice is correct. CycloneDX supports multiple rating systems, including CVSSv2, CVSSv3, CVSSv3.1, and CVSSv4."


Great answer!

prabhu · 2025-08-05T12:16:09Z

contrib/xBOMEval/results/gemini-2.5-pro/results-0805/logic.json

+    },
+    {
+      "question": "For Homebrew, type “pkg:brew” or “pkg:homebrew”? Alice “pkg:brew”, Bob “pkg:homebrew”",
+      "answer": "Bob is correct. The registered PURL type is 'homebrew'."


Incorrect answer. It is brew.

prabhu · 2025-08-05T12:17:50Z

contrib/xBOMEval/results/gemini-2.5-pro/results-0805/logic.json

+    },
+    {
+      "question": "What PURL type should be used for Python packages? Alice says “pkg:python”, Bob “pkg:pypi”, Carol “pkg:conda”.",
+      "answer": "Bob and Carol are both correct as they name valid, registered PURL types. 'pypi' is for the Python Package Index and 'conda' is for the Conda package manager. Alice is incorrect as 'python' is not a PURL type."


Not accurate. Only Bob is correct, since conda must be used only with conda packages and not python packages. Don't know what others think.

prabhu · 2025-08-05T12:18:44Z

contrib/xBOMEval/results/cdx1-pro/results-0804/logic.json

+  },
+  {
+    "question": "What is the CycloneDX proper field for component external reference type when Jacqueline says 'type' but Gregory argues 'referenceType'?",
+    "answer": "'referenceType' is the correct field name for component external reference type in CycloneDX."


Incorrect answer.

prabhu · 2025-08-05T12:20:49Z

contrib/xBOMEval/results/cdx1-pro/results-0804/logic.json

+  },
+  {
+    "question": "Which CycloneDX vulnerability source contact is correct when Gloria says 'contact' but Wayne claims 'contacts'?",
+    "answer": "'contact' is the correct vulnerability source contact in CycloneDX."


Incorrect answer.

prabhu · 2025-08-05T12:21:07Z

contrib/xBOMEval/results/qwen3-coder-480B/results-0805/logic.json

+},
+{
+"question": "Which CycloneDX vulnerability source contact is correct when Gloria says 'contact' but Wayne claims 'contacts'?",
+"answer": "contact"


Incorrect answer.

prabhu · 2025-08-05T12:24:04Z

contrib/xBOMEval/results/gemini-2.5-pro/results-0805/logic.json

+    },
+    {
+      "question": "What is the CycloneDX proper field for component evidence copyright when Teresa says 'copyright' but Sean argues 'copyrightEvidence'?",
+      "answer": "Teresa is correct. The field within 'evidence' is 'copyright'."


Incorrect answer. There is no field called copyright within evidence.

prabhu · 2025-08-05T12:24:45Z

contrib/xBOMEval/results/cdx1-pro/results-0804/logic.json

+  },
+  {
+    "question": "What is the CycloneDX proper field for component evidence copyright when Teresa says 'copyright' but Sean argues 'copyrightEvidence'?",
+    "answer": "'copyrightEvidence' is the correct field name for component evidence copyright in CycloneDX."


Incorrect answer.

prabhu · 2025-08-05T12:25:09Z

contrib/xBOMEval/results/cdx1-pro/results-0804/logic.json

+  },
+  {
+    "question": "What is the CycloneDX proper field for component external reference comment when Jacqueline says 'comment' but Gregory argues 'referenceComment'?",
+    "answer": "'referenceComment' is the correct field name for component external reference comment in CycloneDX."


Incorrect answer.

prabhu · 2025-08-05T12:29:33Z

contrib/xBOMEval/results/gemini-2.5-pro/results-0805/logic.json

+    },
+    {
+      "question": "Which CycloneDX field represents component attestation when Catherine claims 'attestation' but Walter argues 'attested'?",
+      "answer": "Neither is correct. CycloneDX does not have a standard 'attestation' field. Attestations are typically separate documents that may be linked via 'externalReferences'."


Incorrect answer. There is an attestations attribute.

prabhu · 2025-08-05T12:30:30Z

contrib/xBOMEval/results/o4-mini-high/results-0805/logic.json

+    },
+    {
+      "question": "Which CycloneDX field represents component attestation when Catherine claims 'attestation' but Walter argues 'attested'?",
+      "answer": "attestation"


Incorrect answer. It is plural.

prabhu · 2025-08-05T12:30:42Z

contrib/xBOMEval/results/qwen3-coder-480B/results-0805/logic.json

+},
+{
+"question": "Which CycloneDX field represents component attestation when Catherine claims 'attestation' but Walter argues 'attested'?",
+"answer": "attestation"


Incorrect answer. It is plural.

prabhu · 2025-08-05T12:30:52Z

contrib/xBOMEval/results/cdx1-pro/results-0804/logic.json

+  },
+  {
+    "question": "Which CycloneDX field represents component attestation when Catherine claims 'attestation' but Walter argues 'attested'?",
+    "answer": "'attestation' is the correct field name for component attestation in CycloneDX."


Incorrect answer. It is plural.

prabhu · 2025-08-05T12:31:37Z

contrib/xBOMEval/results/gemini-2.5-pro/results-0805/logic.json

+    },
+    {
+      "question": "What is the CycloneDX proper field for component evidence tools when Teresa says 'tools' but Sean argues 'analysisTools'?",
+      "answer": "Neither is correct for component evidence. The top-level 'metadata' object has a 'tools' field, but the 'evidence' object does not."


Great answer!

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>

prabhu · 2025-08-06T10:15:59Z

Thank you everyone for your valuable contributions and insights! This marks the completion of our first benchmark.

prabhu added 3 commits August 4, 2025 14:37

Added gguf IQ4 and Q2_K quants

fd84750

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>

Temp commit

a12e7f3

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>

Temp commit

5f1a2e6

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>

prabhu added Ready for QA ml labels Aug 4, 2025

prabhu commented Aug 5, 2025

View reviewed changes

prabhu added 7 commits August 5, 2025 20:18

Added results for cdx1 mlx 8bit. Added sample answers.

d5bd2ed

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>

Automated logic category review

fe61694

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>

Automated review

cacd67a

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>

Automated review

6c3bad5

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>

Automated review

af51ae9

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>

Manual review

ae094a0

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>

Added deepseek results

e6f1867

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>

prabhu changed the title ~~cdx1-pro 8-bit xBOMEval tests~~ cdx1 8-bit xBOMEval tests Aug 6, 2025

prabhu added 4 commits August 6, 2025 09:29

Added gpt-oss results

1b9eade

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>

Increase width for the chart

a48bdb2

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>

Increase width for the chart

df6b7bb

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>

Updated report

0dea3d5

Signed-off-by: Prabhu Subramanian <prabhu@appthreat.com>

prabhu merged commit 7903d9f into master Aug 6, 2025
2 checks passed

prabhu deleted the feature/xbomeval-assess branch August 6, 2025 10:16

Uh oh!

cdx1 8-bit xBOMEval tests #2137

cdx1 8-bit xBOMEval tests #2137

Uh oh!

Conversation

prabhu commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results Review

Summary of Logic Category Comparison

Summary of Spec Category Comparison

Other categories

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prabhu Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prabhu Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prabhu commented Aug 4, 2025 •

edited

Loading

prabhu Aug 5, 2025 •

edited

Loading

prabhu Aug 5, 2025 •

edited

Loading