Skip to content

[Sandbox] HolmesGPT #392

@aantn

Description

@aantn

Application contact email(s)

natan@robusta.dev,arik@robusta.dev

Trademark and accounts

  • If the project is accepted, I agree to donate all project trademarks and accounts to the CNCF

Contributing or sponsoring entity contact email(s)

natan@robusta.dev

Project summary

HolmesGPT is an AI agent that automates cloud-native troubleshooting, bridging knowledge gaps by investigating alerts, executing runbooks, and correlating observability data in cloud-native platforms.

Project description

Troubleshooting cloud-native systems is inherently complex. A simple HTTP error might stem from an application bug, a misconfigured Kubernetes manifest, DNS resolution issues, or a downstream timeout - each living in a different layer of the stack. Engineers must piece together metrics, logs, and traces from multiple tools, as well as running ad-hoc commands like kubectl to gather additional context. Finally, the knowledge to do so is spread across multiple teams.

While observability is indexed easily by tools like Prometheus, Loki, and Tempo, interpreting it still requires deep expertise. This knowledge gap is especially acute among newer teams: the 2024 CNCF survey found that 51% of moderately experienced cloud-native practitioners cited lack of training as a top challenge, even more than technical issues like networking or storage!

HolmesGPT helps close this gap by taking operational knowledge in the form of natural-language runbooks and executing them using large language models. The platform provides extensible integrations with open source and proprietary troubleshooting tools through native toolsets and external MCP (Model Context Protocol) servers. Users can customize the analysis by adding application knowledge and stack expertise, transforming HolmesGPT into a virtual SRE tailored to their environment.

CNCF end users can deploy HolmesGPT to improve troubleshooting across their engineering teams, using existing monitoring data with no additional instrumentation required.

Org repo URL (provide if all repos under the org are in scope of the application)

N/A

Project repo URL in scope of application

https://github.com/robusta-dev/holmesgpt/

Additional repos in scope of the application

N/A

Website URL

https://github.com/robusta-dev/holmesgpt/

Roadmap

HolmesGPT Roadmap

Roadmap context

This is a joint roadmap with inputs from all organizations that maintain HolmesGPT (currently Robusta and Microsoft).

Contributing guide

https://github.com/robusta-dev/holmesgpt/blob/master/GOVERNANCE.md

Code of Conduct (CoC)

https://github.com/robusta-dev/holmesgpt/blob/master/CODE_OF_CONDUCT.md

Adopters

https://github.com/robusta-dev/holmesgpt/blob/master/ADOPTERS.md

Maintainers file

https://github.com/robusta-dev/holmesgpt/blob/master/MAINTAINERS.md

Security policy file

https://github.com/robusta-dev/holmesgpt/blob/master/SECURITY.md

IP policy

  • If the project is accepted, I agree the project will follow the CNCF IP Policy

Will the project require a license exception?

N/A

Standard or specification?

N/A

Why CNCF?

HolmesGPT's was built to help members of the cloud native community, especially in letting them troubleshoot their existing cloud-native stack faster.

Being part of the CNCF will drive the future success of HolmesGPT by deepening the connection with other CNCF projects, and driving collaboration and community development of more cloud-native integrations.

Finally, several partners and early adopters have expressed interest in contributing to HolmesGPT, but only if it’s governed under a neutral, vendor-independent foundation like the CNCF. Contributing the project to the CNCF will make collaboration easier and encourage broader adoption.

Benefit to the landscape

HolmesGPT uses AI to correlate cloud-native data dispersed across various tools and layers. Most end users operate with a combination of open source projects - Prometheus, Loki, and others - and commercial observability platforms - such as Datadog, Dynatrace, New Relic, and Splunk. Further context often resides outside these systems in ITSM tools like ServiceNow or Jira, chat platforms, and documentation.

For example, an engineer troubleshooting a sudden network failure might find no clear indicators in metrics, logs, or traces. The root cause turns out to be a firewall rule change documented in a ServiceNow ticket - completely invisible to observability tools. HolmesGPT helps surface this kind of context automatically.

By contributing HolmesGPT to the CNCF, we’re extending the landscape in a direction that aligns with where cloud-native operations are headed: intelligent and AI-driven.

Cloud native 'fit'

HolmesGPT is built specifically for cloud-native environments:It operates natively in Kubernetes, consumes telemetry from CNCF observability tools, and was built for end users running cloud-native applications

It also exemplifies several cloud-native principles:

  • Kubernetes-native: HolmesGPT can be deployed via Helm as a stateless service on Kubernetes
  • Composable and loosely coupled: HolmesGPT integrates with a wide variety of tools inside and outside the CNCF landscape, via builtin integrations and support for MCP

Cloud native 'integration'

Runtime: HolmesGPT can run on Kubernetes

Data-sources: HolmesGPT can pull data from Prometheus, Loki, Tempo, ArgoCD, Helm, and more.

Scope (project goals): HolmesGPT helps users troubleshoot cloud-native environments, including Kubernetes

Cloud native overlap

HolmesGPT overlaps conceptually with two CNCF projects that use large language models, but with differences in scope and execution:

  1. K8sGPT
    Overlap: Both HolmesGPT and K8sGPT aim to assist with Kubernetes troubleshooting using AI.

Difference: K8sGPT focuses on interpreting Kubernetes resource status and surfacing potential misconfigurations, whereas HolmesGPT is broader - it can query a wider variety of data sources, start investigations from plaintext (like K8sGPT) or structured data like Prometheus alerts, and can connect with external IT tools such as ServiceNow.

  1. Kagent
    Overlap: Both HolmesGPT and Kagent aim to enable intelligent agents in Kubernetes environments using LLMs.

Difference: Kagent is focused on providing a general-purpose framework for building, hosting, and orchestrating agentic workflows in Kubernetes and that interact with cloud-native tools. It includes example agents with some overlap with HolmesGPT. By contrast, HolmesGPT is an opinionated, production-ready agent focused on root-cause-analysis of cloud-native issues. This leads to a distinct feature set and roadmap. (E.g. HolmesGPT’s HTTP API has the capability to render rich-output formats with Prometheus graphs embedded in them, assuming the client knows how to parse them. Furthermore, there are future plans to incorporate traditional machine learning and AIOps algorithms to reduce log-volume and allow fitting all logs from the cluster in the past hour into a large language model’s context window for anomaly detection.) HolmesGPT could use Kagent as a runtime or execution framework in the future, but the project’s focus will always be a specialized solution with built-in integrations and operational logic rather than an open-ended agent framework.

Similar projects

Kagent
K8sGPT

Landscape

No

Business Product or Service to Project separation

HolmesGPT started as an upstream open source project maintained by Robusta.dev that also functions as the backend for Robusta’s AI-powered root cause analysis features. While Robusta.dev was the first to build on top of HolmesGPT, it is not the only one - other vendors are already adopting and integrating HolmesGPT into their own solutions.

We’ve already taken concrete steps to separate HolmesGPT from other Robusta offerings. The open source project is production-ready, self-contained, and designed to be fully usable without any dependency on commercial services.

As part of this process, we are actively working to separate the HolmesGPT documentation from other Robusta projects. Today, the documentation is hosted alongside our broader offering, but we have started moving it to standalone docs to better reflect HolmesGPT’s independent status.

We are also in the process of updating the existing README and other materials to reflect HolmesGPT’s standalone identity.

Project "Domain Technical Review"

We plan to engage soon and will update here.

CNCF contacts

No response

Additional information

No response

Metadata

Metadata

Labels

Type

No type

Projects

Status

🏗 Upcoming

Relationships

None yet

Development

No branches or pull requests

Issue actions