Every six months or so, someone at work does a hackathon project to automate outage analysis work SRE would likely perform. And every one of them I've seen has been underwhelming and wrong.
There's like three reasons for this disconnect.
1. The agents aren't expert at your proprietary code. They can read logs and traces and make educated guesses, but there's no world model of your code in there.
2. The people building these apps are unqualified to review the output. I used to mock narcissists evaluating ChatGPT quality by asking it for their own biography, but they're at least using a domain they are an expert in. Your average MLE has no profound truths about kubernetes or the app. At best, they're using some toy "known broken" app to demonstrate under what are basically ideal conditions, but part of the holdout set should be new outages in your app.
3. SREs themselves are not so great at causal analysis. Many junior SRE take the "it worked last time" approach, but this embeds a presumption that whatever went wrong "last time" hasn't been fixed in code. Your typical senior SRE takes a "what changed?" approach, which is depressingly effective (as it indicates most outages are caused by coworkers). At the highest echelons, I've seen research papers examining meta-stablity and granger causality networks, but I'm pretty sure nobody in SRE or these RCA agents can explain what they mean.
> The key insight: individual session failures look random. But when you cluster the hypotheses, failure patterns emerge.
My own insight is mostly bayesian. Typical applications have redundancy of some kind, and you can extract useful signals by separating "good" from "bad". A simple bayesian score of (100+bad)/(100+good) does a relatively good job of removing the "oh that error log always happens" signals. There's also likely a path using clickhouse level data and bayesian causal networks, but the problem is traditional bayesian networks are hand crafted by humans.
So yea, you can ask an LLM for 100 guesses and do some kind of k-means clustering on them, but you can probably do a better job doing dimensional analysis first and passing that on to the agent.
show comments
yanovskishai
I imagine it's hard to create a very generic tool for this usecase - what are the supported frameworks/libs, what does this tool assume about my implementation ?
RoiTabach
This looks Amazing
Do you have a LiteLLM integration?
show comments
dwb
> The key insight
I'm so tired
show comments
BlueHotDog2
nice. what a crazy space.
how is this different vs other telemetry/analysis platforms such as langchain/braintrust etc?
show comments
halflife
Kelet as in קלט as in input?
show comments
hadifrt20
in the auickstart, the suggested fixes are called "Prompt Patches" .. does that mean Kelet only surfaces root causes that are fixable in the prompt? What happens when the real bug is in tool selection or retrieval ranking for example?
show comments
peter_parker
> They just quietly give wrong answers.
It's not about wrong answers only. They just stuck in a circle sometimes.
Every six months or so, someone at work does a hackathon project to automate outage analysis work SRE would likely perform. And every one of them I've seen has been underwhelming and wrong.
There's like three reasons for this disconnect.
1. The agents aren't expert at your proprietary code. They can read logs and traces and make educated guesses, but there's no world model of your code in there.
2. The people building these apps are unqualified to review the output. I used to mock narcissists evaluating ChatGPT quality by asking it for their own biography, but they're at least using a domain they are an expert in. Your average MLE has no profound truths about kubernetes or the app. At best, they're using some toy "known broken" app to demonstrate under what are basically ideal conditions, but part of the holdout set should be new outages in your app.
3. SREs themselves are not so great at causal analysis. Many junior SRE take the "it worked last time" approach, but this embeds a presumption that whatever went wrong "last time" hasn't been fixed in code. Your typical senior SRE takes a "what changed?" approach, which is depressingly effective (as it indicates most outages are caused by coworkers). At the highest echelons, I've seen research papers examining meta-stablity and granger causality networks, but I'm pretty sure nobody in SRE or these RCA agents can explain what they mean.
> The key insight: individual session failures look random. But when you cluster the hypotheses, failure patterns emerge.
My own insight is mostly bayesian. Typical applications have redundancy of some kind, and you can extract useful signals by separating "good" from "bad". A simple bayesian score of (100+bad)/(100+good) does a relatively good job of removing the "oh that error log always happens" signals. There's also likely a path using clickhouse level data and bayesian causal networks, but the problem is traditional bayesian networks are hand crafted by humans.
So yea, you can ask an LLM for 100 guesses and do some kind of k-means clustering on them, but you can probably do a better job doing dimensional analysis first and passing that on to the agent.
I imagine it's hard to create a very generic tool for this usecase - what are the supported frameworks/libs, what does this tool assume about my implementation ?
This looks Amazing Do you have a LiteLLM integration?
> The key insight
I'm so tired
nice. what a crazy space. how is this different vs other telemetry/analysis platforms such as langchain/braintrust etc?
Kelet as in קלט as in input?
in the auickstart, the suggested fixes are called "Prompt Patches" .. does that mean Kelet only surfaces root causes that are fixable in the prompt? What happens when the real bug is in tool selection or retrieval ranking for example?
> They just quietly give wrong answers. It's not about wrong answers only. They just stuck in a circle sometimes.
jkfrntgijbntbuijhb8ybu