Lots of logs contain non-interesting information so it easily pollutes the context. Instead, my approach has a TF-IDF classifier + a BERT model on GPU for classifying log lines further to reduce the number of logs that should be then fed to a LLM model. The total size of the models is 50MB and the classifier is written in Rust so it allows achieve >1M lines/sec for classifying. And it finds interesting cases that can be missed by simple grepping
It's meant to be used with Claude Code CLI so it could use these tools instead of trying to read the log files
show comments
sollewitt
But does it work? I’ve used LLMs for log analysis and they have been prone to hallucinate reasons: depending on the logs the distance between cause and effects can be larger than context, usually we’re dealing with multiple failures at once for things to go badly wrong, and plenty of benign issues throw scary sounding errors.
show comments
PaulHoule
My first take is that you could have 10 TB of logs with just a few unique lines that are actually interesting. So I am not thinking "Wow, what impressive big data you have there" but rather "if you have an accuracy of 1-10^-6 you are still are overwhelmed with false positives" or "I hope your daddy is paying for your tokens"
show comments
gabeh
SQL has always been my favorite "loaded gun" api. If you have a control plane of RLS + role based auth and you've got a data dictionary it is trivial to get to a data explorer chat interaction with an LLM doing the heavy lifting.
TKAB
That post reads like fully LLM-generated. It's basically boasting a list of numbers that are supposed to sound impressive. If there's a coherent story, it's well hidden.
Yizahi
We have an ongoing effort in parsing logs for our autotests to speed up debug. It is vary hard to do, mainly because there is a metric ton of false positives or plain old noise even in the info logs. Tracing the culprit can be also tricky, since an error in container A can be caused by the actual failure in the container B which may in turn depend on something entirely else, including hardware problems.
Basically a surefire way to train LLM to parse logs and detect real issues almost entirely depends on the readability and precision of logging. And if logging is good enough then humans can do debug faster and more reliable too :) . Unfortunately people reading logs and people coding them are almost not intersecting in practice and so the issue remains.
show comments
verdverm
This is one of those HN posts you share internally in the hopes you can work this into your sprint
_boffin_
Excited to go through this!
the_arun
The article doesn't mention about which LLM or total cost. Because if they have used ChatGPT or such, the token cost itself should be very expensive, right?
show comments
p0w3n3d
That's in the contrary to my experience. Logs contain a lot of noise and unnecessary information, especially Java, hence best is to prepare them before feeding them to LLM. Not speaking about wasted tokens too...
show comments
sathish316
SQL is the best exploratory interface for LLMs. But, most of Observability data like Metrics, Logs, Traces we have today are hidden in layers of semantics, custom syntax that’s hard for an agent to translate from explore or debug intent to the actual query language.
Large scale data like metrics, logs, traces are optimised for storage and access patterns and OLAP/SQL systems may not be the most optimal way to store or retrieve it. This is one of the reasons I’ve been working on a Text2SQL / Intent2SQL engine for Observability data to let an agent explore schema, semantics, syntax of any metrics, logs data. It is open sourced as Codd Text2SQL engine - https://github.com/sathish316/codd_query_engine/
It is far from done and currently works for Prometheus,Loki,Splunk for few scenarios and is open to OSS contributions. You can find it in action used by Claude Code to debug using Metrics and Logs queries:
"Logs" is doing some heavy lifting here. There's a very non-trivial step in deciding that a particular subset and schema of log messages deserves to be in its own columnar data table. It's a big optimization decision that adds complexity to your logging stack. For a narrow SaaS product that is probably a no-brainer.
I would like to see this approach compared to a more minimal approach with say, VictoriaLogs where the LLM is taught to use LogsQL, but overall it's a more "out of the box" architecture.
show comments
dbreunig
Check out “Recursive Language Models”, or RLMs.
I believe this method works well because it turns a long context problem (hard for LLMs) into a coding and reasoning problem (much better!). You’re leveraging the last 18 months of coding RL by changing you scaffold.
show comments
esafak
Forgive me if this is tangential to the debate, but I am trying to understand Mendral's value proposition. Is it that you save users time in setting up observability for CI? Otherwise could you not simply use gh to fetch the logs, their observability system's API or MCP, and cross check both against the code? Or is there a machine learning system that analyzes these inputs beyond merely retrieving context for the LLM? Good luck!
show comments
iririririr
am i reading correctly that the compression is just a relational records? i.e. omit the pr title, just point to it?
show comments
tehjoker
Interesting article, but there's no rate of investigation success quoted. The engineering is interested, but it's hard to know if there was any point without some kind of measure of the usefulness.
show comments
kingjimmy
[flagged]
truth_seeker
Even if TOP 250 npm packages are refactored through AI coding agent from security, performance and user friendly API point of view, the whole JS ecosystem will be in different shape.
Same is applicable for other language community, of course
whoami4041
"LLMs are good at SQL" is quite the assertion. My experience with LLM generated SQL in OLTP and OLAP platforms has been a mixed bag. IMO analytics/SQL will always be a space that needs a significant weight of human input and judgement in generating. Probably always will be due to the critical business decisions that can be made from the insights.
show comments
kikki
Unrelated; what does "mendral" mean? It's a very... unmemorable word
show comments
yellow_lead
Why the editorialization of the title? "LLMs Are Good at SQL. We Gave Ours Terabytes of CI Logs."
I just wrote a tool for reducing logs for LLM analysis (https://github.com/ascii766164696D/log-mcp)
Lots of logs contain non-interesting information so it easily pollutes the context. Instead, my approach has a TF-IDF classifier + a BERT model on GPU for classifying log lines further to reduce the number of logs that should be then fed to a LLM model. The total size of the models is 50MB and the classifier is written in Rust so it allows achieve >1M lines/sec for classifying. And it finds interesting cases that can be missed by simple grepping
I trained it on ~90GB of logs and provide scripts to retrain the models (https://github.com/ascii766164696D/log-mcp/tree/main/scripts)
It's meant to be used with Claude Code CLI so it could use these tools instead of trying to read the log files
But does it work? I’ve used LLMs for log analysis and they have been prone to hallucinate reasons: depending on the logs the distance between cause and effects can be larger than context, usually we’re dealing with multiple failures at once for things to go badly wrong, and plenty of benign issues throw scary sounding errors.
My first take is that you could have 10 TB of logs with just a few unique lines that are actually interesting. So I am not thinking "Wow, what impressive big data you have there" but rather "if you have an accuracy of 1-10^-6 you are still are overwhelmed with false positives" or "I hope your daddy is paying for your tokens"
SQL has always been my favorite "loaded gun" api. If you have a control plane of RLS + role based auth and you've got a data dictionary it is trivial to get to a data explorer chat interaction with an LLM doing the heavy lifting.
That post reads like fully LLM-generated. It's basically boasting a list of numbers that are supposed to sound impressive. If there's a coherent story, it's well hidden.
We have an ongoing effort in parsing logs for our autotests to speed up debug. It is vary hard to do, mainly because there is a metric ton of false positives or plain old noise even in the info logs. Tracing the culprit can be also tricky, since an error in container A can be caused by the actual failure in the container B which may in turn depend on something entirely else, including hardware problems.
Basically a surefire way to train LLM to parse logs and detect real issues almost entirely depends on the readability and precision of logging. And if logging is good enough then humans can do debug faster and more reliable too :) . Unfortunately people reading logs and people coding them are almost not intersecting in practice and so the issue remains.
This is one of those HN posts you share internally in the hopes you can work this into your sprint
Excited to go through this!
The article doesn't mention about which LLM or total cost. Because if they have used ChatGPT or such, the token cost itself should be very expensive, right?
That's in the contrary to my experience. Logs contain a lot of noise and unnecessary information, especially Java, hence best is to prepare them before feeding them to LLM. Not speaking about wasted tokens too...
SQL is the best exploratory interface for LLMs. But, most of Observability data like Metrics, Logs, Traces we have today are hidden in layers of semantics, custom syntax that’s hard for an agent to translate from explore or debug intent to the actual query language.
Large scale data like metrics, logs, traces are optimised for storage and access patterns and OLAP/SQL systems may not be the most optimal way to store or retrieve it. This is one of the reasons I’ve been working on a Text2SQL / Intent2SQL engine for Observability data to let an agent explore schema, semantics, syntax of any metrics, logs data. It is open sourced as Codd Text2SQL engine - https://github.com/sathish316/codd_query_engine/
It is far from done and currently works for Prometheus,Loki,Splunk for few scenarios and is open to OSS contributions. You can find it in action used by Claude Code to debug using Metrics and Logs queries:
Metric analyzer and Log analyzer skills for Claude code - https://github.com/sathish316/precogs_sre_oncall_skills/tree...
"Logs" is doing some heavy lifting here. There's a very non-trivial step in deciding that a particular subset and schema of log messages deserves to be in its own columnar data table. It's a big optimization decision that adds complexity to your logging stack. For a narrow SaaS product that is probably a no-brainer.
I would like to see this approach compared to a more minimal approach with say, VictoriaLogs where the LLM is taught to use LogsQL, but overall it's a more "out of the box" architecture.
Check out “Recursive Language Models”, or RLMs.
I believe this method works well because it turns a long context problem (hard for LLMs) into a coding and reasoning problem (much better!). You’re leveraging the last 18 months of coding RL by changing you scaffold.
Forgive me if this is tangential to the debate, but I am trying to understand Mendral's value proposition. Is it that you save users time in setting up observability for CI? Otherwise could you not simply use gh to fetch the logs, their observability system's API or MCP, and cross check both against the code? Or is there a machine learning system that analyzes these inputs beyond merely retrieving context for the LLM? Good luck!
am i reading correctly that the compression is just a relational records? i.e. omit the pr title, just point to it?
Interesting article, but there's no rate of investigation success quoted. The engineering is interested, but it's hard to know if there was any point without some kind of measure of the usefulness.
[flagged]
Even if TOP 250 npm packages are refactored through AI coding agent from security, performance and user friendly API point of view, the whole JS ecosystem will be in different shape.
Same is applicable for other language community, of course
"LLMs are good at SQL" is quite the assertion. My experience with LLM generated SQL in OLTP and OLAP platforms has been a mixed bag. IMO analytics/SQL will always be a space that needs a significant weight of human input and judgement in generating. Probably always will be due to the critical business decisions that can be made from the insights.
Unrelated; what does "mendral" mean? It's a very... unmemorable word
Why the editorialization of the title? "LLMs Are Good at SQL. We Gave Ours Terabytes of CI Logs."
Title tells us nothing: what's the tl;dr?