Most important piece of information is in the linked Frontiers article:
However, the overall capability of the chatbot to fully meet user needs received a lower average score (3.1/5.0), highlighting the need for further improvements.
Also there is still the problem of hallucinations, as we see in the „Evaluation“ paragraph:
Live traffic evaluations are essential for monitoring system behavior, identifying potential issues like hallucinations in production, and understanding performance on diverse live queries.
This are quite devastating results. This is a system for scientific research on medicines and mediocrity and hallucinations will kill people.
Would be interesting to know how much money was flushed down the toilet with these experts.
show comments
bob1029
The most important part is the database that the agent can see and how clean the data is. I pitched a custom enterprise agent to a client thinking it would be maybe 50/50 time on data vs agent tuning, but it's more like 99/1.
The alignment process goes very quickly once you have all the fish in exactly one barrel. I think pulling data dynamically from the source systems is where this turns into a game of whack-a-mole.
The problem with dynamic fetch is that you don't get any kind of persistent or compounding gains. There are queries that you simply cannot run because you'd chew through your GitHub, et. al., API quotas. It takes over 48h to fully hydrate the database for GitHub items on my current project. But, once that process is complete I can query across things like issue comments and do crosscutting joins with the state of other vendor systems in milliseconds.
I am finding the MSSQL dialect to be quite agreeable to the OAI models. With absolutely no prompting they will bootstrap off information schema and extended description properties every single time. If you design the schema for your audience, the amount of "Jesus prompting" you will require is much better controlled.
show comments
smallnix
What was the main driver for a dynamic workflow with loops vs a rigid forward running only workflow. The non-deterministic nature of these loops with LLM decision points doesn't mesh well with the transparency requirement imho
AJRF
Two paragraph section on Evaluation after 30 paragraphs explaining the most bog standard rag system you've ever heard of.
Hmm...
show comments
stevex
You can almost tell the "era" that a solution was built in these days since things are changing so fast.
Mid-2026, we have very large context windows, and much smarter models than we did in 2024 when this was built. If I were to tackle this today I'd ask a current frontier model to work through the source data and design a hierarchy that would give it the ability to sift through the content itself by drilling down as it sees fit, and I expect it would nail that.
show comments
AJRF
Seeing this article and seeing the replies. Oof. Maybe Thoughtworks did some good work in traditional software engineering (not sure) - but why would you trust them to touch anything related to LLMs. They don't seem to know what they are doing.
show comments
hirako2000
I happen to have made another attempt trusting an agentic tool, latest "Continue" with frontier model: 8M token burned in 30 minutes. App does not work.
agentdev001
I find papers/articles which discuss solutions that rely heavily on a model in the middle unreadable, if the models used are not discussed.
The data you need to get into context for a small model, vs a big boy frontier model, vs a fine tuned open weight big boy- are all very different. I can understand what they're doing here, and most of the 'why', but- not all of the why.
Littice
The part about context discipline feels underrated.
Larger context windows don’t remove the need to decide what the model shouldn’t see.
show comments
ThePhysicist
I think for mostly search-focused use case like the one presented here AI is great as you don't ask it to build stuff or invent new drugs, you just want to retrieve relevant documents with laser precision, and agents can do that.
I think right now I'm mostly disappointed with agents writing code as they always degrade the quality of the codebase after a while, and the same goes for writing in general which just requires a ton of editing and mostly just sounds good but doesn't have a lot of substance in the end. I think you can really tell that these systems are trained to just produce plausible streams of text, especially in longer artefacts you notice that locally the inner consistency of what they produce is great but globally it really falls apart, it's like seeing the limits of their "intelligence".
For search however I really like AI, it has improved information retrieval so much for me where before I had to think about which keywords to use and combine and which filters to apply, describing what I'm looking for in plain text and then having the AI find it for me feels magical. Recently I wanted to find an artist that I heard in some old episode of the KEXP runcast (a running podcast), and I didn't remember anything except that it was rap with a kind of monotone voice a fast beat and a strong accent. Googles' agent asked a few clarifying questions and after a few rounds it found the artist for me, Genesis Uwusu. That's why I think Google will win in the AI assisted search market, they just have the best integration between fast and reasonably "smart" agents and high quality search data. Claude or ChatGPT are too slow and don't have fast enough data retrieval it seems, using them for search feels quite sluggish in comparison.
show comments
altmanaltman
> The author used AI assistance during the writing of this article. AI tools were used for brainstorming ideas, creating outlines, and reviewing drafts to polish language and improve clarity.
The first sentence makes it seem like they just used to improve sentence structure etc but the second line makes it seem like they used it for 90% of the work. Which one is true?
show comments
ai_slop_hater
> Sarang Kulkarni is a Principal Consultant at Thoughtworks
> teaches an O’Reilly course on building production-ready RAG applications
isn't this basically saying that you are a scammer? or am I paranoid?
yieldcrv
The funniest part of these systems is that I build these massive prompt concatenating controllers with a schema to constrain what the LLM sees and parses, usually a frontier model like Gemini
The model gets it wrong on occasion and I check the input file with Claude/Opus and it just laughs at how simple it is to get the document right
And in the back of my mind I’m thinking why am I not just sending the file through Opus
padolsey
These vast multi-agentic systems with roles like 'Researcher', 'Writer' (with a review loop), 'Reflection agent', seem to ~feel~ mostly right but lack evals as to the merit of agent decomposition. So it forms a satisfying enough flowchart but I see no evidence these authors actually tried other approaches or agent roles. And let's be honest: an agent is just a system prompt and output contracts, and these rich architectures seem to be pontificating beyond their worth. It all feels a bit vibe-y.
show comments
marsven_422
You cannot
show comments
ai_slop_hater
Why is comment from padolsey dead? Seriously, something fishy is going on on this website.
Most important piece of information is in the linked Frontiers article:
Also there is still the problem of hallucinations, as we see in the „Evaluation“ paragraph: This are quite devastating results. This is a system for scientific research on medicines and mediocrity and hallucinations will kill people.Would be interesting to know how much money was flushed down the toilet with these experts.
The most important part is the database that the agent can see and how clean the data is. I pitched a custom enterprise agent to a client thinking it would be maybe 50/50 time on data vs agent tuning, but it's more like 99/1.
The alignment process goes very quickly once you have all the fish in exactly one barrel. I think pulling data dynamically from the source systems is where this turns into a game of whack-a-mole.
The problem with dynamic fetch is that you don't get any kind of persistent or compounding gains. There are queries that you simply cannot run because you'd chew through your GitHub, et. al., API quotas. It takes over 48h to fully hydrate the database for GitHub items on my current project. But, once that process is complete I can query across things like issue comments and do crosscutting joins with the state of other vendor systems in milliseconds.
I am finding the MSSQL dialect to be quite agreeable to the OAI models. With absolutely no prompting they will bootstrap off information schema and extended description properties every single time. If you design the schema for your audience, the amount of "Jesus prompting" you will require is much better controlled.
What was the main driver for a dynamic workflow with loops vs a rigid forward running only workflow. The non-deterministic nature of these loops with LLM decision points doesn't mesh well with the transparency requirement imho
Two paragraph section on Evaluation after 30 paragraphs explaining the most bog standard rag system you've ever heard of.
Hmm...
You can almost tell the "era" that a solution was built in these days since things are changing so fast.
Mid-2026, we have very large context windows, and much smarter models than we did in 2024 when this was built. If I were to tackle this today I'd ask a current frontier model to work through the source data and design a hierarchy that would give it the ability to sift through the content itself by drilling down as it sees fit, and I expect it would nail that.
Seeing this article and seeing the replies. Oof. Maybe Thoughtworks did some good work in traditional software engineering (not sure) - but why would you trust them to touch anything related to LLMs. They don't seem to know what they are doing.
I happen to have made another attempt trusting an agentic tool, latest "Continue" with frontier model: 8M token burned in 30 minutes. App does not work.
I find papers/articles which discuss solutions that rely heavily on a model in the middle unreadable, if the models used are not discussed.
The data you need to get into context for a small model, vs a big boy frontier model, vs a fine tuned open weight big boy- are all very different. I can understand what they're doing here, and most of the 'why', but- not all of the why.
The part about context discipline feels underrated. Larger context windows don’t remove the need to decide what the model shouldn’t see.
I think for mostly search-focused use case like the one presented here AI is great as you don't ask it to build stuff or invent new drugs, you just want to retrieve relevant documents with laser precision, and agents can do that.
I think right now I'm mostly disappointed with agents writing code as they always degrade the quality of the codebase after a while, and the same goes for writing in general which just requires a ton of editing and mostly just sounds good but doesn't have a lot of substance in the end. I think you can really tell that these systems are trained to just produce plausible streams of text, especially in longer artefacts you notice that locally the inner consistency of what they produce is great but globally it really falls apart, it's like seeing the limits of their "intelligence".
For search however I really like AI, it has improved information retrieval so much for me where before I had to think about which keywords to use and combine and which filters to apply, describing what I'm looking for in plain text and then having the AI find it for me feels magical. Recently I wanted to find an artist that I heard in some old episode of the KEXP runcast (a running podcast), and I didn't remember anything except that it was rap with a kind of monotone voice a fast beat and a strong accent. Googles' agent asked a few clarifying questions and after a few rounds it found the artist for me, Genesis Uwusu. That's why I think Google will win in the AI assisted search market, they just have the best integration between fast and reasonably "smart" agents and high quality search data. Claude or ChatGPT are too slow and don't have fast enough data retrieval it seems, using them for search feels quite sluggish in comparison.
> The author used AI assistance during the writing of this article. AI tools were used for brainstorming ideas, creating outlines, and reviewing drafts to polish language and improve clarity.
The first sentence makes it seem like they just used to improve sentence structure etc but the second line makes it seem like they used it for 90% of the work. Which one is true?
> Sarang Kulkarni is a Principal Consultant at Thoughtworks
> teaches an O’Reilly course on building production-ready RAG applications
isn't this basically saying that you are a scammer? or am I paranoid?
The funniest part of these systems is that I build these massive prompt concatenating controllers with a schema to constrain what the LLM sees and parses, usually a frontier model like Gemini
The model gets it wrong on occasion and I check the input file with Claude/Opus and it just laughs at how simple it is to get the document right
And in the back of my mind I’m thinking why am I not just sending the file through Opus
These vast multi-agentic systems with roles like 'Researcher', 'Writer' (with a review loop), 'Reflection agent', seem to ~feel~ mostly right but lack evals as to the merit of agent decomposition. So it forms a satisfying enough flowchart but I see no evidence these authors actually tried other approaches or agent roles. And let's be honest: an agent is just a system prompt and output contracts, and these rich architectures seem to be pontificating beyond their worth. It all feels a bit vibe-y.
You cannot
Why is comment from padolsey dead? Seriously, something fishy is going on on this website.