We've tackled this problem slightly differently where I work. We have AI agents contribute in a large legacy codebase, and without proper guidance, the agents quickly get lost or reimplement existing functionality.
To help the agents understand the codebase, we indexed our code into a graph database using an AST, allowing the agent to easily find linked pages, features, databases, tests, etc from any one point in the code, which helped it produce much more accurate plans with less human intervention and guidance. This is combined with semantic search, where we've indexed the code based on our application's terminology, so when an agent is asked to investigate a task or bug for a specific feature, it'll find the place in the code that implements that feature, and can navigate the graph of dependencies from there to get the big picture.
We provide these tools to the coding agents via MCP and it has worked really well for us. Devs and QAs can find the blast radius of bugs and critical changes very quickly, and the first draft quality of AI generated plans requires much less feedback and corrections for larger changes.
In our case, I doubt that a general purpose AST would work as well. It might be better than a simple grep, especially for indirect dependencies or relationships. But IMO, it'd be far more interesting to start looking at application frameworks or even programming languages that provide this direct traversability out of the box. I remember when reading about Wasp[0] that I thought it would be interesting to see it go this way, and provide tooling specifically for AI agents.
One interesting thing I got in replies is Unison language (content adressed functions, function is defined by AST). Also, I recommend checking Dion language demo (experimental project which stores program as AST).
In general I think there's a missing piece between text and storage. Structural editing is likely a dead end, writing text seems superior, but storage format as text is just fundamentally problematic.
I think we need a good bridge that allows editing via text, but storage like structured database (I'd go as far as say relational database, maybe). This would unlock a lot of IDE-like features for simple programmatic usage, or manipulating langauge semantics in some interesting ways, but challenge is of course how to keep the mapping between textual input in shape.
show comments
moezd
Git works universally as a storage backend, with some reasoning capacity thanks to commit history. It didn't need to include semantics about the code or build a tree of knowledge. That would be against Linux philosophy: Do one thing and do it well.
You can build whatever you want on top to help your AI agents. That would be actually beneficial so that we stop feeding raw text to this insane machinery for once.
charcircuit
>I definitely reject the "git compatible" approach
If your version control system is not compatible with GitHub it will be dead on arrival. The value of allowing people to gradually adopt a new solution can not be understated. There is also value in being compatible with existing git integrations or scripts in projects build systems.
show comments
PunchyHamster
No we don't.
And you can build near any VCS of your dream while still using Git as storage backend, as it is database of a linked snapshots + metadata. Bonus benefit: it will work with existing tooling
The whole article is "I don't know how git works, let's make something from scratch"
show comments
nylonstrung
Trustfall seems really promising for querying files as if they were a db
I've had this idea too, and think about it everytime I'm on a PR with lots of whitespace/non-functional noise how nice it would be if source code wern't just text and I could be looking at a cleaner higher level diff instead.. I think you have to go higher than AST though, it should at least be language-aware
show comments
DannyBee
Somebody call the visualage and clearcase folks and let them know their time has come again!
show comments
whazor
A big challenge will be the unfamiliarity of such a new system. Many people have found that coding agents work really well with terminals, unix tooling, and file systems. It's proven tech.
Where-as doing DB queries to navigate code would be quite unfamiliar.
benrutter
> The monorepo problem: git has difficulty dividing the codebase into modules and joining them back
Can anyone explain this one? I use monorepos everyday and although tools like precommit can get a bit messy, I've never found git itself to be the issue?
show comments
procaryote
One fundamental deal-breaking problem with structure aware version control is that your vcs now needs to know all the versions of all the languages you're writing in. It gets non-trivial fast.
show comments
stared
To me, git works. And LLMs understand that, unlike some yet-to-come-tool.
If you create a new tool for version control, go for it. Then see how it fares in benchmarks (for end-to-end tools) or vox populi - if people use you new tool/skill/workflow.
No, we don’t. The genius of git is in its simplicity (while also being pretty damn complicated too).
sublinear
Missing 4 out of 5 parts. The attempt to wrest control away from the open source community doesn't even have the effort to keep going.
show comments
forrestthewoods
Definitely agree that Git is mediocre-at-best VCS tool. This has always been the case. But LLMs are finally forcing the issue. It’s a shame a whole generation programmers has only used Git/GitHub and think it’s good.
Monorepo and large binary file support is TheWay. A good cross-platform virtual file system (VFS) is necessary; a good open source one doesn’t exist today.
Ideally it comes with a copy-on-write system for cross-repo blob caching. But I suppose that’s optional. It lets you commit toolchains for open source projects which is a dream of mine.
Not sure I agree that LSP like features need to be built in. That feels wrong. That’s just a layer on top.
Do think that agent prompts/plans/summaries need to be a first class part of commits/merges. Not sure the full set of features required here.
We've tackled this problem slightly differently where I work. We have AI agents contribute in a large legacy codebase, and without proper guidance, the agents quickly get lost or reimplement existing functionality.
To help the agents understand the codebase, we indexed our code into a graph database using an AST, allowing the agent to easily find linked pages, features, databases, tests, etc from any one point in the code, which helped it produce much more accurate plans with less human intervention and guidance. This is combined with semantic search, where we've indexed the code based on our application's terminology, so when an agent is asked to investigate a task or bug for a specific feature, it'll find the place in the code that implements that feature, and can navigate the graph of dependencies from there to get the big picture.
We provide these tools to the coding agents via MCP and it has worked really well for us. Devs and QAs can find the blast radius of bugs and critical changes very quickly, and the first draft quality of AI generated plans requires much less feedback and corrections for larger changes.
In our case, I doubt that a general purpose AST would work as well. It might be better than a simple grep, especially for indirect dependencies or relationships. But IMO, it'd be far more interesting to start looking at application frameworks or even programming languages that provide this direct traversability out of the box. I remember when reading about Wasp[0] that I thought it would be interesting to see it go this way, and provide tooling specifically for AI agents.
[0] https://wasp.sh/
I think agree (but I think I think about this maybe a one level higher). I wrote about this a while ago in https://yoyo-code.com/programming-breakthroughs-we-need/#edi... .
One interesting thing I got in replies is Unison language (content adressed functions, function is defined by AST). Also, I recommend checking Dion language demo (experimental project which stores program as AST).
In general I think there's a missing piece between text and storage. Structural editing is likely a dead end, writing text seems superior, but storage format as text is just fundamentally problematic.
I think we need a good bridge that allows editing via text, but storage like structured database (I'd go as far as say relational database, maybe). This would unlock a lot of IDE-like features for simple programmatic usage, or manipulating langauge semantics in some interesting ways, but challenge is of course how to keep the mapping between textual input in shape.
Git works universally as a storage backend, with some reasoning capacity thanks to commit history. It didn't need to include semantics about the code or build a tree of knowledge. That would be against Linux philosophy: Do one thing and do it well.
You can build whatever you want on top to help your AI agents. That would be actually beneficial so that we stop feeding raw text to this insane machinery for once.
>I definitely reject the "git compatible" approach
If your version control system is not compatible with GitHub it will be dead on arrival. The value of allowing people to gradually adopt a new solution can not be understated. There is also value in being compatible with existing git integrations or scripts in projects build systems.
No we don't.
And you can build near any VCS of your dream while still using Git as storage backend, as it is database of a linked snapshots + metadata. Bonus benefit: it will work with existing tooling
The whole article is "I don't know how git works, let's make something from scratch"
Trustfall seems really promising for querying files as if they were a db
https://github.com/obi1kenobi/trustfall
I've had this idea too, and think about it everytime I'm on a PR with lots of whitespace/non-functional noise how nice it would be if source code wern't just text and I could be looking at a cleaner higher level diff instead.. I think you have to go higher than AST though, it should at least be language-aware
Somebody call the visualage and clearcase folks and let them know their time has come again!
A big challenge will be the unfamiliarity of such a new system. Many people have found that coding agents work really well with terminals, unix tooling, and file systems. It's proven tech.
Where-as doing DB queries to navigate code would be quite unfamiliar.
> The monorepo problem: git has difficulty dividing the codebase into modules and joining them back
Can anyone explain this one? I use monorepos everyday and although tools like precommit can get a bit messy, I've never found git itself to be the issue?
One fundamental deal-breaking problem with structure aware version control is that your vcs now needs to know all the versions of all the languages you're writing in. It gets non-trivial fast.
To me, git works. And LLMs understand that, unlike some yet-to-come-tool.
If you create a new tool for version control, go for it. Then see how it fares in benchmarks (for end-to-end tools) or vox populi - if people use you new tool/skill/workflow.
I’ve recently been thinking about this too, here is my idea: https://clintonboys.com/projects/lit/
Talk is cheap. Show me the code.
No, we don’t. The genius of git is in its simplicity (while also being pretty damn complicated too).
Missing 4 out of 5 parts. The attempt to wrest control away from the open source community doesn't even have the effort to keep going.
Definitely agree that Git is mediocre-at-best VCS tool. This has always been the case. But LLMs are finally forcing the issue. It’s a shame a whole generation programmers has only used Git/GitHub and think it’s good.
Monorepo and large binary file support is TheWay. A good cross-platform virtual file system (VFS) is necessary; a good open source one doesn’t exist today.
Ideally it comes with a copy-on-write system for cross-repo blob caching. But I suppose that’s optional. It lets you commit toolchains for open source projects which is a dream of mine.
Not sure I agree that LSP like features need to be built in. That feels wrong. That’s just a layer on top.
Do think that agent prompts/plans/summaries need to be a first class part of commits/merges. Not sure the full set of features required here.