SCM as a database for the code

hallh

We've tackled this problem slightly differently where I work. We have AI agents contribute in a large legacy codebase, and without proper guidance, the agents quickly get lost or reimplement existing functionality.

To help the agents understand the codebase, we indexed our code into a graph database using an AST, allowing the agent to easily find linked pages, features, databases, tests, etc from any one point in the code, which helped it produce much more accurate plans with less human intervention and guidance. This is combined with semantic search, where we've indexed the code based on our application's terminology, so when an agent is asked to investigate a task or bug for a specific feature, it'll find the place in the code that implements that feature, and can navigate the graph of dependencies from there to get the big picture.

We provide these tools to the coding agents via MCP and it has worked really well for us. Devs and QAs can find the blast radius of bugs and critical changes very quickly, and the first draft quality of AI generated plans requires much less feedback and corrections for larger changes.

In our case, I doubt that a general purpose AST would work as well. It might be better than a simple grep, especially for indirect dependencies or relationships. But IMO, it'd be far more interesting to start looking at application frameworks or even programming languages that provide this direct traversability out of the box. I remember when reading about Wasp[0] that I thought it would be interesting to see it go this way, and provide tooling specifically for AI agents.

[0] https://wasp.sh/

panstromek

I think agree (but I think I think about this maybe a one level higher). I wrote about this a while ago in https://yoyo-code.com/programming-breakthroughs-we-need/#edi... .

One interesting thing I got in replies is Unison language (content adressed functions, function is defined by AST). Also, I recommend checking Dion language demo (experimental project which stores program as AST).

In general I think there's a missing piece between text and storage. Structural editing is likely a dead end, writing text seems superior, but storage format as text is just fundamentally problematic.

I think we need a good bridge that allows editing via text, but storage like structured database (I'd go as far as say relational database, maybe). This would unlock a lot of IDE-like features for simple programmatic usage, or manipulating langauge semantics in some interesting ways, but challenge is of course how to keep the mapping between textual input in shape.

show comments

moezd

Git works universally as a storage backend, with some reasoning capacity thanks to commit history. It didn't need to include semantics about the code or build a tree of knowledge. That would be against Linux philosophy: Do one thing and do it well.

You can build whatever you want on top to help your AI agents. That would be actually beneficial so that we stop feeding raw text to this insane machinery for once.

charcircuit

>I definitely reject the "git compatible" approach

If your version control system is not compatible with GitHub it will be dead on arrival. The value of allowing people to gradually adopt a new solution can not be understated. There is also value in being compatible with existing git integrations or scripts in projects build systems.

show comments

PunchyHamster

No we don't.

And you can build near any VCS of your dream while still using Git as storage backend, as it is database of a linked snapshots + metadata. Bonus benefit: it will work with existing tooling

The whole article is "I don't know how git works, let's make something from scratch"

show comments

nylonstrung

Trustfall seems really promising for querying files as if they were a db

https://github.com/obi1kenobi/trustfall

gfody

I've had this idea too, and think about it everytime I'm on a PR with lots of whitespace/non-functional noise how nice it would be if source code wern't just text and I could be looking at a cleaner higher level diff instead.. I think you have to go higher than AST though, it should at least be language-aware

show comments

DannyBee

Somebody call the visualage and clearcase folks and let them know their time has come again!

show comments

whazor

A big challenge will be the unfamiliarity of such a new system. Many people have found that coding agents work really well with terminals, unix tooling, and file systems. It's proven tech.

Where-as doing DB queries to navigate code would be quite unfamiliar.

benrutter

> The monorepo problem: git has difficulty dividing the codebase into modules and joining them back

Can anyone explain this one? I use monorepos everyday and although tools like precommit can get a bit messy, I've never found git itself to be the issue?

show comments

procaryote

One fundamental deal-breaking problem with structure aware version control is that your vcs now needs to know all the versions of all the languages you're writing in. It gets non-trivial fast.

show comments

stared

To me, git works. And LLMs understand that, unlike some yet-to-come-tool.

If you create a new tool for version control, go for it. Then see how it fares in benchmarks (for end-to-end tools) or vox populi - if people use you new tool/skill/workflow.

mtsolitary

I’ve recently been thinking about this too, here is my idea: https://clintonboys.com/projects/lit/

show comments

solarized

Talk is cheap. Show me the code.

deafpolygon

No, we don’t. The genius of git is in its simplicity (while also being pretty damn complicated too).

sublinear

Missing 4 out of 5 parts. The attempt to wrest control away from the open source community doesn't even have the effort to keep going.

show comments

forrestthewoods

Definitely agree that Git is mediocre-at-best VCS tool. This has always been the case. But LLMs are finally forcing the issue. It’s a shame a whole generation programmers has only used Git/GitHub and think it’s good.

Monorepo and large binary file support is TheWay. A good cross-platform virtual file system (VFS) is necessary; a good open source one doesn’t exist today.

Ideally it comes with a copy-on-write system for cross-repo blob caching. But I suppose that’s optional. It lets you commit toolchains for open source projects which is a dream of mine.

Not sure I agree that LSP like features need to be built in. That feels wrong. That’s just a layer on top.

Do think that agent prompts/plans/summaries need to be a first class part of commits/merges. Not sure the full set of features required here.

show comments