Your File System Is Already A Graph Database

148 points67 comments3 days ago
kenforthewin

I keep harping on this, but the question is not "can you use your filesystem as a graph database" - of course you can - but whether this performs better or worse than a vector database approach, especially at scale.

The premise of Atomic, the knowledge base project I'm currently working on, is that there is still significant value in vectors, even in an agentic context. https://github.com/kenforthewin/atomic

show comments
kesor

So you have some folders with markdown files ... which are insanely hard to query without a tool ... impossible to traverse via their relationships ... and you call that a graph database? WHAT?!

Clicked the link expecting to see some tool or method that actually allows graph-like queries and traversals on files in a file system, all I found was some rant about someone on the internet being wrong.

Waste of time.

embedding-shape

I've been playing around with the same, but trying to use local models as my Obsidian vault obviously contain a bunch of private things I'm not willing to share with for-profit companies, but I have yet to find any model that comes close to working out as well as just codex or cc with the small models, even with 96GB of VRAM to play around with.

I've started to think about maybe a fine-tuned model is needed, specifically for "journal data retrieval" or something like that, is anyone aware of any existing models for things like this? I'd do it myself, but since I'm unwilling to send larger parts of my data to 3rd parties, I'm struggling collecting actual data I could use for fine-tuning myself, ending up in a bit of a catch 22.

For some clients projects I've experimented with the same idea too, with less restrictions, and I guess one valuable experience is that letting LLMs write docs and add them to a "knowledge repository" tends to up with a mess, best success we've had is limiting the LLMs jobs to organizing and moving things around, but never actually add their own written text, seems to slowly degrade their quality as their context fills up with their own text, compared to when they only rely on human-written notes.

show comments
stingraycharles

Using the same logic, a key/value database is also a graph database?

Isn’t the biggest benefit of graph databases the indexing and additional query constructs they support, like shortest path finding and whatnot?

show comments
SoftTalker

I will always be in awe of people who can remain diligent doing this level of journaling/personal information management.

I've got scraps of paper and legal pads and post-it notes and just throw them away after they've been sitting around for a while and I forget what they are about.

Jayakumark

Interesting approach but how do you download Google Docs, XLS and Slack threads etc.. and how is it saved in obsidian, are they all converted to markdown before saving or summarized to extract key topics and saved. What about images ?

alxndr

> […] the knowledge base isn’t just for research. It’s a context engineering system. You’re building the exact input your LLM needs to do useful work. > […] there’s a real difference between prompting “help me write a design doc for a rate limiting service” and prompting an LLM that has access to your project folder with six months of meeting notes, three prior design docs, the Slack thread where the team debated the approach, and your notes on the existing architecture.

bullen

Yep, my distributed JSON over HTTP database uses the ext4 binary tree for indexing: http://root.rupy.se

It can only handle 3 way multiple cross references by using 2 folders and a file now (meta) and it's very verbose on the disk (needs type=small otherwise inodes run out before disk space)... but it's incredibly fast and practially unstoppable in read uptime!

Also the simplicity in using text and the file system sort of guarantees longevity and stability even if most people like the monolithic garbled mess that is relational databases binary table formats...

itmitica

I can see over engineering when I look at one. And premature optimization.

Anyway, why care how the data is stored? You need a catalog. You need an index. You need automation. Helps keeping order and helps with inevitable changes and flips and pivots and whims and trends and moods and backups and restoration and snapshots and history and versioning and moon travels and collaboration and compatibility and long summer evening walks and portability.

zadikian

On the other hand, I get why cloud drive users completely disregard file structure and search everything. Two files usually don't have the same name unless you're laying it out programmatically like this. I use dir trees for code ofc, but everything else is flat in my ~/Documents.

Deep inside a project dir, feels like some the ease of LLMs is just not having to cd into the correct directory, but you shouldn't need an LLM to do that. I'm gonna try setting up some aliases like "auto cd to wherever foo/main.py is" and see how that goes.

show comments
aleksiy123

I’ve been thinking about this in a couple of contexts and pretty much how I’ve come to think about it.

Folders give you hierarchical categories.

You still want tags for horizontal grouping. And links and references for precise edges.

But that gives you a really nice foundation that should get you pretty damn far.

I also now am telling the llm to add a summary as the first section of the file is longer.

mzelling

If I understand this right, the difference between the author's suggested approach and simply chatting with an AI agent over your files is hyperlinks: if your files contain links to other relevant files, the agent has an easier time identifying relevant material.

estetlinus

I am more curious on the note taking. How do you ingest data here? Export from slack via LLM:s? Store it in GitHub?

My “knowledge” is spread out on various SaaS (Google, slack, linear, notion, etc). I don’t see how I can centralize my “knowledge” without a lot of manual labour.

show comments
appsoftware

I created AS Notes (https://www.asnotes.io) (an extension for VS Code, Antigravity etc) partly because of this use case. It works like Obsidian, being markdown based, with wikilinks, mermaid rendering and task management. In VS Code, we have access to really good Agent harnesses and can navigate our notes and documents in a file system like manner. Further, using AGENTS.md, idea files etc we can instruct the agent how to interact, add to our notes etc. I've found working with my notes like this really useful, and provided I trim anything generated by an AI that's not going to be useful, provides an investment in the information I've gathered as the information is retained in markdown rather than getting lost in multiple chatbot UI s.

stared

Filesystem is a tree - a particular, constrained graph. Advanced topics usually require a lot of interconnections.

Maybe it is why mind maps never spoke to me. I felt that a tree structure (or even - planar graphs) were not enough to cover any sufficiently complex topic.

show comments
itake

I'm wonder though:

1. Why does AI need that folder structure? Why not a flat list of files and let the AI agent explore with BM25 / grep, etc.

2. pre-compute compression vs compute at query time.

Kaparthy (and you) are recommending pre-compressing and sorting based on hard coded human abstraction opinions that may match how the data might be queried into human-friendly buckets and language.

Why not just let the AI calculate this at run time? Many of these use cases have very few files and for a low traffic knowledge store, it probably costs less tokens if you only tokenize the files you need.

show comments
WillAdams

I've found a similar structure along with a naming convention useful at my day job --- the big thing is the names are such that when copied as a filepath, the filepath and extension deleted, and underscores replaced by tabs, the text may then be pasted into a spreadsheet and summed up or otherwise manipulated.

In somewhat of an inversion, I've been getting the initial naming done by an LLM (well, I was, until CoPilot imposed file upload limits and the new VPN blocked access to it) --- for want of that, I just name each scan by Invoice ID, then use a .bat file made by concatenating columns in a spreadsheet to rename them to the initial state ready for entry.

game_the0ry

There for sure a "second brain" product hiding in plain site for one of the frontier AI companies. Google/Gemini should be all over this right now.

exossho

I can't remember how many file structures I've already tried... LLMs seem to be a great help here. Also used CC to organize my messy harddrive.

Now just need to find a good way to maintain the order...

show comments
bhewes

Wow just strings in files. Are you jumping node to node via pointers index free?

themafia

Sure. It just fails to be atomic. Which is a property I really like.