Isn’t the biggest benefit of graph databases the indexing and additional query constructs they support, like shortest path finding and whatnot?
Anyway, why care how the data is stored? You need a catalog. You need an index. You need automation. Helps keeping order and helps with inevitable changes and flips and pivots and whims and trends and moods and backups and restoration and snapshots and history and versioning and moon travels and collaboration and compatibility and long summer evening walks and portability.
I've started to think about maybe a fine-tuned model is needed, specifically for "journal data retrieval" or something like that, is anyone aware of any existing models for things like this? I'd do it myself, but since I'm unwilling to send larger parts of my data to 3rd parties, I'm struggling collecting actual data I could use for fine-tuning myself, ending up in a bit of a catch 22.
For some clients projects I've experimented with the same idea too, with less restrictions, and I guess one valuable experience is that letting LLMs write docs and add them to a "knowledge repository" tends to up with a mess, best success we've had is limiting the LLMs jobs to organizing and moving things around, but never actually add their own written text, seems to slowly degrade their quality as their context fills up with their own text, compared to when they only rely on human-written notes.
Maybe it is why mind maps never spoke to me. I felt that a tree structure (or even - planar graphs) were not enough to cover any sufficiently complex topic.
Symbolic links can form a graph, and you can process them as needed using readlink etc. to traverse the graph, but they'll still be considered broken if they form a cycle.
It can only handle 3 way multiple cross references by using 2 folders and a file now and it's very verbose on the disk (needs type=small otherwise inodes run out before disk space)... but it's incredibly fast and practially unstoppable in read uptime!
Also the simplicity in using text and the file system sort of guarantees longevity and stability even if most people like the monolithic garbled mess that is relational databases binary table formats...
1. Why does AI need that folder structure? Why not a flat list of files and let the AI agent explore with BM25 / grep, etc.
2. pre-compute compression vs compute at query time.
Kaparthy (and you) are recommending pre-compressing and sorting based on hard coded human abstraction opinions that may match how the data might be queried into human-friendly buckets and language.
Why not just let the AI calculate this at run time? Many of these use cases have very few files and for a low traffic knowledge store, it probably costs less tokens if you only tokenize the files you need.
It doesn't. The human creating the files needs it, to make it easier to traverse in future as the file count grows. At 52k files, that's a horrendous list to scroll through to find the thing you're looking for. Meanwhile, an AI can just `find . -type f -exec whatever {} \;` and be able to process it however it needs. Human doesn't need to change the way they work to appease the magic rock in the box under the desk.
why? The human would just talk to the AI agent. Why would they need to scroll through that many files?
I made a similar system with 232k files (1 file might be a slack message, gitlab comment, etc). it does a decent job at answering questions with only keyword search, but I think i can have better results with RAG+BM25.
Just because AI exists doesn't mean we can neglect basic design principles.
If we throw everything out the window, why don't we just name every file as a hash of its content? Why bother with ASCII names at all?
Fundamentally, it's the human that needs to maintain the system and fix it when it breaks, and that becomes significantly easier if it's designed in a way a human would interact with it. Take the AI away, and you still have a perfectly reasonable data store that a human can continue using.
Now just need to find a good way to maintain the order...
Do you still have your prompt by chance, and willing to share it? I took a stab at this and it didn't want to make much change. I think I need to be more specific but am not sure how to do that in a general way
In somewhat of an inversion, I've been getting the initial naming done by an LLM (well, I was, until CoPilot imposed file upload limits and the new VPN blocked access to it) --- for want of that, I just name each scan by Invoice ID, then use a .bat file made by concatenating columns in a spreadsheet to rename them to the initial state ready for entry.