Is this ready to go? Would love to see a Quickstart usage example somewhere.
I have a task which would potentially hugely benefit from this. I have 10 million, 500-element vectors I need to find their nearest pair (could settle for an approximate match). One time job, can crank away for weeks if required. The SQLite options seem like they will not be able to handle the load, so I was going to explore the Postgres vector extensions.
Doing napkin math says this would require just a stupid huge number of comparisons, but there is so much buzz about vector databases, I wanted to see if the indexes had some magic which could make this feasible without having terabytes of RAM.
What have you tried so far? As a strong baseline, storing 10e6 512-bit vectors in memory should take about 640MiB of RAM. A binary Hamming distance between a query and a gallery is [popcnt(query xor elt) for elt in gallery], which should be plenty fast implemented in vectorized SSE instructions. I’d be shocked if this took more than 10ms per query per thread on modern hardware.
These are not bit vectors, but float64. I could probably convert to float32 to cut down the ram. This is a one time job, so no changes.
My initial attempt has been using the scikit-learn cosine similarity function[0] which lets you take a query matrix against a target one. The actual calculation is pretty fast and seems quite optimized (at least, I can see all of my cores max out when it does the calculation). Then I iterate across the results matrix, sorting for best matches, and record best seen to date.
You are spot on, 10ms internal time for 100000 molecules, 60ms with the "stack on top": https://www.chemeo.com/similar?smiles=CCCCO
It is using a manually coded, most likely not that efficient, with popcnt.
It works right now, but I'm actively adding a lot of additional things that might make your life easier. The roadmap on the readme shows what I'm working on adding. Feel free to shoot me an email at carson at poole.ai and I'd be happy to give some guidance, but a quickstart is definitely at the top of my priorities also. :)
You don’t need a DB, I would avoid that for a one time job (I’ve used pgvector a lot).
Since your data fits in memory (18 GB @ FP32), I would start with a simple python script that does naive exhaustive search, which is O(n^2).
You can do approximate search which will sacrifice some accuracy, but you’ll have to build an index first.
HNSW index is state of the art right now and will give you accurate and fast approximate vector search, O(logᵏn). But the tradeoff is it can take a significant amount of time to build the graph data structure, which may not be favorable given the one off nature of your situation.
I would be sure to use a vectorized (SIMD) similarity search implementation, and multithreading to split the work among all CPU cores on a beefy machine.
Also, this falls into the category of “embarrassingly parallel” problems - it should be straightforward to divide the work across multiple machines if really necessary, eg see Ray and the surrounding python ecosystem.
This sounds like the "Closest pair of points problem", which should have fast solutions - O(n log(n)) comparisons instead of O(n^2), which makes a huge difference when you have 10 million vectors. You might not find an off-the-shelf solution, but there are some good explanations of the approach at least.
The performance numbers do not show time performance versus accuracy perforamnce. Binary embeddings clearly have a time advantage, at the expense of accuracy. Reporting/benchmarking in this domain requires (at least) two dimensions.
I wholeheartedly agree :) I've mostly put this together this weekend, so I haven't had time to do everything yet, but that's definitely on the todo list.
I FT'd some siglip models (see here: https://huggingface.co/carsonpoole/binary-siglip-text) that should be amenable to binary quantization so I'm going to tomorrow get an inference server running with that and then hopefully do the typical MTEB benchmarks for embeddings
There are also FAISS binary indexes[0], so it'd be great to compare binary index vs binary index. Otherwise it seems a little misleading to say it is a FAISS vs not FAISS comparison, since really it would be a binary index vs not binary index comparison. I'm not too familiar with binary indexes, so if there's a significant difference between the types of binary index then it'd be great to explain what that is too.
I remember stumbling upon an early discussion about this [0] a bit ago in the EleutherAI discord when searching for discussion about a paper; I'm glad to see it's turned into something public.
Being in such a crowded space, may I ask what your motivation was for creating a new vector store?
edit: I suggest creating a comparison table. It would inform potential users and help you plan your roadmap. Also talk to customers; what's the market for petabyte-scale RAG?
IMO at least there are a lot of things that other vector DBs are missing and should exist. I want to make this work at petabyte scale data with the features on the readme's roadmap plus some others I have nebulous ideas about.
Very impressive! Do you have similar scale tests at higher numbers? I’m curious what the numbers are at 1B/1T vectors, and amounts larger than what can fit in memory
not yet, but it's roughly linear at scaling, since it's a brute force algorithm. so with the current version it'd probably be about 22 seconds for a 1B vector search. the whole point of having metadata queries are to prevent those kind of searches from being necessary, and hopefully with some FTS interspersed it can reduce the number of similarity comparisons required even more
I have a task which would potentially hugely benefit from this. I have 10 million, 500-element vectors I need to find their nearest pair (could settle for an approximate match). One time job, can crank away for weeks if required. The SQLite options seem like they will not be able to handle the load, so I was going to explore the Postgres vector extensions.
Doing napkin math says this would require just a stupid huge number of comparisons, but there is so much buzz about vector databases, I wanted to see if the indexes had some magic which could make this feasible without having terabytes of RAM.