Haystack DB – 10x faster than FAISS with binary embeddings by default

fbdab103 · 2024-04-29T03:58:22

Is this ready to go? Would love to see a Quickstart usage example somewhere.

I have a task which would potentially hugely benefit from this. I have 10 million, 500-element vectors I need to find their nearest pair (could settle for an approximate match). One time job, can crank away for weeks if required. The SQLite options seem like they will not be able to handle the load, so I was going to explore the Postgres vector extensions.

Doing napkin math says this would require just a stupid huge number of comparisons, but there is so much buzz about vector databases, I wanted to see if the indexes had some magic which could make this feasible without having terabytes of RAM.

gcr · 2024-04-29T04:46:10

Do your vectors change a lot?

What have you tried so far? As a strong baseline, storing 10e6 512-bit vectors in memory should take about 640MiB of RAM. A binary Hamming distance between a query and a gallery is [popcnt(query xor elt) for elt in gallery], which should be plenty fast implemented in vectorized SSE instructions. I’d be shocked if this took more than 10ms per query per thread on modern hardware.

fbdab103 · 2024-04-30T00:12:50

These are not bit vectors, but float64. I could probably convert to float32 to cut down the ram. This is a one time job, so no changes.

My initial attempt has been using the scikit-learn cosine similarity function[0] which lets you take a query matrix against a target one. The actual calculation is pretty fast and seems quite optimized (at least, I can see all of my cores max out when it does the calculation). Then I iterate across the results matrix, sorting for best matches, and record best seen to date.

[0] https://scikit-learn.org/stable/modules/generated/sklearn.me...

Loic · 2024-04-29T07:55:57

You are spot on, 10ms internal time for 100000 molecules, 60ms with the "stack on top": https://www.chemeo.com/similar?smiles=CCCCO It is using a manually coded, most likely not that efficient, with popcnt.

carsonpoole · 2024-04-29T04:03:10

It works right now, but I'm actively adding a lot of additional things that might make your life easier. The roadmap on the readme shows what I'm working on adding. Feel free to shoot me an email at carson at poole.ai and I'd be happy to give some guidance, but a quickstart is definitely at the top of my priorities also. :)

gcr · 2024-04-29T04:54:26

This is really cool! Thanks for writing it!

carsonpoole · 2024-04-29T05:04:41

thank you for the kind words!

refibrillator · 2024-04-29T04:38:18

You don’t need a DB, I would avoid that for a one time job (I’ve used pgvector a lot).

Since your data fits in memory (18 GB @ FP32), I would start with a simple python script that does naive exhaustive search, which is O(n^2).

You can do approximate search which will sacrifice some accuracy, but you’ll have to build an index first.

HNSW index is state of the art right now and will give you accurate and fast approximate vector search, O(logᵏn). But the tradeoff is it can take a significant amount of time to build the graph data structure, which may not be favorable given the one off nature of your situation.

I would be sure to use a vectorized (SIMD) similarity search implementation, and multithreading to split the work among all CPU cores on a beefy machine.

Also, this falls into the category of “embarrassingly parallel” problems - it should be straightforward to divide the work across multiple machines if really necessary, eg see Ray and the surrounding python ecosystem.

gcr · 2024-04-29T04:47:06

Agree in principle but why use fp32? These are binary vectors, so OP just needs a fast Hamming distance

matharmin · 2024-04-29T12:12:18

This sounds like the "Closest pair of points problem", which should have fast solutions - O(n log(n)) comparisons instead of O(n^2), which makes a huge difference when you have 10 million vectors. You might not find an off-the-shelf solution, but there are some good explanations of the approach at least.

social_quotient · 2024-04-29T10:30:27

Super curious what data this is, can you share a bit more of your use case for this much vector data?

J_Shelby_J · 2024-04-29T04:44:15

Nice!

Can you compare and contrast vs. the other options? For example lancedb, usearch, arroy, and oasysdb?

Also, is it possible to use the embedded vs running as a web server?

Also, just a note but haystack is an ai project by deep stack.

sdesol · 2024-04-29T05:52:01

When I saw Haystack, I thought it was a DB by deepset.

anewhnaccount2 · 2024-04-29T05:38:15

The performance numbers do not show time performance versus accuracy perforamnce. Binary embeddings clearly have a time advantage, at the expense of accuracy. Reporting/benchmarking in this domain requires (at least) two dimensions.

carsonpoole · 2024-04-29T05:54:26

I wholeheartedly agree :) I've mostly put this together this weekend, so I haven't had time to do everything yet, but that's definitely on the todo list.

I FT'd some siglip models (see here: https://huggingface.co/carsonpoole/binary-siglip-text) that should be amenable to binary quantization so I'm going to tomorrow get an inference server running with that and then hopefully do the typical MTEB benchmarks for embeddings

DougBTX · 2024-04-29T10:18:38

There are also FAISS binary indexes[0], so it'd be great to compare binary index vs binary index. Otherwise it seems a little misleading to say it is a FAISS vs not FAISS comparison, since really it would be a binary index vs not binary index comparison. I'm not too familiar with binary indexes, so if there's a significant difference between the types of binary index then it'd be great to explain what that is too.

[0] https://github.com/facebookresearch/faiss/wiki/Binary-indexe...

barefeg · 2024-04-29T06:23:30

I was confused for a bit but there is no relation to https://haystack.deepset.ai/

danielbln · 2024-04-29T09:17:56

I've shared the confusion, it's a bit unfortunate naming considering the pretty mature and fleshed out LLM framework of the same name.

manojlds · 2024-04-29T09:36:32

Yeah I was excited at first since the framework is pretty solid and a vector DB from the same folks would have been interesting.

popinman322 · 2024-04-29T10:46:31

I remember stumbling upon an early discussion about this [0] a bit ago in the EleutherAI discord when searching for discussion about a paper; I'm glad to see it's turned into something public.

[0]: https://discord.com/channels/729741769192767510/730095596861...

esafak · 2024-04-29T04:46:52

Being in such a crowded space, may I ask what your motivation was for creating a new vector store?

edit: I suggest creating a comparison table. It would inform potential users and help you plan your roadmap. Also talk to customers; what's the market for petabyte-scale RAG?

carsonpoole · 2024-04-29T05:02:09

IMO at least there are a lot of things that other vector DBs are missing and should exist. I want to make this work at petabyte scale data with the features on the readme's roadmap plus some others I have nebulous ideas about.

avarsheny · 2024-04-29T10:20:50

Would be great to see some documentation of internal data structure and algorithms.

babelfish · 2024-04-29T04:42:11

Very impressive! Do you have similar scale tests at higher numbers? I’m curious what the numbers are at 1B/1T vectors, and amounts larger than what can fit in memory

carsonpoole · 2024-04-29T05:04:24

not yet, but it's roughly linear at scaling, since it's a brute force algorithm. so with the current version it'd probably be about 22 seconds for a 1B vector search. the whole point of having metadata queries are to prevent those kind of searches from being necessary, and hopefully with some FTS interspersed it can reduce the number of similarity comparisons required even more

dankle · 2024-04-29T05:22:43

What's the performance against for example qdrant?