Hacker News new | past | comments | ask | show | jobs | submit login

You don’t need a DB, I would avoid that for a one time job (I’ve used pgvector a lot).

Since your data fits in memory (18 GB @ FP32), I would start with a simple python script that does naive exhaustive search, which is O(n^2).

You can do approximate search which will sacrifice some accuracy, but you’ll have to build an index first.

HNSW index is state of the art right now and will give you accurate and fast approximate vector search, O(logᵏn). But the tradeoff is it can take a significant amount of time to build the graph data structure, which may not be favorable given the one off nature of your situation.

I would be sure to use a vectorized (SIMD) similarity search implementation, and multithreading to split the work among all CPU cores on a beefy machine.

Also, this falls into the category of “embarrassingly parallel” problems - it should be straightforward to divide the work across multiple machines if really necessary, eg see Ray and the surrounding python ecosystem.




Agree in principle but why use fp32? These are binary vectors, so OP just needs a fast Hamming distance




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: