Using Your Vector Database as a JSON (Or Relational) Datastore

jitl · 2024-04-23T03:28:06

From the intro:

> we've seen many companies and organizations hop on the vector search bandwagon, from NoSQL database providers such as MongoDB (via Atlas Vector Search) to traditional relational databases such as Postgres (via pgvector). The general messaging I hear around these vector search plugins is largely the same and goes something like this: developers should stick with us since you can store tables/JSON in addition to vectors, so there is no need to manage multiple pieces of infrastructure!

> This kind of statement always cracks me up, as it's clearly crafted by unsophisticated marketing teams

From the conclusion:

> Once your application starts requiring more complex workloads (such as joins or aggregations), that's when you'll want to contemplate using different data stores.

I don’t know, after reading through the post I’ve come away on the side of the “unsophisticated marketing teams”; if the vector DB doesn’t have transactions, joins, aggregations it sounds like I’d hit its limitations pretty quick and then I’d rather have an all-in-one industrial strength DB like Postgres instead of two systems.

reply

jurip · 2024-04-23T05:28:00

"More complex workloads such as joins". How far we have fallen.

esafak · 2024-04-23T05:32:12

You joke, but you would not blame redis for not doing joins. Vector databases are really more like caches that do similarity search. Maybe we should call them vector stores.

flawsofar · 2024-04-23T07:03:55

I can see uses for joining vector queries. Give me all the people who are attracted to people like this one. Boom, embedded.

esafak · 2024-04-24T15:59:39

> Give me all the people who are attracted to people like this one

This can be handled with a simple batch query rather than a join, even if you want to query for multiple users, fetching all of their related users. In qdrant, for example: https://qdrant.tech/blog/batch-vector-search-with-qdrant/

reply

Culonavirus · 2024-04-23T09:39:18

This is what gets me more and more with LLMs and their "ability to code". A join is a bedrock of a database. Always has been. Always will be. All LLMs are an absolute shitshow when it comes to any SQL logic involving joins. It's hilariously bad. But what can I expect from an LLM when apparently even people consider a JOIN a "more complex workload"? What a joke.

randomdata · 2024-04-23T13:41:36

Obviously it is a more complex workload. Consider a simple database like a hash map. Accessing one record in that map is easy, but it should be quite apparent that you need quite a bit more logic if you want to join that record with other records.

And that's not far off how current generation vector stores are implemented. They are not designed for efficient joins (nor aggregations) and changing that, like above, does require a more complex workload.

There is no free lunch in engineering, just a choice in which trade-offs you are willing to pay for.

reply

sgarland · 2024-04-23T12:06:55

Eh? SQL is one area where I’ve found LLMs – at least, ChatGPT – to be surprisingly decent. I’ve even had it translate relational algebra back and forth.

w1 · 2024-04-23T04:21:04

Postgres (with pgvector) is an unbelievably goated vector db. Idk why anyone would use anything else.

stephen37 · 2024-04-23T07:50:25

It's nice to begin with but you usually quickly hit a scale issue from what I tried and talked to different people using it.

Also they only offer index stored in Memory as far as I know, also lack the support of different index or more advanced ones like GPU index.

https://zilliz.com/comparison/milvus-vs-pgvector

reply

brigadier132 · 2024-04-23T05:57:19

PgVector stores the index in memory right? Do you really want your primary db using all its memory for vectors and all its cpu for doing vector dot products?

Bella-Xiang · 2024-04-23T05:48:59

There are significant differences in performance, accuracy, and other aspects when using a general-purpose database like PostgreSQL. These differences can lead to bottlenecks in performance and data scale. You can refer to this comparison article to learn more https://myscale.com/blog/myscale-vs-postgres-opensearch/

manishsharan · 2024-04-23T13:37:36

So you want to run mission critical workloads on Postgres and do a vector similarity search on the same DB?

trosterud · 2024-04-23T11:12:30

And Datomic is the goat outside of vectors

redwood · 2024-04-23T11:17:34

This "cracks me up" framing is a bit embarrassing to read, maybe they don't realize how they come across

mritchie712 · 2024-04-23T10:55:50

show me what I can't do in pgvector that I can do in your vector database. That's all I need to see. If you're only answer is, "it doesn't scale", I'll need you to be specific about "scale".

stephen37 · 2024-04-23T12:02:53

On top of my head, pgvector only supports 2 indexes, those are running in memory only. They don't support GPU indexing, nor Disk based indexing, they also don't have separation of query and insertions.

Also with different people I've talked to, they struggle with scale past 100K-1M vector.

You can also have a look yourself from a performance perspective: https://ann-benchmarks.com/

reply

gk1 · 2024-04-23T12:10:28

Here you go: https://www.pinecone.io/blog/pinecone-vs-pgvector/

tibanne · 2024-04-23T12:35:09

Here are some other potential posts that could be written in the same style:

Using a Space Shuttle to Go Grocery Shopping

Employing a Symphony Orchestra to Play Happy Birthday

Using a Chainsaw to Cut Birthday Cake

Driving a Formula 1 Car on a School Run

Booking a Five-Star Hotel for a Nap

reply

khaki54 · 2024-04-23T11:22:47

We've come full circle!