Hacker News new | past | comments | ask | show | jobs | submit login
RAG at scale: Synchronizing and ingesting billions of text embeddings (medium.com/neum_ai)
160 points by picohen 7 months ago | hide | past | favorite | 55 comments



We are also developing an open-source solution for those who would like to test it out and/or contribute, it can be consumed as a web service, or embedded into .NET apps. The project is codenamed "Semantic Memory" (available in GitHub) and offers customizable external dependencies, such as using Azure Queues, RabbitMQ, or other alternatives, and options for Azure Cognitive Search, Qdrant (with plans to include Weaviate and more). The architecture is similar, with queues and pipelines.

We believe that enabling custom dependencies and logic, as well as the ability to add/remove pipeline steps, is crucial. As of now, there is no definitive answer to the best chunk size or embedding model, so our project aims to provide the flexibility to inject and replace components and pipeline behavior.

Regarding Scalability, LLM text generators and GPUs remain a limiting factor also in this area, LLMs hold great potential for analyzing input data, and I believe the focus should be less on the speed of queues and storage and more on finding the optimal way to integrate LLMs into these pipelines.


The queues and storage are the foundation on which some of these other integrations can be built on top. Agree fully on the need for LLMs within the pipelines to help with data analysis.

Our current perspective has been on leveraging LLMs as part of async processes to help analyze data. This only really works when your data follows a template where I might be able to apply the analysis to a vast number of documents. Alternatively it becomes too expensive to do at a per document basis.

What types of analysis are you doing with LLMs? Have you started to integrate some of these into your existing solution?


Currently we use LLMs to generate a summary, used as an additional chunk. As you might guess, this can take time, so we postpone the summarization at the end (the current default pipeline is: extract, partition, gen embedding, save embeddings, summarize, gen embeddings (of the summary), save emb)

Initial tests though are showing that summaries are affecting the quality of answers, so we'll probably remove it from the default flow and use it only for specific data types (e.g. chat logs).

There's a bunch of synthetic data scenarios we want to leverage LLMs for. Without going too much into details, sometimes "reading between the lines", and for some memory consolidation patterns (e.g. a "dream phase"), etc.


Makes sense. Interesting on the fact that summaries affect quality sometimes.

For synthetic data scenarios are you also thinking about synthetic queries over the data? (Try to predict which chunks might be more used than others)


yes, queries and also planning.

For instance, given the user "ask" (which could be any generic message in a copilot), decide how to query one or multiple storages. Ultimately, companies and users have different storages, and a few can be indexed with vectors (and additional fine tuned models). But there's a lot of "legacy" structured data accessible only with SQL and similar languages, so a "planner" (in the SK sense of planners) could be useful to query vector indexes, text indexes and knowledge graphs, combining the result.


Really interesting library.

Is anyone aware of something similar but hooked into Google Cloud infra instead of Azure?


we could easily add that if there's interest, e.g. using Pub/Sub and Cloud Storage. If there are .NET libraries, should be straightforward implementing some interfaces. Similar considerations for the inference part, embedding and text generation.



Why .NET apps specifically?


Multiple reasons, some are subjective as usual in these choices. Customers, performance, existing SK community, experience, etc.

However, the recommended use is running it as a web service, so from a consumer perspective the language doesn't really matter.


We’re also building billion-scale pipeline for indexing embeddings. Like the author, most of our pain has been scaling. If you only had to do millions, this whole pipeline would be a 100 LoC. but billions? Our system is at 20k LoC and growing.

The biggest surprise to me here is using Weavite at the scale of billions — my understanding was that this would require tremendous memory requirements (of order a TB in RAM) which are prohibitively expensive (10-50k/m for that much memory).

Instead, we’ve been using Lance, which stores its vector index on disk instead of in memory.


Co-author of article here.

Yeah a ton of the time and effort has gone into building robustness and observability into the process. When dealing with millions of files, a failure half way through it is imperative to be able to recover.

RE: Weaviate: Yeah, we needed to use large amounts of memory with Weaviate which has been a drawback from a cost perspective, but that from a performance perspective delivers on the requirements of our customers. (on Weaviate we explored using product quantization. )

What type of performance have you gotten with Lance both on ingestion and retieval? Is disk retrieval fast enough?


Disk retrieval is definitely slower. In-memory retrieval typically can be ~1ms or less, whereas disk retrieval on a fast network drive is 50-100ms. But frankly, for any use case I can think of 50ms of latency is good enough. The best part is that the cost is driven by disk not ram, which means instead of $50k/month for ~TB of RAM you're talking about $1k/mo for a fast NVMe on a fast link. That's 50x cheaper, because disks are 50x cheaper. $50k/mo for an extra 50ms latency is a pretty clear easy tradeoff.


we've been using pgvector at the 100M scale without any major problems so far, but I guess it depends on your specific use case. we've also been using elastic search dense vector fields which also seems to scale well, but of course its pricey but we already have it in our infra so works well.


What type of latency requirements are you dealing with? (i.e. look up time, ingestion time)

Were you using postgres already or migrated data into it?


I'd love to know the answer here too!

I've ran a few tests on pg and retrieving 100 random indices from a billion-scale table -- without vectors, just a vanilla table with an int64 primary key -- easily took 700ms on beefy GCP instances. And that was without a vector index.

Entirely possibly my take was too cursory, would love to know what latencies you're getting bryan0!


> 100 random indices from a billion-scale table -- without vectors, just a vanilla table with an int64 primary key -- easily took 700ms on beefy GCP instances.

Is there a write up of the analysis? Something seems very wrong with that taking 700ms


we have look up latency requirements on the elastic side. on pgvector it is currently a staging and aggregation database so lookup latency not so important. Our requirement right now is that we need to be able to embed and ingest ~100M vectors / day. This we can achieve without any problems now.

For future lookup queries on pgvector, we can almost always pre-filter on an index before the vector search.

yes, we use postgres pretty extensively already.


What size are your embeddings?



What kind of retrieval performance are you observing with Lance?


For a "small" dataset of 50M and 0.5TB in size with 20 results get around 50-100ms.


What statistics/metrics are used to evaluate RAG systems? Is there any paper that systematically compares different RAG methods (chunkings, models, ect)? I would assume that such metric would be similar to something used for evaluating summarization or question and answering but I am curious to know if there are specific methods/metrics used to evaluate RAG systems.


This is a great article about the technical difficulties of building a RAG system at scale from an engineering perspective. Performance is about speed and compute. A topic that is not addressed is how to evaluate a RAG system where performance is about whether the RAG system is retrieving the correct context and answering questions accurately. A RAG system should be built so that the different parts (retriever, embedder, etc) can easily be taken out and modified to improve the performance of the RAG system at answering questions accurately. Whether a RAG system is answering questions accurately should be assessed during development and then continuously monitored.


Co-author of the article here.

You are right. Retrieval accuracy is important as well. From an accuracy perspective, any tools you have found useful in helping validate retrieval accuracy?

In our current architecture, all the different pieces within the RAG ingestion pipeline are modifiable to be able to improve loading, chunking and embedding.

As part of our development process, we have started to enable other tools that we don't talk as much in the article about including a pre processing and embeddings playground (https://www.neum.ai/post/pre-processing-playground) to be able to test different combinations of modules against a piece of text. The idea being that you can establish you ideal pipeline / transformations that can then be scaled.


Did you consider pre-processing each chunk separately to generate useful information - summary, title, topics - that would enrich embeddings and aid retrieval? Embeddings only capture surface form. "Third letter of second word" won't match embedding for letter "t". Info has surface and depth. We get depth through chain-of-thought, but that requires first digesting raw text with an LLM.

Even LLMs are dumb during training but smart during inference. So to make more useful training examples, we need to first "study" them with a model, making the implicit explicit, before training. This allows training to benefit from inference-stage smarts.

Hopefully we avoid cases where "A is B" fails to recall "B is A" (the reversal curse). The reversal should be predicted during "study" and get added to the training set, reducing fragmentation. Fragmented data in the dataset remains fragmented in the trained model. I believe many of the problems of RAG are related to data fragmentation and superficial presentation.

A RAG system should have an ingestion LLM step for retrieval augmentation and probably hierarchical summarisation up to a decent level. It will be adding insight into the system by processing the raw documents into a more useful form.


Not at scale. Currently we do some extraction for metadata, but pretty simple. Doing LLM based pre-processing of each chunk like this can be quite expensive especially with billions of them. Summarizing each document before ingestion could cost thousands of dollars when you have billions.

We have been experimenting with semantic chunking (https://www.neum.ai/post/contextually-splitting-documents) and semantic selectors (https://www.neum.ai/post/semantic-selectors-for-structured-d...) but from a scale perspective. For example, if we have 1 millions docs, but we know they are generally similar in format / template, then we can bypass having to use an LLM to analyze them one by one and simply help create scripts to extract the right info.

We think there are clever approaches like this that can help improve RAG while still being scalable.


Do you have any more resources on this topic? I’m currently very interested in scaling and verifying RAG systems.


> From an accuracy perspective, any tools you have found useful in helping validate retrieval accuracy?

You’ll probably want to start with the standard rank-based metrics like MRR, nDCG, and precision/recall@K.

Plus if you’re going to spend $$$ embedding tons of docs you’ll want to compare to a “dumb” baseline like bm25.


Yeah, especially if you're experimenting with training and applying a matrix to the embeddings generated by an off the shelf model to help it surface subtleties unique to your domain.


It seems to me that RAG is really search, and search is generally a hard problem without an easy one size fits all solution. E.g., as people push retrieval further and further in the context of LLM generation, they're going to go further down the rabbit hole of how to build a good search system.

Is everyone currently reinventing search from first principles?


I am convinced that we should teach the LLMs to use search as a tool instead of creating special search that is useful for LLMs. We now have a lot of search systems and LLMs can in theory use all kind of text interface, the only problem is with the limited context that LLMs can consume. But is is quite orthogonal to what kind of index we use for the search. In fact for humans it is also be useful that search returns limited chunks - we already have that with the 'snippets' that for example Google shows - we just need it to tweak a bit for them to be maybe two kind of snippets - shorter as they are now and longer.

You can use LLMs to do semantic search using a keyword search - by telling the LLM to come up with a good search term that would include all the synonymes. But if vector search in embeddings really gives better results than keyword search - then we should start using it in all the other search tools used by humans.

LLMs are the more general tool - so adjusting them to the more restricted search technology should be easier and quicker to do instead of doing it the other way around.

By the way - this prompted me to create my Opinionated RAG wiki: https://github.com/zby/answerbot/wiki


Depends on what you mean by search. Do you consider all Question Answering as search?

Some questions require multi-hop reasoning or have to be decomposed into simpler subproblems. When you google a question, often the answer is not trivially included in the retrieved text and you have to process(filter irrelevant information, resolve conflicting information, extrapolate to cases not covered, align the same entities referred to with two different names, etc), forumate an answer for the original question and maybe even predict your intent based on your history to personalize the result or customize the result in the format you like(markdown, json, csv, etc).

Researchers have developed many different techniques to solve the related problems. But as LLMs are getting hyped, many people try to tell you LLM+vector store is all you need.


We're using a product from our existing enterprise search vendor, which they pitch an NLP search. Not convinced it's better than the one we already had consider we have to use an intermediate step of having the LLM turn the user's junk input into a keyword search query, but it's definitely more expensive...


Your intuition on search being implemented is correct.

It's still TBD on whether these new generations of language models will democratize search on bespoke corpuses.

There's going to be a lot of arbitrary alchemy and tribal knowledge...


To some degree. The amount of data that will be brought into search solutions will be enormous, seems like a good time to try to reimagine what that process might look like


Also this is search for LLM not for humans so optimal solution will be different. Or even with models it is not that hard to imagine that Mistral-8b will need different results than GPT4 which has 1.76 trillion parameters.


I think this is premature optimisation. LLMs are the general tool here - in principle we should try first to adjust LLMs to search instead of doing it the other way around.

But really I think that LLMs should use search as just one of their tools - just like humans do. I would call it Tool Augmented Generation. And also be able to reason through many hops. A good system answer the question _What is the 10th Fibonacci number?_ by looking up the definition in wikipedia, writing code for computing the sequence, testing and debugging it and executing it to compute the 10th number.


Are there any good implementations of using RAG within postgresql ecosystem? I have seen blogposts from supabase[0] and timescale db[1] but not a full fledged project. The full text search is very good within postgres at the moment and having semantic search within the same ecosystem is quiet helpful atleast for simple usecases.

[0] https://supabase.com/docs/guides/database/extensions/pgvecto...

[1] https://www.timescale.com/blog/postgresql-as-a-vector-databa...


Isn't RAG "just" dynamically injecting relevant text in a prompt? What more would one implement to achieve RAG, beyond using Postgres' built in full text or knn search?


what i'm looking for is a neat python library (or equivalent) that integrates end to end say with postgres/pgvector using sqlalchemy, enables parallel processing of large number of documents, create interfaces for embeddings using openai/ollama etc. It looks like FastRAG [0] from intel looks close to what i'm envisioning but it doesnt appear to have integration to postgres ecosystem yet i guess.

[0] https://github.com/IntelLabs/fastRAG


Through the platform (Neum AI) we support the ability to do this with Postgres, it is just a cloud platform so not a python library.

Curious on what type of customization are you looking to add that you would want something like a library?


We need something we can orchestrate and control locally and be able make changes if need be. The GUI based interface is good for more mature workflows but our workflows are constantly evolving and requires tweaking that its hard to do with GUI and web interface


Timescale recently released Timescale Vector [0] a scalable search index (DiskANN) and efficient time-based vector search, in addition to all capabilities of pgvector and vanilla PostgreSQL. We plan to add the document processing and embedding creation capabilities you discuss into our Python client library [1] next, but Timescale Vector integrates with LangChain and LlamaIndex today [2], which both have document chunking and embedding creation capabilities. (I work on Timescale Vector)

[0]: https://www.timescale.com/blog/how-we-made-postgresql-the-be... [1]: https://github.com/timescale/python-vector [2]: https://www.timescale.com/ai/#resources


Or generally what are good vector dbs have tried LlaMaindex, pinecone and milvus but all kinda sucked different way.


What about then sucked?


Thanks for writing this up! I'm working on a very similar service (https://embeddingsync.com/) and I implemented almost the same as you've described here, but using a poll-based stateful workflow model instead of queueing.

The biggest challenge - which I haven't solved as seamlessly as I'd like - is supporting updates / deletes in the source. You don't seem to discuss it in this post, does Neum handle that?


Co-author of the article here.

We do support updates for some sources. Deletes not yet. For some sources we do polling which is then dumped on the queues. For other we have listeners that subscribe to changes.

What are the challenges you are facing in supporting this?


Similar to you, for polling you only see new data not the deletion events so I can't delete embeddings unless I keep track of state and do a diff. To properly support that you/I would need effectively CDC, which gets more complex for arbitrary / self-serve databases.


Good article BUT I can't fathom that people would use a managed service to generate and store embeddings.

The openAI or replicate embeddings APIs are already a managed service... You would still need to self managing it all just into a different API.

And dealing with embeddings is the kind of fun work every engineer wants to do anyway.

Still a good article but very perplexing how the company can exist


Sounds like the same people who use langchain's "Prompt replacement" methods instead of, you know, just use string formatting

https://python.langchain.com/docs/modules/model_io/prompts/p...


Some engineers find it fun, other might not. Same as everything.

IMO the fun parts are actually prototyping and figuring out the right pattern I want to use for my solution. Once you have done that, scaling and dealing with robustness tends to be a bit less fun.


Can anyone who has used such systems for some time comment on their usefulness? Is it something you can't live with, a nice to have, or something you tend to forget is available after a while?


here's how we solved engineering challenges related to RAG using open-source Metaflow: https://outerbounds.com/blog/retrieval-augmented-generation/


We also shared an article about how we run these indexing jobs at scale at deepset with kubernetes, SQS, s3 and KEDA.

TL;DR: Queue upload events via SQS, upload files to s3, scale consumers based on queue length with keda and use haystack to turn files into embeddings.

This also works for arbitrary pipelines with your models, custom nodes (python code snippeds) and is pretty efficient.

Part1 (application&architecture): https://medium.com/@ArzelaAscoli/scaling-nlp-indexing-pipeli... Part2 (scaling): https://medium.com/@ArzelaAscoli/scaling-nlp-indexing-pipeli... Example code: https://github.com/ArzelaAscoIi/haystack-keda-indexing

We actually also stared with celery, but moved to SQS to improve the stability.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: