Retrieval-Augmented Generation: embedding tech that's standardizing

https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1

Here’s a post from a year ago by a company called AnyScale on how they built out their own custom LLM-system stack tailored to their dataset. It’s long but straightforward–it shares the code and analyses they did at each step (like 50+ steps), but what’s happening is pretty understandable to non-coders, I think. I guess reading through parts of this could give more UpTrusters more shared vocab and common knowledge of what kinds of methods are good toward what ends, and some ways to reason about them.

Some good snippets/concepts:

Vector DB

A database augmented with the ability to quickly/immediately return the records nearest in an embedding space.

Chunking data

Splitting documents into smaller parts and embedding them separately, so that you can find and use the relevant parts.

re: Fine Tuning

This can especially be impactful if we have a lot of: 1) new tokens that the default tokenization process creates subtokens out of that lose the significance of the token; 2) existing tokens that have contextually different meanings in our use case

(that particular aspect isn’t relevant for us probably, until we have lots of content that’s in rare languages or something.)

https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1