The Intersection of AI and Architecture: A Conversation on Retrieval and Search
The development of Artificial Intelligence (AI) and its applications have led to an explosion of interest and innovation in various fields, including computer architecture. As we continue to push the boundaries of what is possible with AI, it's essential to explore the intersection of these two seemingly disparate areas.
We're at a critical juncture where the number of developers working on AI-related projects far exceeds the number of experts in Information Retrieval (IR) and search. While this may seem like a problem, it also presents an opportunity for growth and innovation. The good news is that we've seen this pattern play out before, with every new technology or technique emerging through a similar process.
Take the development of language models, for example. Initially, these systems were unoptimized and only worked in limited contexts. However, as they evolved, researchers and developers began to optimize them for better performance, leading to significant breakthroughs. This same pattern is likely to repeat itself with other areas of AI research.
One area that's particularly relevant to the development of AI systems is Information Retrieval (IR). IR involves finding specific information within a large dataset, which is an extremely ubiquitous and mainstream use case. Companies need to be able to extract value from their vast amounts of data, and this requires expertise in IR and search. However, rather than requiring companies to hire specialized researchers or developers, we're likely to see the development of off-the-shelf solutions that can be easily integrated into existing systems.
The question remains, though: will AI models be able to abstract away the complexity of IR and search, allowing us to focus on more high-level tasks? The answer is uncertain, but it's clear that the need for effective retrieval and search capabilities will only continue to grow. As we move forward, we can expect to see significant advancements in this area, driven by both researchers and developers working together.
The current state of affairs is that many developers are still "learning as they go," using trial-and-error approaches to develop their skills. This process is likely to continue for the foreseeable future, with a growing number of experts emerging through this process. However, as the field continues to evolve, we can expect to see more efficient and effective solutions being developed.
One potential solution that's gaining traction is the use of pre-trained models and fine-tuning techniques. These approaches allow developers to adapt pre-trained models to specific tasks and domains, reducing the need for extensive retraining from scratch. This approach has already shown promising results in various areas, including natural language processing (NLP) and computer vision.
Another area that's worth exploring is the use of specialized hardware and software architectures. As AI systems become more complex and demanding, we're likely to see significant advancements in this area. The development of specialized hardware, such as tensor processing units (TPUs), has already shown impressive results in certain applications. Similarly, advances in software architecture are allowing for more efficient and scalable deployment of AI models.
The future of AI and computer architecture is bright, with many exciting developments on the horizon. As we continue to push the boundaries of what's possible with AI, it's essential to explore the intersection of these two fields and consider how they can be used together to drive innovation and growth.
"WEBVTTKind: captionsLanguage: enall right everyone welcome to another episode of the twiml AI podcast I am your host Sam charington today I'm joined by Ed and enough Ed is Chief product officer at data Stacks before we get into today's conversation be sure to hit that subscribe button wherever you're listening to Today's Show Ed welcome to the podcast thank you it's great to be here I'm looking forward to our conversation we've got a bunch on the agenda we'll be talking Rag and Vector databases and assistance but before we do that I'd love to have you share a little bit about your background in fact we've got RPI in common yeah we do yes yeah um which uh I think is probably a very chilly place this time of year so it's been a while since I've been back I have you been there any recently uh I wouldn't say recently probably five years ago was the last time yeah so same for me same for me uh great school for those who haven't been there though small small great tech school but uh but but in Upstate New York and uh uh one of the reasons why I chose to move out to the West Coast when I graduated was was the winnner winners there absolutely so tell us a little bit about uh how you got from there to to here yeah so you know came out to the West Coast really wanted to to get into startups and you know everything that was going on and of course this was the early days of uh of things like internet well it was even pre- internet multimedia and all that um but uh but but shortly thereafter you know internet happened and was uh was over at wired in the early days doing uh the search engine and uh and then uh did a whole bunch of stuff started a company in in the Enterprise Java space called epic Centric uh that that had a great run uh went on to to do some other other cool stuff uh social media advertising blogging was at six apart uh for for a while the company that made uh movable type and type pad and uh and uh then went and uh ended up part of apy uh the API management company uh we we had a great run there too did an IPO got acquired by Google and uh uh and and after a few years at Google decided to uh to to come over to uh to data Stacks uh which is the company that makes Cassandra the Cassandra database and have been doing that for uh for for the last few years so a bunch of cool fun stuff primarily making stuff for for people that are are you know building websites building applications building content that that tends to be the type of of stuff I like to do I totally forgot about your epicenter employee at Plum Tree yes yes yeah so those that that was an exciting those were the days exactly awesome awesome so um tell us a little bit about uh you know data sex has been uh kind of active in uh helping organizations um kind of take on this challenge of using llms and and rag tell us about data sex's kind of angle in that sure yeah so as I mentioned you know data Stacks is is the company behind Cassandra and and Cassandra was really the original cloud native database so awful lot of companies whether you know Uber you know whether you're using Uber whether you're using Netflix Apple uh these are all companies that use the Cassandra database and and when you use do something like FedEx package tracking that's that's all on top of Cassandra that's all on data Stacks as well and so we we knew pretty early on that as people were looking to to First with ML and then as Ai and gen AI became a big thing we we knew that that was going to be pretty important that people would want to use the data that they had in in these systems that power all these interactions that they'd want to add AI to it and so we looked at how to add the vector capability the vector search capability to the database and uh and that's that's something that we did we did it both within our within you know astb that's our cloud service but we've also everything we do is also an open source so so Cassandra 5.0 it's part of the Apache Foundation has has this uh this Vector query capability and and you know as you know when you want to get a database to work well with an llm and we'll get into it I I think probably in depth in a little bit when we talk about things like rag but uh but the starting point is to allow you to go and have your llm within some architectures and we'll talk about rag I think in in depth but you want to be able to go and retrieve information from the database uh on on a vector-based query and and so what we've done is and we actually did a couple inter iterations of this at first we we implemented much like most of the other Vector databases that you see we took the H hnsw which is is the hierarchical navigable small worlds um approach of of uh adding Vector query capabilities we we brought that to Cassandra we've since switched to uh something that's called dis uh uh disn um which which is is a different approach um uh that is more oriented towards optimizing uh disio uh we can talk a little bit about that too but but the the goal here is to go and give you a Cassandra which is one of the most scalable databases for people that don't know built on technology from Facebook from Google from from Amazon and uh gives you a scaleout data capability but how do we bring this Vector capability to to you know to these large data sets and that's that's really our angle on all this can you drill into the hnsw versus disn and what those mean and what are what are their implications sure uh you know I'll do my best on that uh probably I probably won't do it justice so so please folks on the podcast don't don't go and like phone in and and you know tell me I screwed it up but a lot of it has so so you know what what so let me talk a little bit about sort of the the origins and why everybody H has been using these things one of the amazing things about um as companies have gone and approached uh uh a lot of the the buildout on on U uh on on AI infrastructure has been that they've been using a lot of stuff that's been around for for a while and in fact if you you know if you talk to to folks they be like oh well you know uh you know approximate nearest neighbor and so on like this been around for a while and in fact there there are better approaches that that are coming down the pike uh on this um but uh the hnsw implementation that most of the databases most of the vector databases out there ended up using what came out of Lucine and Lucine was one of the original search engines right you know started really as a search engine but but at this point it's more of of the the search Library search infrastructure index creation that a lot of folks use and um I'll definitely get this wrong uh so actually I won't name names um but other than just say most of the big names that that you would have heard about and that we all talk about when we talk about Vector databases they all started with the code that was in the Lucine hnsw implementation uh and they uh they they you know in some cases they may have ported it over to their language of choice because not all these things are are you know uh written in the same code base um but they they started with that and that was a good starting point like if you were you know fill in the blank the original you know Vector databases that were named in the in the uh open AI blog post that kicked off the whole Vector database race uh or the F or the people who came out very quickly afterwards um it's a good starting point the problem was that hnsw was um not particularly optimized for for disio and so um you know what what uh what we found was as we started to to deal with the performance of this and particularly with Cassandra that's a distributed database um we went and uh and and looked at at sort of the the structure and um essentially the the the levels of of U and edges of the graph that gets constructed because basically you end up you know breaking these things down this is where the hierarchical part comes from um and then you end up having to do a partitioning particularly where um you know on a clustering basically if you you this really comes home to roost when you're using a a distributed database because when you look at the way something like Cassandra works it actually goes and takes a query Vector non Vector this is just the way Cassandra works it takes that query Farms it out to a whole set of nodes like people who running Cassandra have thousands of nodes and so if you don't um if you don't have something that first of all allows us to map that hierarch hierarchy of where um you know of basically where where the semantic position is of something in that cluster and then further doesn't go and reduce the number of dis um you know IO operations you you end up you end up hitting a performance wall and um and one of the dirty secrets of vector databases is that they all perform really good on small data sets because they effectively end up defaulting to what's in memory and in fact in the early days of when these Vector databases were coming out um a lot of developers you'd read on Hacker News and elsewhere would just joke that like for for these demos people are doing you could just do everything in memory um you know on your laptop and you get better results and um and and so that's where a lot of the the stuff that we end up looking at for us you know particularly for for Cassandra's role in this stuff we want to be the one that's doing really large shade ass sets for example one of our demos one of our rag demos that we use you know has the entire Wikipedia Wiki data data set and that's something like you know 500 million documents and you can actually start start to see the the issues uh start to crop up in as in as few as 100 thousand or so documents but definitely when you get into the million plus documents you start to see that these trade-offs actually have very tangible results in terms of relevancy not just in terms of the performance um but actually starts to cause fairly significant drop offs in relevancy and you just start to get to to junk back and you see particularly where sort of you know all of the vector databases right now are sort of in an arms race where they because really the the business payoff on the other sort of important comment about Vector databases to and databases simp I won't spend a lot of time talking about the database business today but but all the databases all databases whether you are us whether you're whether your pine cone with you whatever these are consumption-based businesses and it means that we we all love and we all do that the hackathons and all that but generally the businesses are built on having very large data sets and if we can't if we can't vectorize that very large data set and make it possible to do like production rag at scale it's going to be really hard for anybody to to build a business around this and so so that's why if you talk to anybody who's who's running a vector database that a lot of the focus is is about how do we get to these how do we make it feasible and cost effective there's a big cost Dimension to this as well to to do these these really large Jets and that's why all of these choices you know like we're we're really happy about dis G in right now it it we we have have a lot of stuff that shows it outperforms in real world a lot of other options but but we're already going and and looking at at some of the stuff that that you know some of the different approaches um you know that that will will move beyond that some of them specifically to to distribute a database space no no no that was that was that was really useful so disn is a an algorithm and approach for doing approximate nearest neighbor on dis presumably like exactly yeah it's it's infastructure aware approach for approximate nearest neighbor exactly that's just principal difference there there are there are a whole bunch of other pieces to it that again um I would do too crappy a job of in terms of going into detail on um uh and uh and but but the principal biggest difference that we've seen has been you know we've been we've been running this stuff you know as a service since early this year and so we've we've had the the privilege to see a lot of the sort of real world what happens when you do raged in in a real world setting and um and so when you start throwing a lot of of IO you know iio Ops um uh you know IO operations dis accesses at the problem you start to get into into uh you start to see a lot of things that you wouldn't see otherwise and and you wouldn't see for example in a in a classic chat GPT scenario because they're not they're not well they do now because now Chachi BT is actually does a lot of rag um as of you know about a month two month or two ago but but you wouldn't have seen previously in sort of the conversational scenarios that were really just throwing things at the model and and in as much as the database is involved it's it's being used for for conversation history which is is and is is is a part of rag but it's not the the it's not where sort of the the you know bread and butter of of of rag really comes home you made an interesting comment about the implication on not just performance but relevancy as you scale Vector databases and um I'd love to hear you elaborate on that a little bit more uh for a bit more context I've shared here on the podcast and and elsewhere one of my observations is that it's you really easy to get you know from kind of zero to POC uh with with Rag and with with dialogue agents based on llms but getting from that POC to uh you know system that you'd be willing to put in front of customers is like a lot harder and one of the big elements of that Gap is relevancy and and you know all the details that go into like the embeddings and constructing a context for the llms but at the heart like it's a relevancy challenge um uh and as a corollary to that the folks that that I've run across that have deep experience in Search and you know getting search results tuned up seem to kind of get this and and know how to fix this problem I'm wondering a does that resonate with you and then B like talk a little bit more about relevance and and kind of the way you see that playing out in Vector databases yeah so there's a couple of different pieces to it um um and in fact there there's actually and you're going to start to see a lot more of this just in general as people people talk about um this stuff so you've got you know you have have you know your Precision your recall your accuracy um you've got ways of measuring those you've got things like F1 that combine um precision and recall because they're tradeoffs um and uh and and so you know as we um you know as as as we look at those um these fluctuate over time um and they fluctuate as a function of of the data set that they're operating under so what this means is so generally like we we go and we look at at stuff and most of your interactions so it's probably worth sort of you know um there's two ways of two ways of looking at this you can look at the vector date like remember most of the action around Vector databases prior to this year was in using the vector database as uh as search engine that was you know if you go back in time go back to the way back machine take a look at what what you know what for example like the names we all throw around today uh when you're when you're looking at um essentially uh uh you know what were these companies doing you know um pine cone uh you know uh the the you know folks like chroma folks like um uh weate all that like you know a year ago they were primarily looking at at being search engines and and the idea was that and as you just pointed out from search concept that a lot of this stuff was was rooted in Search and the idea was that keyword search uh which is what a lot of folks you know more most familiar with um uh you know has has a bunch of limitations actually works well enough in most situations but but we want to but generally people wanted to move from a keyword to a semantic you know approach and so then that's where that's where doing a vector search um uh comes into play and and by the way you know you can actually do you know really good recall with no llm really in the loop um um well you do need the llm but it's not it it's not let it's more of an LM rather than llm um and nowadays we just call it an embedding model um that that is just for the purpose of going and reducing your text input into into a vector that I can now go and do a similar search gu so now we we get into this stuff and we're putting in uh you know we're we're we're putting in much larger data sets and we we want to go and try to figure out okay uh you know how do we go and and and how do we measure um uh you know for example um you know how accurate these results you know basically are are we um uh you know are we getting uh you know a lot of false positives when I go and and and do a search am I getting things back and between accuracy and precision we narrow down sort of you know we we get we get um uh you know a good sense of of of you know sort of we're able to to calibrate the the you know sort of the false positives and then we also get into sort of the recall pieces of it which which translate down into like things that should have come back so it's actually sort of more of the false negatives and again this is very overly uh reductive um but we start to get into that um and you can start measuring again you can measure these things and you can measure them and then again I also mentioned F1 which combines which combines um basically because you've got you you have uh you have a tradeoff um from from these and and from recall you want to be able to to to be able to see and so so people have created other other measur measurements around these and and um and again this is if you go and look sort of over the next six months just for context this is where a lot of action is going to be you're going to go to if you go to any one of these it's not there yet but I can predict within a month or two uh there you go there's my prediction for the new year which is um when you you know when you go to a vector database homepage what you're going to see is you know for any of these open source or commercial you're going to see all these great graphs that show number of documents and what you know Precision recall F1 score and all of that are because that's where like when you get past you know you mentioned sort of the demos like past sort of going and loading a bunch of the stuff in um that's that's where the action becomes now that said there is a whole B apart from just the raw database capability of of doing these these results there's a big piece around um uh whether uh you know actually a big data prep piece um that goes directly into it and that's one of the things because you'll hear people say oh but real world this and all they're right what they're saying is like it is garbage in go garbage out and um and so that's where we get into things like chunking and and so on where where I go and I take this piece of content and I need to to break it into into a set of things not just from the fact that that you you know you you want to get the the appropriate piece for storage standpoint but more importantly when I chunk these things when I take these documents depending on how I chunk them I could lose um and and chunking is exactly name implies taking this one might imagine a large PDF and breaking it into bite-sized pieces I can I can lose a lot of context that the smartest Elum in the world is not going to be able to put back if I happen to to break things apart on on you know if I did it naively or whether I did it with with a better understanding of the structure of the document and so so chunking itself actually um in terms of putting the data in is is where you see a lot of work and if you for example you know we look at some of the Frameworks people use for this stuff Lang chain and L index um they put a lot of effort into uh into into going at at the ingestion stage at going and taking what and again that's part of the reason why LL index is named what it is and actually as an aside sort of you get to the differences between those two projects and the philosophies behind them Lang chain as the name would imply is about chaining llm invocations llama index does that you know um but in order to do that Lang chain does do a bunch of ingestion stuff llama index as the name implies was really about getting that content and building those indexes of course they also do orchestration but but you see you know like you know when you go and look at at the folks doing those projects and you know um there talking about Harrison or Jerry like that you can sort of tell they they both had sort of a difference problem they were trying to solve and um and again you know when we sit down we're we're we have to think when to build like a rag app we got to think about the ENT but but those are they both become big problems and depending on when you want to go and improve you know that the accuracy of your results from from the vector database you know you so some peoplec correctly will go and say oh focus on on the the getting the data in there because that's going to have a big impact on on on your scoring um and other folks come in and say well that's true but but a big other piece of it is is kind of how we break apart and and construct the context uh that we're using to get to generate the the the vector the um the vectors that we want to look up on and and then what post filtering we do and the answer is both of those are very important do you think since you uh offer it up predictions uh do you think we'll get to uh a point where this all happens automatically kind of in the infrastructure and you know a user just needs to kind of bring their documents and get great results or will there always be a degree of uh fine-tuning and tweaking that needs to happen in order to get uh desirable results out of uh a rag type system we could go for a while on that one let me let me let me uh so let me te let me tease apart a couple of pieces to it so this is maybe a little bit of a an optimist pessimist warshock test well let me put this way it's a it's a good question and um and and as we tease that apart we're actually going to get into a bunch of the subjects that I know you wanted to talk about so so okay so yes what's what's our Holy Grail our Holy Grail is that um you know that the process of getting my data and use it within a geni you know conversational experience um and as we've seen you know conversational doesn't mean just doesn't mean chat anymore because we've got multimodal and we've got drawings and pictures and actually dynamically generating a graphical user interface is something that that that that you know we you know we can do now right so so so um you know but I think probably you and all all of your listeners probably already know that which is is that gen doesn't just mean chat anymore it just happens to be the the simplest way to to prototype but but yeah but but yes the the the first so let me put this the first really important piece is and this is not self-evident to a lot of folks in fact actually a lot of folks in the AI research domain don't fully get this which is that that real world AI particularly like business use cases needs you to be able to bring your own data and that data and again this is really important is not a static static Corpus of data like like what people are trying to do and this is why people are using rag like you're talking about data that is live data that you know is changing that is often times like maybe confidential or proprietary like things like electronic medical records or people's financial statements and stuff like that is never going to get fine-tuned into a model like it is never going into the model which means it's always going to go into infrastructure that's around the model so once we do that we're talking about rag or some variant of rag but but the big if you looked earlier this year there was this you know it was it was a Charming like charmingly naive debate on rag versus fine-tuning and it was like answer right and and seriously and you had like really smart people getting into this debate and I was saying like things like oh fine tuning like we won't need rag as of fine tuning and I was like was like you're never going to Mo let me repeat again you're never going to train a model on people's electronic medical records or or or bank statements like if you want models leaking personal information that's how you get models leing personal information right so so like now most people but that was it was the intersection of the fact that you had a lot of people were just very focused on building these really cool models and hadn't sort of zoomed out and said how are we going to use this stuff in more applied ways and so this year has been about the Collision Course of research and applied in AI in a really exciting way but you see it playing out in like a day byday basis so so we do know that but then the question becomes okay what do I have to do to get my data into uh in into a way that I can effectively retrieve it and that's where you've got the Frameworks and I mentioned a few of them and by there's a whole bunch more but but the Kings at this point are are Lang chain and and llama index who have just been moving really really fast on this and you get a lot of people complaining you know you had a lot of people complaining about the code bases there they're like man like they're just I'm like yeah but these guys are are following the stack and that's that you gota you got to admire that because that's you know that that that's that's what that's all about um but as a consequence there is a lot of trial and error um you know so I see a lot of of rag projects and and so going back to what you're saying like is it just going to be somebody just going to make this dead simple well yeah ultimately there aren't a lot of people are working on it um right now when you go and do a rag project you spend a whole bunch of time it's a and it's not dissimilar I think so in that way so there's a lot of ways that an AI project this year a gen project is similar to the ml projects that you and I had been looking at and involved in in previous years right um and then there's a lot of ways that they're completely different and the ways that they're very similar is that you know for all the talk you're talking about about like the cool stuff like the day-to-day is really uh a lot of you know data engineering which is you know a euphemism for just like data cleansing and a whole bunch of and this is again this is why why people using Python and such like just a whole bunch of data M munging just to get that data from the state that people have it in into a way that's going to yield the best results um and so so is that going to go away yeah absolutely um it's you know and and people are going to put a lot of wrappers around uh uh you know what currently requires you to do a lot of spaghetti architecture and and particularly in in the and and and a lot of folks you know and you'll see a whole bunch of companies that Focus just on that because you've got a whole bunch of data in the systems in the databases that people already using they're using streaming architectures or they're using things like psofa or whether they using their databases relational databases whether using non non-relational databases like all that data needs to go into into this stuff and it requires and then you've got more importantly like that's all your structur data you've got a lot where AI starts to to really blow things away is with unstructured data and the unstructured data in most cases you know again I spoke to this earlier the the hello world that for for most uh um rag apps and most um or rag Frameworks and most Vector databases is chat with PDF and the piece of that is again how do we take apart that PDF which is your at this point you know is now the canonical piece of unstructured data that everybody tests with um not the most interesting one but but the most pervasive one and then we go you know and then we go and say okay how do we take it apart is it you know is and and you start to get really into the world of mundane data at that point it's like is this an insurance claim is it a legal contract is it a research report is it a whatever and every one of those has a set of heuristics involved that may be um uh maybe actually defined procedurally because for example in the case of a research report you or EXC me a legal document you don't actually need like AI to take it apart like every contract has exactly the same structure so you can you can just do a bunch of string munging and extract it and turn chunk it which people do there there are entire companies in startups to do that or or you can do things more cleverly which is to go and in you can have an ingestion Loop where you're feeding the stuff into the llm having the llm guide the ingestion flow um and it so you have the llm supervising the dismantling of this unstructured content into chunks that then later at at you know at at rag time you're then going to go and and be able to to grab out right so so you've got those so so so again like this is and I'm sure you know because you've got a lot of folks a lot of your your um you know your listeners and stuff are probably in the middle of rag hell right now and so they're probably like hopefully nodding like yeah no I just did that this week and and so yeah that all that stuff has to go away but the the part of the problem becomes it's like like everybody's learning as we go along and so like you don't want to prematurely optimize and automate for what the use case was last week that was just the Hello World app when what we try to do next week is is something more complicated because every every time we somebody brings in a new data set um or new type of data um then then like it's like okay can we can we reuse the approach we use last time can we abstract on it can we build a new framework and hence there's a new brag framework every week right um yeah that's a fair and interesting point like you could yeah over optimize on PDF uh retrieval and you know get too anchor down into that and totally Miss multimodal for example is what you're saying like there's a danger to premature optimization which is a kind of a truism in software right no I mean your example exactly was right so so again most of what people are doing right now is smart knowledge bases which by the way not a bad thing like lot of use cases a lot of practical applications a lot of happy users and a lot of businesses are going to save a lot of money and make a lot of money just having AI knowledge bases and and for the knowledge bases a lot of that stuff the content that they're sourcing from are are you know support contracts things like that that are or whatever documents in p DF right but like multimodal the stuff you see from multimodal that people are building now um because we now have multim obviously we have multimodal in gb4 but you actually have multimodal now even in the open models um you know that's like the next phase of magic right um to be to be a little handwavy but but like that's that you know that's going to be an entire that that is also rag based um uh or can be um and but it has a completely different ingestion flow MH has actually two pieces of ingestion flow um it has it because a lot of this stuff that we're talking about here is um uh is data prep time but yeah in multimodal you have a realtime ingestion flow like you see all the examples where people I'm drawing a blank right now the name of the draw program that everybody's using to do the you know you draw the thing and then you and then you C the sketch thing yeah Sketch right you know what I'm talking about yes right so so the way the magic on all that on that particular use case is that they built a plugin for it's basically an ingestion plugin right it's like selecting the piece of the drawing area that it's then they've gone and wired in and sends it up to gb4 right so again that's that is a nice hello world example the minute I start applying that in in in to different use cases now I'm going to be like okay like how do I capture is it something from my screen is it something from this app oh I'm taking something from you know so there's be a whole bunch of software engineering that goes into into that yeah no interesting interesting so we talked a lot about rag I guess you know a question for for you um that I've been grappling with a little bit is you know is a vector database is is kind of this Vector capability is it a feature or is it kind of a new platform you know I think when folks think about uh Vector databases for Better or For Worse they think about kind kind of some of these upstart companies but you know there's PG Vector for postgress and uh you know data Stacks uh now has a vector capability and you know all of the traditional database vendors will have a vector capability but the question you know the question still is open for me no and and and so all so the answer so obviously you know I'm I'm at a vector database company so I think about this multiple times a day so my best thinking right now draws from a couple of things so simple to the the you know tldr is it's going to be both right and the longer answer is um you get a new application you get you have a Confluence of new things that that create an opening for for a new type of database and so we go back you know 15 years ago whatever it was and you had um people building with new languages predominantly JavaScript um you had people using using new types of apis predominately rest apis um so they're building new languages on the client they're building new types of ways of moving the data they were building a new runtime in the back end people using things like node.js but other Dynamic languages as well and so what you had was you ended up having this data JavaScript object notation AKA Json and so you had Json on the client you had Json on the wire you had Json on the server and then it made sense that you had Json in the database and that end to end created an opportunity now was not the only Json database but but there was a need and so at least one pure Json database then emerged and is now and so Mongo's doing just fine right like like but at the same time um you had Json is a data type like postris has a wonderful Json support now and that's do most of the other databases and there's other things so so so one could so the answer if you were to go back 15 years ago was like oh is Json you know is it a feature or is it a new type the answer was both it wasn't wasn't was were were there 10 new Json databases well there were 10 new Json databases there's only one time they were at yes right exactly but there's one we remember so of this current batch one or two of them is going to go and be the of this age right I like the sensor yes that makes a lot but at the same yeah but at the same time you know you are going it is also a feature like a bunch of other people are going to add these things and and uh and it's gonna be and everyone's going to bring to it kind of their special sauce your special sauce is kind of horizontal scaling someone else's Special Sauce might be you know this underlying document orientation someone else's Special Sauce will be something else yeah I I think the the real important piece and this is the it's like my refrain these days is I'm I'm more concerned with you know and for us and where I look at the people who are doing this right it's like it's like just follow the stack follow what people are building because that's the important piece like going back to the example the important part wasn't Json as a data type the important part that did was it was Json as queries like going able to things that that the other databases like you go into postgress and yeah they added as a Json type and so on but and and they've improved it over time but but the thing that did better than anybody else wasn't just that they could store and retrieve Json it was like they treated it as a first class citizen so when you did a query right so it's the same thing when you that so so again from a product strategy standpoint I go and look at it and say okay what is the equivalent of that now like like it's not just a question of so everybody right now has gone and said okay I've added Vector a an indexed column so that's great very good starting point but remember all of the vector databases would were out there all of them whether you the ones that you everyone assoc Associates as the pure plays or whether it's people who have added the capability this all happened before rag rag became a thing right and so now the question becomes like which of these are like follow that's why again I go back say follow the stack follow the application right like like the ones who are if are you adding features that are designed to make rag better right and what what what's involved in that like and we we spent a bunch of time talking about this stuff and we can keep we can go into even into even more detail but that's that's the other piece going back to like again if I my prediction of like if you go to a vector database website you know then basically it's already true like you're going to again that website's goingon to have two columns to it one is going to be here's our recall stats because that's that's your new stat like that databases used to talk about like oh I can handle this may request a second now they're going to be like this may request a second with this level of precision and recall but the other hand on other side of that web page is going to be all about rag right because that's your that's your canonical use case and how do they make you how do they make it uniquely easier and so that piece that's where where all the Innovation is going to be and and you know it's going to be and that's where you're going to see the difference the the databases that treated as a feature they're if you as a developer if I sit down to write something I'm going to go to the ones that are act that I can tell that whether it's an open source or commercial it doesn't matter like I'm going to look at be like are these folks focused on on trying to make this easier for me to have a have have a rag application and so so yeah yeah we'll see so so long and short of it is yeah I mean yeah it is both a feature but you'll have one or two folks that that that knock it out of the park and build a business on and that's always the case when we see this stuff right and know is there an infrastructure element or clearly there's an infrastructure element to this I guess more specifically I was you know having the same conversation with someone then they mentioned that they seem to suggest that there were Vector databases that were kind of GPU native or GPU Ena enabled and take advantage of the GPU and there are others that the implication was that they were um not pure plays or something and they weren't uh able to take advantage of the the GPU is that something that you're seeing is that a so when we look at when we look at the process of retrieving data from Vector database what we start to your goal is you know um your goal is not to have to do a whole bunch of GPU dependent Vector comparisons um as as your first of all some of that comes out to the efficiency of of how you built your index and so on but but anybody who's going and and hitting the GPU um you know in in uh um in an unbounded way from from you know from Vector retrieval standpoint um at query time at query time is not going to be um you know you could make the argument like oh that's that's a you know good thing to do but it's not from you know in terms of being more GPU native and certainly the the GPU vendors are you know get very excited about that um you do at so at query time you are going to you are going to hit the embedding model your goal is to hit that once not on a proo basis um but on on the when when you when you take apart me you embed the query and you turn the query into a vector yeah it gets a little more complicated but yes so at their basic level I want I'm going to take my input and I want to um you know I want to run it through an embedding model right and it's going to generate it's going to gener embedding my my vector comparisons my Vector reversal I you can involve the GPU in that it like I said you're going to put yourself in a cost prohibitive situation and one of the one of the key pieces of the the other key metric is is cost a lot of a lot of these things that work really well on your laptop you you price yourself out in terms of going into production from um because it just cost you more money then you know yeah uh you mentioned that your customers have thousands of nodes yeah you're if all those have to have gpus that's another class of of infrastructure cost yeah and and so as we look at that that that that's something you so so we go back into that so yes you do hit the embedding model and as you do that by the way that becomes a big big selection problem and it directly goes into uh you know overall like generally you want your your embedding model is is generally partnered or or or derived from or optimized from the main model that you're going to be using a generation time doesn't have to be um uh depending on on but but because what happens is again you know and I know you've talked a lot about Rag and and llm chaining but but the reason again why we have things called lank chain is because typically I go and I ask something from the agent and the what the embedding model does um is it breaks down my request and and one of the things that breaks it down into is a set of vectors of things that I want to know more about and it f and it Farms that out into a set of queries right so I get maybe five I maybe I get 15 vectors and I do the lookups of that of all the things that that the model that my generation model should know about when it produces the answer and and so so when I do that um uh you know essentially as I said it can be it can be um it they don't have to that that first model does not have to pair with the second model um often times again when you're doing things with like open AI you're you're going to use the same embedding model that's that you know is you know is is Vector compatible with the um but you don't have to do that because you just retrieve everything and then feed it textually into into that now by the way depending on what you're doing like that's your simplest your simplest s rag model has two llm invocations but the reason why people call it again Lang chain where the name comes from is I get a multiple llm iterations with with branching and you get into all sorts of things like Chain of Thought and so on that for for doing very complex answers or instruction following answers um where where you can get into into some really cool stuff but I don't back to original question because I got a little less I'm prone to do I got a little bit off of that I I don't I I I don't um uh the question of of you know the qu part one of the questions that you see like the idea do I actually need it in the database retrieval Loop no I I don't want a GPU there but the real qu but the bigger question becomes do I have to um as I'm doing both my insertions or I'm doing my queries for example can the database can the database invoke the embedding llm directly or do I have to do it in my application tier that's more of a convenience thing but that's an important thing and uh and so you do see at this point most of the vector databases um you know offer that as a feature um as a capability um you know we do as well mean they'll take the text as opposed to the vector and do the embedding for you yeah yeah and the other the other thing you're you're seeing that again this will be a big deal next year um is uh is a lot of the databases are going to offer you a um a natural language query capability because it turns out that these models actually do a very good job of of uh text to uh to sequel text to query language types of generation so so um so we are going to see that as well which is going to be really interesting when that happens because it's going to blur the lines between for example no SQL dat databases and SQL databases um a lot of effort goes into the into right now um creating your your your your queries and and um and like I said a lot of effort and it's actually a um a lot of foundation models are putting a lot of effort into this they already they already do a very good job because there's just because they all of course use like whether whether Google's using its crawl data set or everybody else that's using the common crawl there's a lot of SQL priors on on the web just you know and so so so these models are already very good I mean you can actually you actually use llama 2 and get get or for that matter I mean you go to code llama but just even just you know regular llama 2 will give you very good SQL um which is is an interesting thing to see when we were talking about embeddings you know embedding text you mentioned it's more complicated than that what was underneath that uh that comment well so you know that was that that was in in the um you know from from the standpoint of typically so couple of pieces to that let's talk about it from the ingestion standpoint let's talk about the query piece of it you know from from the ingestion standpoint um we're gen we do an ingestion and we we generate a um um you know a um we generate the embedding Vector um the at inserts or upserts right um and those are when we either create a new record or we update something uh the vectors that we the the embeddings that we create again a lot of stuff happens in the application tier with chunking which is to you know we're figuring out relevant piece but we also get into um which uh embedding model to use um because the dimensionality of of the the vector is going to have a lot of issues from both a cost stand and performance standpoint um and so what you see is that um a lot of the um uh you see for cost purposes a lot of people end up wanting to go and use a smaller model that does for example maybe a 300 Dimension right because we know the the open AI you know open AI is of course the the gold standard if I want to do if I've got unlimited money I want to do it right but it's going to give me A500 Dimension Vector right each one of those Dimensions is a floating point right so it's big thing um and so ideally maybe I'm going to use one of the small 300 Dimension models off a hugging face problem is whatever I use um at ingestion time is going to be the first or one of the first U models that I that I invoke at query time quite frankly those are some of so so now I've got a trade-off because um because my my first you know my first model that I hit is generally is taking your raw input where you're like hey you know um you know where should I go for lunch um whatever with all the additional injected context right that that that these have that and so on but like you know these these smaller models um you know are are not going you know yes they will give you that list of vectors to look up uh but they may not necessarily be that smart in doing it now what happens is in that first pass what it does is it's taking your it's it's generating the goal is to generate to build the context right because everybody users think in terms of prompts but the llm takes a context which has your prompt but all of the additional information that you choose to supplement it with right and so all of that um then gets fed into the second model which is which then again typically restates your original question but says in in depending on what you're trying to do if you if you're trying to get to a zero hallucination type result you may have a system prompt so it's got the system prompt which says you're going to get a question from the user and you've got the prompt underneath it which is here's Sam's question but the system prompt says you can answer this question for Sam but you're only going to use the additional information I Supply right so so the context has system prompt user prompt and then a set of the rag retrievals right um but the set of the rag retrievals are only as good as what that initial earlier model what we call the the embedding model was able to retrieve and if that embedding model is just not very smart then particularly in that situation where I'm limiting my response to the stuff that was retrieved from the vector database uh might not be very good um and then further again we get into the chaining situ situations like you look at the stuff people are doing uh with like chain like and this is the stuff that people love that particularly like the Lang chain folks love showing off in their demos because it's really cool or for that matter if you get into like the autog GPT stuff like which which is is you know ends up being that you know with with you know sort of an outer loop around it um uh then you're actually going in and and you know it may not just be one you maybe doing like several uh you know um llm uh you know essentially input prompt generation of reduction summarization uh look up look up list of vectors then feeding that in again and feeding that again then prompting back yeah and then prompting because again the cool part of this is it's not just uh because it's what's what part of it's doing is it's powering a conversation not just a a magic give me the best answer comes back and says hey Sam um you know did you mean did you mean this this or this right because it's it's and it gets to the point you know again through one of these you know Chain of Thought branching structures it's like it's like let me let me go back to the user and ask him now and then it goes and and and does it further but but the quality and and then it go and you know rinse and repeat but the quality of each one of these steps a lot of it can come from you know from and by the way we call these embedding models but you know the one of the things that's important at Rag and and again I say this as somebody who's you know I want people to use a vector database but but um the models can generate a lot of other types of lookups like it may you may also be like give give me a set of keywords for a conventional search like like the ve Vector is not the be all end all for what the the important part of all of this is this is all about iteratively building a smart context and Vector lookup is one of your best tools for building your context but other forms of look up as well I mean yeah you might you might ask you know and again your and this is where the intersection of the vector day base and what the system prompts and then the logic around it in the case of your Lang chaining you know is is what where it goes into because part of it might be generate you know one might imagine a trip planning thing and it's like and by the way J give me a list of zip codes to look up it's perfectly valid like it's not a vector lookup it's a it's just a zip code lookup and and that's perfectly fine or or give me a give me a numeric range of prices based on the iteration you know based on things that the model has has and again the model may be fine-tuned on Concepts like we may have some some concept of affordability that the model is able to aine on that says if the user meant they wanted something low cost that they meant in this context between $5 and $25 in which case again that might be a lookup a separate lookup from a from a from a product catalog or a Restaurant pricing lookup like the the the iterate the you know options on this are endless like you're going to have a lot of stuff that people are and you already see this like a lot of these apps that people are building are very sort of you know domain specific that are based around building domain specific context and it would be great to imagine that somehow I can just throw more AI computing power but it's but you can't I mean you can but they but the model the model can't solve this on its own it is a conversation between the model and the data some of which is happening in the background while you're sitting there waiting for the response right yeah so anyway uh I don't know if I gave you too much information again maybe maybe what you need is uh maybe my anwers should be mediated by an llm first to to get them a little bit more concise no this great stuff and I appreciate um I appreciate the additional context I guess uh and what you're seeing from a vector database and and rag perspective it's fun because what what it translates into is it is the intersection of a whole bunch of I mean again there's a whole bunch of you look at the things we talked about there's a whole bunch of very mundane like data prep and dat data data cleansing data integration a lot of stuff that that by the way is not hugely a lot of fun but but the minute um you know that's it it's still it's a data engineering project we have we have a whole bunch of stuff that's very model specific and and one can lose themselves in uh in in the model domain because it's it's fascinating and so on but then you just got a whole bunch of software engineering and architecture around this stuff and right the the the part that makes it hard is that like when you start getting like like each one of those like sure there's you know half a million um uh data U uh data engineers and data scientists in the world um which by the way is not a huge number given that well you know given that there's about 25 million developers in the world right so so we look at that and so then we go and say the number of people understand software architecture you know pretty well and and whatever and you've got so so you know who can who can AR architect around these things and that's probably about million developers and then you have like number of actual like AI model data scientists which is probably actually you know at this point uh probably 100,000 to 200,000 right and then youve the problem is is like the intersection of those and you start to get down into like very small number of them the majority of whom are at hackathons in San Francisco right and so and I think this is kind of the heart of my probably my first question which is you know there's another element of this diagram which is like you we talk about rag like it's this new thing but that retrieval like that's information retrieval we've been studying this for decades and there's a smaller number of people that are really experts at ir and have been working on search and are we going to need that expertise for these systems to kind of fully meet their potential or are we going to be able to abstract that away or will the llms be able to do that for us so that you know I don't need to tweak my embeddings you know my chunking and my context and my hierarchy and all that stuff it's one of these that's I I come back to that because it's a question that I've been thinking a lot about recently and asking a lot of people about trying to come up with some uh for looking thoughts on no and I you're asking the right question and and so the good news is there there will be a ton of uh they'll I mean we've seen this plenty of times with every everything you know uh you know everything new like the first iterations of it so so you look at what happened over the last year right like the first thing was you had this stuff and it was unoptimized um so when we look at geni it was it was it was seriously unoptimized but it worked won say maybe it just barely worked but it was it was on unoptimized which meant it was expensive as hell right then we get we get to you know and then we get to to say midy year um and quantization comes onto the scene and it makes it possible to now like actually go and run your models on consumer gpus and then shortly thereafter you saw optimization coming in that allowed it to actually run without needing a GPU like they're you know you can actually it's not very fast but you can run this stuff on on Intel and so what you're see and and then by the way we all love python but python you know is like or is a magnitude slower the because it's not meant to be a performance language right um and so part of what you're see let see so first thing you see is the optimiz ation phase the second phase is just like the important part and I love the fact that you tied it back to like information retrieval and search like so the thing though is that's an extremely like that is an extremely ubiquitous and mainstream use case like like people are going to need to to like every company is going to need to take every company generates information and knowledge and AI enabling it has can't be like a let me go and hire some some AI research grad students it has to be an off-the-shelf proposition and it has to happen where that information sits so so yes so that's going to happen too by by the way right so so all these things make sense but but I think we're right now like at the point where it's a lot of it is roll your own and and it will be for probably another year so but the good news again is I mean again if you're like worried about this stuff putting uh the developers out out of their jobs it's like no you just need to start you need to start working with this stuff and and and and learning it like there and you know there's going to be an awful lot of of you know an awful lot of coding that's going to be happening for for a long time to come around this so yeah no I but yeah it should get easier it will get easier but the but as it gets easier people are going to then do harder like you just pointed out like right now it was just trying right now everybody's I I mean I actually love I'm going to use that you know your your your point there um because it really was like now we're just getting people are just getting to the point where they can do text based rag in a fairly formulaic way and now we got multim modal right yeah and so so yeah right so so you got anyway oh good stuff good stuff well Ed uh great conversation thanks so much for joining us and sharing a bit about uh kind of your take on on Rag and Vector DBS awesome this this was a lot of funall right everyone welcome to another episode of the twiml AI podcast I am your host Sam charington today I'm joined by Ed and enough Ed is Chief product officer at data Stacks before we get into today's conversation be sure to hit that subscribe button wherever you're listening to Today's Show Ed welcome to the podcast thank you it's great to be here I'm looking forward to our conversation we've got a bunch on the agenda we'll be talking Rag and Vector databases and assistance but before we do that I'd love to have you share a little bit about your background in fact we've got RPI in common yeah we do yes yeah um which uh I think is probably a very chilly place this time of year so it's been a while since I've been back I have you been there any recently uh I wouldn't say recently probably five years ago was the last time yeah so same for me same for me uh great school for those who haven't been there though small small great tech school but uh but but in Upstate New York and uh uh one of the reasons why I chose to move out to the West Coast when I graduated was was the winnner winners there absolutely so tell us a little bit about uh how you got from there to to here yeah so you know came out to the West Coast really wanted to to get into startups and you know everything that was going on and of course this was the early days of uh of things like internet well it was even pre- internet multimedia and all that um but uh but but shortly thereafter you know internet happened and was uh was over at wired in the early days doing uh the search engine and uh and then uh did a whole bunch of stuff started a company in in the Enterprise Java space called epic Centric uh that that had a great run uh went on to to do some other other cool stuff uh social media advertising blogging was at six apart uh for for a while the company that made uh movable type and type pad and uh and uh then went and uh ended up part of apy uh the API management company uh we we had a great run there too did an IPO got acquired by Google and uh uh and and after a few years at Google decided to uh to to come over to uh to data Stacks uh which is the company that makes Cassandra the Cassandra database and have been doing that for uh for for the last few years so a bunch of cool fun stuff primarily making stuff for for people that are are you know building websites building applications building content that that tends to be the type of of stuff I like to do I totally forgot about your epicenter employee at Plum Tree yes yes yeah so those that that was an exciting those were the days exactly awesome awesome so um tell us a little bit about uh you know data sex has been uh kind of active in uh helping organizations um kind of take on this challenge of using llms and and rag tell us about data sex's kind of angle in that sure yeah so as I mentioned you know data Stacks is is the company behind Cassandra and and Cassandra was really the original cloud native database so awful lot of companies whether you know Uber you know whether you're using Uber whether you're using Netflix Apple uh these are all companies that use the Cassandra database and and when you use do something like FedEx package tracking that's that's all on top of Cassandra that's all on data Stacks as well and so we we knew pretty early on that as people were looking to to First with ML and then as Ai and gen AI became a big thing we we knew that that was going to be pretty important that people would want to use the data that they had in in these systems that power all these interactions that they'd want to add AI to it and so we looked at how to add the vector capability the vector search capability to the database and uh and that's that's something that we did we did it both within our within you know astb that's our cloud service but we've also everything we do is also an open source so so Cassandra 5.0 it's part of the Apache Foundation has has this uh this Vector query capability and and you know as you know when you want to get a database to work well with an llm and we'll get into it I I think probably in depth in a little bit when we talk about things like rag but uh but the starting point is to allow you to go and have your llm within some architectures and we'll talk about rag I think in in depth but you want to be able to go and retrieve information from the database uh on on a vector-based query and and so what we've done is and we actually did a couple inter iterations of this at first we we implemented much like most of the other Vector databases that you see we took the H hnsw which is is the hierarchical navigable small worlds um approach of of uh adding Vector query capabilities we we brought that to Cassandra we've since switched to uh something that's called dis uh uh disn um which which is is a different approach um uh that is more oriented towards optimizing uh disio uh we can talk a little bit about that too but but the the goal here is to go and give you a Cassandra which is one of the most scalable databases for people that don't know built on technology from Facebook from Google from from Amazon and uh gives you a scaleout data capability but how do we bring this Vector capability to to you know to these large data sets and that's that's really our angle on all this can you drill into the hnsw versus disn and what those mean and what are what are their implications sure uh you know I'll do my best on that uh probably I probably won't do it justice so so please folks on the podcast don't don't go and like phone in and and you know tell me I screwed it up but a lot of it has so so you know what what so let me talk a little bit about sort of the the origins and why everybody H has been using these things one of the amazing things about um as companies have gone and approached uh uh a lot of the the buildout on on U uh on on AI infrastructure has been that they've been using a lot of stuff that's been around for for a while and in fact if you you know if you talk to to folks they be like oh well you know uh you know approximate nearest neighbor and so on like this been around for a while and in fact there there are better approaches that that are coming down the pike uh on this um but uh the hnsw implementation that most of the databases most of the vector databases out there ended up using what came out of Lucine and Lucine was one of the original search engines right you know started really as a search engine but but at this point it's more of of the the search Library search infrastructure index creation that a lot of folks use and um I'll definitely get this wrong uh so actually I won't name names um but other than just say most of the big names that that you would have heard about and that we all talk about when we talk about Vector databases they all started with the code that was in the Lucine hnsw implementation uh and they uh they they you know in some cases they may have ported it over to their language of choice because not all these things are are you know uh written in the same code base um but they they started with that and that was a good starting point like if you were you know fill in the blank the original you know Vector databases that were named in the in the uh open AI blog post that kicked off the whole Vector database race uh or the F or the people who came out very quickly afterwards um it's a good starting point the problem was that hnsw was um not particularly optimized for for disio and so um you know what what uh what we found was as we started to to deal with the performance of this and particularly with Cassandra that's a distributed database um we went and uh and and looked at at sort of the the structure and um essentially the the the levels of of U and edges of the graph that gets constructed because basically you end up you know breaking these things down this is where the hierarchical part comes from um and then you end up having to do a partitioning particularly where um you know on a clustering basically if you you this really comes home to roost when you're using a a distributed database because when you look at the way something like Cassandra works it actually goes and takes a query Vector non Vector this is just the way Cassandra works it takes that query Farms it out to a whole set of nodes like people who running Cassandra have thousands of nodes and so if you don't um if you don't have something that first of all allows us to map that hierarch hierarchy of where um you know of basically where where the semantic position is of something in that cluster and then further doesn't go and reduce the number of dis um you know IO operations you you end up you end up hitting a performance wall and um and one of the dirty secrets of vector databases is that they all perform really good on small data sets because they effectively end up defaulting to what's in memory and in fact in the early days of when these Vector databases were coming out um a lot of developers you'd read on Hacker News and elsewhere would just joke that like for for these demos people are doing you could just do everything in memory um you know on your laptop and you get better results and um and and so that's where a lot of the the stuff that we end up looking at for us you know particularly for for Cassandra's role in this stuff we want to be the one that's doing really large shade ass sets for example one of our demos one of our rag demos that we use you know has the entire Wikipedia Wiki data data set and that's something like you know 500 million documents and you can actually start start to see the the issues uh start to crop up in as in as few as 100 thousand or so documents but definitely when you get into the million plus documents you start to see that these trade-offs actually have very tangible results in terms of relevancy not just in terms of the performance um but actually starts to cause fairly significant drop offs in relevancy and you just start to get to to junk back and you see particularly where sort of you know all of the vector databases right now are sort of in an arms race where they because really the the business payoff on the other sort of important comment about Vector databases to and databases simp I won't spend a lot of time talking about the database business today but but all the databases all databases whether you are us whether you're whether your pine cone with you whatever these are consumption-based businesses and it means that we we all love and we all do that the hackathons and all that but generally the businesses are built on having very large data sets and if we can't if we can't vectorize that very large data set and make it possible to do like production rag at scale it's going to be really hard for anybody to to build a business around this and so so that's why if you talk to anybody who's who's running a vector database that a lot of the focus is is about how do we get to these how do we make it feasible and cost effective there's a big cost Dimension to this as well to to do these these really large Jets and that's why all of these choices you know like we're we're really happy about dis G in right now it it we we have have a lot of stuff that shows it outperforms in real world a lot of other options but but we're already going and and looking at at some of the stuff that that you know some of the different approaches um you know that that will will move beyond that some of them specifically to to distribute a database space no no no that was that was that was really useful so disn is a an algorithm and approach for doing approximate nearest neighbor on dis presumably like exactly yeah it's it's infastructure aware approach for approximate nearest neighbor exactly that's just principal difference there there are there are a whole bunch of other pieces to it that again um I would do too crappy a job of in terms of going into detail on um uh and uh and but but the principal biggest difference that we've seen has been you know we've been we've been running this stuff you know as a service since early this year and so we've we've had the the privilege to see a lot of the sort of real world what happens when you do raged in in a real world setting and um and so when you start throwing a lot of of IO you know iio Ops um uh you know IO operations dis accesses at the problem you start to get into into uh you start to see a lot of things that you wouldn't see otherwise and and you wouldn't see for example in a in a classic chat GPT scenario because they're not they're not well they do now because now Chachi BT is actually does a lot of rag um as of you know about a month two month or two ago but but you wouldn't have seen previously in sort of the conversational scenarios that were really just throwing things at the model and and in as much as the database is involved it's it's being used for for conversation history which is is and is is is a part of rag but it's not the the it's not where sort of the the you know bread and butter of of of rag really comes home you made an interesting comment about the implication on not just performance but relevancy as you scale Vector databases and um I'd love to hear you elaborate on that a little bit more uh for a bit more context I've shared here on the podcast and and elsewhere one of my observations is that it's you really easy to get you know from kind of zero to POC uh with with Rag and with with dialogue agents based on llms but getting from that POC to uh you know system that you'd be willing to put in front of customers is like a lot harder and one of the big elements of that Gap is relevancy and and you know all the details that go into like the embeddings and constructing a context for the llms but at the heart like it's a relevancy challenge um uh and as a corollary to that the folks that that I've run across that have deep experience in Search and you know getting search results tuned up seem to kind of get this and and know how to fix this problem I'm wondering a does that resonate with you and then B like talk a little bit more about relevance and and kind of the way you see that playing out in Vector databases yeah so there's a couple of different pieces to it um um and in fact there there's actually and you're going to start to see a lot more of this just in general as people people talk about um this stuff so you've got you know you have have you know your Precision your recall your accuracy um you've got ways of measuring those you've got things like F1 that combine um precision and recall because they're tradeoffs um and uh and and so you know as we um you know as as as we look at those um these fluctuate over time um and they fluctuate as a function of of the data set that they're operating under so what this means is so generally like we we go and we look at at stuff and most of your interactions so it's probably worth sort of you know um there's two ways of two ways of looking at this you can look at the vector date like remember most of the action around Vector databases prior to this year was in using the vector database as uh as search engine that was you know if you go back in time go back to the way back machine take a look at what what you know what for example like the names we all throw around today uh when you're when you're looking at um essentially uh uh you know what were these companies doing you know um pine cone uh you know uh the the you know folks like chroma folks like um uh weate all that like you know a year ago they were primarily looking at at being search engines and and the idea was that and as you just pointed out from search concept that a lot of this stuff was was rooted in Search and the idea was that keyword search uh which is what a lot of folks you know more most familiar with um uh you know has has a bunch of limitations actually works well enough in most situations but but we want to but generally people wanted to move from a keyword to a semantic you know approach and so then that's where that's where doing a vector search um uh comes into play and and by the way you know you can actually do you know really good recall with no llm really in the loop um um well you do need the llm but it's not it it's not let it's more of an LM rather than llm um and nowadays we just call it an embedding model um that that is just for the purpose of going and reducing your text input into into a vector that I can now go and do a similar search gu so now we we get into this stuff and we're putting in uh you know we're we're we're putting in much larger data sets and we we want to go and try to figure out okay uh you know how do we go and and and how do we measure um uh you know for example um you know how accurate these results you know basically are are we um uh you know are we getting uh you know a lot of false positives when I go and and and do a search am I getting things back and between accuracy and precision we narrow down sort of you know we we get we get um uh you know a good sense of of of you know sort of we're able to to calibrate the the you know sort of the false positives and then we also get into sort of the recall pieces of it which which translate down into like things that should have come back so it's actually sort of more of the false negatives and again this is very overly uh reductive um but we start to get into that um and you can start measuring again you can measure these things and you can measure them and then again I also mentioned F1 which combines which combines um basically because you've got you you have uh you have a tradeoff um from from these and and from recall you want to be able to to to be able to see and so so people have created other other measur measurements around these and and um and again this is if you go and look sort of over the next six months just for context this is where a lot of action is going to be you're going to go to if you go to any one of these it's not there yet but I can predict within a month or two uh there you go there's my prediction for the new year which is um when you you know when you go to a vector database homepage what you're going to see is you know for any of these open source or commercial you're going to see all these great graphs that show number of documents and what you know Precision recall F1 score and all of that are because that's where like when you get past you know you mentioned sort of the demos like past sort of going and loading a bunch of the stuff in um that's that's where the action becomes now that said there is a whole B apart from just the raw database capability of of doing these these results there's a big piece around um uh whether uh you know actually a big data prep piece um that goes directly into it and that's one of the things because you'll hear people say oh but real world this and all they're right what they're saying is like it is garbage in go garbage out and um and so that's where we get into things like chunking and and so on where where I go and I take this piece of content and I need to to break it into into a set of things not just from the fact that that you you know you you want to get the the appropriate piece for storage standpoint but more importantly when I chunk these things when I take these documents depending on how I chunk them I could lose um and and chunking is exactly name implies taking this one might imagine a large PDF and breaking it into bite-sized pieces I can I can lose a lot of context that the smartest Elum in the world is not going to be able to put back if I happen to to break things apart on on you know if I did it naively or whether I did it with with a better understanding of the structure of the document and so so chunking itself actually um in terms of putting the data in is is where you see a lot of work and if you for example you know we look at some of the Frameworks people use for this stuff Lang chain and L index um they put a lot of effort into uh into into going at at the ingestion stage at going and taking what and again that's part of the reason why LL index is named what it is and actually as an aside sort of you get to the differences between those two projects and the philosophies behind them Lang chain as the name would imply is about chaining llm invocations llama index does that you know um but in order to do that Lang chain does do a bunch of ingestion stuff llama index as the name implies was really about getting that content and building those indexes of course they also do orchestration but but you see you know like you know when you go and look at at the folks doing those projects and you know um there talking about Harrison or Jerry like that you can sort of tell they they both had sort of a difference problem they were trying to solve and um and again you know when we sit down we're we're we have to think when to build like a rag app we got to think about the ENT but but those are they both become big problems and depending on when you want to go and improve you know that the accuracy of your results from from the vector database you know you so some peoplec correctly will go and say oh focus on on the the getting the data in there because that's going to have a big impact on on on your scoring um and other folks come in and say well that's true but but a big other piece of it is is kind of how we break apart and and construct the context uh that we're using to get to generate the the the vector the um the vectors that we want to look up on and and then what post filtering we do and the answer is both of those are very important do you think since you uh offer it up predictions uh do you think we'll get to uh a point where this all happens automatically kind of in the infrastructure and you know a user just needs to kind of bring their documents and get great results or will there always be a degree of uh fine-tuning and tweaking that needs to happen in order to get uh desirable results out of uh a rag type system we could go for a while on that one let me let me let me uh so let me te let me tease apart a couple of pieces to it so this is maybe a little bit of a an optimist pessimist warshock test well let me put this way it's a it's a good question and um and and as we tease that apart we're actually going to get into a bunch of the subjects that I know you wanted to talk about so so okay so yes what's what's our Holy Grail our Holy Grail is that um you know that the process of getting my data and use it within a geni you know conversational experience um and as we've seen you know conversational doesn't mean just doesn't mean chat anymore because we've got multimodal and we've got drawings and pictures and actually dynamically generating a graphical user interface is something that that that that you know we you know we can do now right so so so um you know but I think probably you and all all of your listeners probably already know that which is is that gen doesn't just mean chat anymore it just happens to be the the simplest way to to prototype but but yeah but but yes the the the first so let me put this the first really important piece is and this is not self-evident to a lot of folks in fact actually a lot of folks in the AI research domain don't fully get this which is that that real world AI particularly like business use cases needs you to be able to bring your own data and that data and again this is really important is not a static static Corpus of data like like what people are trying to do and this is why people are using rag like you're talking about data that is live data that you know is changing that is often times like maybe confidential or proprietary like things like electronic medical records or people's financial statements and stuff like that is never going to get fine-tuned into a model like it is never going into the model which means it's always going to go into infrastructure that's around the model so once we do that we're talking about rag or some variant of rag but but the big if you looked earlier this year there was this you know it was it was a Charming like charmingly naive debate on rag versus fine-tuning and it was like answer right and and seriously and you had like really smart people getting into this debate and I was saying like things like oh fine tuning like we won't need rag as of fine tuning and I was like was like you're never going to Mo let me repeat again you're never going to train a model on people's electronic medical records or or or bank statements like if you want models leaking personal information that's how you get models leing personal information right so so like now most people but that was it was the intersection of the fact that you had a lot of people were just very focused on building these really cool models and hadn't sort of zoomed out and said how are we going to use this stuff in more applied ways and so this year has been about the Collision Course of research and applied in AI in a really exciting way but you see it playing out in like a day byday basis so so we do know that but then the question becomes okay what do I have to do to get my data into uh in into a way that I can effectively retrieve it and that's where you've got the Frameworks and I mentioned a few of them and by there's a whole bunch more but but the Kings at this point are are Lang chain and and llama index who have just been moving really really fast on this and you get a lot of people complaining you know you had a lot of people complaining about the code bases there they're like man like they're just I'm like yeah but these guys are are following the stack and that's that you gota you got to admire that because that's you know that that that's that's what that's all about um but as a consequence there is a lot of trial and error um you know so I see a lot of of rag projects and and so going back to what you're saying like is it just going to be somebody just going to make this dead simple well yeah ultimately there aren't a lot of people are working on it um right now when you go and do a rag project you spend a whole bunch of time it's a and it's not dissimilar I think so in that way so there's a lot of ways that an AI project this year a gen project is similar to the ml projects that you and I had been looking at and involved in in previous years right um and then there's a lot of ways that they're completely different and the ways that they're very similar is that you know for all the talk you're talking about about like the cool stuff like the day-to-day is really uh a lot of you know data engineering which is you know a euphemism for just like data cleansing and a whole bunch of and this is again this is why why people using Python and such like just a whole bunch of data M munging just to get that data from the state that people have it in into a way that's going to yield the best results um and so so is that going to go away yeah absolutely um it's you know and and people are going to put a lot of wrappers around uh uh you know what currently requires you to do a lot of spaghetti architecture and and particularly in in the and and and a lot of folks you know and you'll see a whole bunch of companies that Focus just on that because you've got a whole bunch of data in the systems in the databases that people already using they're using streaming architectures or they're using things like psofa or whether they using their databases relational databases whether using non non-relational databases like all that data needs to go into into this stuff and it requires and then you've got more importantly like that's all your structur data you've got a lot where AI starts to to really blow things away is with unstructured data and the unstructured data in most cases you know again I spoke to this earlier the the hello world that for for most uh um rag apps and most um or rag Frameworks and most Vector databases is chat with PDF and the piece of that is again how do we take apart that PDF which is your at this point you know is now the canonical piece of unstructured data that everybody tests with um not the most interesting one but but the most pervasive one and then we go you know and then we go and say okay how do we take it apart is it you know is and and you start to get really into the world of mundane data at that point it's like is this an insurance claim is it a legal contract is it a research report is it a whatever and every one of those has a set of heuristics involved that may be um uh maybe actually defined procedurally because for example in the case of a research report you or EXC me a legal document you don't actually need like AI to take it apart like every contract has exactly the same structure so you can you can just do a bunch of string munging and extract it and turn chunk it which people do there there are entire companies in startups to do that or or you can do things more cleverly which is to go and in you can have an ingestion Loop where you're feeding the stuff into the llm having the llm guide the ingestion flow um and it so you have the llm supervising the dismantling of this unstructured content into chunks that then later at at you know at at rag time you're then going to go and and be able to to grab out right so so you've got those so so so again like this is and I'm sure you know because you've got a lot of folks a lot of your your um you know your listeners and stuff are probably in the middle of rag hell right now and so they're probably like hopefully nodding like yeah no I just did that this week and and so yeah that all that stuff has to go away but the the part of the problem becomes it's like like everybody's learning as we go along and so like you don't want to prematurely optimize and automate for what the use case was last week that was just the Hello World app when what we try to do next week is is something more complicated because every every time we somebody brings in a new data set um or new type of data um then then like it's like okay can we can we reuse the approach we use last time can we abstract on it can we build a new framework and hence there's a new brag framework every week right um yeah that's a fair and interesting point like you could yeah over optimize on PDF uh retrieval and you know get too anchor down into that and totally Miss multimodal for example is what you're saying like there's a danger to premature optimization which is a kind of a truism in software right no I mean your example exactly was right so so again most of what people are doing right now is smart knowledge bases which by the way not a bad thing like lot of use cases a lot of practical applications a lot of happy users and a lot of businesses are going to save a lot of money and make a lot of money just having AI knowledge bases and and for the knowledge bases a lot of that stuff the content that they're sourcing from are are you know support contracts things like that that are or whatever documents in p DF right but like multimodal the stuff you see from multimodal that people are building now um because we now have multim obviously we have multimodal in gb4 but you actually have multimodal now even in the open models um you know that's like the next phase of magic right um to be to be a little handwavy but but like that's that you know that's going to be an entire that that is also rag based um uh or can be um and but it has a completely different ingestion flow MH has actually two pieces of ingestion flow um it has it because a lot of this stuff that we're talking about here is um uh is data prep time but yeah in multimodal you have a realtime ingestion flow like you see all the examples where people I'm drawing a blank right now the name of the draw program that everybody's using to do the you know you draw the thing and then you and then you C the sketch thing yeah Sketch right you know what I'm talking about yes right so so the way the magic on all that on that particular use case is that they built a plugin for it's basically an ingestion plugin right it's like selecting the piece of the drawing area that it's then they've gone and wired in and sends it up to gb4 right so again that's that is a nice hello world example the minute I start applying that in in in to different use cases now I'm going to be like okay like how do I capture is it something from my screen is it something from this app oh I'm taking something from you know so there's be a whole bunch of software engineering that goes into into that yeah no interesting interesting so we talked a lot about rag I guess you know a question for for you um that I've been grappling with a little bit is you know is a vector database is is kind of this Vector capability is it a feature or is it kind of a new platform you know I think when folks think about uh Vector databases for Better or For Worse they think about kind kind of some of these upstart companies but you know there's PG Vector for postgress and uh you know data Stacks uh now has a vector capability and you know all of the traditional database vendors will have a vector capability but the question you know the question still is open for me no and and and so all so the answer so obviously you know I'm I'm at a vector database company so I think about this multiple times a day so my best thinking right now draws from a couple of things so simple to the the you know tldr is it's going to be both right and the longer answer is um you get a new application you get you have a Confluence of new things that that create an opening for for a new type of database and so we go back you know 15 years ago whatever it was and you had um people building with new languages predominantly JavaScript um you had people using using new types of apis predominately rest apis um so they're building new languages on the client they're building new types of ways of moving the data they were building a new runtime in the back end people using things like node.js but other Dynamic languages as well and so what you had was you ended up having this data JavaScript object notation AKA Json and so you had Json on the client you had Json on the wire you had Json on the server and then it made sense that you had Json in the database and that end to end created an opportunity now was not the only Json database but but there was a need and so at least one pure Json database then emerged and is now and so Mongo's doing just fine right like like but at the same time um you had Json is a data type like postris has a wonderful Json support now and that's do most of the other databases and there's other things so so so one could so the answer if you were to go back 15 years ago was like oh is Json you know is it a feature or is it a new type the answer was both it wasn't wasn't was were were there 10 new Json databases well there were 10 new Json databases there's only one time they were at yes right exactly but there's one we remember so of this current batch one or two of them is going to go and be the of this age right I like the sensor yes that makes a lot but at the same yeah but at the same time you know you are going it is also a feature like a bunch of other people are going to add these things and and uh and it's gonna be and everyone's going to bring to it kind of their special sauce your special sauce is kind of horizontal scaling someone else's Special Sauce might be you know this underlying document orientation someone else's Special Sauce will be something else yeah I I think the the real important piece and this is the it's like my refrain these days is I'm I'm more concerned with you know and for us and where I look at the people who are doing this right it's like it's like just follow the stack follow what people are building because that's the important piece like going back to the example the important part wasn't Json as a data type the important part that did was it was Json as queries like going able to things that that the other databases like you go into postgress and yeah they added as a Json type and so on but and and they've improved it over time but but the thing that did better than anybody else wasn't just that they could store and retrieve Json it was like they treated it as a first class citizen so when you did a query right so it's the same thing when you that so so again from a product strategy standpoint I go and look at it and say okay what is the equivalent of that now like like it's not just a question of so everybody right now has gone and said okay I've added Vector a an indexed column so that's great very good starting point but remember all of the vector databases would were out there all of them whether you the ones that you everyone assoc Associates as the pure plays or whether it's people who have added the capability this all happened before rag rag became a thing right and so now the question becomes like which of these are like follow that's why again I go back say follow the stack follow the application right like like the ones who are if are you adding features that are designed to make rag better right and what what what's involved in that like and we we spent a bunch of time talking about this stuff and we can keep we can go into even into even more detail but that's that's the other piece going back to like again if I my prediction of like if you go to a vector database website you know then basically it's already true like you're going to again that website's goingon to have two columns to it one is going to be here's our recall stats because that's that's your new stat like that databases used to talk about like oh I can handle this may request a second now they're going to be like this may request a second with this level of precision and recall but the other hand on other side of that web page is going to be all about rag right because that's your that's your canonical use case and how do they make you how do they make it uniquely easier and so that piece that's where where all the Innovation is going to be and and you know it's going to be and that's where you're going to see the difference the the databases that treated as a feature they're if you as a developer if I sit down to write something I'm going to go to the ones that are act that I can tell that whether it's an open source or commercial it doesn't matter like I'm going to look at be like are these folks focused on on trying to make this easier for me to have a have have a rag application and so so yeah yeah we'll see so so long and short of it is yeah I mean yeah it is both a feature but you'll have one or two folks that that that knock it out of the park and build a business on and that's always the case when we see this stuff right and know is there an infrastructure element or clearly there's an infrastructure element to this I guess more specifically I was you know having the same conversation with someone then they mentioned that they seem to suggest that there were Vector databases that were kind of GPU native or GPU Ena enabled and take advantage of the GPU and there are others that the implication was that they were um not pure plays or something and they weren't uh able to take advantage of the the GPU is that something that you're seeing is that a so when we look at when we look at the process of retrieving data from Vector database what we start to your goal is you know um your goal is not to have to do a whole bunch of GPU dependent Vector comparisons um as as your first of all some of that comes out to the efficiency of of how you built your index and so on but but anybody who's going and and hitting the GPU um you know in in uh um in an unbounded way from from you know from Vector retrieval standpoint um at query time at query time is not going to be um you know you could make the argument like oh that's that's a you know good thing to do but it's not from you know in terms of being more GPU native and certainly the the GPU vendors are you know get very excited about that um you do at so at query time you are going to you are going to hit the embedding model your goal is to hit that once not on a proo basis um but on on the when when you when you take apart me you embed the query and you turn the query into a vector yeah it gets a little more complicated but yes so at their basic level I want I'm going to take my input and I want to um you know I want to run it through an embedding model right and it's going to generate it's going to gener embedding my my vector comparisons my Vector reversal I you can involve the GPU in that it like I said you're going to put yourself in a cost prohibitive situation and one of the one of the key pieces of the the other key metric is is cost a lot of a lot of these things that work really well on your laptop you you price yourself out in terms of going into production from um because it just cost you more money then you know yeah uh you mentioned that your customers have thousands of nodes yeah you're if all those have to have gpus that's another class of of infrastructure cost yeah and and so as we look at that that that that's something you so so we go back into that so yes you do hit the embedding model and as you do that by the way that becomes a big big selection problem and it directly goes into uh you know overall like generally you want your your embedding model is is generally partnered or or or derived from or optimized from the main model that you're going to be using a generation time doesn't have to be um uh depending on on but but because what happens is again you know and I know you've talked a lot about Rag and and llm chaining but but the reason again why we have things called lank chain is because typically I go and I ask something from the agent and the what the embedding model does um is it breaks down my request and and one of the things that breaks it down into is a set of vectors of things that I want to know more about and it f and it Farms that out into a set of queries right so I get maybe five I maybe I get 15 vectors and I do the lookups of that of all the things that that the model that my generation model should know about when it produces the answer and and so so when I do that um uh you know essentially as I said it can be it can be um it they don't have to that that first model does not have to pair with the second model um often times again when you're doing things with like open AI you're you're going to use the same embedding model that's that you know is you know is is Vector compatible with the um but you don't have to do that because you just retrieve everything and then feed it textually into into that now by the way depending on what you're doing like that's your simplest your simplest s rag model has two llm invocations but the reason why people call it again Lang chain where the name comes from is I get a multiple llm iterations with with branching and you get into all sorts of things like Chain of Thought and so on that for for doing very complex answers or instruction following answers um where where you can get into into some really cool stuff but I don't back to original question because I got a little less I'm prone to do I got a little bit off of that I I don't I I I don't um uh the question of of you know the qu part one of the questions that you see like the idea do I actually need it in the database retrieval Loop no I I don't want a GPU there but the real qu but the bigger question becomes do I have to um as I'm doing both my insertions or I'm doing my queries for example can the database can the database invoke the embedding llm directly or do I have to do it in my application tier that's more of a convenience thing but that's an important thing and uh and so you do see at this point most of the vector databases um you know offer that as a feature um as a capability um you know we do as well mean they'll take the text as opposed to the vector and do the embedding for you yeah yeah and the other the other thing you're you're seeing that again this will be a big deal next year um is uh is a lot of the databases are going to offer you a um a natural language query capability because it turns out that these models actually do a very good job of of uh text to uh to sequel text to query language types of generation so so um so we are going to see that as well which is going to be really interesting when that happens because it's going to blur the lines between for example no SQL dat databases and SQL databases um a lot of effort goes into the into right now um creating your your your your queries and and um and like I said a lot of effort and it's actually a um a lot of foundation models are putting a lot of effort into this they already they already do a very good job because there's just because they all of course use like whether whether Google's using its crawl data set or everybody else that's using the common crawl there's a lot of SQL priors on on the web just you know and so so so these models are already very good I mean you can actually you actually use llama 2 and get get or for that matter I mean you go to code llama but just even just you know regular llama 2 will give you very good SQL um which is is an interesting thing to see when we were talking about embeddings you know embedding text you mentioned it's more complicated than that what was underneath that uh that comment well so you know that was that that was in in the um you know from from the standpoint of typically so couple of pieces to that let's talk about it from the ingestion standpoint let's talk about the query piece of it you know from from the ingestion standpoint um we're gen we do an ingestion and we we generate a um um you know a um we generate the embedding Vector um the at inserts or upserts right um and those are when we either create a new record or we update something uh the vectors that we the the embeddings that we create again a lot of stuff happens in the application tier with chunking which is to you know we're figuring out relevant piece but we also get into um which uh embedding model to use um because the dimensionality of of the the vector is going to have a lot of issues from both a cost stand and performance standpoint um and so what you see is that um a lot of the um uh you see for cost purposes a lot of people end up wanting to go and use a smaller model that does for example maybe a 300 Dimension right because we know the the open AI you know open AI is of course the the gold standard if I want to do if I've got unlimited money I want to do it right but it's going to give me A500 Dimension Vector right each one of those Dimensions is a floating point right so it's big thing um and so ideally maybe I'm going to use one of the small 300 Dimension models off a hugging face problem is whatever I use um at ingestion time is going to be the first or one of the first U models that I that I invoke at query time quite frankly those are some of so so now I've got a trade-off because um because my my first you know my first model that I hit is generally is taking your raw input where you're like hey you know um you know where should I go for lunch um whatever with all the additional injected context right that that that these have that and so on but like you know these these smaller models um you know are are not going you know yes they will give you that list of vectors to look up uh but they may not necessarily be that smart in doing it now what happens is in that first pass what it does is it's taking your it's it's generating the goal is to generate to build the context right because everybody users think in terms of prompts but the llm takes a context which has your prompt but all of the additional information that you choose to supplement it with right and so all of that um then gets fed into the second model which is which then again typically restates your original question but says in in depending on what you're trying to do if you if you're trying to get to a zero hallucination type result you may have a system prompt so it's got the system prompt which says you're going to get a question from the user and you've got the prompt underneath it which is here's Sam's question but the system prompt says you can answer this question for Sam but you're only going to use the additional information I Supply right so so the context has system prompt user prompt and then a set of the rag retrievals right um but the set of the rag retrievals are only as good as what that initial earlier model what we call the the embedding model was able to retrieve and if that embedding model is just not very smart then particularly in that situation where I'm limiting my response to the stuff that was retrieved from the vector database uh might not be very good um and then further again we get into the chaining situ situations like you look at the stuff people are doing uh with like chain like and this is the stuff that people love that particularly like the Lang chain folks love showing off in their demos because it's really cool or for that matter if you get into like the autog GPT stuff like which which is is you know ends up being that you know with with you know sort of an outer loop around it um uh then you're actually going in and and you know it may not just be one you maybe doing like several uh you know um llm uh you know essentially input prompt generation of reduction summarization uh look up look up list of vectors then feeding that in again and feeding that again then prompting back yeah and then prompting because again the cool part of this is it's not just uh because it's what's what part of it's doing is it's powering a conversation not just a a magic give me the best answer comes back and says hey Sam um you know did you mean did you mean this this or this right because it's it's and it gets to the point you know again through one of these you know Chain of Thought branching structures it's like it's like let me let me go back to the user and ask him now and then it goes and and and does it further but but the quality and and then it go and you know rinse and repeat but the quality of each one of these steps a lot of it can come from you know from and by the way we call these embedding models but you know the one of the things that's important at Rag and and again I say this as somebody who's you know I want people to use a vector database but but um the models can generate a lot of other types of lookups like it may you may also be like give give me a set of keywords for a conventional search like like the ve Vector is not the be all end all for what the the important part of all of this is this is all about iteratively building a smart context and Vector lookup is one of your best tools for building your context but other forms of look up as well I mean yeah you might you might ask you know and again your and this is where the intersection of the vector day base and what the system prompts and then the logic around it in the case of your Lang chaining you know is is what where it goes into because part of it might be generate you know one might imagine a trip planning thing and it's like and by the way J give me a list of zip codes to look up it's perfectly valid like it's not a vector lookup it's a it's just a zip code lookup and and that's perfectly fine or or give me a give me a numeric range of prices based on the iteration you know based on things that the model has has and again the model may be fine-tuned on Concepts like we may have some some concept of affordability that the model is able to aine on that says if the user meant they wanted something low cost that they meant in this context between $5 and $25 in which case again that might be a lookup a separate lookup from a from a from a product catalog or a Restaurant pricing lookup like the the the iterate the you know options on this are endless like you're going to have a lot of stuff that people are and you already see this like a lot of these apps that people are building are very sort of you know domain specific that are based around building domain specific context and it would be great to imagine that somehow I can just throw more AI computing power but it's but you can't I mean you can but they but the model the model can't solve this on its own it is a conversation between the model and the data some of which is happening in the background while you're sitting there waiting for the response right yeah so anyway uh I don't know if I gave you too much information again maybe maybe what you need is uh maybe my anwers should be mediated by an llm first to to get them a little bit more concise no this great stuff and I appreciate um I appreciate the additional context I guess uh and what you're seeing from a vector database and and rag perspective it's fun because what what it translates into is it is the intersection of a whole bunch of I mean again there's a whole bunch of you look at the things we talked about there's a whole bunch of very mundane like data prep and dat data data cleansing data integration a lot of stuff that that by the way is not hugely a lot of fun but but the minute um you know that's it it's still it's a data engineering project we have we have a whole bunch of stuff that's very model specific and and one can lose themselves in uh in in the model domain because it's it's fascinating and so on but then you just got a whole bunch of software engineering and architecture around this stuff and right the the the part that makes it hard is that like when you start getting like like each one of those like sure there's you know half a million um uh data U uh data engineers and data scientists in the world um which by the way is not a huge number given that well you know given that there's about 25 million developers in the world right so so we look at that and so then we go and say the number of people understand software architecture you know pretty well and and whatever and you've got so so you know who can who can AR architect around these things and that's probably about million developers and then you have like number of actual like AI model data scientists which is probably actually you know at this point uh probably 100,000 to 200,000 right and then youve the problem is is like the intersection of those and you start to get down into like very small number of them the majority of whom are at hackathons in San Francisco right and so and I think this is kind of the heart of my probably my first question which is you know there's another element of this diagram which is like you we talk about rag like it's this new thing but that retrieval like that's information retrieval we've been studying this for decades and there's a smaller number of people that are really experts at ir and have been working on search and are we going to need that expertise for these systems to kind of fully meet their potential or are we going to be able to abstract that away or will the llms be able to do that for us so that you know I don't need to tweak my embeddings you know my chunking and my context and my hierarchy and all that stuff it's one of these that's I I come back to that because it's a question that I've been thinking a lot about recently and asking a lot of people about trying to come up with some uh for looking thoughts on no and I you're asking the right question and and so the good news is there there will be a ton of uh they'll I mean we've seen this plenty of times with every everything you know uh you know everything new like the first iterations of it so so you look at what happened over the last year right like the first thing was you had this stuff and it was unoptimized um so when we look at geni it was it was it was seriously unoptimized but it worked won say maybe it just barely worked but it was it was on unoptimized which meant it was expensive as hell right then we get we get to you know and then we get to to say midy year um and quantization comes onto the scene and it makes it possible to now like actually go and run your models on consumer gpus and then shortly thereafter you saw optimization coming in that allowed it to actually run without needing a GPU like they're you know you can actually it's not very fast but you can run this stuff on on Intel and so what you're see and and then by the way we all love python but python you know is like or is a magnitude slower the because it's not meant to be a performance language right um and so part of what you're see let see so first thing you see is the optimiz ation phase the second phase is just like the important part and I love the fact that you tied it back to like information retrieval and search like so the thing though is that's an extremely like that is an extremely ubiquitous and mainstream use case like like people are going to need to to like every company is going to need to take every company generates information and knowledge and AI enabling it has can't be like a let me go and hire some some AI research grad students it has to be an off-the-shelf proposition and it has to happen where that information sits so so yes so that's going to happen too by by the way right so so all these things make sense but but I think we're right now like at the point where it's a lot of it is roll your own and and it will be for probably another year so but the good news again is I mean again if you're like worried about this stuff putting uh the developers out out of their jobs it's like no you just need to start you need to start working with this stuff and and and and learning it like there and you know there's going to be an awful lot of of you know an awful lot of coding that's going to be happening for for a long time to come around this so yeah no I but yeah it should get easier it will get easier but the but as it gets easier people are going to then do harder like you just pointed out like right now it was just trying right now everybody's I I mean I actually love I'm going to use that you know your your your point there um because it really was like now we're just getting people are just getting to the point where they can do text based rag in a fairly formulaic way and now we got multim modal right yeah and so so yeah right so so you got anyway oh good stuff good stuff well Ed uh great conversation thanks so much for joining us and sharing a bit about uh kind of your take on on Rag and Vector DBS awesome this this was a lot of fun\n"