#221 Radar Recap - Scaling Data Quality in the Age of Generative AI

The Challenges of Measuring Data Quality: A Discussion with Industry Experts

Managing data quality is crucial for any organization, but it can be challenging to put a cost on data quality issues. According to Kalpa, acknowledging that not everything in data can directly be measured is an important step in addressing this issue. "Data itself is a support function inside an organization," she explains. "It's like Bops and management things, so if an analyst produces a report that has a significant impact on the business, it's harder to put a cost on that individual's work because it's two layers removed from the actual business outcome."

To make data quality more tangible, experts recommend looking at different metrics. For executives, reputation, brand trust, and sleep better at night knowing that someone is powering their business are all important considerations. In contrast, data engineers and learning engineers often focus on time spent on cleaning up data, building new pipelines, and other tasks related to data maintenance.

One of the most significant costs associated with poor data quality is revenue loss. As Kalpa notes, "one data issue can easily cost millions of dollars for an organization." She cites the example of a airline, where a single data error can have far-reaching consequences. In contrast, teams often prioritize metrics such as team efficiency or organizational time spent on data-related tasks.

To develop a comprehensive understanding of data quality costs, organizations should consider a range of factors beyond traditional ROI calculations. By acknowledging the importance of data quality and exploring alternative metrics, businesses can better understand the value of their data assets and make informed decisions about how to prioritize their investments.

The conversation around data quality is essential for any organization looking to improve its operations and drive business success. By embracing a nuanced understanding of data quality costs and exploring new approaches to measurement, organizations can unlock the full potential of their data assets and achieve lasting improvement.

Accepting that everything in data cannot directly be measured is a crucial first step in addressing this issue. According to Kalpa, "you should have good data to dive into," but if an organization is not convincing itself or others on the importance of high-quality data, then it may be time to reassess priorities. By focusing on metrics that matter most for individual roles and teams, organizations can build a more comprehensive understanding of data quality costs.

When discussing data quality with executives, reputation, brand trust, and personal satisfaction are often at the top of the list. For example, when asked about ROI, an executive might respond that knowing someone is powering their business "just makes me sleep better at night." In contrast, data engineers and learning engineers tend to focus on metrics related to time spent on data maintenance and other tasks.

In general, there are three key areas to consider when thinking about the cost of data quality issues: reputation and brand trust, revenue, and team efficiency. By exploring these different dimensions, organizations can gain a more complete understanding of their data quality costs and make informed decisions about how to prioritize their investments.

The conversation around data quality is ongoing, and there is no single metric that captures its full scope. As Kalpa notes, "it's way more than 10 sales reps" – and the value of an analyst or data engineer may be harder to quantify because it's two layers removed from direct business outcomes. By acknowledging these complexities and exploring alternative metrics, organizations can build a more nuanced understanding of their data quality costs and make progress towards improvement.

The importance of reputation and brand trust cannot be overstated when it comes to data quality. As Kalpa notes, "if you talk to Executives the number one thing they'll tell you is I just sleep better at night knowing that someone sort of is that that the data that's powering my business." This emphasis on personal satisfaction highlights the human side of data quality and the need for organizations to prioritize its importance.

In contrast, data engineers and learning engineers tend to focus on metrics related to time spent on data maintenance and other tasks. For example, they might ask about "how much time are you spending cleaning up data or cleaning up fire drills." By exploring these different dimensions of data quality costs, organizations can gain a more complete understanding of their needs and make informed decisions about how to prioritize their investments.

The conversation around data quality is ongoing, and there is no single metric that captures its full scope. However, by acknowledging the importance of reputation, brand trust, revenue, and team efficiency, organizations can build a more nuanced understanding of their data quality costs and make progress towards improvement.

As Kalpa notes, "it's depends on who you're at who you're talking to." Executives tend to prioritize reputation and brand trust, while data engineers and learning engineers focus on metrics related to time spent on data maintenance. By exploring these different dimensions of data quality costs, organizations can gain a more complete understanding of their needs and make informed decisions about how to prioritize their investments.

The importance of reputation and brand trust cannot be overstated when it comes to data quality. As Kalpa notes, "the second cost of revenue" is often significant, with one data issue able to have far-reaching consequences for an organization's bottom line. In contrast, team efficiency and organizational time spent on data-related tasks may take a backseat to more pressing concerns.

By acknowledging the complexities of data quality costs and exploring alternative metrics, organizations can build a more nuanced understanding of their needs and make progress towards improvement. This conversation is essential for any organization looking to improve its operations and drive business success.

In conclusion, measuring data quality is a complex task that requires a nuanced approach. By acknowledging the importance of reputation, brand trust, revenue, and team efficiency, organizations can build a more comprehensive understanding of their data quality costs and make informed decisions about how to prioritize their investments.

"WEBVTTKind: captionsLanguage: enall right all right hello hello everyone and welcome to the final session of the day of data Camp radar on scaling data quality in the age of Jaren of AI we left the best for last so everyone do give us a lot of love in the Emojis as you can see here below and let us know where you're joining from I see more than 500 people in the session already so yeah do let us know uh where you're joining from uh especially and what you thought of data Camp radar day one and what you're excited about data Camp radar day too um so of course as organizations continue to embrace Ai and machine learning the importance of maintaining high quality data has never been more critical there are arguably no better people in the data business across the board than bar Moses pral Panka and George Frasier to come talk to us about data quality so first I'm goingon to introduce bar Moses she is the CEO and co-founder of Monte Carlo a pioneering company in data reliability and the creator of the data observability category uh Monte Carlo is backed by top VCS such as X Axel ggv red Point iconic grow Salesforce Venture and ivp bar it's great to see you thanks for having me next up is prala Sankar she is the founder of atlan prala is a leading modern data and AI governance company on a mission to enable better collaboration around data between business people analysts and Engineers she has been awarded the economic Times emerging entrepreneur of the Year Forbes 30 under 30 40 under 40 and the top 10 CNBC young businesswoman of 2016a it's great to see you thanks for having me awesome and last but not least is George Frasier CEO at F Tran uh George founded F Tran to help data Engineers simplify the process of working with disperate data sources he has grown fer to be the de facto standard platform for data movement in 2023 he was named a data Nami person to watch he also has a PHD in neurobiology George great to see you great to be with you and just a few housekeeping notes before we get started there will be time for Q&A at the end so make sure to ask questions by using the Q&A feature and vote for your favorite questions if you want to chat with the other participants use the chat feature we highly encourage you to engage in the conversation if you want to network and add folks on LinkedIn and share your LinkedIn profile they will be removed automatically but do join our LinkedIn profile that is linked in the chat as well and you can connect with fellow atendees and I think this is a great starting point to start today's session um you know it's safe to say that data quality is at the top of Mind of many data leaders today especially with the generative AI boom that we see um but maybe to set this stage how would you describe the current state of data quality within the industry within organizations today what do you think are the common challenges organizations are facing when it comes to maintaining high quality data bar I'll actually start with you sure I have lots of opinions on this topic trying not to hog the entire time um yes data quality well frankly let me start by saying it has been a problem and an issue in the space for the last couple of decades so nothing is New Right we've been complaining about data quality for a long time we shall continue to to complain about the quality of our data for a long time um however I do think a few things have changed um first and foremost you know obviously generative AI products you know being more um uh or being prevalent uh at least in terms of the desire to to build them um uh data are put under a lot of pressure um we actually put out a survey that showed we surveyed sort of a bunch of data leadered data leaders and 100% of data leaders were cited as um uh Under Pressure to deliver generative AI products uh no one said they are not being asked to build something um however uh only 70% of them just under 70 68% of them actually feel like their data is ready for generative AI so that means that while there's a ton of pressure from sea level and board and others in the market to actually build generative Ai No One or the large majority of people don't think that their data is ready ready for that and I think that poses a good question for us as an industry to figure out why that is the case um and my hypothesis is that what I would call the data state has changed a lot in the last five to 10 years so the way in which we process transform store data has changed a ton but the way in which we manage data hasn't changed at all and so that means that you know if you go back to the survey actually 50% of um those data leaders still use manual um sort of approaches to data quality and so while we've become a lot more sophisticated with what we demand from our data and from our data infrastructure we have not become more sophisticated in how we manage data quality um you know I think manual rules will and always be important but that they're not the end allv in fact it is just the starting point um and so I think this you know in short if I had to respond to what is the state of data quality today I think there are new problem it's sort of an old problem with new challenges that we have not cut up yet um I definitely have ideas on how how we need to solve that uh but I'll pause that uh I'll pause there for a minute and see if any reactions from my esteemed uh fellow panelists proa let you react here yeah I think I I agree with everything that said but I I think the one thing to like abstract this a little bit over like I think about this concept of Data Trust more than just data quality or like and maybe this is the reason you have the three of us in this panel but like you know the way I think about this is like if you think about that final layer of trust uh and you have a human who says this number on the stash BR is broken oh my God or it doesn't look right like what's wrong right it sounds like a very simple question it's actually a very difficult question to answer because the reason a number could be off could be because the f f Tran pipeline that day broke and then run it could be because uh it never happens what are you talking about or it could be because it could be because the data quality checks that day failed it could be because someone changed the way we measured annual recording revenue and like no one forgot like no one remembered to update the data consumer right and so if you think about this flow I almost think of it as you have data producers who actually kind of want to guarantee trust where sell service data like no data producer wants to spend their time answering the question of why a number is off and on the other hand like you have data consumers who actually want to use data like no one actually cares about the quality of the data like they actually just want to use the B like a data consumer cares about making business decisions and in the middle we have this Gap and the reason we have this Gap is because we have a proliferate it's almost self created problems we have created a significant number of tools that have been that that have scaled massively but we have a proliferation of tools we also have significant diversity in people so any single final dashboard probably had five people touch touch it this problem just gets worse in the AI era so at least if I was a human I look at the number and I'm like uh maybe the number doesn't look right and I can do something about it if I'm AI I don't do that and that can actually like lead to pretty significant position so I think the way we think about this uh and we sit on S that layer between the producers and the consumers and bringing this stuff together is what does it mean to create these data products finally like what makes something reusable and trustworthy uh and how can you bring context across from the pipeline from data quality from all of these layers in the stack like human context to solve the trust problem or the Gap okay that's really great and George I'll let you react you know to both bar helpus framework is right um that there's a lot of layers to the system and it matters a lot where the problem is AR Rising except the part about five trans R but I mean you would be you would we we try very hard to avoid contributing to the data uh quality problem in our layer you would not believe the amount of effort that goes on behind the scenes to try to chase down the long tail of replication out of sync buz that can happen with all the systems we support um we are not perfect I can only say that we are better than everyone else I think where it happens is very important in terms of troubleshooting um you ask like why despite all the efforts in this is this um is this you know still so top of mind and the answer is because it's impossible to fix uh data quality is not a problem you can ever fix uh there are so many layers and there are so so many ways for things to go wrong and sometimes the source of pro some of the later stage problems that poupel was just um talking about are things like you can't actually 100% know whether the person who checked out at your point of sales system was actually the same person who uh you know created an account on your website so you can never actually get to 100% uh with data quality you just have to manage it uh and you have to identify what are the what are the highest priority areas where is it most important that the numbers be right prioritize those work on those um but you you I think you've got to start out acknowledging it will never be perfect yeah yeah no I I agre I I think the thing on that is kind of what you said right that I think things will always like the reality of running like especially realtime Dynamic data ecosystems is that things will always break like there's it's likely that there will always be things that are because it's that like it's just the nature of the Beast and so that's why I think a lot about like when you think about trust trust wasn't actually big because something went wrong trust breaks because someone told you your stakeholder told you that something went wrong without you telling them actually something went wrong today maybe you should like and I think that's the element of trust which is it's one layer above something and I don't think the solution is trying to make sure nothing ever goes wrong the solution is how do you go one level above and make sure that you solve for trust and then how do you measure it and manage it toward him trust is word it is very hard to win and very easy to an example of this I heard a long time ago it's funny I'm I'm in New York right now and I met with somebody earlier today at Bloomberg actually I still had the ice coffee that I got while I was there uh and long time ago I don't know if you know what Bloomberg is but they do data feeds um for finance um it's a kind of data management um but it's it's data uh you know like stock prices commodity prices gas prices things like that many years ago when five Tran first started one of the things I learned is a key element of their business is that uh the um is that is not is is that it is accurate even in this obscure cases you know the price of beans in Korea or whatever it is like even the most obscure data feeds they are more accurate than anybody else and that is really important because if one thing is wrong one day out of the year that is a huge problem that's something we've always tried to emulate F5 Trend in a very different context replicating a company's own data but it it speaks to how it is when when you're in the when you're in any kind of data business um you can you can be the difference between zero errors and one error is bigger than the difference between one error and Infinity uh trust is so hard hard to win and so quickly lost and bar I'll let you react and then I'll ask my next question oh I was just going to say just reflecting on this like like I wouldn't be surprised if we would be sitting in a panel like in 10 years from now still having you know sort of similar discussions except the words change and definition change so maybe we call it you know trust or data quality or whatever now like hallucinations in the context of generative AI right um but the problem Remains the Same um I think one of the the sort of interesting questions to answer is like what or to Think Through is like what what are our you know customers who are now faced with what are the challenges that our customers today are faced with and how are they dealing with that and how is that different from you know a few years ago or honestly just like a year ago um and the reality is like these problems are just not going away and so figuring out how to address those um you know in a way that uh adapts to where customers are and meeting them where they are is is uh I think super important and I want to pick up on that point because you know bar you alluded to like you know me in a panel in 10 years I'd be very excited to have a data Camp radar AI Edition 2034 uh where we discuss this um and what it seems to me is that the goalpost on data quality is Shifting every year right so like organizations make rdes they make investments but then the Ambitions of what it means to have like high quality data also shift with that so the challenges also still remain the same so you mentioned this bar earlier in the discussion is that same problems different challenges what are those challenges today so I'd love to learn that from you yeah great question and I mean I'll start by saying look like if you you know you could in some world like if model output is wrong or you know you sort of you know you you're prompting with a question and and and the answer is is wrong is it better than to not have an answer at all like is no data better than bad data um maybe I I think so but also then what's the point of having you know kind of a Q&A or chat bot if it can't provide you an answer at all right um and so like to your question the definition of good what does good look like actually becomes tricky um and how do you define like what should we strive for changes I think um but you know to your particular question like what are the challenges or kind of pinpointing those um I think you know kind of kind of how I'm alluded to sort of how the data state has changed over time I think the historically what we've done you know when it comes to sort of trust to pra's point was really start with can we figure out about data issues before anyone else Downstream learns about it right whether that's you know in in in generative AI or not whether it could be in a dashboard um and so you know the the the thought is that if you know we can catch issues before others Downstream do we can sort of either repair that trust or rebuild that trust um I think what we're seeing right now is that the challenge that is definitely a very important challenge to um to address and I think detection capabilities have evolved a certain agree you know I sort of talked about manual solutions for that versus not I think sort of the big kind of like next leap here for building you know data quality Data Trust whatever you want to call it um is sort of going Beyond detection and taking the next step of sort of understanding how do you actually resolve how do you actually address these problems um and when you think about the root cause of these challenges that has changed too and so in the past like you know you really most data most teams really just f on sort of cleansing the data and once you cleanse the data and was fine you brought it in you're all good five Trend never breaks so you were okay um and and in in today's world five Trend still doesn't break sorry George I'm bigging on you but um uhh but the data sort of you know if you look at the data landscape it's become really complex it's like s super super complex for even for a small team to manage right and I think if you think about sort of the core pillars of what makes up the kind of like data State I would call it there's three things the first is the data itself so actually like the data sources whatever you know kind of you're ingesting the second is uh the code so you know code written by Engineers machine learning Engineers uh data scientists analytics Engineers Etc and then the third component is the systems or the infrastructure basically the jobs running all of that and so you have multiple teams mult bu building multiple complex webs of all three of those things the problem is that data can break as a result of each one of those three so it could be as a result of the data you know just that you ingested being totally incorrect it could be the result of you know bad code bad code could be like a bad join or schema change or it could be a system failure I won't name names but systems do fa do fail right could be any elt general elt solution they use and so actually like understanding that um in order to really build reliable products you have to look at and understand each each of those components you first of all have to have an overview and sort of visibility in each of these components and then also understand can you correlate between a particular data issue that you're experiencing and say a code change or an infrastructure change or anything like that that is really really hard to do today um and so what ends up happening is that data teams are inundated with lots of you know alerts or kind of you know quality detections data quality issues and they're all flying around between you know 20 to 30 different data teams and 10 different domains and go fix like who needs to address which problem so when or which alert um so you actually like you know down to like the brass tax of how do we handle this those are some of the challenges that I think really sort of figuring out how do we both have really strong detection of issues but then how do we go to the next step and actually figure out what is a root cause and honestly oftentimes it's more than just one root cause so it's typically you know this storm excuse my language with like every single thing breaking right like it'll be both a data a code and a system issue um and so you know when I think about how our systems can get more sophisticated or how we build more reliable um Data Systems it has to have a more sophisticated view of what's actually um uh what are the you know V components of that and what could break that's really great and maybe George from your perspective adding on top of what bar said what are the challenges that you're seeing today when it comes to U scaling data quality or like you know improving moving the needle on data quality well I mean we look at a very particular slice of this we look at the replication piece does the data in the central data warehouse match the data in the systems of reford be it um a database like postgress or Oracle or a app like Salesforce or workday um and we you know we we've we've come a long way with uh just centralization of effort just the fact that five pipelines are standard so everyone is running the same code there's this cumulative thing where over years we fix bugs that's the main tactic at the end of the day that we use to um identify and and and squash data Integrity issues we are experimenting with some new ideas to try to get that last little bit that last 0.1% is very hard um and they include uh the the most uh exciting idea right now is the idea of doing direct sampling for ver validation um so you know from when you're when you're in the business of replication data quality can be seen as you basically just need another sync mechanism that you can use to compare against uh yeah and um we have we've done a few iterations internally um we've we've shipped things and the these are all running in the background these are not things you see as a five TR customer um and where where basically we pull samples of data from The Source or the destination and compare them uh to just create a totally outof band mechanism to verify and we've discovered for example we discovered a floating Point truncation bug when we write CSV files for loading into Data warehouses by doing this um and we think there are more things out there uh that we could we could discover and fix by doing that um and then the other side of this is at some point we want to make these capabilities customer facing because there's a lot of phantom data Integrity issues in our world we get a lot of reports from customers whether they're like oh VI train is broken the system doesn't match and sometimes they are right uh we do occasionally hit but a lot of the time they're they're compare there's something wrong with the the comparison that they're doing and that that doesn't mean that we just tell them to go away we have to figure it out we have to verify that it's a like a false alarm so we get a lot of false alarms if five trans to the extent that we can build tools for quickly um proving or disproving the the concern we're we're thinking about that too that's awesome and then pra from your side of the you know data quality Island uh what are the challenges that you're seeing today yeah so the way I think about it is I think of it as a three-step framework uh it's actually very similar to health I think like generally like Life Health um it's awareness um that's the first step the second step is cure uh and the third step is prevention uh and I think if you think about each of these steps like awareness for example what we're seeing with customers is uh we like we have a ton of customers who use us with five TR monardo and I think like for example five Tran we were like the metata API that came out now we have customers that say let's pull out context on what's Happening um and send out an announcement directly to my end users which is red green yellow is did the pipeline run or did it not run did it run as I expected stuff like that uh we have the same thing with anomaly detection on the so the stuff that the data producers know can we share awareness to end consumers and end users and in a way that's easy for them it's in their bi tool it's in slack it's this like green announce M that says red green yellow right like stuff like that that's first step can we create awareness of where we are the one big change we've seen is is this move to this concept of a data product where I think some of the most furthest ahead teams are actually taking all these metrics and metadata and converting it almost into a score which is a data product score and say like here like let's create a measure of like you know if you don't measure you can't really improve what's the measure of reusability and Trust as a I think about a data product so that's been I I mean I've been super surprised by how quickly that adoption has grown uh across our customers the second on cure bar alluded to this collaboration I think that's the most broken flow that exists right now because cure is a solution between business and data producers both need to come together so there's a mass like there I think we have a lot of work to do hope when we come back maybe not in 10 years like we come back in even like a year like we've made significant progress in collaboration and the third is prevention I think the biggest piece here like we're seeing a lot of adoption around data contracts and preventing so how do you take what you learned in awareness and cure it but also make it something that's more sustainable over time and I think that that's actually where there's been a a bunch of innovation I think like we launched a module but there's been a ton of innovation over the last uh the last some time U and hopefully all those three things together actually get us to a point where we solve for data trust in you know I really my vision for this is like in a few years like it becomes a really boring problem like we're not talking about it it's like it just it's there uh and then we keep improving it but it's not a it's not a problem that we should have a topic of conversation about it should become stable s data contracts so the simplest way of this is how do you help a producer and a consumer align on an selling that's the best way that we're thinking and so what do you believe are the four rules for data quality again it's it's a little bit more of a collaboration problem actually more than it's a technical problem which is what is what do we agree on is our core layers of this is what we believe and then how do you translate that into the actual data producer workflow itself that's the best example of what we're seeing um customers on the C and there's one thing that you mentioned pra which is on the collaboration side that I think is very important which is that data quality is often a cultural issue as much as it is you know broken pipelines or like uh you know something happening on the data collection side um can you walk us through maybe the main cultural issues that lead to poor data quality like and expand on that notion a bit more and like what can happen on organization what can organizations do today to shift their culture to prioritize dat quality so let you lead with app culture and then I i' love to listen from the remaining panelists yeah every cultural thing right I actually think of it similar like what's the base of culture like first if you believe in like if you think about if you believe in good intent which I would like to believe that most like everyone actually is trying to do the right thing for the company to a large no it's trying to destroy data PES no data producer wants to like ship something that like breaks and then spend like nobody wants to that like's like let's start like everyone wants like everyone wants good inent um so I think the first first step is really I think just shared awareness and shared context uh so first like a lack of like I remember this I remember once I got like I used to be a data leader in my previous and I got this call from a stakeholder and they were like number of this dashboard doesn't look right I remember like jumping on my bed and I look at my dashboard and there's a 2X Spike overnight and I'm like oh my God this is crazy and even I in that moment had a question on like oh my god did something Breck or is my data engineer not doing his job I even I had because I couldn't just open up airflow and look at the audit logs and see what happened like I just could so first step the reality is that five different people with their own DN and they don't fully understand each other's work unlike any other team sales teams are very homogeneous data is not it's very hetrogeneous so first how do you create shared context shared understanding awareness across the ecosystem in a way that everyone can understand second can you measured it uh which is why I'm super excited because if you create a measure then it becomes very easy for people to move to it so I think as you think about so how do you measure it is the second um and then the third is actually then just the process flows it's tooling process flows like iterative Improvement that's actually the easy part of the problem in my mind I think the first two things like for example even shared context J said this right like you break it's very easy to lose trust uh but at that time nobody says 99.5% of the time it was accurate you see the one time the number was broken and it breaks trust right and so then what's the shared understanding what are we defining as trust uh and how do you solve that human problem um I think the best examples of this is we've seen people actually have folks who understand both culture and humans and Data Drive the charge on building that initial people call it governance standards people whatever you decide to call it but that initial shared context and understanding is I think the first step to good culture yeah and I like George like you react to that like maybe what are some of the levers that you've seen that can improve on cultural level to improve data quality within an organizations maybe kind of getting inspired from five Trend customers here oh my gosh I'll let you know when I find one unfortunately a lot of data quality problems Have and Have origins in like poor systems configuration and those things are really hard to fix uh yeah you know um if I have one piece of advice for early stage Founders it is keep an eye on your Salesforce configuration because if that thing gets out of joint uh man it is hard to fix uh so it's it's it is a real grind trying to make progress on this a lot of it consists of going Upstream to the systems of record and improving their configuration so that they're not generating like a zillion duplicate accounts and stuff like that yeah I can attest to that I can attest to that and then uh bar from your perspective culturally maybe how do you move the needle as a data leader yeah I mean agree with what pra and George said maybe the perspective that I can add here I think the companies that we are seeing make progress it's due to a few reasons the first is there's both this organizational top down and bottom up um agreement that data matters and that the quality and Trust of that data matter um if it's just one direction that typically fails um so if there's you know you know there's like a CEO of one of the you know Fortune 500 Banks gets upset every time uh they get a report with bad data and so they actually made it a sea level initiative to um to make sure their data is is sort of you know Ready Clean to the best degree that they can Etc um that obviously creates some pressure creates you know real initiatives in the business real metrics to pera's point earlier um that is not sufficient um it's very important but the you know sort of business teams if you will business analysts but also the data governance teams um in large Enterprises there's various you know it could be the centralized data engineering platform all those people need all the different teams and um or all stakeholders in an initiative and need to care about it just as much for those teams oftentimes the motivation is that they're spending most of their days in fire drills on data issues um and by the way I saw someone sort of asking can we clarify what is a data issue I think that's a great question um bad bad data can look in various forms and has you know can can uh you know it symptoms are are so different um but generally what you know the way that that I'm thinking about this is if you look at some data product whatever data product that is again it could be um pricing recommendation it could be you know a dashboard that you're CMO is looking at um it could be a chat bot um and if you're looking at it you look at the data and it's very clear to you that the answer is wrong um maybe the most you know an example from the last few weeks um from this was from Google I think someone searched how do you how do I keep my cheese on my pizza or something like that and Google recommended you can use organic superglue that's a great way to keep search from Reddit yeah yeah exactly that's yeah exactly that's right and so um is a good example of bad data um that you know that is one you know very public and I think it went viral and you know maybe Google Google can get away with that but many other companies can't get away with that so there was um you know an airline that actually um provided the wrong uh discount on an airline ticket and so you know a consumer purchased a ticket at different price and actually sued that Airline and got the B the money back Rec coup the money um and so you know there really real repercussions to putting that data out there um and I think you know going back to your question about culture I think you know both the the teams working with data have to care about that now the thing is that they don't always do because they don't always understand where is your data going so if I'm building a data pipeline I don't necessarily understand who's being who's using that data um and why which makes sense if I'm way upstream and so oftentimes I find that um the companies who have made the most progress are those who are able to bring together those teams under a unified view of where do we want to go as a company um oftentimes that could start looking at just how many data incidents do we have um how quick are we to respond like what's our time to to detection of those what's our time to resolution and then you know taking this a step further uh oftentimes are teams putting together slas between each other so you know the SLA for particular data to arrive on time or to arrive in some complete State um Etc so so I would say kind of the focus on metrics I agree with prala that that's um that typically drives the right Behavior or drives some Behavior which is better than none okay I couldn't agree more and then maybe you mentioned something onc will drive some Behavior I agree with that I I'm gonna get that tattooed and then uh the one thing that you mentioned bar is the Google example right because I think this kind of is a perfect segue into the you know nuances of data quality when it comes to J AI right you mentioned that survey at the beginning of 100% exactly 100% of data leaders are pressure are you know Under Pressure to deliver generative AI use cases right um that does not sound surprising at all so you know when if you're a data leader if you're in an organization trying to build a j of AI use case what are the data quality considerations you need to have H are they different from the general data quality considerations you need to have like what are the nuances that you need to have uh like what are the nuances when it comes to the considerations of data quality when it comes to J of AI so bar I'll let you you know continue on that yeah I mean I think look if if I think about sort of the state of generative AI with Enterprises today I you know I mentioned from the survey 100% are under pressure by the way 91% are actually building something so we've almost almost all of us have succumbed to the pressure for whatever reason um and uh I think when we say we're building with gener of AI that can takes various definition so I'll give you an example just last week I spoke to one one Enterprise customer who told me we have the full entire sort of like Tech stack for generative AI build you know with all Best in Class we're we're fully ready to go we have no you know use cases or we we don't really know we don't have anything that's like tied to business outcome that we can point to but the teex stack is ready like we're ready to go and then and then um you know another customer that said we have you know 300 or so business use cases laid out we have like some great ideas for how to drive business stup we have nothing on the tech stack we're totally we you know we don't even know where to get started um and I think that represents the spectrum of where customers are at uh you know it can be anywhere on one versus the other side or in the middle I think there's more questions than answers at this point A lot of people are experimenting or sort of in the early days of um building things in in Pilots some of it is also in production um but I think early days I think by and large across all of these instances companies understand that they have to make sure that the data that's actually serving those llms um has to be accurate and here's why today everyone has access to the best models you know the the models being built by 5,000 phds and a billion dollars in gpus we can all access them there's no competitive Advantage for a company with them where does a competitive Advantage lie it's actually with the proprietary data that you can bring that could be you know via rag or fine tuning whatever method you choose but your proprietary data that will help differentiate your generative AI product so that you can create personalized experience for your customer or so that you can automate your own business process but without that proprietary data there's not really a mode or sort of competitive advantage and so companies are realizing that they need to get their proprietary data um in strong shape and so that means making sure that that is a high high quality data and so we are seeing um you know more and more companies thinking about how do we get ready so that when the time comes and we actually have the Tex deack and the business use case and everything and we can actually deliver on that we have the right the right data um uh and we can actually use it that's great I could agree more and then George Al let you react to that as well um I like your comment about the models uh you know everyone has access to the models and the access for differentiation is is uh what data you put into them I I've actually heard Consultants advise companies that because everyone has to has access to the public models there's the way you need to differentiate is by making your own model which is insane advice like yeah that will differentiate you it will differentiate you because your model will be much worse than every but uh it's so early days hard to speculate about this I mean all the AI stuff is so embryonic it's very exciting because it's giving uh us the ability to do something with the unstructured Tex data that we've been we've had we've had it for years um but it's it's giving us the ability to interact with unstructured text in a meaningful but programmatic way um what that turns into um time will tell I don't know if rag is going to be to be all end all um I question whether chat is even the right long-term interface for a lot of these internal applications um but I don't have like a great alternative on the tip of my tongue so I just I think it's very early days and everyone should have their eyes and ears open yeah andou up from a data quality data governance perspective how does generative AI change the conversation yeah I think it's very early but a few patterns we're seeing I think across all our customers were seeing this pattern of people deploying small language models right uh more than like they like which is where Rags fine tuning like some of this comes in that's one pattern we're seeing and as they look at that I think the two nuances outside of just normal data quality which is that we're seeing is one the importance of business terms and semantic context so for example we had a customer who's an investment and you know he was like you know when someone searches in our like someone chats saying Tam in our company Tam means total addressable Market not the eight thing that it means on the internet so one layer is just like how do you like if you get an accurate output what is the sem context that's core to the company and how do we feed that that's one layer that's becoming very important or more important than before the second is we seeing uh around this is a little bit more around governance but also relates a little bit of trust which is how do I depending on who's writing a question what data actually goes into the answer so for example if I'm deploying something for my HR team it's probably okay if payroll data gets used in the answer it's probably not okay if it is across the rest of the company um I buy data from LinkedIn for which has like certain terms and conditions associated with it I can only use it for this purpose not this purpose and so as you build that scale democratization uh the way I think about this you you alluded to this right like which is goalpost keeps changing actually I think that's a good thing the reason the goalpost is changing is because people are using data more and the more people use data the more they they need to trust it the more there are issues like it's it's actually a good goal post and so if you actually play this out like there's more and more people who are going through maybe the the dream of like truly democratized data where everybody actually uses data daily and like you know like that's going to play out but then how do you feed it with the the right people should only get the right context at the right time and a way that's safe and secure like how do you solve for that those problems prating and now need to be done in a very different way than before you know I think those problems of permissions are easier than you're getting ad per call but the reason people think these problems are hard is because they look at that the people who make the base models like open Ai and anthropic and mistol and all uh and they're doing like web scraping so they have a whole data pipeline that's like a scraping pipeline that is in designed that on the assumption that is public data and so everything has the same permissions domain um but a a business that's using its internal data and an AI application is not like that at all uh they have data with very complex permissions Landscapes but that's very normal in a like business intelligence uh scenario the the way you solve this is by using the same data infrastructure that you use for everything else for AI I have disappointing news a relational database is the right answer uh for your as the data platform for your AI workload as well because uh yes there there is some unstructured tech data in there relational databases have text columns they've had them for a long time they work great uh but there is also going to be a forest of other tables that tell you all of the permissions metadata that you need to know in order to manage this problem so I think you know if if you if your starting point is like a web scraping pipeline that looks like what the people who the base models are using yes the permissions problem look very hard but if your starting point is a relational database that is structured similarly to the one you use for bi uh this is whole problem you just need to you need to join to all the appropriate things and recapitulate the permissions rules of the systems of record in the SQL queries and you're ready to go it's not that it's like um so easy you're going to do it in a day but my point is it's not really new this idea of I have a database I have a bunch of data in it there are rules about who is allowed to see what like that if if you have a complete schema of the system that you're talking about that is a very solvable problem using traditional techniques and is that solv through sorry for me the question is is that solve through something like rag buta I'll let you answer your question so I mean I think the I I don't think my call is for like what's the new technology we need I actually think that I I think we actually spoke about this maybe like N9 months ago on maybe a different panel like and we were like you know maybe the techn the data in AI it's likely going to look pretty similar to what the like I don't think it's a technology here I do think the nuan is solving the like it does introduce a lot of new nuances around like uh just because you're processing this at a speed and at a scale and at like very like there's actually nuances to this which need to be solved for um if we really move towards that AI World um and then second there's a um there's also a human collaboration problem which is what is the policy uh it's not it's not even a technology problem it's like how do we collaborate to figure out what the right policies are for what The Right Use cases are and that was like that used to happen on like people like I've seen so many examples of people doing this like there's documents written and like published somewhere nobody ever uses them uh and it was okay because it was like a few dashboards here and there which is just not going to be okay in the future so how do you solve for that I think that's why it becomes more important okay that's awesome awesome so we do have a couple of minutes left I want to make sure that we answer some audience qia there's one that I think I think we're going to be only able to answer one but I think it's going to be very relevant um which is how do you proceed with management to make them understand the cost of data quality issues because we've been talking about putting a metric on it we've been talking about you know aligning the organization I think no metric is better than Roi or lack thereof so maybe yeah how do you how are you able to kind of uh put a cost on data quality issues so for C I'll start with you um I feel like bar might have a better answer to this one um but because I know you have a framework that you put out at some point but I think the high level your first is almost accepting that everything in data across cannot directly be measured the business value because data itself is a support function inside an or it's like Bops and management things like so like if an analyst produced a report with say like our CEO talks about this like he was like I was hiring a bunch of sales reps in my in my team same time like I hired one analyst who found this one thing that we could optimize and we actually like made an extra million dollars through that one analyst what's the ROI of that analyst thing it's way more than 10 sales reps uh um so how do you think about it's just harder to do because it's two layers removed because data needs to drive strategy or execution both of those things together Drive business business Ro uh and it's hard to De covered so that's true for any data platform tooling across and I think first just accepting that I think is helpful and then second then okay then what are the so if you can't get the outcome metric what's the output metric that you can get to is the way I think about it so we do need a notar that we can progress towards because you know that this is important to get to the outcome and you know if you're in a company that you need to convince people on that then I like I would question whether like data is actually really important to the company uh because like that's pretty straightforward like you know you should have good data to dive but like if you're convincing people on that like I think the first question to really have with whoever is asking you for a business case is is this really important uh for you because if it's not like like then that's okay let's have a conversation about it and then I think what's output metric and I'll let bar talk about that because I know you yeah I let bar and end with us on the framework sure so um the number one thing it depends on I'll say depends on who you're at who you're talking to if you talk to Executives the number one thing they'll tell you is I just sleep better at night uh knowing that someone sort of is that that the data that's powering my business you know whatever it is Data products dashboard J of AI call whatever you like I just sleep better at night knowing that the data is accurate which is very hard to measure to perus point um if you talk to a data engineer sheine learning engineer whatever it is um they often time will talk about how much time they spend and are they spending on sort of cleaning up data or cleaning up you know fire drills um and you know or or actually are they sort of you know building new pipelines and and doing things um doing other things so that's you know kind of like various answers that you will get I will say in general there's sort of three things that we think about the first is um reputation in brand and Trust so when your data is wrong again like you know think about the Google example I don't know if I'll Trust another Google search again after I saw the superglue example um the second cost of Revenue um and so oftentimes you know I gave the airline example but there's realc real implications you know one data issue can easily cost millions of dollars for an organization um and then the third metric that you know I mentioned is sort of Team efficiency or or team time uh your organizational time on that those are sort of the three high level metrics okay that is awesome and I think this is a great time to end today's panel to end day one of radar uh I want to say hug huge thank you for kalpa bar George for joining us for such an insightful session I truly truly appreciate everyone show them the love with the Emojis below and I also say a huge thank you for everyone who's joining from across the world you know people joining us from different time zones people even at like 2 am 3: am watching this stuff I really really appreciate it so I want to say huge huge thank you to all panelists today and to our speakers um and to our audience and in the meantime do check out the LinkedIn group keep connecting and see you tomorrow tomorrow same time same place I really appreciate everyone speak soon thank youall right all right hello hello everyone and welcome to the final session of the day of data Camp radar on scaling data quality in the age of Jaren of AI we left the best for last so everyone do give us a lot of love in the Emojis as you can see here below and let us know where you're joining from I see more than 500 people in the session already so yeah do let us know uh where you're joining from uh especially and what you thought of data Camp radar day one and what you're excited about data Camp radar day too um so of course as organizations continue to embrace Ai and machine learning the importance of maintaining high quality data has never been more critical there are arguably no better people in the data business across the board than bar Moses pral Panka and George Frasier to come talk to us about data quality so first I'm goingon to introduce bar Moses she is the CEO and co-founder of Monte Carlo a pioneering company in data reliability and the creator of the data observability category uh Monte Carlo is backed by top VCS such as X Axel ggv red Point iconic grow Salesforce Venture and ivp bar it's great to see you thanks for having me next up is prala Sankar she is the founder of atlan prala is a leading modern data and AI governance company on a mission to enable better collaboration around data between business people analysts and Engineers she has been awarded the economic Times emerging entrepreneur of the Year Forbes 30 under 30 40 under 40 and the top 10 CNBC young businesswoman of 2016a it's great to see you thanks for having me awesome and last but not least is George Frasier CEO at F Tran uh George founded F Tran to help data Engineers simplify the process of working with disperate data sources he has grown fer to be the de facto standard platform for data movement in 2023 he was named a data Nami person to watch he also has a PHD in neurobiology George great to see you great to be with you and just a few housekeeping notes before we get started there will be time for Q&A at the end so make sure to ask questions by using the Q&A feature and vote for your favorite questions if you want to chat with the other participants use the chat feature we highly encourage you to engage in the conversation if you want to network and add folks on LinkedIn and share your LinkedIn profile they will be removed automatically but do join our LinkedIn profile that is linked in the chat as well and you can connect with fellow atendees and I think this is a great starting point to start today's session um you know it's safe to say that data quality is at the top of Mind of many data leaders today especially with the generative AI boom that we see um but maybe to set this stage how would you describe the current state of data quality within the industry within organizations today what do you think are the common challenges organizations are facing when it comes to maintaining high quality data bar I'll actually start with you sure I have lots of opinions on this topic trying not to hog the entire time um yes data quality well frankly let me start by saying it has been a problem and an issue in the space for the last couple of decades so nothing is New Right we've been complaining about data quality for a long time we shall continue to to complain about the quality of our data for a long time um however I do think a few things have changed um first and foremost you know obviously generative AI products you know being more um uh or being prevalent uh at least in terms of the desire to to build them um uh data are put under a lot of pressure um we actually put out a survey that showed we surveyed sort of a bunch of data leadered data leaders and 100% of data leaders were cited as um uh Under Pressure to deliver generative AI products uh no one said they are not being asked to build something um however uh only 70% of them just under 70 68% of them actually feel like their data is ready for generative AI so that means that while there's a ton of pressure from sea level and board and others in the market to actually build generative Ai No One or the large majority of people don't think that their data is ready ready for that and I think that poses a good question for us as an industry to figure out why that is the case um and my hypothesis is that what I would call the data state has changed a lot in the last five to 10 years so the way in which we process transform store data has changed a ton but the way in which we manage data hasn't changed at all and so that means that you know if you go back to the survey actually 50% of um those data leaders still use manual um sort of approaches to data quality and so while we've become a lot more sophisticated with what we demand from our data and from our data infrastructure we have not become more sophisticated in how we manage data quality um you know I think manual rules will and always be important but that they're not the end allv in fact it is just the starting point um and so I think this you know in short if I had to respond to what is the state of data quality today I think there are new problem it's sort of an old problem with new challenges that we have not cut up yet um I definitely have ideas on how how we need to solve that uh but I'll pause that uh I'll pause there for a minute and see if any reactions from my esteemed uh fellow panelists proa let you react here yeah I think I I agree with everything that said but I I think the one thing to like abstract this a little bit over like I think about this concept of Data Trust more than just data quality or like and maybe this is the reason you have the three of us in this panel but like you know the way I think about this is like if you think about that final layer of trust uh and you have a human who says this number on the stash BR is broken oh my God or it doesn't look right like what's wrong right it sounds like a very simple question it's actually a very difficult question to answer because the reason a number could be off could be because the f f Tran pipeline that day broke and then run it could be because uh it never happens what are you talking about or it could be because it could be because the data quality checks that day failed it could be because someone changed the way we measured annual recording revenue and like no one forgot like no one remembered to update the data consumer right and so if you think about this flow I almost think of it as you have data producers who actually kind of want to guarantee trust where sell service data like no data producer wants to spend their time answering the question of why a number is off and on the other hand like you have data consumers who actually want to use data like no one actually cares about the quality of the data like they actually just want to use the B like a data consumer cares about making business decisions and in the middle we have this Gap and the reason we have this Gap is because we have a proliferate it's almost self created problems we have created a significant number of tools that have been that that have scaled massively but we have a proliferation of tools we also have significant diversity in people so any single final dashboard probably had five people touch touch it this problem just gets worse in the AI era so at least if I was a human I look at the number and I'm like uh maybe the number doesn't look right and I can do something about it if I'm AI I don't do that and that can actually like lead to pretty significant position so I think the way we think about this uh and we sit on S that layer between the producers and the consumers and bringing this stuff together is what does it mean to create these data products finally like what makes something reusable and trustworthy uh and how can you bring context across from the pipeline from data quality from all of these layers in the stack like human context to solve the trust problem or the Gap okay that's really great and George I'll let you react you know to both bar helpus framework is right um that there's a lot of layers to the system and it matters a lot where the problem is AR Rising except the part about five trans R but I mean you would be you would we we try very hard to avoid contributing to the data uh quality problem in our layer you would not believe the amount of effort that goes on behind the scenes to try to chase down the long tail of replication out of sync buz that can happen with all the systems we support um we are not perfect I can only say that we are better than everyone else I think where it happens is very important in terms of troubleshooting um you ask like why despite all the efforts in this is this um is this you know still so top of mind and the answer is because it's impossible to fix uh data quality is not a problem you can ever fix uh there are so many layers and there are so so many ways for things to go wrong and sometimes the source of pro some of the later stage problems that poupel was just um talking about are things like you can't actually 100% know whether the person who checked out at your point of sales system was actually the same person who uh you know created an account on your website so you can never actually get to 100% uh with data quality you just have to manage it uh and you have to identify what are the what are the highest priority areas where is it most important that the numbers be right prioritize those work on those um but you you I think you've got to start out acknowledging it will never be perfect yeah yeah no I I agre I I think the thing on that is kind of what you said right that I think things will always like the reality of running like especially realtime Dynamic data ecosystems is that things will always break like there's it's likely that there will always be things that are because it's that like it's just the nature of the Beast and so that's why I think a lot about like when you think about trust trust wasn't actually big because something went wrong trust breaks because someone told you your stakeholder told you that something went wrong without you telling them actually something went wrong today maybe you should like and I think that's the element of trust which is it's one layer above something and I don't think the solution is trying to make sure nothing ever goes wrong the solution is how do you go one level above and make sure that you solve for trust and then how do you measure it and manage it toward him trust is word it is very hard to win and very easy to an example of this I heard a long time ago it's funny I'm I'm in New York right now and I met with somebody earlier today at Bloomberg actually I still had the ice coffee that I got while I was there uh and long time ago I don't know if you know what Bloomberg is but they do data feeds um for finance um it's a kind of data management um but it's it's data uh you know like stock prices commodity prices gas prices things like that many years ago when five Tran first started one of the things I learned is a key element of their business is that uh the um is that is not is is that it is accurate even in this obscure cases you know the price of beans in Korea or whatever it is like even the most obscure data feeds they are more accurate than anybody else and that is really important because if one thing is wrong one day out of the year that is a huge problem that's something we've always tried to emulate F5 Trend in a very different context replicating a company's own data but it it speaks to how it is when when you're in the when you're in any kind of data business um you can you can be the difference between zero errors and one error is bigger than the difference between one error and Infinity uh trust is so hard hard to win and so quickly lost and bar I'll let you react and then I'll ask my next question oh I was just going to say just reflecting on this like like I wouldn't be surprised if we would be sitting in a panel like in 10 years from now still having you know sort of similar discussions except the words change and definition change so maybe we call it you know trust or data quality or whatever now like hallucinations in the context of generative AI right um but the problem Remains the Same um I think one of the the sort of interesting questions to answer is like what or to Think Through is like what what are our you know customers who are now faced with what are the challenges that our customers today are faced with and how are they dealing with that and how is that different from you know a few years ago or honestly just like a year ago um and the reality is like these problems are just not going away and so figuring out how to address those um you know in a way that uh adapts to where customers are and meeting them where they are is is uh I think super important and I want to pick up on that point because you know bar you alluded to like you know me in a panel in 10 years I'd be very excited to have a data Camp radar AI Edition 2034 uh where we discuss this um and what it seems to me is that the goalpost on data quality is Shifting every year right so like organizations make rdes they make investments but then the Ambitions of what it means to have like high quality data also shift with that so the challenges also still remain the same so you mentioned this bar earlier in the discussion is that same problems different challenges what are those challenges today so I'd love to learn that from you yeah great question and I mean I'll start by saying look like if you you know you could in some world like if model output is wrong or you know you sort of you know you you're prompting with a question and and and the answer is is wrong is it better than to not have an answer at all like is no data better than bad data um maybe I I think so but also then what's the point of having you know kind of a Q&A or chat bot if it can't provide you an answer at all right um and so like to your question the definition of good what does good look like actually becomes tricky um and how do you define like what should we strive for changes I think um but you know to your particular question like what are the challenges or kind of pinpointing those um I think you know kind of kind of how I'm alluded to sort of how the data state has changed over time I think the historically what we've done you know when it comes to sort of trust to pra's point was really start with can we figure out about data issues before anyone else Downstream learns about it right whether that's you know in in in generative AI or not whether it could be in a dashboard um and so you know the the the thought is that if you know we can catch issues before others Downstream do we can sort of either repair that trust or rebuild that trust um I think what we're seeing right now is that the challenge that is definitely a very important challenge to um to address and I think detection capabilities have evolved a certain agree you know I sort of talked about manual solutions for that versus not I think sort of the big kind of like next leap here for building you know data quality Data Trust whatever you want to call it um is sort of going Beyond detection and taking the next step of sort of understanding how do you actually resolve how do you actually address these problems um and when you think about the root cause of these challenges that has changed too and so in the past like you know you really most data most teams really just f on sort of cleansing the data and once you cleanse the data and was fine you brought it in you're all good five Trend never breaks so you were okay um and and in in today's world five Trend still doesn't break sorry George I'm bigging on you but um uhh but the data sort of you know if you look at the data landscape it's become really complex it's like s super super complex for even for a small team to manage right and I think if you think about sort of the core pillars of what makes up the kind of like data State I would call it there's three things the first is the data itself so actually like the data sources whatever you know kind of you're ingesting the second is uh the code so you know code written by Engineers machine learning Engineers uh data scientists analytics Engineers Etc and then the third component is the systems or the infrastructure basically the jobs running all of that and so you have multiple teams mult bu building multiple complex webs of all three of those things the problem is that data can break as a result of each one of those three so it could be as a result of the data you know just that you ingested being totally incorrect it could be the result of you know bad code bad code could be like a bad join or schema change or it could be a system failure I won't name names but systems do fa do fail right could be any elt general elt solution they use and so actually like understanding that um in order to really build reliable products you have to look at and understand each each of those components you first of all have to have an overview and sort of visibility in each of these components and then also understand can you correlate between a particular data issue that you're experiencing and say a code change or an infrastructure change or anything like that that is really really hard to do today um and so what ends up happening is that data teams are inundated with lots of you know alerts or kind of you know quality detections data quality issues and they're all flying around between you know 20 to 30 different data teams and 10 different domains and go fix like who needs to address which problem so when or which alert um so you actually like you know down to like the brass tax of how do we handle this those are some of the challenges that I think really sort of figuring out how do we both have really strong detection of issues but then how do we go to the next step and actually figure out what is a root cause and honestly oftentimes it's more than just one root cause so it's typically you know this storm excuse my language with like every single thing breaking right like it'll be both a data a code and a system issue um and so you know when I think about how our systems can get more sophisticated or how we build more reliable um Data Systems it has to have a more sophisticated view of what's actually um uh what are the you know V components of that and what could break that's really great and maybe George from your perspective adding on top of what bar said what are the challenges that you're seeing today when it comes to U scaling data quality or like you know improving moving the needle on data quality well I mean we look at a very particular slice of this we look at the replication piece does the data in the central data warehouse match the data in the systems of reford be it um a database like postgress or Oracle or a app like Salesforce or workday um and we you know we we've we've come a long way with uh just centralization of effort just the fact that five pipelines are standard so everyone is running the same code there's this cumulative thing where over years we fix bugs that's the main tactic at the end of the day that we use to um identify and and and squash data Integrity issues we are experimenting with some new ideas to try to get that last little bit that last 0.1% is very hard um and they include uh the the most uh exciting idea right now is the idea of doing direct sampling for ver validation um so you know from when you're when you're in the business of replication data quality can be seen as you basically just need another sync mechanism that you can use to compare against uh yeah and um we have we've done a few iterations internally um we've we've shipped things and the these are all running in the background these are not things you see as a five TR customer um and where where basically we pull samples of data from The Source or the destination and compare them uh to just create a totally outof band mechanism to verify and we've discovered for example we discovered a floating Point truncation bug when we write CSV files for loading into Data warehouses by doing this um and we think there are more things out there uh that we could we could discover and fix by doing that um and then the other side of this is at some point we want to make these capabilities customer facing because there's a lot of phantom data Integrity issues in our world we get a lot of reports from customers whether they're like oh VI train is broken the system doesn't match and sometimes they are right uh we do occasionally hit but a lot of the time they're they're compare there's something wrong with the the comparison that they're doing and that that doesn't mean that we just tell them to go away we have to figure it out we have to verify that it's a like a false alarm so we get a lot of false alarms if five trans to the extent that we can build tools for quickly um proving or disproving the the concern we're we're thinking about that too that's awesome and then pra from your side of the you know data quality Island uh what are the challenges that you're seeing today yeah so the way I think about it is I think of it as a three-step framework uh it's actually very similar to health I think like generally like Life Health um it's awareness um that's the first step the second step is cure uh and the third step is prevention uh and I think if you think about each of these steps like awareness for example what we're seeing with customers is uh we like we have a ton of customers who use us with five TR monardo and I think like for example five Tran we were like the metata API that came out now we have customers that say let's pull out context on what's Happening um and send out an announcement directly to my end users which is red green yellow is did the pipeline run or did it not run did it run as I expected stuff like that uh we have the same thing with anomaly detection on the so the stuff that the data producers know can we share awareness to end consumers and end users and in a way that's easy for them it's in their bi tool it's in slack it's this like green announce M that says red green yellow right like stuff like that that's first step can we create awareness of where we are the one big change we've seen is is this move to this concept of a data product where I think some of the most furthest ahead teams are actually taking all these metrics and metadata and converting it almost into a score which is a data product score and say like here like let's create a measure of like you know if you don't measure you can't really improve what's the measure of reusability and Trust as a I think about a data product so that's been I I mean I've been super surprised by how quickly that adoption has grown uh across our customers the second on cure bar alluded to this collaboration I think that's the most broken flow that exists right now because cure is a solution between business and data producers both need to come together so there's a mass like there I think we have a lot of work to do hope when we come back maybe not in 10 years like we come back in even like a year like we've made significant progress in collaboration and the third is prevention I think the biggest piece here like we're seeing a lot of adoption around data contracts and preventing so how do you take what you learned in awareness and cure it but also make it something that's more sustainable over time and I think that that's actually where there's been a a bunch of innovation I think like we launched a module but there's been a ton of innovation over the last uh the last some time U and hopefully all those three things together actually get us to a point where we solve for data trust in you know I really my vision for this is like in a few years like it becomes a really boring problem like we're not talking about it it's like it just it's there uh and then we keep improving it but it's not a it's not a problem that we should have a topic of conversation about it should become stable s data contracts so the simplest way of this is how do you help a producer and a consumer align on an selling that's the best way that we're thinking and so what do you believe are the four rules for data quality again it's it's a little bit more of a collaboration problem actually more than it's a technical problem which is what is what do we agree on is our core layers of this is what we believe and then how do you translate that into the actual data producer workflow itself that's the best example of what we're seeing um customers on the C and there's one thing that you mentioned pra which is on the collaboration side that I think is very important which is that data quality is often a cultural issue as much as it is you know broken pipelines or like uh you know something happening on the data collection side um can you walk us through maybe the main cultural issues that lead to poor data quality like and expand on that notion a bit more and like what can happen on organization what can organizations do today to shift their culture to prioritize dat quality so let you lead with app culture and then I i' love to listen from the remaining panelists yeah every cultural thing right I actually think of it similar like what's the base of culture like first if you believe in like if you think about if you believe in good intent which I would like to believe that most like everyone actually is trying to do the right thing for the company to a large no it's trying to destroy data PES no data producer wants to like ship something that like breaks and then spend like nobody wants to that like's like let's start like everyone wants like everyone wants good inent um so I think the first first step is really I think just shared awareness and shared context uh so first like a lack of like I remember this I remember once I got like I used to be a data leader in my previous and I got this call from a stakeholder and they were like number of this dashboard doesn't look right I remember like jumping on my bed and I look at my dashboard and there's a 2X Spike overnight and I'm like oh my God this is crazy and even I in that moment had a question on like oh my god did something Breck or is my data engineer not doing his job I even I had because I couldn't just open up airflow and look at the audit logs and see what happened like I just could so first step the reality is that five different people with their own DN and they don't fully understand each other's work unlike any other team sales teams are very homogeneous data is not it's very hetrogeneous so first how do you create shared context shared understanding awareness across the ecosystem in a way that everyone can understand second can you measured it uh which is why I'm super excited because if you create a measure then it becomes very easy for people to move to it so I think as you think about so how do you measure it is the second um and then the third is actually then just the process flows it's tooling process flows like iterative Improvement that's actually the easy part of the problem in my mind I think the first two things like for example even shared context J said this right like you break it's very easy to lose trust uh but at that time nobody says 99.5% of the time it was accurate you see the one time the number was broken and it breaks trust right and so then what's the shared understanding what are we defining as trust uh and how do you solve that human problem um I think the best examples of this is we've seen people actually have folks who understand both culture and humans and Data Drive the charge on building that initial people call it governance standards people whatever you decide to call it but that initial shared context and understanding is I think the first step to good culture yeah and I like George like you react to that like maybe what are some of the levers that you've seen that can improve on cultural level to improve data quality within an organizations maybe kind of getting inspired from five Trend customers here oh my gosh I'll let you know when I find one unfortunately a lot of data quality problems Have and Have origins in like poor systems configuration and those things are really hard to fix uh yeah you know um if I have one piece of advice for early stage Founders it is keep an eye on your Salesforce configuration because if that thing gets out of joint uh man it is hard to fix uh so it's it's it is a real grind trying to make progress on this a lot of it consists of going Upstream to the systems of record and improving their configuration so that they're not generating like a zillion duplicate accounts and stuff like that yeah I can attest to that I can attest to that and then uh bar from your perspective culturally maybe how do you move the needle as a data leader yeah I mean agree with what pra and George said maybe the perspective that I can add here I think the companies that we are seeing make progress it's due to a few reasons the first is there's both this organizational top down and bottom up um agreement that data matters and that the quality and Trust of that data matter um if it's just one direction that typically fails um so if there's you know you know there's like a CEO of one of the you know Fortune 500 Banks gets upset every time uh they get a report with bad data and so they actually made it a sea level initiative to um to make sure their data is is sort of you know Ready Clean to the best degree that they can Etc um that obviously creates some pressure creates you know real initiatives in the business real metrics to pera's point earlier um that is not sufficient um it's very important but the you know sort of business teams if you will business analysts but also the data governance teams um in large Enterprises there's various you know it could be the centralized data engineering platform all those people need all the different teams and um or all stakeholders in an initiative and need to care about it just as much for those teams oftentimes the motivation is that they're spending most of their days in fire drills on data issues um and by the way I saw someone sort of asking can we clarify what is a data issue I think that's a great question um bad bad data can look in various forms and has you know can can uh you know it symptoms are are so different um but generally what you know the way that that I'm thinking about this is if you look at some data product whatever data product that is again it could be um pricing recommendation it could be you know a dashboard that you're CMO is looking at um it could be a chat bot um and if you're looking at it you look at the data and it's very clear to you that the answer is wrong um maybe the most you know an example from the last few weeks um from this was from Google I think someone searched how do you how do I keep my cheese on my pizza or something like that and Google recommended you can use organic superglue that's a great way to keep search from Reddit yeah yeah exactly that's yeah exactly that's right and so um is a good example of bad data um that you know that is one you know very public and I think it went viral and you know maybe Google Google can get away with that but many other companies can't get away with that so there was um you know an airline that actually um provided the wrong uh discount on an airline ticket and so you know a consumer purchased a ticket at different price and actually sued that Airline and got the B the money back Rec coup the money um and so you know there really real repercussions to putting that data out there um and I think you know going back to your question about culture I think you know both the the teams working with data have to care about that now the thing is that they don't always do because they don't always understand where is your data going so if I'm building a data pipeline I don't necessarily understand who's being who's using that data um and why which makes sense if I'm way upstream and so oftentimes I find that um the companies who have made the most progress are those who are able to bring together those teams under a unified view of where do we want to go as a company um oftentimes that could start looking at just how many data incidents do we have um how quick are we to respond like what's our time to to detection of those what's our time to resolution and then you know taking this a step further uh oftentimes are teams putting together slas between each other so you know the SLA for particular data to arrive on time or to arrive in some complete State um Etc so so I would say kind of the focus on metrics I agree with prala that that's um that typically drives the right Behavior or drives some Behavior which is better than none okay I couldn't agree more and then maybe you mentioned something onc will drive some Behavior I agree with that I I'm gonna get that tattooed and then uh the one thing that you mentioned bar is the Google example right because I think this kind of is a perfect segue into the you know nuances of data quality when it comes to J AI right you mentioned that survey at the beginning of 100% exactly 100% of data leaders are pressure are you know Under Pressure to deliver generative AI use cases right um that does not sound surprising at all so you know when if you're a data leader if you're in an organization trying to build a j of AI use case what are the data quality considerations you need to have H are they different from the general data quality considerations you need to have like what are the nuances that you need to have uh like what are the nuances when it comes to the considerations of data quality when it comes to J of AI so bar I'll let you you know continue on that yeah I mean I think look if if I think about sort of the state of generative AI with Enterprises today I you know I mentioned from the survey 100% are under pressure by the way 91% are actually building something so we've almost almost all of us have succumbed to the pressure for whatever reason um and uh I think when we say we're building with gener of AI that can takes various definition so I'll give you an example just last week I spoke to one one Enterprise customer who told me we have the full entire sort of like Tech stack for generative AI build you know with all Best in Class we're we're fully ready to go we have no you know use cases or we we don't really know we don't have anything that's like tied to business outcome that we can point to but the teex stack is ready like we're ready to go and then and then um you know another customer that said we have you know 300 or so business use cases laid out we have like some great ideas for how to drive business stup we have nothing on the tech stack we're totally we you know we don't even know where to get started um and I think that represents the spectrum of where customers are at uh you know it can be anywhere on one versus the other side or in the middle I think there's more questions than answers at this point A lot of people are experimenting or sort of in the early days of um building things in in Pilots some of it is also in production um but I think early days I think by and large across all of these instances companies understand that they have to make sure that the data that's actually serving those llms um has to be accurate and here's why today everyone has access to the best models you know the the models being built by 5,000 phds and a billion dollars in gpus we can all access them there's no competitive Advantage for a company with them where does a competitive Advantage lie it's actually with the proprietary data that you can bring that could be you know via rag or fine tuning whatever method you choose but your proprietary data that will help differentiate your generative AI product so that you can create personalized experience for your customer or so that you can automate your own business process but without that proprietary data there's not really a mode or sort of competitive advantage and so companies are realizing that they need to get their proprietary data um in strong shape and so that means making sure that that is a high high quality data and so we are seeing um you know more and more companies thinking about how do we get ready so that when the time comes and we actually have the Tex deack and the business use case and everything and we can actually deliver on that we have the right the right data um uh and we can actually use it that's great I could agree more and then George Al let you react to that as well um I like your comment about the models uh you know everyone has access to the models and the access for differentiation is is uh what data you put into them I I've actually heard Consultants advise companies that because everyone has to has access to the public models there's the way you need to differentiate is by making your own model which is insane advice like yeah that will differentiate you it will differentiate you because your model will be much worse than every but uh it's so early days hard to speculate about this I mean all the AI stuff is so embryonic it's very exciting because it's giving uh us the ability to do something with the unstructured Tex data that we've been we've had we've had it for years um but it's it's giving us the ability to interact with unstructured text in a meaningful but programmatic way um what that turns into um time will tell I don't know if rag is going to be to be all end all um I question whether chat is even the right long-term interface for a lot of these internal applications um but I don't have like a great alternative on the tip of my tongue so I just I think it's very early days and everyone should have their eyes and ears open yeah andou up from a data quality data governance perspective how does generative AI change the conversation yeah I think it's very early but a few patterns we're seeing I think across all our customers were seeing this pattern of people deploying small language models right uh more than like they like which is where Rags fine tuning like some of this comes in that's one pattern we're seeing and as they look at that I think the two nuances outside of just normal data quality which is that we're seeing is one the importance of business terms and semantic context so for example we had a customer who's an investment and you know he was like you know when someone searches in our like someone chats saying Tam in our company Tam means total addressable Market not the eight thing that it means on the internet so one layer is just like how do you like if you get an accurate output what is the sem context that's core to the company and how do we feed that that's one layer that's becoming very important or more important than before the second is we seeing uh around this is a little bit more around governance but also relates a little bit of trust which is how do I depending on who's writing a question what data actually goes into the answer so for example if I'm deploying something for my HR team it's probably okay if payroll data gets used in the answer it's probably not okay if it is across the rest of the company um I buy data from LinkedIn for which has like certain terms and conditions associated with it I can only use it for this purpose not this purpose and so as you build that scale democratization uh the way I think about this you you alluded to this right like which is goalpost keeps changing actually I think that's a good thing the reason the goalpost is changing is because people are using data more and the more people use data the more they they need to trust it the more there are issues like it's it's actually a good goal post and so if you actually play this out like there's more and more people who are going through maybe the the dream of like truly democratized data where everybody actually uses data daily and like you know like that's going to play out but then how do you feed it with the the right people should only get the right context at the right time and a way that's safe and secure like how do you solve for that those problems prating and now need to be done in a very different way than before you know I think those problems of permissions are easier than you're getting ad per call but the reason people think these problems are hard is because they look at that the people who make the base models like open Ai and anthropic and mistol and all uh and they're doing like web scraping so they have a whole data pipeline that's like a scraping pipeline that is in designed that on the assumption that is public data and so everything has the same permissions domain um but a a business that's using its internal data and an AI application is not like that at all uh they have data with very complex permissions Landscapes but that's very normal in a like business intelligence uh scenario the the way you solve this is by using the same data infrastructure that you use for everything else for AI I have disappointing news a relational database is the right answer uh for your as the data platform for your AI workload as well because uh yes there there is some unstructured tech data in there relational databases have text columns they've had them for a long time they work great uh but there is also going to be a forest of other tables that tell you all of the permissions metadata that you need to know in order to manage this problem so I think you know if if you if your starting point is like a web scraping pipeline that looks like what the people who the base models are using yes the permissions problem look very hard but if your starting point is a relational database that is structured similarly to the one you use for bi uh this is whole problem you just need to you need to join to all the appropriate things and recapitulate the permissions rules of the systems of record in the SQL queries and you're ready to go it's not that it's like um so easy you're going to do it in a day but my point is it's not really new this idea of I have a database I have a bunch of data in it there are rules about who is allowed to see what like that if if you have a complete schema of the system that you're talking about that is a very solvable problem using traditional techniques and is that solv through sorry for me the question is is that solve through something like rag buta I'll let you answer your question so I mean I think the I I don't think my call is for like what's the new technology we need I actually think that I I think we actually spoke about this maybe like N9 months ago on maybe a different panel like and we were like you know maybe the techn the data in AI it's likely going to look pretty similar to what the like I don't think it's a technology here I do think the nuan is solving the like it does introduce a lot of new nuances around like uh just because you're processing this at a speed and at a scale and at like very like there's actually nuances to this which need to be solved for um if we really move towards that AI World um and then second there's a um there's also a human collaboration problem which is what is the policy uh it's not it's not even a technology problem it's like how do we collaborate to figure out what the right policies are for what The Right Use cases are and that was like that used to happen on like people like I've seen so many examples of people doing this like there's documents written and like published somewhere nobody ever uses them uh and it was okay because it was like a few dashboards here and there which is just not going to be okay in the future so how do you solve for that I think that's why it becomes more important okay that's awesome awesome so we do have a couple of minutes left I want to make sure that we answer some audience qia there's one that I think I think we're going to be only able to answer one but I think it's going to be very relevant um which is how do you proceed with management to make them understand the cost of data quality issues because we've been talking about putting a metric on it we've been talking about you know aligning the organization I think no metric is better than Roi or lack thereof so maybe yeah how do you how are you able to kind of uh put a cost on data quality issues so for C I'll start with you um I feel like bar might have a better answer to this one um but because I know you have a framework that you put out at some point but I think the high level your first is almost accepting that everything in data across cannot directly be measured the business value because data itself is a support function inside an or it's like Bops and management things like so like if an analyst produced a report with say like our CEO talks about this like he was like I was hiring a bunch of sales reps in my in my team same time like I hired one analyst who found this one thing that we could optimize and we actually like made an extra million dollars through that one analyst what's the ROI of that analyst thing it's way more than 10 sales reps uh um so how do you think about it's just harder to do because it's two layers removed because data needs to drive strategy or execution both of those things together Drive business business Ro uh and it's hard to De covered so that's true for any data platform tooling across and I think first just accepting that I think is helpful and then second then okay then what are the so if you can't get the outcome metric what's the output metric that you can get to is the way I think about it so we do need a notar that we can progress towards because you know that this is important to get to the outcome and you know if you're in a company that you need to convince people on that then I like I would question whether like data is actually really important to the company uh because like that's pretty straightforward like you know you should have good data to dive but like if you're convincing people on that like I think the first question to really have with whoever is asking you for a business case is is this really important uh for you because if it's not like like then that's okay let's have a conversation about it and then I think what's output metric and I'll let bar talk about that because I know you yeah I let bar and end with us on the framework sure so um the number one thing it depends on I'll say depends on who you're at who you're talking to if you talk to Executives the number one thing they'll tell you is I just sleep better at night uh knowing that someone sort of is that that the data that's powering my business you know whatever it is Data products dashboard J of AI call whatever you like I just sleep better at night knowing that the data is accurate which is very hard to measure to perus point um if you talk to a data engineer sheine learning engineer whatever it is um they often time will talk about how much time they spend and are they spending on sort of cleaning up data or cleaning up you know fire drills um and you know or or actually are they sort of you know building new pipelines and and doing things um doing other things so that's you know kind of like various answers that you will get I will say in general there's sort of three things that we think about the first is um reputation in brand and Trust so when your data is wrong again like you know think about the Google example I don't know if I'll Trust another Google search again after I saw the superglue example um the second cost of Revenue um and so oftentimes you know I gave the airline example but there's realc real implications you know one data issue can easily cost millions of dollars for an organization um and then the third metric that you know I mentioned is sort of Team efficiency or or team time uh your organizational time on that those are sort of the three high level metrics okay that is awesome and I think this is a great time to end today's panel to end day one of radar uh I want to say hug huge thank you for kalpa bar George for joining us for such an insightful session I truly truly appreciate everyone show them the love with the Emojis below and I also say a huge thank you for everyone who's joining from across the world you know people joining us from different time zones people even at like 2 am 3: am watching this stuff I really really appreciate it so I want to say huge huge thank you to all panelists today and to our speakers um and to our audience and in the meantime do check out the LinkedIn group keep connecting and see you tomorrow tomorrow same time same place I really appreciate everyone speak soon thank you\n"