#204 Data & AI Trends in 2024 _ Tom Tunguz, General Partner at Theory Ventures

**The Future of Data and AI in Companies**

I think the leaders of the next wave will be conversent in AI they will know how to speak to their leaders and educate them on where the company should be spending time where the company should be investing in terms of software because the reality is like every board and every SE Suite is saying we need AI we need AI we're starting to see the productivity improvements from our competitors they're looking to the leaders within each individual team and Department to educate themselves and then ultimately develop a strategy on how to leverage this technology internally and so being an expert in that domain I think will will lead to lots of promotions um because if you can cut two-thirds of a Workforce or if you can increase Revenue by a third uh by making sales team that just that much more effective through automation there's a lot of value to be created absolutely so you take advantage of this you could be like the real hero of your company event absolutely okay um and are there any particular areas where you think um data teams um should be focusing their attention or companies should be focusing their attention to improve their data capabilities I mean I think the first is the data pipelines that that are going to be necessary to power sales um sales optimization and customer support optimization those seem to be the two areas across companies where for example we're starting to see startups that are building fully automated uh sales development reps so that instead of sending a human sending 10 emails per day these systems are sending a th000 emails a week or 2,000 emails a week in order for those programs to be effective the data pipelines to inform those outbound campaigns will need to be there same on customer support so those chat Bots that are able to deflect two-thirds or more of the inbound customer support queries the Richer the context is the more data those robots have about the customer in particular or the FAQs or the new product features the better and more effective they'll be and so I think that's where you'll see a lot of effort and energy because those are the two one is a revenue C in order to make a lot of money in software they increase the revenue of your customer you materially reduce the cost sales team that's where you're going to increase revenue and then the customer support team typically that's where pretty significant cost comes from so I would expect focusing on those data pipelines enabling that to happen will be great.

**The Importance of Data Pipelines**

the Third priority is around marketing because there's a technique in marketing it's called account based marketing building a website for a particular buyer like a Coca-Cola Proctor and Gamble and historically that's been really difficult to scale just as you can imagine need a lot of data to be able to do that now we're starting to see the Next Generation Um account-based marketing companies do this for every single customer in a in a universe because they're just automating machines and there again the data pipeline context is really so stop the stuff is really can have a direct impact on your revenue or your costs and then go towards like personalization and think about uh better marketing uh through personalization okay those seem like pretty strong areas to Target um so just before we wrap up are there any companies that you are particularly excited about right now yeah within the data ecosystem I know mother duck we talked about I think they have the potential to really changed the the cloud data warehouse ecosystem with their unique hybrid architecture and Jordan who's the founder and CEO there he he was the tech L big query really understands domain think the the other realization that we've had as a company is that more than 80% of the cloud data warehouse workloads are small enough to be able to process on a modern Mac and so having this hybrid architecture allows companies to do that the other one is is Omni so this is the xooker team uh the key part of them uh and who got together with one of the Chris Merritt who's one of the architects of DBT they're building a modern bi system that balances centralized control and metrics definition with enabling individual an individual marketer to create a metric around cost of customer acquisition and building the workflow to have that move all the way up to centralized the centralized data team and they're having a lot of success those are the two I'd like to highlight today okay mother duck and omley uh yeah companies to watch out for then um all right and Jeffy final advice for companies wanting to make better use of data keep going it can be it can be hard uh and there's tends to be a lot of resistance associated with it but what I've seen in my career time and time again is the the companies who move really fast and experiment with the Next Generation Technologies are often able to find a lot of Alpha and develop competitive advantage through it all right super uh great advice there uh thank you so much Tom oh pleasure was mine thank you Richie

"WEBVTTKind: captionsLanguage: enwell I think the productivity expectations of every white collar worker will now go through the roof I think everyone will be expected to be 50 to I don't know 250% more productive than they have been in the past because they have these tools at their disposal and so being familiar with all these different systems and understanding when to use them will be absolutely essential hi Tom thank you for joining me pleasure to be here Richie thanks for inviting me he excellent so we're going to talk a bit about Trends uh we'll start with a big one so generative AI has obviously been the big story of the last few years and it seems to be changing pretty fast so uh what trends are you seeing at the moment I think you're right it I mean it feels like crypto did maybe two or three years ago where every morning you'd wake up and there'd be a new paper in the world would have changed uh and I definitely feel that way I think we're all waiting with baited breath for the next generation of the models the ones at Facebook meta has promised that GPT 5 which seem like they'll be able to work on much longer duration tasks so till we know what that looks like but in the meantime I think there's definitely been a shift I would say last year to this year from preferring co-pilots which complete your sentences to agents which will fully automate work or at least some fraction of the road work okay no that's very interesting because um yeah there's been a big refrain of last um year or two um like as not take your job it's going to augment it but agent sort of promise to just like automate things and completely take away that human from the loop so um yeah tell me more about that yeah for sure so I mean I think what we the we've seen some of the productivity stats on co-pilots which are about 50 to 75% that's Microsoft in service now the other way around Microsoft 75 if the llm agents or anything like mechanical robots the promise is to increase productivity by about 250% so one robot would take the place two two and a half humans which is what's happened on Automotive manufacturing lines we're still really early there I would say we're starting to see the very first applications of this and security automation or customer support where Clara cut two-thirds of their customer support staff and we can debate the relative effectiveness of the Bots but they did it and they saved a lot of money doing it and so I think what we're seeing across in there people are really excited about this trend and the the nature of like entry level tasks will change there was a really interesting Twitter Thread about this where they were talking about um okay so robots have taken off in the last 10 years is robotic surgery particularly like Urology or areas in the body where there's just a very difficult way a very difficult space to navigate and as a result robots you know they can move in ways that humans can't because their fingers so to speak are much narrower there's now this training Gap where the the senior surgeons are no longer teaching the junior surgeons side by side the junior surgeons are now just watching the robot operate and so as a result there'll be this generation of Surgeons or security analysts or customer support reps who are experts in the domain and then the younger ones don't receive the benefit of the training and so that's that's an interesting Dynamic that we're sort of paying attention to um but in terms of like the technology itself we're looking for tasks where there's a lot of s summarization maybe some initial customization some of these agents are actually starting to write their own queries to hit external databases to enrich and then produce a summary output and then that goes to a classifier that says is this worth a human's attention or not so uh on the other side of AI so what about non-generative AI are you seeing any changes there I think what well it it's not moving as fast it's not to say that it's not important we're starting to see combinations generative Ai and class we'll call them classical methods one reason for this is that the generative AI mechanisms tend to be chaotic and non-deterministic if I ask how do I reset my password period And I ask the question how do I reset my password space period it's a different question that will elicit a different answer from a generative AI model and if you start to chain these inputs and outputs across generative AI steps you might have error that explodes and so one of the and this is very early on but we're starting to see companies take the output of a generative AI step and then apply a classical machine learning method to it to classify it or just to kind of constrain the error and then pass it on to the next step pass it on to the next step um you know we were we we've met a company recently where there's three or four different kinds of machine learning that are being used all in one system uh generative AI is really good at two different things it's really good at generating idea so you know four to five great titles for this blog post that I wrote right now and then deterministic methods are the classical methods are the ones that are really good at picking and so I think you'll see this nice marriage between the two where most generative AI applications will not be entirely gen they won't just be the rappers and as we get to the next step you'll see we call them constellation of models and so there's just to kind of talk about in one level more detail you know input from the user comes in there's a classifier that could be a classical classifier or gen classifier that says what kind of query is this have I seen this before and then given that output it passes it either to a small language gen model that knows what it's talking about or if it's like a completely new query it goes to something like a chat gp4 because it can handle whole universe or if you go to a classical okay so that hybrid approach is really interesting um and so it's maybe like using the generative AI stuff as like the the human interface because humans like chatting and then You' got like something more deterministic underneath so you not got the the uh error propagation problems okay um now actually one Trend I've seen just from going to a lot of data conferences last year everyone was like Hey you must build stuff with generative AI it's going to be amazing and then just in the last few months it's changed to all those generative AI prototypes you bu last year failed because your data quality sucks and now you need to think about like data governance and how you improve your data quality so are there any things interesting you're seeing in that space well yeah there's I mean there's a lot around data security so we've um I think the big fear with generative AI you have two different kinds of problems with generative AI security the first is what happened with the Canadian national airline when a passenger asked about a bereavement policy and the generative AI created a breath policy that Canada the Canadian Airline was forced to adhere to even though was nothing to do with the real policy so there's this like hallucination problem but there are other challenges with it give you a friend of mine started is running a company one of his employees installed an llm on top of the cloud data warehouse and he asked what is this employee social security number and I'll pop the Social Security number so you have data security issues so we've identified like five different kinds of uh security that will be needed around large language models whether it's ensuring the developers environment that they're building the large language model to secure ensuring that there's no data loss like the social security issue there's the right access permissions to the database are you hitting the right snowflake connections do they have the right user account um there are data poisoning attempts so if I'm downloading an open source model am I downloading the right one has it been tampered with that's sort of software supply chain and there are others and so but this whole Space is new and the challenge is historically the ciso the Chief Information security officer has been predominantly responsible for securing data the average ten year there is about 14 months out reaches so it's it's a very challenging job because surface area continues to increase but now the the data person the head of data is now basically responsible maybe even the head of engineering are responsible for building the systems that use those data in generative Ai and so there's a sort of both of these roles have to night in terms of securing and making a hermetic seal around the data and those workflows have not yet been created our our sense is that for many of the largest companies in the world who have might have hundreds or thousands of llm enabled applications on their road maps security is the single biggest blocking feature because nobody wants to be fired for the leag but there's no there's no Playbook yet there's no sort of standard tools and that's really slowing a lot of the Enterprise adoption okay so it seems like um making llms or making generative AI applications more secure is could be a big growth area in the next few months or years then okay I say I liked your um example of the Canadian Airlines I believe like the the courts ruled that they had to uphold that bement policy that the AI um told them about yeah and I heard like another example of this where someone was chatting with like a a car dealership and it was an AI sales rep and they pers did the AI to sell my car for a dollar and I have no idea whether that's been upheld or not but like that seems like a good we if you can uh if you can pull it off um okay so um related to this it seems like um a lot of the cloud data warehouses you snowflakes and data bricks and all this um they're rapidly adding AI features and so how do you see um data warehouses changing at the moment um with this rise of AI they're at the core right a lot of the data that's being fed into these large language models comes from from the cloud data warehouses themselves and so typically like in the in the pre gen era you the map of the modern data stack was you'd have a source system an ETL platform the cloud data warehouse and then you would have three different consumers you have Pi exploratory analytics and then machine learning but machine learning was always post Hawk it was never passive production it was always customer segmentation turn prediction Revenue maximization exercises now what's happening is the the cloud data warehouse is basically now in the path of production so to speak where extracts of that data is are being fed into an into a machine learning pipeline that's then cutting up all the the data calculating the different vectors potentially storing it in a vector database and then at inference time the query plus the documents and the vectors are all going into um the vectors are leading to the documents it said going to the large language model and so these cloud data warehouses suddenly have to rethink the way that they uh they where they sit in the stack a lot of them I mean many of the historical ones have not been architected to serve as production databases so what's being know what's happening is there these new pipelines right so there's there's spark that's being pushed into to calculate the different vectors particularly large scale and then you have this category of vector databases which um which do a lot of calculations on similarity search there's a strategic question about how much of the vector database should be a standalone database and how much of the vector database can actually exist within a cloud data warehouse core functions there are relatively straightforward it's clustering with K means or another form of clustering and then cosine similarity how how similar are two vectors or three vectors or four vectors and so I think you know we'll start to see it some of the bigger database companies have announced Vector database initiatives so I that's an important competitive Dynamic that we'll see play out uh and that that'll be that'll be really critical I think the other Dynamic within allive cloud data work warehousing is a separation of comput storage but not in the way that snowflake takes talks about it literally mean like the separation of the query engine from that where the data is stored and you saw that in snowflakes most recent earnings where a lot of the bigger customers are asking for the data to be stored in iceberg tables in S3 uh and so you know it's very possible that that core centralized data may actually be hit by a different query engine in order to power generative AI pipeline okay so you might be skirting around the data warehouse entire you go straight from your cloud storage in S3 or whatever and that's going straight to the the llm and then or into your vector database and it's just like cutting out that middle part is that right yeah it could be you may need to I mean it depends on the sizes of the data and again like one of the challenges is formatting the data in the right way a lot of the times like if anybody has used Lang chain and you're trying to process documents how you chunk the documents how you cut them up how you structure them in order to Fe the model is is really critical the sequencing is also really important and so you you may need to pre-process using a data warehouse or some other tool and then write to Iceberg tables with arrow files underneath and then that will be consumed but there's no at least our perception is there's no real design standard design pattern quite yet for how everybody's building these flows in a very customized way today in two three years from now they'll be I mean just way we had in the modern data stack right it's Source system five TR snowflake looker there's there was your stack uh we don't have that yet okay that's very interesting and it's seem like there are a lot of new companies in this area and they're all doing overlapping parts of the puzzle so they probably lots of different ways to get you complete flow at the moment but um you think we're going to end up with like these companies becoming bigger and they'll sort of overlap each other to converge on something or is it going to be there'll be like one winner somewhere that does everything I don't think there'll be one winner I mean the modern data static had many different many different approaches uh and you know I mean you could look at like data birs versus snowflake you can look at five Trend versus airite looker versus I mean any number of the the bi product so I think uh there will be many different approaches what's different within the world of AI is that the unlike unlike business intelligence where everybody wanted a dashboard the output is different you uh a company might want to build a gen pipeline to to summarize text right another company might want to build a recommendation system that combines both uh textual information about like a video but also statistics about how long users watching those videos in order to show the right next video that's a different a different kind of pipeline where you're Computing multimodal vectors on the multimodal topic you may have a third company that's trying to build uh TV production production grade video generation that's a third completely different kind of pipeline uh you might have like legal documents or let's say you were looking to analyze the financials of a company let's take a look at an income statement you really want to you might want to classical method there to understand exactly what's going on inside of the the p&l because you can't you can't take any you can't accept any error and then you know then there's another the way that a venture capitalist is get p&l is very different than the way that auditor would look at a p&l anyway so my point is there's so many different outputs that are necessary that you may actually see a much broader diversity of different data pipelines and data pipeline vendors to support all these different use cases okay so I guess like once we get um a mod AI stack it's going to be a lot more rich and complex compared to the modern data stack that was talked about a few years ago I think so okay um so going back to um the idea of data quality I know one of the buzziest terms at the moment is the IDE of data contracts can you just talk me through like what are these and when would you need one sure yeah data quality so you can imagine data is is like a manufacturing line right you have we talked about it you have raw ingredients then go through a processing facility and then they're packaged up and data quality Monti Carlo is the leader here is really about understanding how effectively the data is coming through are there changes in the Upstream sources and is the qual is the distribution of the data changing is the volume of the data changing is the shape of the data changing and if there are that everybody should be alerted because that's that's very likely a problem and so that's what data the data quality movement is about and so you have company like Monte Carlo which is doing using machine learning to understand exactly what's happening and then you also have a test based approach where you assert different conditions about there should never be a zero in this particular column where it should always be numeric for example and so once once a data pipeline is working and you have an observability layer kind of like a data dog through that you're you're in a really good place the the the next sort of theme within the modern data stack world is this idea of a data contract and it's part of this broader theme that parallels What's happen in software engineering so let's take a step back so the way that we used to build software is we used to build on one very large code base all the engineers who work on one really large code base at the same time and there's some advantages to that but what we found is by cutting it up into small pieces and having small teams of engineers build on all those different microservices we call them it was much easier for everybody to collaborate the One requirement was that each team had declare to the rest of the world that these are the kinds of inputs my system expects and these are the kind of outputs that my system will produce and these are the guarantees that I'll give you about how fast it will do that um and how often we'll update are and so this is exactly what's happening in the data world it 20 years ago micro strategy and business objects and cognos they were all controlled by a centralized data team and the access to data was was really Limited in order to make sure that it was highly controlled and the data was was was good was accurate and now what's happened is we've had a democratization of data where the marketing team has its own analyst it might have its own cloud data warehouse the same for the sales team and finance might have the same thing and so now it's been distributed just in the way microservices and what what data contracts promise is let's encode the inputs the outputs and then the SLA the expectations around performance into software so we can manage all of this effectively today for many of the largest companies this doesn't exist and so having a software platform or the change management associated with encoding that so that if the marketing team is consuming product data and the the product team doesn't just change the format of the data that in the marketing team breaking a whole bunch down Downstream systems okay so this sounds like um a really effective way for different teams who are working on either similar bits of data or one bit of data where it's Downstream from another team they can work together more effectively because they're essentially guaranteed like what they're getting from the other team that's right yeah it feeds into this idea of data mesh which is data is distributed all throughout the organization people are each consuming and and producing data for each other as opposed to from a centralized no so I'd also like to talk a bit about cloud computing so it seems like um moving everything from working Lily to moving uh to working the cloud that's been a big Trend over the last well More Than A decade is this something you see continuing or is the pendulum going to swing the other way so I think we'll start to see the emergence so it's absolutely true everyone was moving from onr to the cloud now what we're starting to see is hybrid execution and what do there's two different parts to hybrid execution the first the customers want to hold on to their own data uh and so we talked a little bit about Snowflake and Iceberg before and so many of the largest companies or the Privacy Centric companies they want their own data stored on their S3 buckets or whatever buckets they have and what they want to do is they want to bring the software and the compute to that data as opposed to sending that data to Salesforce or sending it to maretto and then having the output stay there and then somehow have to pay a third party vendor so just bring all the data here here we'll bring your software in we'll compute and then the software can leave if ever we decide to change it that is becoming much much more important than it has been in the past the other part of hybrid execution is actually so this is why it's a little bit overloaded there's there's a new wave of building applications where some of the processing is done in the cloud maybe through that architecture and some of the processing is done in the browser and so if we look at Technologies like ddb you can run a ddb instance inside of a awom container web assembly container and so let's say you have this really huge data set like a 100 gigabyte data set pre-process some of it to the cloud take two gigabytes put it locally and then all of the analysis of the visualization that's done on that two gigabyte data set is actually done on the user's computer and so as a result it's much faster it's much more Capital efficient you actually offload if you're a vendor with this kind of architecture you're able to operate with significantly better margins because you don't pay for as much computer fast okay uh so certainly saving money sounds like a good idea and it it sounds like so you've got um raw data sets are generally bigger so they're going to be somewhere centralized and then by the time you got something processed that's going to be much smaller so that may arguably makes more sense to be done locally okay um I suppose that the trick is just to be able to move fluidly from local to cloud and back again so you're not worry too much about that that's right that's the hard part okay all right uh and so um another big trend is business intelligence platforms have been conquering everything certainly the data analytic space over the last few years uh and you mentioned looker and then there's others like powerbi and Tableau and all the rest so um do you see these bi platforms changing at all or are they mature I do think I mean so uh these bi platforms will change I think the way that we think about bi over the last 20 year and we talked a bit about this but in the early 2000s bi was really centralized and controlled by a small number of people in order to ensure the accuracy that was like in the year 2000 let's say and then during Tableau was formed in 2004 and really hit it stride over the next five to 10 years that was about enabling an analyst to take control and analyze their own data completely Bottoms Up strategy no centralized control excuse me at the beginning and so you had a huge pendulum swing to the from the center to the edge then looker came and said well there's these next Generation cloud data bases these cloud data warehouses like Snowflake and big query uh why don't we try to exact some more control while giving some flexibility they deployed a a mo modeling language called look ml which allowed the data team to define a metric and then everyone in the organization to use that definition of Revenue and now I think where we're going is um continuing to sort of like go back to the pendulum so the next generation of of companies like Omi uh what they're trying to do is allow the metrics definitions to be created at the edge and then promoted all the way through with the Rons of workflow the big question you know the one that you asked is well where does AI fit into this there's a challenge with AI I think at least in the bi World AI has a role with SQL query completion seems to be a great use case I may not know exactly the right syntax to do um a window function or I may not know exactly the right Syntax for beautiful Union across two different tables or CTE and so I can I can I can complete the code complete the query using AI the big question is whether or not people will trust you know these AI systems uh in order to answer questions like what is the company's Revenue broken up by region and territory uh by region and product because if I ask that question in a slightly different way if I flip the group effectively the group eyes I might get a different answer and so until we really solve that problem the lack of trust I think will be a pretty significant barrier to entry to full automated just ask a question and trust that the data is correct and then talking to different data teams there's another Nuance here which is the interpretation of the data even if you have the right data the interpretation of the data is still a really hard part is there statistical significant you know statistically significant difference between two averages and then how does that impact what the business ultimately makes so there I'm a little more cautious because of some of the some of the chaotic nature of the AI let's say there's two parts then so you've got there helping you write the code in order to actually do the analysis then the interpretation separately and it seems like I I think for simple queries it's certainly like the the SQL generation is pretty good and I agree with you that it's impossible to remember the syntax of a window function so yeah better to get the AI to write that but yeah uh for more complex queries I can certainly see have to be some trust issues there so um in that case do you think um we're going to need need to have humans in the loop there like for a long time in order to just sanity check the SQL I think our bet is that the number of people working in data will actually increase by a big big number multiple because so many many many more functions and many more businesses will now become reliant on data because they need it uh and so what will change is the kinds of tasks that analysts are doing but we'll need many many more of them and on sub of like bi and AI um obviously like the the biggest generative AI application is is chat gbt and so do you think that chat interface is going to replace um like the bi point and click interface or the two things complimentary I think so there's bi and then there's exploratory analytics my sense is in the bi landscape it will probably it probably won't have that much of an impact I mean maybe it will help you find a dashboard or might point you to a data point I think in the exploratory use case there's a much greater value there because you have a question like uh I don't know let's see um you know let's imagine on Coca-Cola like which one of the bottlers has seen the greatest amount of volume growth in the last 12 months and then you might want to dig into that in a bunch of different dimensions like let's say we wanted to tie that to uh contract structures and so in we would jump from the world of structured data which is what's going what's happening in the core um dashboards and we want to tie it to is there a particular provision within the contract terms associated with these kinds of bottlers there I could see generative AI being really helpful because its ability to classify and its ability to retrieve different kinds of information that might be written differently across different contracts would be really useful and so I think it might have and it's still early so I might be completely wrong but I think it'll probably have more of an impact on the exploratory analytics and it will on core core dashboarding and reporting of business metrics yeah so if you're trying to ask lots of questions quickly like you do in uh Eda then you probably want to chat interfase if you know what you want to build then just build the dashboard using the traditional tools and then yeah you're done okay uh all right so um this leads it I mean you said that um data scientists like and data analyst the skills or their jobs going to change like what new skills do they need in order to be able to cope this new world well I think the productivity expectations of every white collar worker will now I'll go through the roof I think everyone will be expected to be 50 to I don't know 250% more productive than they have been in the past because they have these tools at their disposal and so being familiar with all these different systems and understanding when to use them will be absolutely essential I think uh we look at even I mean there's this there's this professor at Duke who requires who requires all is English students to use chat GPT when they write and uh it's a bit of a divisive perspective but very much in agreement with it the reason is when those students graduate they'll be at a huge Advantage if they understand how to do it how to use chat gbt in a very sophisticated way to write the trade that he makes with his students is if there's a single grammatical error you fail uh and so just as much as the the student can benefit from the technology also you know need to assume some responsibility for it so I think the the role will become less data like munging data movement data management and much more understanding is the data correct what is the right interpretation how does this apply to the business which I think we would all agree that that's a a far more interesting part of the job than modifying a data frame being wide to long for example uh does it a definitely a task that I think no data analyst or data sence appreciates it's it was always a pain the data being in the wrong shape um okay so um we talking about changes in skills at the level like whole roles or jobs have you seen any roles becoming more popular or less popular like um is your job title G to change I don't think the job titles will change we haven't seen the impact broadly yet I think the the main difference here is the structure of the organizations where a lot of the core like machine learning data s d functions are now being pushed to fuse with the core engineering teams and this is because the machine Learning System the AI systems are now being put into production uh and so that's that's a very different place than where the data team used to live which was Downstream analytics post Haw uh not in the path of production and so I one of the broader themes that we're wondering is does the um does the data team actually start to live underneath engineering ah have you seen any companies where that's actually happened yet it's starting to happen particularly in the smaller companies uh where yeah where AI is a core part of the feature and just you know there's an output from a cloud data warehouse or some kind of aggregation that then being funneled into production because all of a sudden then like if you think about the classic modern data stack that whole tool chain or the whole value chain to deliver that data needs to be production grade in the definition of a py reliability engineer working on a core website it needs to have three or four nines of upside of um uptime reliability there needs to be alerting and monitoring around those core systems and so there's this cultural Fusion that I think will need to happen between the classic AI data science machine learning teams and the core engineering teams and that that is definitely we're seeing that broadly today but it will take time okay uh what sort of culture clashes are you expecting there between data and engineering teams well the engineering teams like I said they're accustomed to carry I mean many of them will carry pagers so if these systems break they're accustomed to using libraries they're they think about uh shipping the product really fast they code in a very particular way so if you look if we were to compare the python of a web application engineer and the python in an IPython notebook of a data scientist one has nothing to do with the other right in fact with the core software engineering team looks at an i python notebook it say like what is this I can't I can't take that code and put it into production I need to actually reimplement it or or write it into a different language that goes into the cicd which is a sorry integration continuous integration continuous development flow and the code path of production it's completely different than the way that dat data science is been a custom building and so just like even the core workflows of like how to commit the code what is the code format look like is it packages as a library how do I make sure it has tests and checks and it performs in a similar way to the other kinds of python code within the code base that that needs to come together okay so it sounds like um a lot of your data scientists they need to like be able to write a great function create great packages um just left but style guides and like how to structure their code um are there any other skills along those lines that you think are going to be important for data people no I think that's it I mean there's a another function or another Well actually another skill is this idea of um a data product so I was a product manager to Google and what we would do before deciding to build a product was we would create a product requirements document PRD which described exactly what it is that we were going to build we would socialize it within the rest of the company we would understand the dependencies and then once that was complete it would pass to the engineering team from implementation obviously to be back and forth what we're starting to see in some of the more sophisticated data engineering organizations and AI organizations is are starting to create data prds and they're starting to think about tables or apis that's data products just the way that like a regular API or a production grade API would be so they're starting to look like their own sort of engineering teams with the data product manager data Tech lead and then a bunch of data Engineers who are building this maintaining it so I think that formal way of Building Products will come to data okay so um it's going from that sort of Scrappy I'll just do do my analysis in Notebook to okay actually have to think about like who else is breeding the code who else is consuming this and uh make sure it's uh good good for public consumption okay um so for companies are thinking okay there are all these new tools available what do you need to do to take advantage of them I think today it's all about experimentation and really understanding where they work and where they don't the ecosystem is changing really fast I think the leaders of the next wave will be conversent in AI they will know how to speak to their leaders and educate them on where the company should be spending time where the company should be investing in terms of software because the reality is like every board and every SE Suite is saying we need AI we need AI we're starting to see the productivity improvements from our competitors they're looking to the leaders within each individual team and Department to educate themselves and then ultimately develop a strategy on how to leverage this technology internally and so being an expert in that domain I think will will lead to lots of promotions um because if you can cut two-thirds of a Workforce or if you can increase Revenue buy a third uh by making sales team that just that much more effective through automation there's a lot of value to be created absolutely so you take advantage of this you could be like the real hero of heroin of your company event absolutely okay um and are there any particular areas where you think um data teams um should be focusing their attention or companies should be focusing their attention to improve their data capabilities I mean I think the first is the data pipelines that that are going to be necessary to power sales um sales optimization and customer support optimization those seem to be the two areas across companies where for example we're starting to see startups that are building um fully automated uh sales development reps so that instead of sending a human sending 10 emails per day these systems are sending a th000 emails a week or 2,000 emails a week in order for those programs to be effective the data pipelines to inform those outbound campaigns will need to be there same on customer support so those chat Bots that are able to deflect two-thirds or more of the inbound customer support queries the Richer the context is the more data those robots have about the customer in particular or the FAQs or the new product features the better and more effective they'll be and so I think that's where you'll see a lot of effort and energy because those are the two one is a revenue C in order to make a lot of money in software they increase the revenue of your customer you materially reduce the cost sales team that's where you're going to increase revenue and then the customer support team typically that's where pretty significant cost comes from so I would expect focusing on those data pipelines enabling that to happen will be great the Third thir order priority is around marketing because there's a technique in marketing it's called account based marketing building a website for a particular buyer like a Coca-Cola Proctor and Gamble and historically that's been really difficult to scale just as you can imagine need a lot of data to be able to to do that now we're starting to see the Next Generation Um account-based marketing companies do this for every single customer in a in a universe because they're just automating machines and there again the data pip context of report okay that's really so stop the stuff is really can have a direct impact on your revenue or your costs and then go towards like personalization and think about uh better marketing uh through personalization okay those seem like pretty strong areas to Target um so just before we wrap up are there any companies that you are particularly excited about right now yeah within the data ecosystem you know mother duck we talked about I think they have the potential to really uh changed the the cloud data warehouse ecosystem with their unique hybrid architecture and Jordan who's the founder and CEO there he he was the tech L big query really understands domain think the the other realization that we've had as a company is that more than 80% of the cloud data warehouse workloads are small enough to be able to process on a modern Mac and so having this hybrid architecture allows companies to do that the other one is is Omni so this is the xooker team uh the key part of them uh and who got together with one of the Chris Merritt who's one of the architects of DBT they're building a modern bi system that balances centralized control and metrics definition with enabling individual an individual marketer to create a metric around cost of customer acquisition and building the workflow to have that move all the way up to centralized the centralized data team and they're having a lot of success those are the two I'd like to highlight today okay mother duck and omley uh yeah companies to watch out for then um all right and Jeffy final advice for companies wanting to make better use of data keep going it can be it can be hard uh and there's tends to be a lot of resistance associated with it but what I've seen in my career time and time again is the the companies who move really fast and experiment with the Next Generation Technologies are often able to find a lot of Alpha and develop competitive advantage through it all right super uh great advice there uh thank you so much Tom oh pleasure was mine thank you Richiewell I think the productivity expectations of every white collar worker will now go through the roof I think everyone will be expected to be 50 to I don't know 250% more productive than they have been in the past because they have these tools at their disposal and so being familiar with all these different systems and understanding when to use them will be absolutely essential hi Tom thank you for joining me pleasure to be here Richie thanks for inviting me he excellent so we're going to talk a bit about Trends uh we'll start with a big one so generative AI has obviously been the big story of the last few years and it seems to be changing pretty fast so uh what trends are you seeing at the moment I think you're right it I mean it feels like crypto did maybe two or three years ago where every morning you'd wake up and there'd be a new paper in the world would have changed uh and I definitely feel that way I think we're all waiting with baited breath for the next generation of the models the ones at Facebook meta has promised that GPT 5 which seem like they'll be able to work on much longer duration tasks so till we know what that looks like but in the meantime I think there's definitely been a shift I would say last year to this year from preferring co-pilots which complete your sentences to agents which will fully automate work or at least some fraction of the road work okay no that's very interesting because um yeah there's been a big refrain of last um year or two um like as not take your job it's going to augment it but agent sort of promise to just like automate things and completely take away that human from the loop so um yeah tell me more about that yeah for sure so I mean I think what we the we've seen some of the productivity stats on co-pilots which are about 50 to 75% that's Microsoft in service now the other way around Microsoft 75 if the llm agents or anything like mechanical robots the promise is to increase productivity by about 250% so one robot would take the place two two and a half humans which is what's happened on Automotive manufacturing lines we're still really early there I would say we're starting to see the very first applications of this and security automation or customer support where Clara cut two-thirds of their customer support staff and we can debate the relative effectiveness of the Bots but they did it and they saved a lot of money doing it and so I think what we're seeing across in there people are really excited about this trend and the the nature of like entry level tasks will change there was a really interesting Twitter Thread about this where they were talking about um okay so robots have taken off in the last 10 years is robotic surgery particularly like Urology or areas in the body where there's just a very difficult way a very difficult space to navigate and as a result robots you know they can move in ways that humans can't because their fingers so to speak are much narrower there's now this training Gap where the the senior surgeons are no longer teaching the junior surgeons side by side the junior surgeons are now just watching the robot operate and so as a result there'll be this generation of Surgeons or security analysts or customer support reps who are experts in the domain and then the younger ones don't receive the benefit of the training and so that's that's an interesting Dynamic that we're sort of paying attention to um but in terms of like the technology itself we're looking for tasks where there's a lot of s summarization maybe some initial customization some of these agents are actually starting to write their own queries to hit external databases to enrich and then produce a summary output and then that goes to a classifier that says is this worth a human's attention or not so uh on the other side of AI so what about non-generative AI are you seeing any changes there I think what well it it's not moving as fast it's not to say that it's not important we're starting to see combinations generative Ai and class we'll call them classical methods one reason for this is that the generative AI mechanisms tend to be chaotic and non-deterministic if I ask how do I reset my password period And I ask the question how do I reset my password space period it's a different question that will elicit a different answer from a generative AI model and if you start to chain these inputs and outputs across generative AI steps you might have error that explodes and so one of the and this is very early on but we're starting to see companies take the output of a generative AI step and then apply a classical machine learning method to it to classify it or just to kind of constrain the error and then pass it on to the next step pass it on to the next step um you know we were we we've met a company recently where there's three or four different kinds of machine learning that are being used all in one system uh generative AI is really good at two different things it's really good at generating idea so you know four to five great titles for this blog post that I wrote right now and then deterministic methods are the classical methods are the ones that are really good at picking and so I think you'll see this nice marriage between the two where most generative AI applications will not be entirely gen they won't just be the rappers and as we get to the next step you'll see we call them constellation of models and so there's just to kind of talk about in one level more detail you know input from the user comes in there's a classifier that could be a classical classifier or gen classifier that says what kind of query is this have I seen this before and then given that output it passes it either to a small language gen model that knows what it's talking about or if it's like a completely new query it goes to something like a chat gp4 because it can handle whole universe or if you go to a classical okay so that hybrid approach is really interesting um and so it's maybe like using the generative AI stuff as like the the human interface because humans like chatting and then You' got like something more deterministic underneath so you not got the the uh error propagation problems okay um now actually one Trend I've seen just from going to a lot of data conferences last year everyone was like Hey you must build stuff with generative AI it's going to be amazing and then just in the last few months it's changed to all those generative AI prototypes you bu last year failed because your data quality sucks and now you need to think about like data governance and how you improve your data quality so are there any things interesting you're seeing in that space well yeah there's I mean there's a lot around data security so we've um I think the big fear with generative AI you have two different kinds of problems with generative AI security the first is what happened with the Canadian national airline when a passenger asked about a bereavement policy and the generative AI created a breath policy that Canada the Canadian Airline was forced to adhere to even though was nothing to do with the real policy so there's this like hallucination problem but there are other challenges with it give you a friend of mine started is running a company one of his employees installed an llm on top of the cloud data warehouse and he asked what is this employee social security number and I'll pop the Social Security number so you have data security issues so we've identified like five different kinds of uh security that will be needed around large language models whether it's ensuring the developers environment that they're building the large language model to secure ensuring that there's no data loss like the social security issue there's the right access permissions to the database are you hitting the right snowflake connections do they have the right user account um there are data poisoning attempts so if I'm downloading an open source model am I downloading the right one has it been tampered with that's sort of software supply chain and there are others and so but this whole Space is new and the challenge is historically the ciso the Chief Information security officer has been predominantly responsible for securing data the average ten year there is about 14 months out reaches so it's it's a very challenging job because surface area continues to increase but now the the data person the head of data is now basically responsible maybe even the head of engineering are responsible for building the systems that use those data in generative Ai and so there's a sort of both of these roles have to night in terms of securing and making a hermetic seal around the data and those workflows have not yet been created our our sense is that for many of the largest companies in the world who have might have hundreds or thousands of llm enabled applications on their road maps security is the single biggest blocking feature because nobody wants to be fired for the leag but there's no there's no Playbook yet there's no sort of standard tools and that's really slowing a lot of the Enterprise adoption okay so it seems like um making llms or making generative AI applications more secure is could be a big growth area in the next few months or years then okay I say I liked your um example of the Canadian Airlines I believe like the the courts ruled that they had to uphold that bement policy that the AI um told them about yeah and I heard like another example of this where someone was chatting with like a a car dealership and it was an AI sales rep and they pers did the AI to sell my car for a dollar and I have no idea whether that's been upheld or not but like that seems like a good we if you can uh if you can pull it off um okay so um related to this it seems like um a lot of the cloud data warehouses you snowflakes and data bricks and all this um they're rapidly adding AI features and so how do you see um data warehouses changing at the moment um with this rise of AI they're at the core right a lot of the data that's being fed into these large language models comes from from the cloud data warehouses themselves and so typically like in the in the pre gen era you the map of the modern data stack was you'd have a source system an ETL platform the cloud data warehouse and then you would have three different consumers you have Pi exploratory analytics and then machine learning but machine learning was always post Hawk it was never passive production it was always customer segmentation turn prediction Revenue maximization exercises now what's happening is the the cloud data warehouse is basically now in the path of production so to speak where extracts of that data is are being fed into an into a machine learning pipeline that's then cutting up all the the data calculating the different vectors potentially storing it in a vector database and then at inference time the query plus the documents and the vectors are all going into um the vectors are leading to the documents it said going to the large language model and so these cloud data warehouses suddenly have to rethink the way that they uh they where they sit in the stack a lot of them I mean many of the historical ones have not been architected to serve as production databases so what's being know what's happening is there these new pipelines right so there's there's spark that's being pushed into to calculate the different vectors particularly large scale and then you have this category of vector databases which um which do a lot of calculations on similarity search there's a strategic question about how much of the vector database should be a standalone database and how much of the vector database can actually exist within a cloud data warehouse core functions there are relatively straightforward it's clustering with K means or another form of clustering and then cosine similarity how how similar are two vectors or three vectors or four vectors and so I think you know we'll start to see it some of the bigger database companies have announced Vector database initiatives so I that's an important competitive Dynamic that we'll see play out uh and that that'll be that'll be really critical I think the other Dynamic within allive cloud data work warehousing is a separation of comput storage but not in the way that snowflake takes talks about it literally mean like the separation of the query engine from that where the data is stored and you saw that in snowflakes most recent earnings where a lot of the bigger customers are asking for the data to be stored in iceberg tables in S3 uh and so you know it's very possible that that core centralized data may actually be hit by a different query engine in order to power generative AI pipeline okay so you might be skirting around the data warehouse entire you go straight from your cloud storage in S3 or whatever and that's going straight to the the llm and then or into your vector database and it's just like cutting out that middle part is that right yeah it could be you may need to I mean it depends on the sizes of the data and again like one of the challenges is formatting the data in the right way a lot of the times like if anybody has used Lang chain and you're trying to process documents how you chunk the documents how you cut them up how you structure them in order to Fe the model is is really critical the sequencing is also really important and so you you may need to pre-process using a data warehouse or some other tool and then write to Iceberg tables with arrow files underneath and then that will be consumed but there's no at least our perception is there's no real design standard design pattern quite yet for how everybody's building these flows in a very customized way today in two three years from now they'll be I mean just way we had in the modern data stack right it's Source system five TR snowflake looker there's there was your stack uh we don't have that yet okay that's very interesting and it's seem like there are a lot of new companies in this area and they're all doing overlapping parts of the puzzle so they probably lots of different ways to get you complete flow at the moment but um you think we're going to end up with like these companies becoming bigger and they'll sort of overlap each other to converge on something or is it going to be there'll be like one winner somewhere that does everything I don't think there'll be one winner I mean the modern data static had many different many different approaches uh and you know I mean you could look at like data birs versus snowflake you can look at five Trend versus airite looker versus I mean any number of the the bi product so I think uh there will be many different approaches what's different within the world of AI is that the unlike unlike business intelligence where everybody wanted a dashboard the output is different you uh a company might want to build a gen pipeline to to summarize text right another company might want to build a recommendation system that combines both uh textual information about like a video but also statistics about how long users watching those videos in order to show the right next video that's a different a different kind of pipeline where you're Computing multimodal vectors on the multimodal topic you may have a third company that's trying to build uh TV production production grade video generation that's a third completely different kind of pipeline uh you might have like legal documents or let's say you were looking to analyze the financials of a company let's take a look at an income statement you really want to you might want to classical method there to understand exactly what's going on inside of the the p&l because you can't you can't take any you can't accept any error and then you know then there's another the way that a venture capitalist is get p&l is very different than the way that auditor would look at a p&l anyway so my point is there's so many different outputs that are necessary that you may actually see a much broader diversity of different data pipelines and data pipeline vendors to support all these different use cases okay so I guess like once we get um a mod AI stack it's going to be a lot more rich and complex compared to the modern data stack that was talked about a few years ago I think so okay um so going back to um the idea of data quality I know one of the buzziest terms at the moment is the IDE of data contracts can you just talk me through like what are these and when would you need one sure yeah data quality so you can imagine data is is like a manufacturing line right you have we talked about it you have raw ingredients then go through a processing facility and then they're packaged up and data quality Monti Carlo is the leader here is really about understanding how effectively the data is coming through are there changes in the Upstream sources and is the qual is the distribution of the data changing is the volume of the data changing is the shape of the data changing and if there are that everybody should be alerted because that's that's very likely a problem and so that's what data the data quality movement is about and so you have company like Monte Carlo which is doing using machine learning to understand exactly what's happening and then you also have a test based approach where you assert different conditions about there should never be a zero in this particular column where it should always be numeric for example and so once once a data pipeline is working and you have an observability layer kind of like a data dog through that you're you're in a really good place the the the next sort of theme within the modern data stack world is this idea of a data contract and it's part of this broader theme that parallels What's happen in software engineering so let's take a step back so the way that we used to build software is we used to build on one very large code base all the engineers who work on one really large code base at the same time and there's some advantages to that but what we found is by cutting it up into small pieces and having small teams of engineers build on all those different microservices we call them it was much easier for everybody to collaborate the One requirement was that each team had declare to the rest of the world that these are the kinds of inputs my system expects and these are the kind of outputs that my system will produce and these are the guarantees that I'll give you about how fast it will do that um and how often we'll update are and so this is exactly what's happening in the data world it 20 years ago micro strategy and business objects and cognos they were all controlled by a centralized data team and the access to data was was really Limited in order to make sure that it was highly controlled and the data was was was good was accurate and now what's happened is we've had a democratization of data where the marketing team has its own analyst it might have its own cloud data warehouse the same for the sales team and finance might have the same thing and so now it's been distributed just in the way microservices and what what data contracts promise is let's encode the inputs the outputs and then the SLA the expectations around performance into software so we can manage all of this effectively today for many of the largest companies this doesn't exist and so having a software platform or the change management associated with encoding that so that if the marketing team is consuming product data and the the product team doesn't just change the format of the data that in the marketing team breaking a whole bunch down Downstream systems okay so this sounds like um a really effective way for different teams who are working on either similar bits of data or one bit of data where it's Downstream from another team they can work together more effectively because they're essentially guaranteed like what they're getting from the other team that's right yeah it feeds into this idea of data mesh which is data is distributed all throughout the organization people are each consuming and and producing data for each other as opposed to from a centralized no so I'd also like to talk a bit about cloud computing so it seems like um moving everything from working Lily to moving uh to working the cloud that's been a big Trend over the last well More Than A decade is this something you see continuing or is the pendulum going to swing the other way so I think we'll start to see the emergence so it's absolutely true everyone was moving from onr to the cloud now what we're starting to see is hybrid execution and what do there's two different parts to hybrid execution the first the customers want to hold on to their own data uh and so we talked a little bit about Snowflake and Iceberg before and so many of the largest companies or the Privacy Centric companies they want their own data stored on their S3 buckets or whatever buckets they have and what they want to do is they want to bring the software and the compute to that data as opposed to sending that data to Salesforce or sending it to maretto and then having the output stay there and then somehow have to pay a third party vendor so just bring all the data here here we'll bring your software in we'll compute and then the software can leave if ever we decide to change it that is becoming much much more important than it has been in the past the other part of hybrid execution is actually so this is why it's a little bit overloaded there's there's a new wave of building applications where some of the processing is done in the cloud maybe through that architecture and some of the processing is done in the browser and so if we look at Technologies like ddb you can run a ddb instance inside of a awom container web assembly container and so let's say you have this really huge data set like a 100 gigabyte data set pre-process some of it to the cloud take two gigabytes put it locally and then all of the analysis of the visualization that's done on that two gigabyte data set is actually done on the user's computer and so as a result it's much faster it's much more Capital efficient you actually offload if you're a vendor with this kind of architecture you're able to operate with significantly better margins because you don't pay for as much computer fast okay uh so certainly saving money sounds like a good idea and it it sounds like so you've got um raw data sets are generally bigger so they're going to be somewhere centralized and then by the time you got something processed that's going to be much smaller so that may arguably makes more sense to be done locally okay um I suppose that the trick is just to be able to move fluidly from local to cloud and back again so you're not worry too much about that that's right that's the hard part okay all right uh and so um another big trend is business intelligence platforms have been conquering everything certainly the data analytic space over the last few years uh and you mentioned looker and then there's others like powerbi and Tableau and all the rest so um do you see these bi platforms changing at all or are they mature I do think I mean so uh these bi platforms will change I think the way that we think about bi over the last 20 year and we talked a bit about this but in the early 2000s bi was really centralized and controlled by a small number of people in order to ensure the accuracy that was like in the year 2000 let's say and then during Tableau was formed in 2004 and really hit it stride over the next five to 10 years that was about enabling an analyst to take control and analyze their own data completely Bottoms Up strategy no centralized control excuse me at the beginning and so you had a huge pendulum swing to the from the center to the edge then looker came and said well there's these next Generation cloud data bases these cloud data warehouses like Snowflake and big query uh why don't we try to exact some more control while giving some flexibility they deployed a a mo modeling language called look ml which allowed the data team to define a metric and then everyone in the organization to use that definition of Revenue and now I think where we're going is um continuing to sort of like go back to the pendulum so the next generation of of companies like Omi uh what they're trying to do is allow the metrics definitions to be created at the edge and then promoted all the way through with the Rons of workflow the big question you know the one that you asked is well where does AI fit into this there's a challenge with AI I think at least in the bi World AI has a role with SQL query completion seems to be a great use case I may not know exactly the right syntax to do um a window function or I may not know exactly the right Syntax for beautiful Union across two different tables or CTE and so I can I can I can complete the code complete the query using AI the big question is whether or not people will trust you know these AI systems uh in order to answer questions like what is the company's Revenue broken up by region and territory uh by region and product because if I ask that question in a slightly different way if I flip the group effectively the group eyes I might get a different answer and so until we really solve that problem the lack of trust I think will be a pretty significant barrier to entry to full automated just ask a question and trust that the data is correct and then talking to different data teams there's another Nuance here which is the interpretation of the data even if you have the right data the interpretation of the data is still a really hard part is there statistical significant you know statistically significant difference between two averages and then how does that impact what the business ultimately makes so there I'm a little more cautious because of some of the some of the chaotic nature of the AI let's say there's two parts then so you've got there helping you write the code in order to actually do the analysis then the interpretation separately and it seems like I I think for simple queries it's certainly like the the SQL generation is pretty good and I agree with you that it's impossible to remember the syntax of a window function so yeah better to get the AI to write that but yeah uh for more complex queries I can certainly see have to be some trust issues there so um in that case do you think um we're going to need need to have humans in the loop there like for a long time in order to just sanity check the SQL I think our bet is that the number of people working in data will actually increase by a big big number multiple because so many many many more functions and many more businesses will now become reliant on data because they need it uh and so what will change is the kinds of tasks that analysts are doing but we'll need many many more of them and on sub of like bi and AI um obviously like the the biggest generative AI application is is chat gbt and so do you think that chat interface is going to replace um like the bi point and click interface or the two things complimentary I think so there's bi and then there's exploratory analytics my sense is in the bi landscape it will probably it probably won't have that much of an impact I mean maybe it will help you find a dashboard or might point you to a data point I think in the exploratory use case there's a much greater value there because you have a question like uh I don't know let's see um you know let's imagine on Coca-Cola like which one of the bottlers has seen the greatest amount of volume growth in the last 12 months and then you might want to dig into that in a bunch of different dimensions like let's say we wanted to tie that to uh contract structures and so in we would jump from the world of structured data which is what's going what's happening in the core um dashboards and we want to tie it to is there a particular provision within the contract terms associated with these kinds of bottlers there I could see generative AI being really helpful because its ability to classify and its ability to retrieve different kinds of information that might be written differently across different contracts would be really useful and so I think it might have and it's still early so I might be completely wrong but I think it'll probably have more of an impact on the exploratory analytics and it will on core core dashboarding and reporting of business metrics yeah so if you're trying to ask lots of questions quickly like you do in uh Eda then you probably want to chat interfase if you know what you want to build then just build the dashboard using the traditional tools and then yeah you're done okay uh all right so um this leads it I mean you said that um data scientists like and data analyst the skills or their jobs going to change like what new skills do they need in order to be able to cope this new world well I think the productivity expectations of every white collar worker will now I'll go through the roof I think everyone will be expected to be 50 to I don't know 250% more productive than they have been in the past because they have these tools at their disposal and so being familiar with all these different systems and understanding when to use them will be absolutely essential I think uh we look at even I mean there's this there's this professor at Duke who requires who requires all is English students to use chat GPT when they write and uh it's a bit of a divisive perspective but very much in agreement with it the reason is when those students graduate they'll be at a huge Advantage if they understand how to do it how to use chat gbt in a very sophisticated way to write the trade that he makes with his students is if there's a single grammatical error you fail uh and so just as much as the the student can benefit from the technology also you know need to assume some responsibility for it so I think the the role will become less data like munging data movement data management and much more understanding is the data correct what is the right interpretation how does this apply to the business which I think we would all agree that that's a a far more interesting part of the job than modifying a data frame being wide to long for example uh does it a definitely a task that I think no data analyst or data sence appreciates it's it was always a pain the data being in the wrong shape um okay so um we talking about changes in skills at the level like whole roles or jobs have you seen any roles becoming more popular or less popular like um is your job title G to change I don't think the job titles will change we haven't seen the impact broadly yet I think the the main difference here is the structure of the organizations where a lot of the core like machine learning data s d functions are now being pushed to fuse with the core engineering teams and this is because the machine Learning System the AI systems are now being put into production uh and so that's that's a very different place than where the data team used to live which was Downstream analytics post Haw uh not in the path of production and so I one of the broader themes that we're wondering is does the um does the data team actually start to live underneath engineering ah have you seen any companies where that's actually happened yet it's starting to happen particularly in the smaller companies uh where yeah where AI is a core part of the feature and just you know there's an output from a cloud data warehouse or some kind of aggregation that then being funneled into production because all of a sudden then like if you think about the classic modern data stack that whole tool chain or the whole value chain to deliver that data needs to be production grade in the definition of a py reliability engineer working on a core website it needs to have three or four nines of upside of um uptime reliability there needs to be alerting and monitoring around those core systems and so there's this cultural Fusion that I think will need to happen between the classic AI data science machine learning teams and the core engineering teams and that that is definitely we're seeing that broadly today but it will take time okay uh what sort of culture clashes are you expecting there between data and engineering teams well the engineering teams like I said they're accustomed to carry I mean many of them will carry pagers so if these systems break they're accustomed to using libraries they're they think about uh shipping the product really fast they code in a very particular way so if you look if we were to compare the python of a web application engineer and the python in an IPython notebook of a data scientist one has nothing to do with the other right in fact with the core software engineering team looks at an i python notebook it say like what is this I can't I can't take that code and put it into production I need to actually reimplement it or or write it into a different language that goes into the cicd which is a sorry integration continuous integration continuous development flow and the code path of production it's completely different than the way that dat data science is been a custom building and so just like even the core workflows of like how to commit the code what is the code format look like is it packages as a library how do I make sure it has tests and checks and it performs in a similar way to the other kinds of python code within the code base that that needs to come together okay so it sounds like um a lot of your data scientists they need to like be able to write a great function create great packages um just left but style guides and like how to structure their code um are there any other skills along those lines that you think are going to be important for data people no I think that's it I mean there's a another function or another Well actually another skill is this idea of um a data product so I was a product manager to Google and what we would do before deciding to build a product was we would create a product requirements document PRD which described exactly what it is that we were going to build we would socialize it within the rest of the company we would understand the dependencies and then once that was complete it would pass to the engineering team from implementation obviously to be back and forth what we're starting to see in some of the more sophisticated data engineering organizations and AI organizations is are starting to create data prds and they're starting to think about tables or apis that's data products just the way that like a regular API or a production grade API would be so they're starting to look like their own sort of engineering teams with the data product manager data Tech lead and then a bunch of data Engineers who are building this maintaining it so I think that formal way of Building Products will come to data okay so um it's going from that sort of Scrappy I'll just do do my analysis in Notebook to okay actually have to think about like who else is breeding the code who else is consuming this and uh make sure it's uh good good for public consumption okay um so for companies are thinking okay there are all these new tools available what do you need to do to take advantage of them I think today it's all about experimentation and really understanding where they work and where they don't the ecosystem is changing really fast I think the leaders of the next wave will be conversent in AI they will know how to speak to their leaders and educate them on where the company should be spending time where the company should be investing in terms of software because the reality is like every board and every SE Suite is saying we need AI we need AI we're starting to see the productivity improvements from our competitors they're looking to the leaders within each individual team and Department to educate themselves and then ultimately develop a strategy on how to leverage this technology internally and so being an expert in that domain I think will will lead to lots of promotions um because if you can cut two-thirds of a Workforce or if you can increase Revenue buy a third uh by making sales team that just that much more effective through automation there's a lot of value to be created absolutely so you take advantage of this you could be like the real hero of heroin of your company event absolutely okay um and are there any particular areas where you think um data teams um should be focusing their attention or companies should be focusing their attention to improve their data capabilities I mean I think the first is the data pipelines that that are going to be necessary to power sales um sales optimization and customer support optimization those seem to be the two areas across companies where for example we're starting to see startups that are building um fully automated uh sales development reps so that instead of sending a human sending 10 emails per day these systems are sending a th000 emails a week or 2,000 emails a week in order for those programs to be effective the data pipelines to inform those outbound campaigns will need to be there same on customer support so those chat Bots that are able to deflect two-thirds or more of the inbound customer support queries the Richer the context is the more data those robots have about the customer in particular or the FAQs or the new product features the better and more effective they'll be and so I think that's where you'll see a lot of effort and energy because those are the two one is a revenue C in order to make a lot of money in software they increase the revenue of your customer you materially reduce the cost sales team that's where you're going to increase revenue and then the customer support team typically that's where pretty significant cost comes from so I would expect focusing on those data pipelines enabling that to happen will be great the Third thir order priority is around marketing because there's a technique in marketing it's called account based marketing building a website for a particular buyer like a Coca-Cola Proctor and Gamble and historically that's been really difficult to scale just as you can imagine need a lot of data to be able to to do that now we're starting to see the Next Generation Um account-based marketing companies do this for every single customer in a in a universe because they're just automating machines and there again the data pip context of report okay that's really so stop the stuff is really can have a direct impact on your revenue or your costs and then go towards like personalization and think about uh better marketing uh through personalization okay those seem like pretty strong areas to Target um so just before we wrap up are there any companies that you are particularly excited about right now yeah within the data ecosystem you know mother duck we talked about I think they have the potential to really uh changed the the cloud data warehouse ecosystem with their unique hybrid architecture and Jordan who's the founder and CEO there he he was the tech L big query really understands domain think the the other realization that we've had as a company is that more than 80% of the cloud data warehouse workloads are small enough to be able to process on a modern Mac and so having this hybrid architecture allows companies to do that the other one is is Omni so this is the xooker team uh the key part of them uh and who got together with one of the Chris Merritt who's one of the architects of DBT they're building a modern bi system that balances centralized control and metrics definition with enabling individual an individual marketer to create a metric around cost of customer acquisition and building the workflow to have that move all the way up to centralized the centralized data team and they're having a lot of success those are the two I'd like to highlight today okay mother duck and omley uh yeah companies to watch out for then um all right and Jeffy final advice for companies wanting to make better use of data keep going it can be it can be hard uh and there's tends to be a lot of resistance associated with it but what I've seen in my career time and time again is the the companies who move really fast and experiment with the Next Generation Technologies are often able to find a lot of Alpha and develop competitive advantage through it all right super uh great advice there uh thank you so much Tom oh pleasure was mine thank you Richie\n"

#204 Data & AI Trends in 2024 _ Tom Tunguz, General Partner at Theory Ventures

Random Videos