#166 Optimizing Cloud Data Warehouses with Salim Syed, Head of Engineering at Capital One Software

Centralizing Data Management: A Key to Efficient Organization

The importance of effective data management cannot be overstated. In today's fast-paced business world, organizations need to ensure that their data is well-managed and governed. This requires a centralized approach to data management, where all departments and teams work together to create policies and procedures that govern the use of data. By doing so, organizations can reduce waste and inefficiencies, while also ensuring that they are using their data effectively.

To achieve this goal, it's essential to invest in good tooling that allows for centralized policies to be enforced. This includes tools that enable line of businesses to have ownership and control over their data, as well as cost management systems that can help organizations track and manage their expenses. By having a robust system in place, organizations can ensure that they are using their data effectively and efficiently.

Another critical aspect of effective data management is budgeting and funding. In today's business environment, costs can vary greatly from one project to another, and it's essential to have a proper budgeting process in place. This includes having approval workflows in place to ensure that all projects are approved and funded correctly. By having a clear understanding of how much money is being spent on data-related projects, organizations can make informed decisions about where to allocate their resources.

The Importance of Visibility

Visibility into data spending is also crucial for effective data management. Organizations need to be able to see exactly how much they're spending on data-related projects and what that data is being used for. By having visibility into data spending, organizations can identify areas where they can reduce waste and inefficiencies, while also ensuring that their data is being used effectively.

A Journey of Productizing a Tool

The journey of productizing a tool built internally is not an easy one. When Capital One decided to build a new data platform, they realized that they needed to create a multi-tenant system that could protect customer data and ensure that it didn't overlap with other systems. To achieve this, they had to build a new SAS framework and develop a different software development practice that would help them manage the complexity of their system.

One of the key challenges they faced was creating a platform that could scale with their needs. As Capital One continued to grow, they realized that their initial solution wasn't sufficient to meet their demands. They had to re-architect their system to ensure that it could handle the increased traffic and usage patterns.

The Shift Towards a Federated Mindset

As organizations continue to evolve and grow, they're realizing that centralized data management may not be the best approach for all of them. The shift towards a federated mindset is becoming more evident, as companies recognize that each line of business has its own unique needs and requirements when it comes to data management.

By allowing different lines of businesses to have ownership over their data, organizations can reduce silos and create a single source of truth across the organization. This approach also enables organizations to break free from traditional centralized systems and explore new and innovative ways of managing their data.

The Future of Data Management

Looking ahead, it's clear that the future of data management will be shaped by a federated mindset. As companies continue to evolve and grow, they'll need to adapt their approaches to data management to meet changing needs and requirements.

One trend that's already emerging is the importance of cloud-based solutions. With more and more organizations moving to the cloud, it's essential to have systems in place that can manage and govern data across multiple platforms. By doing so, companies can ensure that their data is secure, scalable, and easily accessible.

Final Call to Action

When it comes to cost optimization, it's never once and done deal. You need to be vigilant and watch for new users, new usage patterns, and workloads being introduced into the system. Focus on reducing waste and inefficiencies, rather than just reducing costs, and you'll be able to scale with peace of mind.

By following these best practices and embracing a federated mindset, organizations can create effective data management systems that drive growth, innovation, and success. With the right tools, strategies, and approach, companies can unlock the full potential of their data and achieve their business goals.

"WEBVTTKind: captionsLanguage: enwhen it comes to the cost optimization it's never a once andone deal you have to be vigilant it's a constant uh battle so you need to be vigilant you need to be watching for a new users new usage new usage pattern workloads are always introduced in the system so don't focus on reducing cost Sal say it's great to have you on the show yeah happy to be here so you are the VP and head of slingshot engineering at Capital One software uh so maybe first to set set the stage walk us through Capital One software and slingshot and your role as it's head of engineering so Capital One software is a Enterprise B2B software uh software business of Capital One uh it's dedicated to providing data Management Solutions to uh companies that operating in the cloud and um you know our foray into into the software business you know it started a long time ago with building so many different softwares in-house softwares at the time we were in the cloud we just didn't have those abilities how do you manage cost how do you manage efficiency how do you do data governance in the cloud all these we had to build ourselves then we also realize that you know this we're not unique to this challenge there are other companies that are facing Sim their challenges while they're going to the cloud or already in the cloud and now experiencing these challenges so you know our offering is to help other companies you know be more efficient and uh in the cloud yeah that's great and maybe a bit expand a bit on slingshot as well and what slingshot is so slingshot is our first uh B2B software and it's designed to help it's a data management solution that helps companies optimize cost uh and um manage and accelerate their journey into snowflake as well as it really allows you to remove your wastage and inefficiencies from the way you're using your data platform that's really great and I want to Center today's discussion on really how organizations can maximize the value of managing their data in the cloud you know in many ways centralizing your data in a cloud data warehouse and building a modern data platform there really table Stakes for any organization today trying to be data driven a lot of data leaders have this off of mind uh so maybe kind of getting a bit of background as well maybe walk us through why that is and why nailing data management and data infrastructure should be top of mind for every data leader today the way you get value in a business it really starts from data right it's understanding data the insights the the insights that are coming out of data is going to drive your business it's going to help you with your customers uh it's going to help making their lives easier so data is very crucial and what we're seeing is that when you move to the cloud the the amount of data the type of data all this increases exponentially right if you have all that data coming in and it's not well managed and well governed you don't know where the data is and you know the right access is around it right security all that will make it more challenging for you either you will not have business cannot get insights in a timely manner or you will create silos that will prevent you from accessing each other's data and seeing the value that comes with you know sharing of data uh across your organization so uh to have a good data management is is really crucial right from the start I couldn't agree more because ultimately you're creating that pipeline right that you know from collection to enable people to draw insights and that pipeline needs to be resilient and needs to be U you know a highly efficient well-managed pipeline plan so maybe zooming out on the broader context of the industry you know you work with a lot of Industry partners and have worked on data management extensively at Capital One inter internally um where do you think we are today when it comes to organizations effectively managing their data where are we on this journey you know as organizations uh mature their data management ah this is a very good question um different companies are different places in their matur maturity of data governance and Adoption of cloud uh but what I'm seeing is that one of the places where I see some some friction is that a lot of companies are moving to the cloud but they still have the mindset of on premise and let me explain what that means that means that you know in the in the on premise world you had a centralized team managing all of your data right they they're the ones who are saying hey this is the data you can have this is they're doing the data governance data uh data publishing all of that and that really is hard to maintain in the cloud especially because now you're going to have an explosion of data as I explained and businesses all the businesses really wants to go at the speed of their their demand and they cannot be dependent on a central team and the central team doesn't have the domain expertise across the lines of business so they become a bottleneck in the cloud right so that's where we see a lot of lot of lot of friction around adopting this new mindset where you want to allow businesses to move at the speed that they need but also you can't um there's two things you have to really be careful of one is creating silos across all the businesses and the second is how do you enforce Enterprise uh centralized policies governance policies across uniformly across your organization and that for that balance to like to find that balance Capital One has uh built the way we look at things are you know you have Central policy and Central tooling but then Federated ownership what that allows you to do centralized team to do is create the policies make sure the tool enforces them or or puts the right guidelines guard rails around uh around how you publish data how you access data how you govern data and but then gives the ownership for the line of business so they can move they can publish on their own they can you know find data on their own without having a bottleneck on the central team and I see that as the as the biggest challenge uh uh for companies to really truly adopt the cloud mindset okay that's really great and maybe kind of expanding on that cloud mindset um you know what have you seen or ways to deal with this particular challenge that you mentioned here on you know having this Federated approach while still centralizing data governance and data controls uh maybe walks through this kind of solution solution landscape in a bit more detail on how data leaders should should be approaching this aspect or this challenge of data management it starts with what Capital One Believes In you know that that how to solve this problem which is uh centralized tooling and centralized policy what we've built is we've built centralized tooling for our imagine our data publishing I'll give you two examples publishing to your data environment and then what consumption needs are first let's talk about publishing so one way to publishing is you create centralized policies you give it to the line of businesses they publish data into their uh data environment it's very becomes from a central team it becomes very difficult to enforce all those to guarantee that everyone's following the practices so what we've done is we've build Central tooling that allows you to publish your data and making sure that the the publishing all the governance need for publishing data for example you registering your data are you checking for data quality are you uh identifying what is sensitive what is not are you you know making sure that the the sensitive columns like you know credit card or personal informations are encrypted or tokenized all that is governance need for publishing a pipeline and what we've done is we made that transparent to uh the business so that if they're using our Central tooling all that comes free so not only this going to help you be more compliant but also by making sure that everything is registered the data quality is collected it makes it on the consumption side for our users and employees when they're accessing data they have first of all they can find data very quickly find relevant data very quickly because now we've we have all the data dat registered in the right places we're collecting the right metadata and also we provide the consumers of the DAT of the data enough information about the data that they can trust it so think about you know what is the result of your data quality what is the uh what a sample data SI sample data uh min max data profile all that is given to you so that you can trust it and then in the same experience we allow you to access the data request access to data so that it goes to the right owner of the data and they can um they can approve and you provide your business justification the point I'm trying to make is you want to make sure that uh all the all the data governance needs are as easy as possible as transparent as possible so they don't become a hindrance to the developers or the data producers or the DAT consumers to getting their job done right and and you get both benefits you get the benefit of uh compliance and meeting all the regulatory requirements as well as it's easy to find the data it's easy to get to the data quickly and get the value out of the data that's really great and you mentioned a lot here kind of the data governance aspect to it but one thing you mentioned earlier was the need to you know develop this Federated mindset and to have you know to let business teams also lead with their data right while having kind of centralized policies um maybe what have you seen are effective ways teams can organize themselves around this model right as data management as well evolves maybe walk us through the different ways um the successful data teams have been able to organize themselves around and successful data producers and consumers have been organized as a consequence of this shift in mindset the best people who know the data are the businesses they have all the domain expertise they understand what the data means how to use it and you want them to own it right and and that's where the mindset comes that they're the ones who know it give them the tooling that allows them to publish data quickly allows them to uh ingest the metadata about the data so that you know everyone can get benefit and uh the other thing you need to make sure is that uh data has to be you know with proper access control it needs to be you you don't want the architecture to limit data sharing between organizations like if if architecture allows that then you can build access controls on top of it you can build that but as soon as you create an architecture where you have to copy data out of one place into another and go through multiple Hoops it becomes very hard and it just slows down the data sharing so data sharing is going to be a very key component of any organization uh and you want to make that easy but with the right controls in place and we're talking about controls we're talking about data governance right like this often makes me and we there's a bit of a um maybe a friction sometimes as you mentioned between having centralized data governance policies but also a Federated model right um maybe who should be the stakeholders involved in setting up a data governance uh strategy maybe walk us through in a bit more detail what a successful data governance Journey looks like here for organizations as they set that up you have to have buyin from uh your leadership uh because it requires buying from the businesses uh businesses have to also create their own uh governance own risk officers that allow them to uh manage their own risk because every line of business has different risk around it right so you want to make sure that they have the right ownership they have the right uh uh risk reviewers within their line of businesses as well as a central policy and um the way it really works is if you have um a leadership that says this is the way we're going to do it but this is the benefit whole organization is going to get it then you have a lot of buying but without the buying from leadership it becomes very hard to start this from the grounds up uh so yeah that that is I think is a requirement per yeah I couldn't agree more here especially on the importance of uh buy in for leadership now speaking of buyin right I think in a lot of ways the other side of cloud transformation and effectively managing data in the cloud is actually ensuring that people have the tooling skills the ability uh to adopt data right um so maybe and which is this is much more of a people and tooling problem than it is only a data management problem so maybe walk us through good examples of data adoption that you've encountered at Capital One uh and Capital One software customers so it comes really down to again I said the leadership but it's it's a data mentality right you want to have everyone be educated on making data driven decisions and it it needs to show from up down everywhere uh even during development processes you're making decisions based on data right so that's the culture you want to create first but uh on top of that you want to make sure that it be it's very easy to find find access and Trust data so the democratization of data is very important in any organization to allow not just uh data Engineers to get access to data but make it so easy that you know anyone with the right access and need can get value out of the data so that and then the third piece would be just education uh do the Train the trainers uh and make it uh make every data uh even even during uh experience where there's inefficiencies in the way you're accessing data make it a teachable right so teachable movements create more more uh it's a it's got a trickle down effect uh across the organization so that's how I see it yeah okay that's great and kind of maybe expanding here on the aspect of providing access to data um you know you mentioned here in effect ways of of of surfacing data I think one common entty pattern we see in organizations today is that there's really limited context on how this data is useful and could be used um and you know a lot of organizations are trying to solve this you know with a metadata platform right or a data catalog uh maybe I would love if you can comment here on the importance of surfacing metadata and providing that data catalog for The Wider organization and what are effective ways that you've seen as well to provide provide that context for The Wider work yeah metadata is everything you know and catalog is is the key on that metadata allows you to know about your data it allows you to know you know not just the business context or the technical context of the data but it's actually becoming bigger than how metadata was used in the past it was about catalog you you kept the business and Technical metadata but now there's a concept called passive metadata which is all the cost Associated security uh all the um resiliency associated with the data set like how do you track that along with the metadata along with the uh data so think about a table you have in your environment uh how often is it used is anyone using it how how often is it updated how much does it cost to maintain that table how much does it cost to access that table table those are all passive metadata and the the bigger en the bigger uh view you have across all the metadata gives a company a lot of edge around where not only about uh providing you know information or what data you have in your environment but also what's valuable what is used more how what two data sets are relatable you know how often people join the two data together to create the insight all that becomes very important so you always want to start with a catalog but it needs to expand on just static metadata it needs to be live uh everything that's happening to that data needs to be collected at the same time to give you even more insight into your operational uh operational excellence right yeah uh and yeah this often makes me think about uh really the importance of providing as well the business context on you know how data is being used like effectively which teams are using it which queries which tables uh is this data used and yeah I I couldn't agree more here on the importance of of seeing how that data is being transformed um now as we're discussing about you know a lot of the challenges organizations face uh when it comes to data management I'd love to also learn more about Capital One slingshot right so maybe walk us in a bit more detail U what Capital One slingshot aims to solve you know as in a lot of ways this challenges that we're discussing here today were challenges faced by Capital One right so I I'd love to learn you know how Capital One slingshot aims to solve a lot of these challenges before I get into Capital One uh the slingshot I want to explain the concept of optimization uh one of the things that happens in the cloud is that when you are on premise your data data platform costs are pretty much static right you you buy certain number of servers and uh and then you have uh you have some constraint on how much you can use but the cost doesn't fluctuate until you buy a bigger server or expand and it took four to five months to do that but when you move to the cloud for the first time cost is now dependent on how you use it and how much you use right so the more compute you use the more queries you run the more your cost is going to be right and what that means is there's there's a room for a lot of inefficiencies uh that you you have probably had in the all environment but it didn't affect your cost it it affected your performance uh but it didn't affect your cost but now it does affect your cost uh so what slingshot does is it it tries to tries to help you optimize your cost but the way it does is in in a in a few steps right in first is it provides you the visibility into into your cost where are the cost spiking where where are the cost high low all that second it gives you uh near realtime alerts and insights into the cost drivers uh the last thing you want to do is wait until a month-end bill comes and realize that you had one query running for 30 days you know a runaway query so you want to know about that as quickly as possible then the last one is you not everyone is going to be an expert in Snowflake to know how to optimize their queries or optimize their server configuration so what we try to do is we try to recommend you when we give you we find an inefficiencies we give you a recommendation on this is why we're seeing the inefficiency and this is exactly what you can do or give you a couple of options and then in the tool allow you to um allow you to make the changes to your settings right and all three together allows you to not only save costs remove inefficiencies but also get get um everything that happens in the server in the snowflake environment is basically all well managed and well governed because we also have uh uh a way for you to provision resources change resources all uh through approval process and the proper tagging so it's it's well managed well governed as well uh yeah that that's that's in a nutshell what slingshot does that that's great and when you mentioned here on cost optimization maybe we should have covered that a bit more depth uh walk us through the the drivers behind U you know ballooning costs when trying when managing data in the cloud yeah there there's a you know I I the way I see it is there's four different areas where cost can be so variable and can just ex and have an uh explosion if you don't manage it well uh you know first one is um the compute cost which is in the modern data warehouses as data platforms you always have a compute aspect which is separate from your storage and the more queries you run the more the more you spend the second one is just the query itself the query optimization third one is um I'll go into a detail in a second but the third one is your um data set optimization which is how much data do you store how is it being uh modeled correctly and the last one is environments uh environment optimization which is around you know there's there's going to be a lower environment Dev QA and production and in these there's a lot of inefficiencies that can happen in in the lower environments as well but let's go back to the first one the compute this is where you know what we understood is that there's no one size that fits for all times of the day what we see is that workload usually goes Up and Down based on the time of the day day of the week and you want your uh snowflake resources to size up and down accordingly and today s doesn't let you do that it it it gives you one size and you have to so it's very important to know the ups and downs and making sure your uh your compute goes with it that way you're spending uh the most effectively uh also making sure that your servers are shut down when not in use uh and you're you know you have some timeouts in your queries so that you know it don't a runaway query keeping a server up so watching for those uh is very important the second one is uh query optimization like I said queries the longer the query runs the more it costs so if you have a badly written query uh that like does for example a cartisian data join that can forever uh all that will cost you a lot of money and you won't even know about it unless somebody's watching that so it's it's very important to not only buil in alerts uh for runaway queries or but also provide um a way to give you advice on how to rewrite the query and get the best performance and lower the cost and our slingshot tool does that also and the and the third one is data set optimization I call it data set optimization because it's got a few aspect to it one is storage cost even though you You' think storage cost is much cheaper in the cloud compared to on premise but when you're dealing with multi-petabyte of data then card will sneak up on you so it's very important to have a retention strategy on your data that means know how long you want to keep it it could be based on business requirement regulat requirement like figure out a way that your storage will not keep going up and up otherwise what happens is even if you start with a 5% storage cost in year three it'll be 40% storage cost because you're you're not purging anything old or you're not archiving anything old so it's very important to keep a retention strategy second is a lot of data that we're loading today in our data warehouses you have to understand the consumption patterns as well so there's no loading something in real time when you're consuming it once a month because there is a cost to loading data as well uh so these are the insights you want to draw from a from a tool that allows you to know okay what is the best uh best way to load the data that helps with the usage right and uh the last point is on the environment cost one of the things we notice is when you're building data pipelines in lower environments there's a lot of room for uh inefficiencies there for example uh one of the things we did was we made a rule that says you cannot get a in a Dev environment you can't have uh a server large greater than small for example size and and just by enforcing that rule we saw a significant savings and and if there is a need for something larger than that you go through a certain approval process but but to enforce rules like that to enforce that jobs that are uh you know when you're testing something when you're developing something that are not just running when you're not working right and those things can happen and so it's very important that you pay attention to the lower environment even though that's a much smaller percentage of total cost but that can also balloon out of control if you're not putting in the same inefficiency uh checks uh that you're going to put in production right so point I'm trying to make is don't forget about the lower environment CA that can sneak up as well yeah I love this overview and I love how you know you thought about all of these Ed cases and kind of all of these drivers of bullying cost when designing cap one lingsha uh but maybe here uh Switching gears slightly and thinking about how leaders and organization should think about their data management journey and to avoid uh you know ballooning costs U you know a lot of organizations are still early in their Cloud Journeys a lot of organizations are still setting up their Cloud infrastructure and maybe what advice would you give them to make that they're driving as much as Roi as possible when it comes to data management how do they avoid being in a mode where they're increasing investment and ballooning and costs are ballooning without necessarily driving a lot of Roi from data activities there are times leaders you know that that companies that are moving to the cloud don't see what's right around the corner is they go and then they face all these different challenges around governance or on cost or on uh accessing data uh so you know my my recommendation and advice has always been think about data governance data management when you're starting off uh the journey to the cloud uh sometimes it's much more harder once you've opened the Pandora's Bosque to put everything back in because uh you know once it's it's much harder to do that if if you don't have your data registered across your organization now you're going to have to go through and and collect all that but if you make uh if you think about a Central Tool Central policy then from any new data that's created automatically will be well-managed well governed so that's that's the first piece that you should think about is uh invest in a good tooling that allows you to you know uh do the centralized policies enforce the centralized policies but give ownership to the to the lines of businesses as well as um on the cost side I would I would recom same thing it's very important to uh from the beginning uh put in the right policies on how you're going to do chargeback to your line of businesses are you going to have a budget or not are you going to have um uh a way to uh request more funding all that needs to be part of the uh a data platform in in this modern world you can't just have uh you know the since it's it's a variable cost you need to have a proper funding proper budgeting and proper way to asking for more resources uh the right approval workflows should be built in and uh and then the visibility is very important right uh on the way you're spending your uh money yeah that's could I agree more um and you know what's interesting hearing about you know your journey uh now leading Capital One slingshot as we discussed a lot of the ways in a lot of ways slingshot was built upon you know Capital One solving these problems right that we're discussing right now uh maybe walk us through you know briefly the Journey of productizing a tool built internally uh and productizing it to the wider Market what changed if any in your approach as you were productizing sling Shaw wow uh that's a very good question it's you know the first thing what we had to do was we had to build uh the tool we built within Capital One was very specific to Capital One right so there was a lot of uh hidden uh hidden integration with Capital One infrastructure already so we had to build a a SAS platform that was supposed to be multi-tenant that is supposed to be multi-tenancy it has to protect the customer data it has to make sure that you know the the datas don't overlap each other uh so we had to create a a whole new SAS framework uh we had to create a different uh a data uh data development software development practice uh to help this with this journey and uh one of the things that you're going to see from Capital One build software is that it's security um performance scale resiliency these things are just built in into our DNA right so everything we build is built for uh Harden platform and that's how we've always started even though other uh startups will start with just providing you the features then figure out how to harden it later we we've always made sure that the product you see is going to be a Harden Enterprise great software and you know as we are closing our episodes today Salim H you know when looking ahead maybe what are trends that you're excited about when it comes to organizations managing their data effectively how do you think the landscape will evolve over the you know foreseeable future no I think it's already evolving into that Federated mindset the more companies I they they all they realize that uh it it's going to be very hard to manage it from a centralized point of view right so uh that's the shift that's happening but it's not happening as fast as I'd like to see uh but but it's clearly uh evident from the strategies of different companies the way they execute it is a little different they may keep data across uh multiple platforms but what's interesting is you have a central catalog across all of have Central Access policy across all your data and that those are the key part right that way you break the silos and you have a a single single way to get data even though it's owned by different uh line of businesses yeah that's definitely an exciting uh development as we see that Federated mindset uh you know evolving and you know empowering organizations and line different lines of businesses to use their data effectively um finally s was great having you on the show before we wrap up up you have any final call to action or notes to share uh with listeners before we end today's episode yeah absolutely um this is what I say quite a lot is when it comes to the cost optimization it's never uh once and done deal you have to be vigilant it's a constant uh battle so you need to be vigilant you need to be watching for a new users new usage new usage pattern workloads are always introduced in the system so don't focus on reducing cost I would say focus on reducing waste and reducing inefficiencies in the system and then you will be able to scale with with peace of mind knowing that the the money you're spending is going to the value generating for the business that's really great Sal it was great having you on data frame thank you so much it's been my pleasure ohwhen it comes to the cost optimization it's never a once andone deal you have to be vigilant it's a constant uh battle so you need to be vigilant you need to be watching for a new users new usage new usage pattern workloads are always introduced in the system so don't focus on reducing cost Sal say it's great to have you on the show yeah happy to be here so you are the VP and head of slingshot engineering at Capital One software uh so maybe first to set set the stage walk us through Capital One software and slingshot and your role as it's head of engineering so Capital One software is a Enterprise B2B software uh software business of Capital One uh it's dedicated to providing data Management Solutions to uh companies that operating in the cloud and um you know our foray into into the software business you know it started a long time ago with building so many different softwares in-house softwares at the time we were in the cloud we just didn't have those abilities how do you manage cost how do you manage efficiency how do you do data governance in the cloud all these we had to build ourselves then we also realize that you know this we're not unique to this challenge there are other companies that are facing Sim their challenges while they're going to the cloud or already in the cloud and now experiencing these challenges so you know our offering is to help other companies you know be more efficient and uh in the cloud yeah that's great and maybe a bit expand a bit on slingshot as well and what slingshot is so slingshot is our first uh B2B software and it's designed to help it's a data management solution that helps companies optimize cost uh and um manage and accelerate their journey into snowflake as well as it really allows you to remove your wastage and inefficiencies from the way you're using your data platform that's really great and I want to Center today's discussion on really how organizations can maximize the value of managing their data in the cloud you know in many ways centralizing your data in a cloud data warehouse and building a modern data platform there really table Stakes for any organization today trying to be data driven a lot of data leaders have this off of mind uh so maybe kind of getting a bit of background as well maybe walk us through why that is and why nailing data management and data infrastructure should be top of mind for every data leader today the way you get value in a business it really starts from data right it's understanding data the insights the the insights that are coming out of data is going to drive your business it's going to help you with your customers uh it's going to help making their lives easier so data is very crucial and what we're seeing is that when you move to the cloud the the amount of data the type of data all this increases exponentially right if you have all that data coming in and it's not well managed and well governed you don't know where the data is and you know the right access is around it right security all that will make it more challenging for you either you will not have business cannot get insights in a timely manner or you will create silos that will prevent you from accessing each other's data and seeing the value that comes with you know sharing of data uh across your organization so uh to have a good data management is is really crucial right from the start I couldn't agree more because ultimately you're creating that pipeline right that you know from collection to enable people to draw insights and that pipeline needs to be resilient and needs to be U you know a highly efficient well-managed pipeline plan so maybe zooming out on the broader context of the industry you know you work with a lot of Industry partners and have worked on data management extensively at Capital One inter internally um where do you think we are today when it comes to organizations effectively managing their data where are we on this journey you know as organizations uh mature their data management ah this is a very good question um different companies are different places in their matur maturity of data governance and Adoption of cloud uh but what I'm seeing is that one of the places where I see some some friction is that a lot of companies are moving to the cloud but they still have the mindset of on premise and let me explain what that means that means that you know in the in the on premise world you had a centralized team managing all of your data right they they're the ones who are saying hey this is the data you can have this is they're doing the data governance data uh data publishing all of that and that really is hard to maintain in the cloud especially because now you're going to have an explosion of data as I explained and businesses all the businesses really wants to go at the speed of their their demand and they cannot be dependent on a central team and the central team doesn't have the domain expertise across the lines of business so they become a bottleneck in the cloud right so that's where we see a lot of lot of lot of friction around adopting this new mindset where you want to allow businesses to move at the speed that they need but also you can't um there's two things you have to really be careful of one is creating silos across all the businesses and the second is how do you enforce Enterprise uh centralized policies governance policies across uniformly across your organization and that for that balance to like to find that balance Capital One has uh built the way we look at things are you know you have Central policy and Central tooling but then Federated ownership what that allows you to do centralized team to do is create the policies make sure the tool enforces them or or puts the right guidelines guard rails around uh around how you publish data how you access data how you govern data and but then gives the ownership for the line of business so they can move they can publish on their own they can you know find data on their own without having a bottleneck on the central team and I see that as the as the biggest challenge uh uh for companies to really truly adopt the cloud mindset okay that's really great and maybe kind of expanding on that cloud mindset um you know what have you seen or ways to deal with this particular challenge that you mentioned here on you know having this Federated approach while still centralizing data governance and data controls uh maybe walks through this kind of solution solution landscape in a bit more detail on how data leaders should should be approaching this aspect or this challenge of data management it starts with what Capital One Believes In you know that that how to solve this problem which is uh centralized tooling and centralized policy what we've built is we've built centralized tooling for our imagine our data publishing I'll give you two examples publishing to your data environment and then what consumption needs are first let's talk about publishing so one way to publishing is you create centralized policies you give it to the line of businesses they publish data into their uh data environment it's very becomes from a central team it becomes very difficult to enforce all those to guarantee that everyone's following the practices so what we've done is we've build Central tooling that allows you to publish your data and making sure that the the publishing all the governance need for publishing data for example you registering your data are you checking for data quality are you uh identifying what is sensitive what is not are you you know making sure that the the sensitive columns like you know credit card or personal informations are encrypted or tokenized all that is governance need for publishing a pipeline and what we've done is we made that transparent to uh the business so that if they're using our Central tooling all that comes free so not only this going to help you be more compliant but also by making sure that everything is registered the data quality is collected it makes it on the consumption side for our users and employees when they're accessing data they have first of all they can find data very quickly find relevant data very quickly because now we've we have all the data dat registered in the right places we're collecting the right metadata and also we provide the consumers of the DAT of the data enough information about the data that they can trust it so think about you know what is the result of your data quality what is the uh what a sample data SI sample data uh min max data profile all that is given to you so that you can trust it and then in the same experience we allow you to access the data request access to data so that it goes to the right owner of the data and they can um they can approve and you provide your business justification the point I'm trying to make is you want to make sure that uh all the all the data governance needs are as easy as possible as transparent as possible so they don't become a hindrance to the developers or the data producers or the DAT consumers to getting their job done right and and you get both benefits you get the benefit of uh compliance and meeting all the regulatory requirements as well as it's easy to find the data it's easy to get to the data quickly and get the value out of the data that's really great and you mentioned a lot here kind of the data governance aspect to it but one thing you mentioned earlier was the need to you know develop this Federated mindset and to have you know to let business teams also lead with their data right while having kind of centralized policies um maybe what have you seen are effective ways teams can organize themselves around this model right as data management as well evolves maybe walk us through the different ways um the successful data teams have been able to organize themselves around and successful data producers and consumers have been organized as a consequence of this shift in mindset the best people who know the data are the businesses they have all the domain expertise they understand what the data means how to use it and you want them to own it right and and that's where the mindset comes that they're the ones who know it give them the tooling that allows them to publish data quickly allows them to uh ingest the metadata about the data so that you know everyone can get benefit and uh the other thing you need to make sure is that uh data has to be you know with proper access control it needs to be you you don't want the architecture to limit data sharing between organizations like if if architecture allows that then you can build access controls on top of it you can build that but as soon as you create an architecture where you have to copy data out of one place into another and go through multiple Hoops it becomes very hard and it just slows down the data sharing so data sharing is going to be a very key component of any organization uh and you want to make that easy but with the right controls in place and we're talking about controls we're talking about data governance right like this often makes me and we there's a bit of a um maybe a friction sometimes as you mentioned between having centralized data governance policies but also a Federated model right um maybe who should be the stakeholders involved in setting up a data governance uh strategy maybe walk us through in a bit more detail what a successful data governance Journey looks like here for organizations as they set that up you have to have buyin from uh your leadership uh because it requires buying from the businesses uh businesses have to also create their own uh governance own risk officers that allow them to uh manage their own risk because every line of business has different risk around it right so you want to make sure that they have the right ownership they have the right uh uh risk reviewers within their line of businesses as well as a central policy and um the way it really works is if you have um a leadership that says this is the way we're going to do it but this is the benefit whole organization is going to get it then you have a lot of buying but without the buying from leadership it becomes very hard to start this from the grounds up uh so yeah that that is I think is a requirement per yeah I couldn't agree more here especially on the importance of uh buy in for leadership now speaking of buyin right I think in a lot of ways the other side of cloud transformation and effectively managing data in the cloud is actually ensuring that people have the tooling skills the ability uh to adopt data right um so maybe and which is this is much more of a people and tooling problem than it is only a data management problem so maybe walk us through good examples of data adoption that you've encountered at Capital One uh and Capital One software customers so it comes really down to again I said the leadership but it's it's a data mentality right you want to have everyone be educated on making data driven decisions and it it needs to show from up down everywhere uh even during development processes you're making decisions based on data right so that's the culture you want to create first but uh on top of that you want to make sure that it be it's very easy to find find access and Trust data so the democratization of data is very important in any organization to allow not just uh data Engineers to get access to data but make it so easy that you know anyone with the right access and need can get value out of the data so that and then the third piece would be just education uh do the Train the trainers uh and make it uh make every data uh even even during uh experience where there's inefficiencies in the way you're accessing data make it a teachable right so teachable movements create more more uh it's a it's got a trickle down effect uh across the organization so that's how I see it yeah okay that's great and kind of maybe expanding here on the aspect of providing access to data um you know you mentioned here in effect ways of of of surfacing data I think one common entty pattern we see in organizations today is that there's really limited context on how this data is useful and could be used um and you know a lot of organizations are trying to solve this you know with a metadata platform right or a data catalog uh maybe I would love if you can comment here on the importance of surfacing metadata and providing that data catalog for The Wider organization and what are effective ways that you've seen as well to provide provide that context for The Wider work yeah metadata is everything you know and catalog is is the key on that metadata allows you to know about your data it allows you to know you know not just the business context or the technical context of the data but it's actually becoming bigger than how metadata was used in the past it was about catalog you you kept the business and Technical metadata but now there's a concept called passive metadata which is all the cost Associated security uh all the um resiliency associated with the data set like how do you track that along with the metadata along with the uh data so think about a table you have in your environment uh how often is it used is anyone using it how how often is it updated how much does it cost to maintain that table how much does it cost to access that table table those are all passive metadata and the the bigger en the bigger uh view you have across all the metadata gives a company a lot of edge around where not only about uh providing you know information or what data you have in your environment but also what's valuable what is used more how what two data sets are relatable you know how often people join the two data together to create the insight all that becomes very important so you always want to start with a catalog but it needs to expand on just static metadata it needs to be live uh everything that's happening to that data needs to be collected at the same time to give you even more insight into your operational uh operational excellence right yeah uh and yeah this often makes me think about uh really the importance of providing as well the business context on you know how data is being used like effectively which teams are using it which queries which tables uh is this data used and yeah I I couldn't agree more here on the importance of of seeing how that data is being transformed um now as we're discussing about you know a lot of the challenges organizations face uh when it comes to data management I'd love to also learn more about Capital One slingshot right so maybe walk us in a bit more detail U what Capital One slingshot aims to solve you know as in a lot of ways this challenges that we're discussing here today were challenges faced by Capital One right so I I'd love to learn you know how Capital One slingshot aims to solve a lot of these challenges before I get into Capital One uh the slingshot I want to explain the concept of optimization uh one of the things that happens in the cloud is that when you are on premise your data data platform costs are pretty much static right you you buy certain number of servers and uh and then you have uh you have some constraint on how much you can use but the cost doesn't fluctuate until you buy a bigger server or expand and it took four to five months to do that but when you move to the cloud for the first time cost is now dependent on how you use it and how much you use right so the more compute you use the more queries you run the more your cost is going to be right and what that means is there's there's a room for a lot of inefficiencies uh that you you have probably had in the all environment but it didn't affect your cost it it affected your performance uh but it didn't affect your cost but now it does affect your cost uh so what slingshot does is it it tries to tries to help you optimize your cost but the way it does is in in a in a few steps right in first is it provides you the visibility into into your cost where are the cost spiking where where are the cost high low all that second it gives you uh near realtime alerts and insights into the cost drivers uh the last thing you want to do is wait until a month-end bill comes and realize that you had one query running for 30 days you know a runaway query so you want to know about that as quickly as possible then the last one is you not everyone is going to be an expert in Snowflake to know how to optimize their queries or optimize their server configuration so what we try to do is we try to recommend you when we give you we find an inefficiencies we give you a recommendation on this is why we're seeing the inefficiency and this is exactly what you can do or give you a couple of options and then in the tool allow you to um allow you to make the changes to your settings right and all three together allows you to not only save costs remove inefficiencies but also get get um everything that happens in the server in the snowflake environment is basically all well managed and well governed because we also have uh uh a way for you to provision resources change resources all uh through approval process and the proper tagging so it's it's well managed well governed as well uh yeah that that's that's in a nutshell what slingshot does that that's great and when you mentioned here on cost optimization maybe we should have covered that a bit more depth uh walk us through the the drivers behind U you know ballooning costs when trying when managing data in the cloud yeah there there's a you know I I the way I see it is there's four different areas where cost can be so variable and can just ex and have an uh explosion if you don't manage it well uh you know first one is um the compute cost which is in the modern data warehouses as data platforms you always have a compute aspect which is separate from your storage and the more queries you run the more the more you spend the second one is just the query itself the query optimization third one is um I'll go into a detail in a second but the third one is your um data set optimization which is how much data do you store how is it being uh modeled correctly and the last one is environments uh environment optimization which is around you know there's there's going to be a lower environment Dev QA and production and in these there's a lot of inefficiencies that can happen in in the lower environments as well but let's go back to the first one the compute this is where you know what we understood is that there's no one size that fits for all times of the day what we see is that workload usually goes Up and Down based on the time of the day day of the week and you want your uh snowflake resources to size up and down accordingly and today s doesn't let you do that it it it gives you one size and you have to so it's very important to know the ups and downs and making sure your uh your compute goes with it that way you're spending uh the most effectively uh also making sure that your servers are shut down when not in use uh and you're you know you have some timeouts in your queries so that you know it don't a runaway query keeping a server up so watching for those uh is very important the second one is uh query optimization like I said queries the longer the query runs the more it costs so if you have a badly written query uh that like does for example a cartisian data join that can forever uh all that will cost you a lot of money and you won't even know about it unless somebody's watching that so it's it's very important to not only buil in alerts uh for runaway queries or but also provide um a way to give you advice on how to rewrite the query and get the best performance and lower the cost and our slingshot tool does that also and the and the third one is data set optimization I call it data set optimization because it's got a few aspect to it one is storage cost even though you You' think storage cost is much cheaper in the cloud compared to on premise but when you're dealing with multi-petabyte of data then card will sneak up on you so it's very important to have a retention strategy on your data that means know how long you want to keep it it could be based on business requirement regulat requirement like figure out a way that your storage will not keep going up and up otherwise what happens is even if you start with a 5% storage cost in year three it'll be 40% storage cost because you're you're not purging anything old or you're not archiving anything old so it's very important to keep a retention strategy second is a lot of data that we're loading today in our data warehouses you have to understand the consumption patterns as well so there's no loading something in real time when you're consuming it once a month because there is a cost to loading data as well uh so these are the insights you want to draw from a from a tool that allows you to know okay what is the best uh best way to load the data that helps with the usage right and uh the last point is on the environment cost one of the things we notice is when you're building data pipelines in lower environments there's a lot of room for uh inefficiencies there for example uh one of the things we did was we made a rule that says you cannot get a in a Dev environment you can't have uh a server large greater than small for example size and and just by enforcing that rule we saw a significant savings and and if there is a need for something larger than that you go through a certain approval process but but to enforce rules like that to enforce that jobs that are uh you know when you're testing something when you're developing something that are not just running when you're not working right and those things can happen and so it's very important that you pay attention to the lower environment even though that's a much smaller percentage of total cost but that can also balloon out of control if you're not putting in the same inefficiency uh checks uh that you're going to put in production right so point I'm trying to make is don't forget about the lower environment CA that can sneak up as well yeah I love this overview and I love how you know you thought about all of these Ed cases and kind of all of these drivers of bullying cost when designing cap one lingsha uh but maybe here uh Switching gears slightly and thinking about how leaders and organization should think about their data management journey and to avoid uh you know ballooning costs U you know a lot of organizations are still early in their Cloud Journeys a lot of organizations are still setting up their Cloud infrastructure and maybe what advice would you give them to make that they're driving as much as Roi as possible when it comes to data management how do they avoid being in a mode where they're increasing investment and ballooning and costs are ballooning without necessarily driving a lot of Roi from data activities there are times leaders you know that that companies that are moving to the cloud don't see what's right around the corner is they go and then they face all these different challenges around governance or on cost or on uh accessing data uh so you know my my recommendation and advice has always been think about data governance data management when you're starting off uh the journey to the cloud uh sometimes it's much more harder once you've opened the Pandora's Bosque to put everything back in because uh you know once it's it's much harder to do that if if you don't have your data registered across your organization now you're going to have to go through and and collect all that but if you make uh if you think about a Central Tool Central policy then from any new data that's created automatically will be well-managed well governed so that's that's the first piece that you should think about is uh invest in a good tooling that allows you to you know uh do the centralized policies enforce the centralized policies but give ownership to the to the lines of businesses as well as um on the cost side I would I would recom same thing it's very important to uh from the beginning uh put in the right policies on how you're going to do chargeback to your line of businesses are you going to have a budget or not are you going to have um uh a way to uh request more funding all that needs to be part of the uh a data platform in in this modern world you can't just have uh you know the since it's it's a variable cost you need to have a proper funding proper budgeting and proper way to asking for more resources uh the right approval workflows should be built in and uh and then the visibility is very important right uh on the way you're spending your uh money yeah that's could I agree more um and you know what's interesting hearing about you know your journey uh now leading Capital One slingshot as we discussed a lot of the ways in a lot of ways slingshot was built upon you know Capital One solving these problems right that we're discussing right now uh maybe walk us through you know briefly the Journey of productizing a tool built internally uh and productizing it to the wider Market what changed if any in your approach as you were productizing sling Shaw wow uh that's a very good question it's you know the first thing what we had to do was we had to build uh the tool we built within Capital One was very specific to Capital One right so there was a lot of uh hidden uh hidden integration with Capital One infrastructure already so we had to build a a SAS platform that was supposed to be multi-tenant that is supposed to be multi-tenancy it has to protect the customer data it has to make sure that you know the the datas don't overlap each other uh so we had to create a a whole new SAS framework uh we had to create a different uh a data uh data development software development practice uh to help this with this journey and uh one of the things that you're going to see from Capital One build software is that it's security um performance scale resiliency these things are just built in into our DNA right so everything we build is built for uh Harden platform and that's how we've always started even though other uh startups will start with just providing you the features then figure out how to harden it later we we've always made sure that the product you see is going to be a Harden Enterprise great software and you know as we are closing our episodes today Salim H you know when looking ahead maybe what are trends that you're excited about when it comes to organizations managing their data effectively how do you think the landscape will evolve over the you know foreseeable future no I think it's already evolving into that Federated mindset the more companies I they they all they realize that uh it it's going to be very hard to manage it from a centralized point of view right so uh that's the shift that's happening but it's not happening as fast as I'd like to see uh but but it's clearly uh evident from the strategies of different companies the way they execute it is a little different they may keep data across uh multiple platforms but what's interesting is you have a central catalog across all of have Central Access policy across all your data and that those are the key part right that way you break the silos and you have a a single single way to get data even though it's owned by different uh line of businesses yeah that's definitely an exciting uh development as we see that Federated mindset uh you know evolving and you know empowering organizations and line different lines of businesses to use their data effectively um finally s was great having you on the show before we wrap up up you have any final call to action or notes to share uh with listeners before we end today's episode yeah absolutely um this is what I say quite a lot is when it comes to the cost optimization it's never uh once and done deal you have to be vigilant it's a constant uh battle so you need to be vigilant you need to be watching for a new users new usage new usage pattern workloads are always introduced in the system so don't focus on reducing cost I would say focus on reducing waste and reducing inefficiencies in the system and then you will be able to scale with with peace of mind knowing that the the money you're spending is going to the value generating for the business that's really great Sal it was great having you on data frame thank you so much it's been my pleasure oh\n"