#219 Building a Data Platform that Drives Value _ Shuang Li, Group Product Manager at Box
The Importance of Metrics in Data Platform and Engineering
In today's fast-paced data-driven world, having a clear understanding of metrics is crucial for any organization. Metrics help us measure progress, identify areas for improvement, and make informed decisions. In this article, we will explore the importance of metrics in data platform and engineering.
At Box, we believe that metrics are essential for making an impact on the company level. We have two sets of metrics that we always tell our team to focus on: time-to-value and enable new use cases by introducing streaming capability. Time-to-value is a key metric that measures how quickly we can deliver value to internal customers. By reducing onboarding time, we are doing a great job in this area. On the other hand, enabling new use cases through streaming capability allows us to build near-real-time anomaly detection and other innovative solutions.
The metrics we use at Box are designed to be relevant to every engineer on our team. We start with the bottom level metric that is most relevant to everybody, which is streaming capability in data platform. The engineer working on this can understand how streaming works. However, to understand how streaming contributes to the product and engineering level metric, we have a box metric called "enable new use cases by introducing streaming." This metric allows us to build near-real-time anomaly detection or other innovative solutions that enable new use cases.
As engineers work on their projects, they can see how their work is contributing to the company's goals. The metrics we use are designed to be easy to understand and track. We have a platform-level metric called "time-to-value," which measures how quickly we can deliver value to internal customers. By reducing onboarding time, we are doing a great job in this area.
In addition to these metrics, we also focus on developer experience. We invest in tools and frameworks that make it easier for teams to interact with data platform. For example, our team builds frameworks to help non-expert users aggregate their data and build business logic on top of it. This makes it easier for teams to use data platform and get insights or innovations at scale.
Another area we are focusing on is log pipeline. Logs are an essential part of any software development project, but they can be a challenge to manage. We are building a tiered log pipeline that meets the needs of different users. For real-time logs, we have a pipeline that provides immediate visibility into application performance and errors. For analytics or auditing purposes, we have a separate pipeline that stores data for compliance use cases.
Finally, we are also looking at how AI can help us with data platform. AI can detect anomalies in pipelines, detect data loss or late data arrival, and even provide insights into data discovery. We are building a data catalog to manage metadata and make it easier for users to find the data they need.
The Future of Data Platform and Engineering
The trends in data platform and engineering are evolving rapidly. As knowledge builds up over time, it can become outdated quickly when new technologies emerge. To stay ahead, we must continue to learn from others, our experience, and from the latest developments in the field. We are excited about what's next for data, ML, and AI, and we believe that these areas will have a significant impact on how organizations operate.
In conclusion, metrics are essential for making an impact on the company level. By focusing on time-to-value and enable new use cases by introducing streaming capability, we can build near-real-time anomaly detection and other innovative solutions. We also invest in developer experience tools and frameworks to make it easier for teams to interact with data platform. Our focus on log pipeline and AI will help us provide real-time insights into application performance and errors. As the field continues to evolve, we must stay ahead of the curve by continuing to learn and adapt to new technologies and trends.
"WEBVTTKind: captionsLanguage: enI was uh always like joking with my team like oh if we look back right building this data platform it's like a climbing a high mountain right and you you can reach the top within a day or two and for Box's case we we didn't reach the top within a year we spent like couple of years to get where we are today especially talking about the entire Cloud migration we did in the past couple of years schwang it's great to have you on the show nice meeting you again Adele I'm very excited to be on the show awesome so you are the group product manager at box where you lead the building of the data platform there so maybe to set this stage what got you into this role and how do you become a data product manager you my career I have been always passionate about what's made possible by data ML and AI including optimizations insights and predictions so when I was doing my PhD in computer science at the Ohio State University um I was doing actually a lot of theoretical research and then um the summer um it's like the year before my graduation I got an internship at quom so that was my first exposure to Industry and I was very interested actually in the project I was working on so it's basically um uplift uplifting the user experience of U streaming media over wireless networks by developing a machine learning based algorithm so I got very excited about that project because I was thinking you know this algorithm would be very likely running on the cell phones of millions of people so that's my actually the first transition into my career I decided to join Google as a software engineer instead of staying in Academia so that's the first transition and then my second transition actually also happened at Google so I um I joined Google Fiber it's like a startup inside a big company so uh our mission was to bring the high speed internet to households so it was a small team and I was actually working with people from you know different functions like oh product managers Business Development marketing all these different people and I got into customer cost uh together with our product manager I was amazed about how he asked questions trying to understand the pain points of the customers and then wrote requirements document to solve the problems for our customers so then I transitioned gradually into product management because I I got very interested in this area so that was my second transition so over the past uh uh uh few years I've been in product management um in payment in electric vehicle charging and also in Big Data Cloud machine learning and AI so my third transition actually happened about three years ago when I um uh became a group product manager at box so I was hired to uh build and lead a team of product managers to build a box data platform so it's like over my career Journey so far right three transition but I've been always very interested in Big Data ML and AI so that's how I got here so maybe let's jump in into the meat of today's discussion schang you know if we first Focus maybe on you know building a data platform specifically the why behind building a data platform so maybe walk us through first what exactly is a data platform to spell any you know uh myths about it and why organizations should invest in building one so when you talk about the data platform right you ask different people they may give you different answers but I think what's in common is overall it's the high skill data infra every company builds to solve the business problems for the company right so I hope that's brief enough because you know you can have different variations depending on which company or which industry you are in right but in terms of why right think about so I can maybe share a little bit more about uh the story uh at box right right so when I first joined Box about 3 years ago um we didn't have a actually a um data platform because we call them like data infra different teams right they work in data silos so they had their different data infra they had a dedicated set of people managing the data INF from managing all the data and it's really hard to scale at that time and also there was no way for these different teams to share data they work in silos and not to to mention okay can you guarantee the right performance of your data infra and there's cost challenges there right it's hard to SC scale reliability quality all these problems so I believe like if we talk about the why here right you think about without it what would happen right to a company so I think um essentially uh like I mentioned before right we want to have this right scale the high skill data infra so for box right maybe I can share a few numbers so people can understand a bit more so for box we are a Content Cloud company we have millions of users and we manage billions of objects so when I talk about objects these are like files folders and other kind of uh objects customers store in the Box content Cloud right and they interact with these uh count uh files and folders all the time think about all the downloads uploads editing deletions right so imagine that kind of skill like you you have uh you know all the user activity data metadata file information you need to manage right if we work in data silos we just have teams managing their own data infa it's a really hard problem we we are solving so it's very important to build such high skill dat infra to solve the business problems for the company so hopefully yeah that's uh something uh people can resonate when they hear about this yeah I mean 100% it's it's hard to imagine as well like a company like box said operating at such high scale of data is well not needing something like a data platform right so and then when you reflect on your experiences kind of leading the data platform group so far Bo you mentioned you joined three years ago a lot of work has happened since then what do you think are kind of key components of building an effective data platform what makes a data platform successful so that's actually I think the core of data platform so um I always want to like you know keep things simple right but like you said what are the key components here maybe that's where we start when we build a platform and then we can build on top of that so these are like the building blocks of a data platform so uh I think um you know essentially right we want to get the data in right and then make it available for different teams across the company to consume the data and then you need to have a pipeline right to get the data in that's the data ingestion pipeline but you need to think about all the data sources you have at a company right it could be on your platform like it's other teams there let's say storing the transactional data but they need to uh get data ingested into your data platform for other teams to consume right it could also be some third- party tools let's say your marketing team or customer success team are using so it's off platform data it's another type of data source but you need to figure out for those data sources how do you build the right ingestion pipeline to get the data in your data platform right so I think the first key component is the ingestion pipeline in my mind and second you need to be able to you know process the data right could be the ETL pipeline right we work together with our data engineering team to build that but beyond that right there are teams who want to use your data processing capabilities right so probably you get the data in for them but they want to do some further processing like application specific objects so they probably want to use your batch compute or stream compute capabilities right so at box right we have streaming use cases for sure uh both on the infra side and also on customer facing side uh so for example we have a security product team their job is to basically detect anomaly of user activities let's say oh your company only operates in this regions but you found some downloads actually hundreds of downloads within a second from a region you have never seen before so that's suspicious location or suspicious activity right they want to detect right away or we say near real time so that's when stream compute come into the picture and that team will need to Leverage The Stream compute capability we have on data platform so that's the second key component data processing so it could go batch stream or you could have both and I think the the third key component I think uh maybe people talk about that all the time it's the most kind of uh it's the core of the core data link right once you inest data process it properly you need to store it manage it somewhere and then you need to support support all the use cases for your customers um here I talk about internal customers but of course they build products or they build analytics for external customer or leadership for those dashboards right so um we need to you know choose the right uh tool the right technology and then we need to innovate on top of that for sure right there's performance there's cost but of course the feature set you have in mind so then to summarize right the three key components I have in mind is data gestion pipeline data processing capabilities and also data link okay that's really great and there's so much to tease out here maybe kind of focusing on that ingestion pipeline right maybe you know for an organization like box walk through the complexities of trying to capture all those data sources how do you approach it especially where you're building out early in the data platform Journey so uh I briefly mentioned all the data sources right uh we have on platform data sources so it's like um you know some transaction no data we capture in the some relational databases but you know they couldn't make it available for Downstream teams to do all the analytics right or build products it's very hard to do it there so that's one of the biggest data sources we need to figure out right and then there are other uh you know data sources like oh there's some metadata we store right in other places and beyond that there's some events data like user activities downloads uploads which we call Enterprise events that's another data source so I think we have at least uh you know like 10 uh on platform data sources but on the you know um kind of uh uh marketing and customer success or data science side they leverage some third party pipelines uh uh we use snap logic actually yeah to to get the data in and then you have to manage like oh if you only build one pipeline right it's not like one size fits all you need to figure out how you get data from different uh sources they're in different formats they probably have different SLA requirements and we need to figure all these out and I think I mentioned the scale we are operating at box right imagine those millions of users interacting with their billions of objects every day right think about the scale yeah that's really awesome and then you know the second kind of component that you mentioned is the processing pipeline here which you know assume the data engineering team here is managing it and building these ETL pipelines maybe walk me through here how ingal is the relationship between you know the product management team here and the data engineering team and how do you build out those ETL pipelines in a way that satisfy all the requirements that you have whether for Downstream users or you know to make sure that you know the right security levels are in place maybe walk me through the nuances of building effective ETL pipelines in such a circumstance I I'm a product manager I have a team of product managers uh working on a data platform uh we have Engineers we call them data platform Engineers but the third party you you talk about is this data engineering team so they're are not part of us they're not part of data platform but we are in this partnership together um so for box case it's a um so actually um it's probably more complex than than than we all thought in the beginning of this uh uh conversation because three years ago when I joined uh we didn't have a decent data platform but at the same time the entire company was starting the migration to the cloud so it's like we are we have been building the data platform together with doing the Cloud migration it's a great thing right by migrating to the cloud you can build the right data platform leveraging Cloud native Solutions but it also adds complexity to the project or to the program we are we are managing right so um of course statea engineering team we engage them very early in this journey so uh we always have like all the uh discussions together because we always think they're part of us we're doing this all together so even today after we already finished uh a CL the migration we we are kind of brainstorming together on what's next right so in this journey like a data engineering team so this is uh basically how a company structure uh the team so in box it happened that uh data engineering is part of the um our goto Market org so they have a lot of close interaction with our marketing customer success so those are like the teams not as part of uh product and Engineering but they have this good relationship ship with them so you know building relationship with your stakeholders in all the you know data platform building a data platform or in the cloud migration is super important so they are like our partners they work with those marketing and customer success so um of course I I am I was deeply involved in all all those conversations together with my team of product managers so we figure out oh probably they have different use cases on the marketing side because they need to use the data to run marketing campaign they want to find the the right Target of customers right to run the campaign and also on the business side right for business analytics they want to create dashboards to track our Revenue to do the CH forecast those are very different use cases from let's say the security product team they want to just detect anomaly of our uh the content of the customers right those are very different use cases different SLA and different scale different uh you know for example when do they expect the data to arrive right those are different things probably from what other product teams are expecting so we need to understand all these use cases across the company because basically entire uh box all the teams are customers internal customers of data platform so I believe in this journey uh we really enjoy the partnership with the data engineering um so they are part of us but they kind of help us broaden uh our relationship the station the the engagement of different internal customers yeah that's very fascinating I think this the first time I hear of a data engineering team being part of a goto Market team but it's it's pretty it's pretty interesting to see how the interlock works and maybe kind of Switching gears here a bit let's start at the very beginning right like when you uh start building the data platform box maybe what were some of the key steps and Milestones and building the data platform you know if you were to zoom out outside of box right what advice would you give for those looking to build the data platform right now where do you where do you start yeah I I was always like joking with my team like oh if we look back right building this data platform it's like a climbing a high mountain right and you you can't reach the top within or within a day or two and for box case we we didn't reach the top within a year we spent like a couple of years to get where we are today right especially talking about the entire uh Cloud migration uh we did in the past couple of years so like I mentioned when I joined we were in the very early phase of the uh migration to the cloud so it's not just dat platform it's the entire company we were doing the Cloud migration and data platform is one of the foundational teams so we kind of went first in this uh Cloud Journey so after we finish together with other platform teams our service teams applications team they can they could start doing the migration because their services have been built on top of those platform uh the services of platform teams so um I think in the very beginning right uh we need to uh of course identify by uh the key first steps for the entire company of course it's about you know platform teams you guys need to go first and then other teams can build on top of you guys but then if we talk about data platforming specific right we need to identify the problem we had at that time right I think for everybody it was very clear we were working in data silos every team had to allocate resources to build maintain their own data infra and sometimes you know there's data inaccuracy right and they have to figure it out by ourselves and let's say product analyst they were expecting data for their monthly active user weekly active user dashboard but data didn't come through where's the problem right they have to figure out by relying on their own resources so all these things there was no single source of truth right at that time so that's the biggest problem so that's why like we kind of identify the goal and align with all these stakeholders by talking to them constantly right um so we need to build this uh Consolidated data platform right get all the data in one place build rob a robust scalable reliable ingestion pipelines right provide the right data processing capabilities to all these teams so that's I think very important um in the uh very beginning I think you also mentioned like uh uh do you want want me to share any uh advice with other companies like yeah I'd love if you can share advice yeah okay yeah I think alignment is very important that's the most important thing so uh for our case we're basically dealing with the entire company so there's the product engineering marketing customer success right the entire go-to Market is in this picture as well and we have product support they are also using data platform we our compliance team they need to uh store the data set a specific retention period for auditing and then our data science data engineering team product analytics business analytics so it's a lot of team we're dealing with getting alignment right not just in the early phase but of course getting it alignment in the early early phase is super important but along the way you need to constantly talk to them you know because sometimes it requires realignment things change right they probably have some different use cases or the way you thought things would work didn't so those are all the things but overall it's all about alignment I think that's the most important thing yeah and I really like the analogy of climbing a mountain that you use here because you you can extend that analogy and say okay like the mountain is very high but it has multiple Peaks along the way and you want to like you know arrive at the first Peak second Peak the third Peak so uh maybe you know when we're talking here about the different Peaks how do you Chunk Up you know the massive journey of building a data platform to these small iterable iterative Pro like goals that are achievable in the short term how do you define those over time it's actually not easy for sure right it's a companywide program right right everybody everybody is uh uh deeply involved in in this but then you know it's overwhelming right in the beginning like you know for for box case um we were not on the cloud right not many people had Cloud experience not to mention the best practices in the cloud right if we are like oh let's just adopt everything all the cloud services all at a time nobody could do that right it's really hard so we're trying to make things simple right we we we trying to build in iterative way so um we worked very closely with our architect group so box we have a architect leadership group there's all the principal and distinguished architectes in this group so we work very closely with them along the journey and even today right after we finish we will have all the architectural discussions with them so we identified the two groups of use cases overall right think about all those different use cases and stakeholders but we just divide them into two group one is about uplift group two is about lift and shift so we when we talk about uplift right think about oh we want to our end goal right we need to build the right architecture to handle highs scale data processing in the cloud right but lift and shift is just we want to optimize the delivering time while meeting the needs and probably we'll come back later and see what we can do there would be some technical that but that's how we we need to operate in in order to meet the timeline of the cloud migration or some other go set by the company so I think you know trying to make things simple is really important here in this journey yeah I couldn't agree more and you mentioned something here is that you if you want to adopt all of the cloud tools at the same time it's not necessarily possible maybe when it comes to technology choices such as databases Frameworks which cloud provider to choose Etc how do you make those decisions how do you approach these trade-offs so at box right we always H have this debate uh when we talk about you know different choices uh tools Technologies or Services right uh build versus buy right you need to answer this question every time so we would look at uh of course feature set right these are the use cases we have in mind can this tool have all the right features for us to solve these problems and cost is another thing right if we think about uh we'll buy this licensing cost right how much would that spend you need to have a a more or less accurate estimate on that right with all the forecast of your traffic in the next two or three years or even longer and then if you build by yourself right it's your engineering cost it's not like oh you need to pay your engineers to build a product right it's not like free coming free and then there's maintenance cost right you build your own product let's say probably from scratch or on some open source but you need to maintain it all the time it's your engineering resource for sure but of course if there's issues it might be faster to troubleshoot because it's your own engineering team they know the code right but if it's the vendor right or you have have to file all the ticket think about the the back and forth sometimes it doesn't work well right so I would say if we um want to go with a buy let's say oh the the the cost is under our budget and we we like the feature set provided by the vendor but we want to look for the long term it's not like today right we have this much traffic these are the features we want but we need to think about long term right is this vendor in the right ecosystem right for example this vendor has great partnership with others right you probably want to expand uh some use cases in areas or features provided by by its partners and how what is the Innovation speed of this vendor right you may go well beyond what you're are looking for for for this year right and then good partnership is important right not every vendor is easy to work with you need to look at how easy right to to to have this um good partnership and of of course so for box I I have one one more thing to add it's about all the security requirements so box our security office has very strict security requirements so when we pick vendors right we need to look at all the uh requirements provided by our Global Security office we need to make sure they check all the boxes so that's another important thing so usually when we decide oh we probably will go with this vendor we'll initiate our uh request to that Global Security Office very early because that that review could take you know a month or two right we don't want to be like oh we decide well use this vendor but it end up getting declined by the our security approval team yeah yeah uh that's yeah great great ideas here on the buy versus uh versus build maybe an additional you know component of building a data platform that we haven't touched upon yet here is data observability and quality right uh which is really integral right to maintaining uh High trust in the data platform maybe walk me through some of the best practices that you can share here when it comes to you know keeping data quality up you know having data observ observability pipelines that monitor when data breaks I'd love to learn here how you've approached that as well and when does that come in the journey well when we were doing the migration to the cloud right we we didn't invest much to be honest in this area so actually we structured our entire uh data platform engine engineering team as um data management versus data transformation so you can hear from the name they are both about Foundation of data platform right there's not developer experience we were thinking about so basically management we're thinking about data at rest so think about data Lake and the ingestion pipeline how do you get the data in right but of course there are like related capabilities I'm just to trying to keep this simple but data transformation we're talking about data emotion right the ETL the processing right all the capabilities and orchestration we're providing so that's how we structured our engineering team that of course I have my uh product man managers covering uh providing the coverage for both teams but once we finish the the migration we're like go Foundation of course will keep innovating we'll adopt Cloud native Solutions we'll adopt best practices but how about developer experience like what you mentioned right the data quality data observability right our product analyst let's say uh today at 9:00 a.m. Pacific time uh on Tuesday right they're expecting all the data in the past 24 hour uh 24 hours so that they can show the you know the daily active user monthly active user dashboards and they they want to present to leadership right but data didn't arrive right nobody told them and they found out maybe one or two hours later they had to Oh Come to data platform data platform is like oh we we build and manage the pipeline let's check with the data source and then we talk to another team which keeps the transactional data right that's a very hard uh process or no process at all it's just you know asking around right so we realize the problem because that's just one of the examples on the developer experience side right there are many other things like how how to discover data easily right and you know can we provide a kind of a playground for teams to you know have a it's a production like environment to play around with our capabilities on data platform before they go to production right so all these things fall into developer experience and then last year we decided let's restructure the uh the team here uh instead of having data management versus data transformation both of which are in Foundation right we have data platform Foundation team versus data platform developer experience so then it's easier to prioritize right because every time when there's a developer experience request right that goes into the developer experience team then foundation will operate or execute against a separate road map and then we can prioritize uh accordingly so that's actually you know structure goes first right without the structure oh we want to have data observability everything on the road map but every time we they probably got deprioritized because we don't have a dedicated team investing in this so actually uh starting from last year uh including this year as well we we are investing very heavily in developer experience so I think for data observability I briefly touched upon like data freshness right but I I don't know do you want me to expend more on what we have built yeah I I'd love if you can learn more but then what I actually would like to ask you know because I'll let you cover this in the in the next question that I'm going to ask is what do you think are key components to a healthy developer experience for a data platform because you know freshness year is a part of it and datail to a part of it but maybe expand on you know what makes a great developer experience for a data platform as well so that actually uh comes back to the North Star for building a data platform like what metrics are you measuring right cuz for platform teams like data platform right you're not directly delivering customer facing products right can use a revenue to to measure right how good or how bad this is but then for platform teams right we have actually teams across the company using your capabilities right then that's your customer and you can measure so we use time to value or time to Market or time to production so we always use these these three interchangeably that's the metric we're measuring so let's talk about time to value what does this mean right use that security product team as an example again they want to build a let's say near real time anomaly detection product for our customers right let's say they haven't used data platform at all they need to onboard to data platform that's the first step right how easy is the onboarding process to be honest when we first started this journey it was extremely hard it would take them a quarter like three months to just on board to data platform so they become a tenant of data platform they can start using the capabilities after they unboard to data platform right how can we make it easier for them to discover all the different capabilities of data platform right we should have the right documentation for them right Playbook tutorials office hours but those are maybe artifacts together with process so they start experimenting with those capabilities right can we have the right environment for them to play with them end to end right once they're ready to go from Death to production right how do we have this uh we call Shadow environment to provide production like traffic so you have the volume production volume and production level diversity for them to try and then they have this confidence they move to production but you know along this journey there are different branches right for example at a certain time point right they want to explore data is it easy for them to discover the data the data size data table even the column right yeah and then data observability also comes into the picture right they run their jobs things happen right how can they troubleshoot or you can build some monitoring for them they get alerts right away and then even better you can tell them oh this is the issue or I already Auto recovered for you so that's a the next level for sure but it's kind of like along this journey right today right for a security product team if they want to build a new feature from onboarding all the exploration experimentation and then to production and our customers can use it how long does it take so that's our North Star and our goal is to shorten this time so everything we're doing is to make this time shorter what's really wonderful about what you're saying this resonates a lot with me because you know we also have our own data platform at data Camp right and one thing that is really magical about you know when the data platform works is just how democratized data can be for the wider organization I'm I you know I'm pretty data fluent but I wouldn't call myself a data scientist right but I do know where the data I need is and I can I have access to it and I semi- production environment I'm able to experiment with it right maybe what are the key aspects of key aspects of making data accessible to non-technical users like when it comes to data democratization I'd love to learn here schwang what are the nuances related to uh data demonetization uh that anyone working on a data platform should be aware of yeah I think briefly uh talked about that like uh the restructuring of uh data platform now we have a dedicated data platform data exper uh developer experience team that team is dedicated to features like this and we call De uh data Discovery so uh of course for both Technical and non-technical users so take our product analysts and the business analyst for example so they fall into the non-technical user category so they got a request from product teams sometimes from leadership like oh figuring out can you figure out the daily active user or feature usage product usage of this newly launched feature of product right so I was talking to our uh manager of product analytics uh last year I asked him how much time do you guys spend trying to figure this out he told me on average they would spend four weeks on average to figuring out where the data is because it works today as let's say our software engineer from that particular team they launch a new feature right they just write everything in the big query table right no comments no annotations nobody knows which table it is uh except this person himself or herself so there's not good documentation and then for those product analysts right they got a request they have to like oh search all the Confluence Pages maybe the Box uh documents no luck right most of the case so they have to like try out different data sets and figure figure out which one is the right data field they need to explore so that's a big paino and that's kind of how why we decided we have to invest in this area for data Discovery so now we're um pushing the teams like the owners of the data data the tables to um you know tag their table they could add descriptions they could uh tag uh columns for example this column is specifically for this uh new feature and it's about usage something like that and then we can have metadata management build right in data platform and we are leveraging actually data catalog today uh for data Discovery for um data observability data lineage all these features and data classification as well so I I believe data catalog is a very good tool for for companies uh who want to really invest or invest more in their uh data platform developer experience schang as we you know as we close up I'd love to kind of discuss as well some of the challenges that you've encountered along the way that you think are really common to build a data platform right uh what would you say are the top challenges folks have that they may encounter here uh that are relevant to building a data platform lot of challenges right trying again trying to keep things uh simple so maybe I can uh talk about the top three so first I think I briefly mentioned this one so when we when we build this data platform right while doing the Cloud migration almost everybody was new to the cloud so in the cloud world is very different from on pram and everybody's uh you know doing this and learning the best practices from the industry at the same time right so we made mistakes but we moved forward but it's a big challenge we we were tackling back then so that's the first one um second one so for this uh data platform we're building and the in the broad Cloud migration project right that's actually the biggest project ever of the company so it's it was overwhelming for everybody but we were able to break this overwhelming work into milestones and got alignment uh across the company across different stakeholders so that's the second challenge because I also share how we tackle these challenges otherwise we wouldn't get here right and the third is about um is the team the team morale so for many Engineers right they want to build new things so there's a balance here between new feature development or we call up lift right versus lift and shift so we need to tell the right story to them like oh we're doing probably lift shift for some of the components for now but we'll come back right that's actually you know in our data platform Foundation team that's exactly where what we are doing now we're revisiting those LIF and shift we have done uh when we did the cloud migration yeah so these are the three probably challenges I want to share yeah and maybe when you mentioned on cloud migration right like I think a big trade-off that organizations face on face here is the uh trade-off between being cost effective while also scaling right how do you approach that as as a function as you know you're growing the data platform how do you best approach kind of being cost effective while you're growing the amount of compute that you're using the amount of resources that you're using yeah this tradeoff is a um is very it's a it's a hard uh problem to solve again it's like a building a platform right so for Bobs it's very like specific we a SAS Company software as a service so we're talking about this rule of 40 is a very important metric for SAS companies like a box uh for the you know uh the business Health um so uh rule of 40 means the revenue growth together with your profit margin should be at least 40% of your Revenue right and then you're on the right path to sustain sustainable growth so for data platform team right we we contribute to the profit margin part that's the cost right that's translated that's how it's translated to this U uh business metric so we do quarterly and also uh you know monthly as well uh cost forecast so in these forecast right we need to take in into account organic growth right the traffic and also the new use cases so we do budget planning based on that right but of course you may go over budget right sometimes and then we need to think about you know shall we pay the licensing fee this year or we can build a working solution this year and then probably re-evaluate next year when we have more budget because it's always like trade-off you need to uh you need to like um you know uh think about right and make the right decision so you know sometimes paying for a vendor could reduce your overall cost so to to show you an example like we're uh we're uplifting our loging pipeline because loging pipeline is also under data platform at box so we could pay a vendor the vendor could do the lck aggregation for us so in that way we can reduce the volume the ingestion rate to another vendor who is our loging vendor right so it's like we're paying for the first vendor but it helps us to reduce the ingestion and then we pay pay less for the second vendor so in that way we play around right we can still you know get this vendor in but at the same time reduce the overall cost for the company and also consider the scale right we are operating at and the growth something else a challenge that you touched upon is the story The Narrative right that you have to discuss with Engineers when it comes to you know um kind of let's say maintenance for foundational work versus let's say you know Innovation work right um how do you maybe walk us through in a bit more depth how do you tell that story so that people are excited by foundational work that may not be you know uh the sexiest thing to add on a resume but equally as important for the company's bottom line so box right we have um two sets of metrics I mean overall for for the company uh so one set of course every team is adopting that is the business metrix and we call it lther up Matrix so actually our uh CTO Bankers came up with this one so let me briefly talk about what this this is so there are um to make it simple three levels the top level is the company level metric so we're looking for profitable growth that's very simple right everything you contribute to that one you're moving the needle for the company and the second level is at the product and Engineering level but of course for our goto Market or other Orcs they have their own metrics but for product and Engineering uh we have uh four metrics where tracking I'm not going to share with you all of them but you know I can give you a big example and show you how we letter up uh in the three levels so this is the second level so the third level or we call the bottom level most relevant to everybody every engine enger every product manager in this or right so it's our own uh Team so for example let's say for um data platform right we're introducing streaming capability on data platform so we made it work right so that's our uh kind of metric and then the engineer working on that they can understand right all streaming but then how does streaming L up to the product and Engineering level metric so we have this uh box metric called enable new use cases by introducing streaming right you can build near realtime anomaly detection or some other use cases and then that enable new use cases at the product and Engineering level will lther up to the company level for sure profitable growth so so for every engineer we have a metric right kind of their work is a map to and gradually they can L up to the top level so then in the eyes of engineering right that's how I can convince them what you are doing right is making an impact right uh on the company we're moving the needle here and then at the same time right I think I mentioned data platform level we have our own platform metric right time to Value so by introducing this we're reducing the onboarding time for our internal customer we're doing a great job so these are the two sets of metrics we always tell to our team or to to the entire company okay that's really great and then you know as we close out our conversation TR what do you see next being for the data platform at box and maybe walk through some broader data engineering trends that you see or you know data platform trends that you see happening this year so uh developer experience for sure we're keep investing and then uh we're also building some tooling and Frameworks to make it even easier for teams to you know interact with data platform for insights or for Innovations at scale so uh we have a a big group of like a Insight products overall across box right but then those teams not everybody is a big data expert so they have to do the aggregation they have to use like stream compute badge compute and then store the data somewhere and then make it available for query right right so we build this framework to help them aggregate their data and then they can build business Logic on top of that so that's one example I want to share like build some Frameworks right maybe at the data platform level put them in a common place and then other teams can just uh you know plug in and use it very very easily right and then lock pipeline I mentioned that so we're uplifting our log pipeline so for logs it's not like a you know a super shiny topic people are talking talking about but it's very important for the company right it's the developer the troubleshooting and I think you know for some company they even draw insights from their logs for their businesses right the compliance team they use a log pipeline for auditing those use cases so we're thinking about some tiered loog pipeline where you need real time logs right probably goes through one pip P line uh it could be more expensive but that's the price we need to pay but if you only need to run some analytics right this could be a separate pipeline but then there's some code storage use case you just store for compliance right you you retain the data for one year but you don't do a lot of analytics even so that's some code storage we can put in there so that's the simple idea about building the tiered log pipeline uh for the company and of course AI right I pull that one last because everybody is talking about AI these days so basically how AI can help data platform users do their jobs more easily so I mentioned all data observability right AI can can do that for sure right detect anomaly in your pipeline right there's the data loss data late data arrival AI can help you figure this out and beyond that right I mentioned for data Discovery we're building data catalog right to do the metadata management but even better right I think you know these days if you go to bigquery you can just ask natural language kind of question Direct toward your data sets right so that's something we can leverage for sure so these are basically um the the trends I have seen and I I want to share with the audience that is awesome now as we wrap up schwang do you have any final closing words to share with the audience yeah I think uh you know data mlai these areas they're evolving so fast right so like the knowledge you have built over the years um could be outdated right when when there's new technology uh uh coming up so keep learning and then learn from others learn from your experience and I think uh you know I'm I'm very much looking forward to what's next right for for data for ML and for AI I couldn't agree more and that's a great way to end today's podcast thank you so much Wang for coming on data framedI was uh always like joking with my team like oh if we look back right building this data platform it's like a climbing a high mountain right and you you can reach the top within a day or two and for Box's case we we didn't reach the top within a year we spent like couple of years to get where we are today especially talking about the entire Cloud migration we did in the past couple of years schwang it's great to have you on the show nice meeting you again Adele I'm very excited to be on the show awesome so you are the group product manager at box where you lead the building of the data platform there so maybe to set this stage what got you into this role and how do you become a data product manager you my career I have been always passionate about what's made possible by data ML and AI including optimizations insights and predictions so when I was doing my PhD in computer science at the Ohio State University um I was doing actually a lot of theoretical research and then um the summer um it's like the year before my graduation I got an internship at quom so that was my first exposure to Industry and I was very interested actually in the project I was working on so it's basically um uplift uplifting the user experience of U streaming media over wireless networks by developing a machine learning based algorithm so I got very excited about that project because I was thinking you know this algorithm would be very likely running on the cell phones of millions of people so that's my actually the first transition into my career I decided to join Google as a software engineer instead of staying in Academia so that's the first transition and then my second transition actually also happened at Google so I um I joined Google Fiber it's like a startup inside a big company so uh our mission was to bring the high speed internet to households so it was a small team and I was actually working with people from you know different functions like oh product managers Business Development marketing all these different people and I got into customer cost uh together with our product manager I was amazed about how he asked questions trying to understand the pain points of the customers and then wrote requirements document to solve the problems for our customers so then I transitioned gradually into product management because I I got very interested in this area so that was my second transition so over the past uh uh uh few years I've been in product management um in payment in electric vehicle charging and also in Big Data Cloud machine learning and AI so my third transition actually happened about three years ago when I um uh became a group product manager at box so I was hired to uh build and lead a team of product managers to build a box data platform so it's like over my career Journey so far right three transition but I've been always very interested in Big Data ML and AI so that's how I got here so maybe let's jump in into the meat of today's discussion schang you know if we first Focus maybe on you know building a data platform specifically the why behind building a data platform so maybe walk us through first what exactly is a data platform to spell any you know uh myths about it and why organizations should invest in building one so when you talk about the data platform right you ask different people they may give you different answers but I think what's in common is overall it's the high skill data infra every company builds to solve the business problems for the company right so I hope that's brief enough because you know you can have different variations depending on which company or which industry you are in right but in terms of why right think about so I can maybe share a little bit more about uh the story uh at box right right so when I first joined Box about 3 years ago um we didn't have a actually a um data platform because we call them like data infra different teams right they work in data silos so they had their different data infra they had a dedicated set of people managing the data INF from managing all the data and it's really hard to scale at that time and also there was no way for these different teams to share data they work in silos and not to to mention okay can you guarantee the right performance of your data infra and there's cost challenges there right it's hard to SC scale reliability quality all these problems so I believe like if we talk about the why here right you think about without it what would happen right to a company so I think um essentially uh like I mentioned before right we want to have this right scale the high skill data infra so for box right maybe I can share a few numbers so people can understand a bit more so for box we are a Content Cloud company we have millions of users and we manage billions of objects so when I talk about objects these are like files folders and other kind of uh objects customers store in the Box content Cloud right and they interact with these uh count uh files and folders all the time think about all the downloads uploads editing deletions right so imagine that kind of skill like you you have uh you know all the user activity data metadata file information you need to manage right if we work in data silos we just have teams managing their own data infa it's a really hard problem we we are solving so it's very important to build such high skill dat infra to solve the business problems for the company so hopefully yeah that's uh something uh people can resonate when they hear about this yeah I mean 100% it's it's hard to imagine as well like a company like box said operating at such high scale of data is well not needing something like a data platform right so and then when you reflect on your experiences kind of leading the data platform group so far Bo you mentioned you joined three years ago a lot of work has happened since then what do you think are kind of key components of building an effective data platform what makes a data platform successful so that's actually I think the core of data platform so um I always want to like you know keep things simple right but like you said what are the key components here maybe that's where we start when we build a platform and then we can build on top of that so these are like the building blocks of a data platform so uh I think um you know essentially right we want to get the data in right and then make it available for different teams across the company to consume the data and then you need to have a pipeline right to get the data in that's the data ingestion pipeline but you need to think about all the data sources you have at a company right it could be on your platform like it's other teams there let's say storing the transactional data but they need to uh get data ingested into your data platform for other teams to consume right it could also be some third- party tools let's say your marketing team or customer success team are using so it's off platform data it's another type of data source but you need to figure out for those data sources how do you build the right ingestion pipeline to get the data in your data platform right so I think the first key component is the ingestion pipeline in my mind and second you need to be able to you know process the data right could be the ETL pipeline right we work together with our data engineering team to build that but beyond that right there are teams who want to use your data processing capabilities right so probably you get the data in for them but they want to do some further processing like application specific objects so they probably want to use your batch compute or stream compute capabilities right so at box right we have streaming use cases for sure uh both on the infra side and also on customer facing side uh so for example we have a security product team their job is to basically detect anomaly of user activities let's say oh your company only operates in this regions but you found some downloads actually hundreds of downloads within a second from a region you have never seen before so that's suspicious location or suspicious activity right they want to detect right away or we say near real time so that's when stream compute come into the picture and that team will need to Leverage The Stream compute capability we have on data platform so that's the second key component data processing so it could go batch stream or you could have both and I think the the third key component I think uh maybe people talk about that all the time it's the most kind of uh it's the core of the core data link right once you inest data process it properly you need to store it manage it somewhere and then you need to support support all the use cases for your customers um here I talk about internal customers but of course they build products or they build analytics for external customer or leadership for those dashboards right so um we need to you know choose the right uh tool the right technology and then we need to innovate on top of that for sure right there's performance there's cost but of course the feature set you have in mind so then to summarize right the three key components I have in mind is data gestion pipeline data processing capabilities and also data link okay that's really great and there's so much to tease out here maybe kind of focusing on that ingestion pipeline right maybe you know for an organization like box walk through the complexities of trying to capture all those data sources how do you approach it especially where you're building out early in the data platform Journey so uh I briefly mentioned all the data sources right uh we have on platform data sources so it's like um you know some transaction no data we capture in the some relational databases but you know they couldn't make it available for Downstream teams to do all the analytics right or build products it's very hard to do it there so that's one of the biggest data sources we need to figure out right and then there are other uh you know data sources like oh there's some metadata we store right in other places and beyond that there's some events data like user activities downloads uploads which we call Enterprise events that's another data source so I think we have at least uh you know like 10 uh on platform data sources but on the you know um kind of uh uh marketing and customer success or data science side they leverage some third party pipelines uh uh we use snap logic actually yeah to to get the data in and then you have to manage like oh if you only build one pipeline right it's not like one size fits all you need to figure out how you get data from different uh sources they're in different formats they probably have different SLA requirements and we need to figure all these out and I think I mentioned the scale we are operating at box right imagine those millions of users interacting with their billions of objects every day right think about the scale yeah that's really awesome and then you know the second kind of component that you mentioned is the processing pipeline here which you know assume the data engineering team here is managing it and building these ETL pipelines maybe walk me through here how ingal is the relationship between you know the product management team here and the data engineering team and how do you build out those ETL pipelines in a way that satisfy all the requirements that you have whether for Downstream users or you know to make sure that you know the right security levels are in place maybe walk me through the nuances of building effective ETL pipelines in such a circumstance I I'm a product manager I have a team of product managers uh working on a data platform uh we have Engineers we call them data platform Engineers but the third party you you talk about is this data engineering team so they're are not part of us they're not part of data platform but we are in this partnership together um so for box case it's a um so actually um it's probably more complex than than than we all thought in the beginning of this uh uh conversation because three years ago when I joined uh we didn't have a decent data platform but at the same time the entire company was starting the migration to the cloud so it's like we are we have been building the data platform together with doing the Cloud migration it's a great thing right by migrating to the cloud you can build the right data platform leveraging Cloud native Solutions but it also adds complexity to the project or to the program we are we are managing right so um of course statea engineering team we engage them very early in this journey so uh we always have like all the uh discussions together because we always think they're part of us we're doing this all together so even today after we already finished uh a CL the migration we we are kind of brainstorming together on what's next right so in this journey like a data engineering team so this is uh basically how a company structure uh the team so in box it happened that uh data engineering is part of the um our goto Market org so they have a lot of close interaction with our marketing customer success so those are like the teams not as part of uh product and Engineering but they have this good relationship ship with them so you know building relationship with your stakeholders in all the you know data platform building a data platform or in the cloud migration is super important so they are like our partners they work with those marketing and customer success so um of course I I am I was deeply involved in all all those conversations together with my team of product managers so we figure out oh probably they have different use cases on the marketing side because they need to use the data to run marketing campaign they want to find the the right Target of customers right to run the campaign and also on the business side right for business analytics they want to create dashboards to track our Revenue to do the CH forecast those are very different use cases from let's say the security product team they want to just detect anomaly of our uh the content of the customers right those are very different use cases different SLA and different scale different uh you know for example when do they expect the data to arrive right those are different things probably from what other product teams are expecting so we need to understand all these use cases across the company because basically entire uh box all the teams are customers internal customers of data platform so I believe in this journey uh we really enjoy the partnership with the data engineering um so they are part of us but they kind of help us broaden uh our relationship the station the the engagement of different internal customers yeah that's very fascinating I think this the first time I hear of a data engineering team being part of a goto Market team but it's it's pretty it's pretty interesting to see how the interlock works and maybe kind of Switching gears here a bit let's start at the very beginning right like when you uh start building the data platform box maybe what were some of the key steps and Milestones and building the data platform you know if you were to zoom out outside of box right what advice would you give for those looking to build the data platform right now where do you where do you start yeah I I was always like joking with my team like oh if we look back right building this data platform it's like a climbing a high mountain right and you you can't reach the top within or within a day or two and for box case we we didn't reach the top within a year we spent like a couple of years to get where we are today right especially talking about the entire uh Cloud migration uh we did in the past couple of years so like I mentioned when I joined we were in the very early phase of the uh migration to the cloud so it's not just dat platform it's the entire company we were doing the Cloud migration and data platform is one of the foundational teams so we kind of went first in this uh Cloud Journey so after we finish together with other platform teams our service teams applications team they can they could start doing the migration because their services have been built on top of those platform uh the services of platform teams so um I think in the very beginning right uh we need to uh of course identify by uh the key first steps for the entire company of course it's about you know platform teams you guys need to go first and then other teams can build on top of you guys but then if we talk about data platforming specific right we need to identify the problem we had at that time right I think for everybody it was very clear we were working in data silos every team had to allocate resources to build maintain their own data infra and sometimes you know there's data inaccuracy right and they have to figure it out by ourselves and let's say product analyst they were expecting data for their monthly active user weekly active user dashboard but data didn't come through where's the problem right they have to figure out by relying on their own resources so all these things there was no single source of truth right at that time so that's the biggest problem so that's why like we kind of identify the goal and align with all these stakeholders by talking to them constantly right um so we need to build this uh Consolidated data platform right get all the data in one place build rob a robust scalable reliable ingestion pipelines right provide the right data processing capabilities to all these teams so that's I think very important um in the uh very beginning I think you also mentioned like uh uh do you want want me to share any uh advice with other companies like yeah I'd love if you can share advice yeah okay yeah I think alignment is very important that's the most important thing so uh for our case we're basically dealing with the entire company so there's the product engineering marketing customer success right the entire go-to Market is in this picture as well and we have product support they are also using data platform we our compliance team they need to uh store the data set a specific retention period for auditing and then our data science data engineering team product analytics business analytics so it's a lot of team we're dealing with getting alignment right not just in the early phase but of course getting it alignment in the early early phase is super important but along the way you need to constantly talk to them you know because sometimes it requires realignment things change right they probably have some different use cases or the way you thought things would work didn't so those are all the things but overall it's all about alignment I think that's the most important thing yeah and I really like the analogy of climbing a mountain that you use here because you you can extend that analogy and say okay like the mountain is very high but it has multiple Peaks along the way and you want to like you know arrive at the first Peak second Peak the third Peak so uh maybe you know when we're talking here about the different Peaks how do you Chunk Up you know the massive journey of building a data platform to these small iterable iterative Pro like goals that are achievable in the short term how do you define those over time it's actually not easy for sure right it's a companywide program right right everybody everybody is uh uh deeply involved in in this but then you know it's overwhelming right in the beginning like you know for for box case um we were not on the cloud right not many people had Cloud experience not to mention the best practices in the cloud right if we are like oh let's just adopt everything all the cloud services all at a time nobody could do that right it's really hard so we're trying to make things simple right we we we trying to build in iterative way so um we worked very closely with our architect group so box we have a architect leadership group there's all the principal and distinguished architectes in this group so we work very closely with them along the journey and even today right after we finish we will have all the architectural discussions with them so we identified the two groups of use cases overall right think about all those different use cases and stakeholders but we just divide them into two group one is about uplift group two is about lift and shift so we when we talk about uplift right think about oh we want to our end goal right we need to build the right architecture to handle highs scale data processing in the cloud right but lift and shift is just we want to optimize the delivering time while meeting the needs and probably we'll come back later and see what we can do there would be some technical that but that's how we we need to operate in in order to meet the timeline of the cloud migration or some other go set by the company so I think you know trying to make things simple is really important here in this journey yeah I couldn't agree more and you mentioned something here is that you if you want to adopt all of the cloud tools at the same time it's not necessarily possible maybe when it comes to technology choices such as databases Frameworks which cloud provider to choose Etc how do you make those decisions how do you approach these trade-offs so at box right we always H have this debate uh when we talk about you know different choices uh tools Technologies or Services right uh build versus buy right you need to answer this question every time so we would look at uh of course feature set right these are the use cases we have in mind can this tool have all the right features for us to solve these problems and cost is another thing right if we think about uh we'll buy this licensing cost right how much would that spend you need to have a a more or less accurate estimate on that right with all the forecast of your traffic in the next two or three years or even longer and then if you build by yourself right it's your engineering cost it's not like oh you need to pay your engineers to build a product right it's not like free coming free and then there's maintenance cost right you build your own product let's say probably from scratch or on some open source but you need to maintain it all the time it's your engineering resource for sure but of course if there's issues it might be faster to troubleshoot because it's your own engineering team they know the code right but if it's the vendor right or you have have to file all the ticket think about the the back and forth sometimes it doesn't work well right so I would say if we um want to go with a buy let's say oh the the the cost is under our budget and we we like the feature set provided by the vendor but we want to look for the long term it's not like today right we have this much traffic these are the features we want but we need to think about long term right is this vendor in the right ecosystem right for example this vendor has great partnership with others right you probably want to expand uh some use cases in areas or features provided by by its partners and how what is the Innovation speed of this vendor right you may go well beyond what you're are looking for for for this year right and then good partnership is important right not every vendor is easy to work with you need to look at how easy right to to to have this um good partnership and of of course so for box I I have one one more thing to add it's about all the security requirements so box our security office has very strict security requirements so when we pick vendors right we need to look at all the uh requirements provided by our Global Security office we need to make sure they check all the boxes so that's another important thing so usually when we decide oh we probably will go with this vendor we'll initiate our uh request to that Global Security Office very early because that that review could take you know a month or two right we don't want to be like oh we decide well use this vendor but it end up getting declined by the our security approval team yeah yeah uh that's yeah great great ideas here on the buy versus uh versus build maybe an additional you know component of building a data platform that we haven't touched upon yet here is data observability and quality right uh which is really integral right to maintaining uh High trust in the data platform maybe walk me through some of the best practices that you can share here when it comes to you know keeping data quality up you know having data observ observability pipelines that monitor when data breaks I'd love to learn here how you've approached that as well and when does that come in the journey well when we were doing the migration to the cloud right we we didn't invest much to be honest in this area so actually we structured our entire uh data platform engine engineering team as um data management versus data transformation so you can hear from the name they are both about Foundation of data platform right there's not developer experience we were thinking about so basically management we're thinking about data at rest so think about data Lake and the ingestion pipeline how do you get the data in right but of course there are like related capabilities I'm just to trying to keep this simple but data transformation we're talking about data emotion right the ETL the processing right all the capabilities and orchestration we're providing so that's how we structured our engineering team that of course I have my uh product man managers covering uh providing the coverage for both teams but once we finish the the migration we're like go Foundation of course will keep innovating we'll adopt Cloud native Solutions we'll adopt best practices but how about developer experience like what you mentioned right the data quality data observability right our product analyst let's say uh today at 9:00 a.m. Pacific time uh on Tuesday right they're expecting all the data in the past 24 hour uh 24 hours so that they can show the you know the daily active user monthly active user dashboards and they they want to present to leadership right but data didn't arrive right nobody told them and they found out maybe one or two hours later they had to Oh Come to data platform data platform is like oh we we build and manage the pipeline let's check with the data source and then we talk to another team which keeps the transactional data right that's a very hard uh process or no process at all it's just you know asking around right so we realize the problem because that's just one of the examples on the developer experience side right there are many other things like how how to discover data easily right and you know can we provide a kind of a playground for teams to you know have a it's a production like environment to play around with our capabilities on data platform before they go to production right so all these things fall into developer experience and then last year we decided let's restructure the uh the team here uh instead of having data management versus data transformation both of which are in Foundation right we have data platform Foundation team versus data platform developer experience so then it's easier to prioritize right because every time when there's a developer experience request right that goes into the developer experience team then foundation will operate or execute against a separate road map and then we can prioritize uh accordingly so that's actually you know structure goes first right without the structure oh we want to have data observability everything on the road map but every time we they probably got deprioritized because we don't have a dedicated team investing in this so actually uh starting from last year uh including this year as well we we are investing very heavily in developer experience so I think for data observability I briefly touched upon like data freshness right but I I don't know do you want me to expend more on what we have built yeah I I'd love if you can learn more but then what I actually would like to ask you know because I'll let you cover this in the in the next question that I'm going to ask is what do you think are key components to a healthy developer experience for a data platform because you know freshness year is a part of it and datail to a part of it but maybe expand on you know what makes a great developer experience for a data platform as well so that actually uh comes back to the North Star for building a data platform like what metrics are you measuring right cuz for platform teams like data platform right you're not directly delivering customer facing products right can use a revenue to to measure right how good or how bad this is but then for platform teams right we have actually teams across the company using your capabilities right then that's your customer and you can measure so we use time to value or time to Market or time to production so we always use these these three interchangeably that's the metric we're measuring so let's talk about time to value what does this mean right use that security product team as an example again they want to build a let's say near real time anomaly detection product for our customers right let's say they haven't used data platform at all they need to onboard to data platform that's the first step right how easy is the onboarding process to be honest when we first started this journey it was extremely hard it would take them a quarter like three months to just on board to data platform so they become a tenant of data platform they can start using the capabilities after they unboard to data platform right how can we make it easier for them to discover all the different capabilities of data platform right we should have the right documentation for them right Playbook tutorials office hours but those are maybe artifacts together with process so they start experimenting with those capabilities right can we have the right environment for them to play with them end to end right once they're ready to go from Death to production right how do we have this uh we call Shadow environment to provide production like traffic so you have the volume production volume and production level diversity for them to try and then they have this confidence they move to production but you know along this journey there are different branches right for example at a certain time point right they want to explore data is it easy for them to discover the data the data size data table even the column right yeah and then data observability also comes into the picture right they run their jobs things happen right how can they troubleshoot or you can build some monitoring for them they get alerts right away and then even better you can tell them oh this is the issue or I already Auto recovered for you so that's a the next level for sure but it's kind of like along this journey right today right for a security product team if they want to build a new feature from onboarding all the exploration experimentation and then to production and our customers can use it how long does it take so that's our North Star and our goal is to shorten this time so everything we're doing is to make this time shorter what's really wonderful about what you're saying this resonates a lot with me because you know we also have our own data platform at data Camp right and one thing that is really magical about you know when the data platform works is just how democratized data can be for the wider organization I'm I you know I'm pretty data fluent but I wouldn't call myself a data scientist right but I do know where the data I need is and I can I have access to it and I semi- production environment I'm able to experiment with it right maybe what are the key aspects of key aspects of making data accessible to non-technical users like when it comes to data democratization I'd love to learn here schwang what are the nuances related to uh data demonetization uh that anyone working on a data platform should be aware of yeah I think briefly uh talked about that like uh the restructuring of uh data platform now we have a dedicated data platform data exper uh developer experience team that team is dedicated to features like this and we call De uh data Discovery so uh of course for both Technical and non-technical users so take our product analysts and the business analyst for example so they fall into the non-technical user category so they got a request from product teams sometimes from leadership like oh figuring out can you figure out the daily active user or feature usage product usage of this newly launched feature of product right so I was talking to our uh manager of product analytics uh last year I asked him how much time do you guys spend trying to figure this out he told me on average they would spend four weeks on average to figuring out where the data is because it works today as let's say our software engineer from that particular team they launch a new feature right they just write everything in the big query table right no comments no annotations nobody knows which table it is uh except this person himself or herself so there's not good documentation and then for those product analysts right they got a request they have to like oh search all the Confluence Pages maybe the Box uh documents no luck right most of the case so they have to like try out different data sets and figure figure out which one is the right data field they need to explore so that's a big paino and that's kind of how why we decided we have to invest in this area for data Discovery so now we're um pushing the teams like the owners of the data data the tables to um you know tag their table they could add descriptions they could uh tag uh columns for example this column is specifically for this uh new feature and it's about usage something like that and then we can have metadata management build right in data platform and we are leveraging actually data catalog today uh for data Discovery for um data observability data lineage all these features and data classification as well so I I believe data catalog is a very good tool for for companies uh who want to really invest or invest more in their uh data platform developer experience schang as we you know as we close up I'd love to kind of discuss as well some of the challenges that you've encountered along the way that you think are really common to build a data platform right uh what would you say are the top challenges folks have that they may encounter here uh that are relevant to building a data platform lot of challenges right trying again trying to keep things uh simple so maybe I can uh talk about the top three so first I think I briefly mentioned this one so when we when we build this data platform right while doing the Cloud migration almost everybody was new to the cloud so in the cloud world is very different from on pram and everybody's uh you know doing this and learning the best practices from the industry at the same time right so we made mistakes but we moved forward but it's a big challenge we we were tackling back then so that's the first one um second one so for this uh data platform we're building and the in the broad Cloud migration project right that's actually the biggest project ever of the company so it's it was overwhelming for everybody but we were able to break this overwhelming work into milestones and got alignment uh across the company across different stakeholders so that's the second challenge because I also share how we tackle these challenges otherwise we wouldn't get here right and the third is about um is the team the team morale so for many Engineers right they want to build new things so there's a balance here between new feature development or we call up lift right versus lift and shift so we need to tell the right story to them like oh we're doing probably lift shift for some of the components for now but we'll come back right that's actually you know in our data platform Foundation team that's exactly where what we are doing now we're revisiting those LIF and shift we have done uh when we did the cloud migration yeah so these are the three probably challenges I want to share yeah and maybe when you mentioned on cloud migration right like I think a big trade-off that organizations face on face here is the uh trade-off between being cost effective while also scaling right how do you approach that as as a function as you know you're growing the data platform how do you best approach kind of being cost effective while you're growing the amount of compute that you're using the amount of resources that you're using yeah this tradeoff is a um is very it's a it's a hard uh problem to solve again it's like a building a platform right so for Bobs it's very like specific we a SAS Company software as a service so we're talking about this rule of 40 is a very important metric for SAS companies like a box uh for the you know uh the business Health um so uh rule of 40 means the revenue growth together with your profit margin should be at least 40% of your Revenue right and then you're on the right path to sustain sustainable growth so for data platform team right we we contribute to the profit margin part that's the cost right that's translated that's how it's translated to this U uh business metric so we do quarterly and also uh you know monthly as well uh cost forecast so in these forecast right we need to take in into account organic growth right the traffic and also the new use cases so we do budget planning based on that right but of course you may go over budget right sometimes and then we need to think about you know shall we pay the licensing fee this year or we can build a working solution this year and then probably re-evaluate next year when we have more budget because it's always like trade-off you need to uh you need to like um you know uh think about right and make the right decision so you know sometimes paying for a vendor could reduce your overall cost so to to show you an example like we're uh we're uplifting our loging pipeline because loging pipeline is also under data platform at box so we could pay a vendor the vendor could do the lck aggregation for us so in that way we can reduce the volume the ingestion rate to another vendor who is our loging vendor right so it's like we're paying for the first vendor but it helps us to reduce the ingestion and then we pay pay less for the second vendor so in that way we play around right we can still you know get this vendor in but at the same time reduce the overall cost for the company and also consider the scale right we are operating at and the growth something else a challenge that you touched upon is the story The Narrative right that you have to discuss with Engineers when it comes to you know um kind of let's say maintenance for foundational work versus let's say you know Innovation work right um how do you maybe walk us through in a bit more depth how do you tell that story so that people are excited by foundational work that may not be you know uh the sexiest thing to add on a resume but equally as important for the company's bottom line so box right we have um two sets of metrics I mean overall for for the company uh so one set of course every team is adopting that is the business metrix and we call it lther up Matrix so actually our uh CTO Bankers came up with this one so let me briefly talk about what this this is so there are um to make it simple three levels the top level is the company level metric so we're looking for profitable growth that's very simple right everything you contribute to that one you're moving the needle for the company and the second level is at the product and Engineering level but of course for our goto Market or other Orcs they have their own metrics but for product and Engineering uh we have uh four metrics where tracking I'm not going to share with you all of them but you know I can give you a big example and show you how we letter up uh in the three levels so this is the second level so the third level or we call the bottom level most relevant to everybody every engine enger every product manager in this or right so it's our own uh Team so for example let's say for um data platform right we're introducing streaming capability on data platform so we made it work right so that's our uh kind of metric and then the engineer working on that they can understand right all streaming but then how does streaming L up to the product and Engineering level metric so we have this uh box metric called enable new use cases by introducing streaming right you can build near realtime anomaly detection or some other use cases and then that enable new use cases at the product and Engineering level will lther up to the company level for sure profitable growth so so for every engineer we have a metric right kind of their work is a map to and gradually they can L up to the top level so then in the eyes of engineering right that's how I can convince them what you are doing right is making an impact right uh on the company we're moving the needle here and then at the same time right I think I mentioned data platform level we have our own platform metric right time to Value so by introducing this we're reducing the onboarding time for our internal customer we're doing a great job so these are the two sets of metrics we always tell to our team or to to the entire company okay that's really great and then you know as we close out our conversation TR what do you see next being for the data platform at box and maybe walk through some broader data engineering trends that you see or you know data platform trends that you see happening this year so uh developer experience for sure we're keep investing and then uh we're also building some tooling and Frameworks to make it even easier for teams to you know interact with data platform for insights or for Innovations at scale so uh we have a a big group of like a Insight products overall across box right but then those teams not everybody is a big data expert so they have to do the aggregation they have to use like stream compute badge compute and then store the data somewhere and then make it available for query right right so we build this framework to help them aggregate their data and then they can build business Logic on top of that so that's one example I want to share like build some Frameworks right maybe at the data platform level put them in a common place and then other teams can just uh you know plug in and use it very very easily right and then lock pipeline I mentioned that so we're uplifting our log pipeline so for logs it's not like a you know a super shiny topic people are talking talking about but it's very important for the company right it's the developer the troubleshooting and I think you know for some company they even draw insights from their logs for their businesses right the compliance team they use a log pipeline for auditing those use cases so we're thinking about some tiered loog pipeline where you need real time logs right probably goes through one pip P line uh it could be more expensive but that's the price we need to pay but if you only need to run some analytics right this could be a separate pipeline but then there's some code storage use case you just store for compliance right you you retain the data for one year but you don't do a lot of analytics even so that's some code storage we can put in there so that's the simple idea about building the tiered log pipeline uh for the company and of course AI right I pull that one last because everybody is talking about AI these days so basically how AI can help data platform users do their jobs more easily so I mentioned all data observability right AI can can do that for sure right detect anomaly in your pipeline right there's the data loss data late data arrival AI can help you figure this out and beyond that right I mentioned for data Discovery we're building data catalog right to do the metadata management but even better right I think you know these days if you go to bigquery you can just ask natural language kind of question Direct toward your data sets right so that's something we can leverage for sure so these are basically um the the trends I have seen and I I want to share with the audience that is awesome now as we wrap up schwang do you have any final closing words to share with the audience yeah I think uh you know data mlai these areas they're evolving so fast right so like the knowledge you have built over the years um could be outdated right when when there's new technology uh uh coming up so keep learning and then learn from others learn from your experience and I think uh you know I'm I'm very much looking forward to what's next right for for data for ML and for AI I couldn't agree more and that's a great way to end today's podcast thank you so much Wang for coming on data framed\n"