Why is Finding Data so Hard with Shinji Kim, CEO at Select Star

The Challenges of Choosing the Right Data Set: A Common Issue for Data Scientists

Using the right data set is a crucial aspect of data quality, and it's an issue that many data scientists struggle with. According to experts, having too many data sets can be a significant problem, as it makes it difficult for individuals to determine which one to use for a particular analysis.

In the past, when there were fewer data sources available, data analysts and data scientists could easily work with raw data and create reports without much difficulty. However, with the increasing number of data sets available today, things have become more complex. Data teams are now required to create live dashboards that draw enriched data from multiple sources, providing richer and more accurate results for decision-making.

While this increased complexity has its benefits, it also presents several challenges. One of the main complaints that data teams share is that they spend too much time preparing data because they have to access raw data directly, which can be time-consuming and labor-intensive. Additionally, with the creation of more tables, joins, and views inside databases, understanding which model to use becomes increasingly important.

However, this added complexity also means that data scientists face a significant challenge in discovering what's going on within their data sets. With hundreds or thousands of different tables created for analysis, it can be overwhelming to determine which one is the right one to use. Even in smaller companies, data teams will typically have access to a large set of data, making it difficult to navigate and make sense of it all.

There are two main reasons why this problem persists. Firstly, individuals who join new companies often struggle to understand which trusted data sets they can work with, as they may not be familiar with the company's existing databases or data sources. This lack of knowledge can lead to a steep learning curve, making it challenging for newcomers to get up to speed quickly.

Secondly, when a data scientist or analyst leaves a company, their colleagues often struggle to pass on the knowledge and expertise that they built over time. The work created by the departing team member is often embedded in their code and SQL queries, making it difficult for others to replicate and maintain their models without additional documentation or guidance.

The industry as a whole is slowly maturing as data teams begin to adopt more software engineering practices. While this shift towards greater organization and efficiency has its benefits, it also means that data scientists must adapt to new tools and methodologies in order to effectively work with complex databases.

"WEBVTTKind: captionsLanguage: enyou mentioned the idea that it's often quite hard for data scientists to know they're using the right data set this seems like something that's incredibly important you want to be using the right data uh and that's a very important part of data quality so why is it such a big problem for data scientists to find which is the right data set to use for any particular analysis yeah I mean to put it simply we have too many data sets you know so in the past when we didn't have all these data sets a lot of data analysts and data scientists were able to just you know export the raw data and then prep it to create the reports and that was fine and a lot of this data analyst data analy analysis has been done in like Excel spreadsheets in the past and we lived fine um but now we want to have live dashboards data that is enriched from M data sources that is up toate and that and that does provide a richer and more accurate results uh for our decisions to make that happen uh now and I would say in the past the complaints and this could still be the complaints today but in the past the biggest complaints that I've ever I've always heard from the data team is that they spend too much time on prepping the data because they were Avail they were accessing the raw data today there are all these transion tables you have all DVT and others to make your own jobs to cepto data automatically and that's all great now but you have created a lot more tables and joints and Views inside it's matter of uh understanding which one is the right model to use that's becoming more important but it is hard because you have to now consider hundreds or thousands of different tables that are created for analysis and you need to figure out which one is the right one to use okay well uh once you start looking through like hundreds or thousand different tables I can see how that becomes um a very difficult problem to discover what's going on and certainly even in like a fairly small Enterprise or small company you're going to get like a pretty large set of data there yeah I would say in a way there are two parts one is if you've been in the company for like I don't know more than six months then you might already have a pretty good idea which are the trusted data sets and which are the like a main data sets that I can work off from uh but when you first start this is really hard to um another part that we I see as an an issue that happens a lot is that you have a data scientist or data analyst or engineer that have built a lot of these models one day if that person leaves the company it's very hard for the next data scientist or the rest of the data team to pass through what has been uh the main sources of truth that the person was using and there might have been documentation or whatnot but it's always really hard to dig through like all the uh work that they've done because lot of it is embedded inside of their code and uh in their sqle C so uh I would say overall uh in general the data uh work of data and this is just kind of like the like as an industry like the phase that we are in it's it the the industry is starting to mature as we adopting more software engineering practices but overall uh in the past and and and still in many companies a lot of data teams work in their local environment they run different SQL queries and then embedded for their specific dashboards so um a lot of the tribal knowledge and then the understanding of using which data sets how to join them what to filter on uh a lot of these are embedded as like a tribal knowledge in organizations today and this is why it's hard to find the right dat to useyou mentioned the idea that it's often quite hard for data scientists to know they're using the right data set this seems like something that's incredibly important you want to be using the right data uh and that's a very important part of data quality so why is it such a big problem for data scientists to find which is the right data set to use for any particular analysis yeah I mean to put it simply we have too many data sets you know so in the past when we didn't have all these data sets a lot of data analysts and data scientists were able to just you know export the raw data and then prep it to create the reports and that was fine and a lot of this data analyst data analy analysis has been done in like Excel spreadsheets in the past and we lived fine um but now we want to have live dashboards data that is enriched from M data sources that is up toate and that and that does provide a richer and more accurate results uh for our decisions to make that happen uh now and I would say in the past the complaints and this could still be the complaints today but in the past the biggest complaints that I've ever I've always heard from the data team is that they spend too much time on prepping the data because they were Avail they were accessing the raw data today there are all these transion tables you have all DVT and others to make your own jobs to cepto data automatically and that's all great now but you have created a lot more tables and joints and Views inside it's matter of uh understanding which one is the right model to use that's becoming more important but it is hard because you have to now consider hundreds or thousands of different tables that are created for analysis and you need to figure out which one is the right one to use okay well uh once you start looking through like hundreds or thousand different tables I can see how that becomes um a very difficult problem to discover what's going on and certainly even in like a fairly small Enterprise or small company you're going to get like a pretty large set of data there yeah I would say in a way there are two parts one is if you've been in the company for like I don't know more than six months then you might already have a pretty good idea which are the trusted data sets and which are the like a main data sets that I can work off from uh but when you first start this is really hard to um another part that we I see as an an issue that happens a lot is that you have a data scientist or data analyst or engineer that have built a lot of these models one day if that person leaves the company it's very hard for the next data scientist or the rest of the data team to pass through what has been uh the main sources of truth that the person was using and there might have been documentation or whatnot but it's always really hard to dig through like all the uh work that they've done because lot of it is embedded inside of their code and uh in their sqle C so uh I would say overall uh in general the data uh work of data and this is just kind of like the like as an industry like the phase that we are in it's it the the industry is starting to mature as we adopting more software engineering practices but overall uh in the past and and and still in many companies a lot of data teams work in their local environment they run different SQL queries and then embedded for their specific dashboards so um a lot of the tribal knowledge and then the understanding of using which data sets how to join them what to filter on uh a lot of these are embedded as like a tribal knowledge in organizations today and this is why it's hard to find the right dat to use\n"