The Data Science Process - A Visual Guide (Part 1)

The Evolution of Data Science: A Journey from Data Mining to Crisp DM and Awesome Framework

As I reflect on my journey into the world of data, I am reminded of the early days of data mining. It was a time when the field was still gaining traction, and there was a lack of standard protocols for carrying out data mining tasks in a robust manner. The introduction of the Cross-Industry Standard Process for Data Mining (CRISP-DM) in 1996 marked a significant milestone in this journey. CRISP-DM was designed to provide a standardized workflow for data mining, ensuring that the same protocol could be adopted and applied across various industries.

The CRISP-DM framework consists of five phases: business understanding, data understanding, model building, evaluation, and deployment. Each phase is crucial in its own right, and together they form the backbone of the data mining process. The business understanding phase involves identifying a specific area of interest within an organization or industry, while the data understanding phase focuses on gathering relevant data for analysis. The model building phase involves training machine learning models to extract insights from the data, and evaluation ensures that these models are accurate and reliable. Finally, deployment allows the insights gained from data mining to be shared with stakeholders, providing actionable recommendations.

Fast-forwarding to 2010, another significant milestone in the evolution of data science was marked by the introduction of the Awesome Framework (S-N-O-S-E-M). The Awesome Framework is designed to provide a standard protocol for carrying out data science tasks, emphasizing the importance of storytelling, problem-solving, and soft skills. As I began my own journey into data science in 2004, I quickly realized that data mining was no longer just about translating data into knowledge but had evolved into a comprehensive field that encompassed various aspects such as software engineering, data engineering, and data visualization.

A typical Data Scientist's skill set has expanded significantly over the years. At its core, programming is essential for performing tasks such as data collection, pre-processing, exploratory data analysis, descriptive statistics, and model building using machine learning deep learning techniques. Additionally, understanding mathematical concepts like linear algebra, geometry, calculus, and discrete mathematics forms a solid foundation for machine learning. Software engineering skills are also crucial in optimizing code, making it run faster, deploying models, creating web applications, and developing APIs.

The underlying principles of machine learning lie at the heart of data science. As such, understanding these concepts is vital for effective data analysis. The typical Data Science life cycle begins with data collection and pre-processing, followed by exploratory data analysis, descriptive statistics, and model building. Finally, insights are delivered to stakeholders through storytelling, problem-solving, and communication.

Soft skills play a crucial role in the success of any Data Scientist. Insights must be presented in an engaging manner to non-technical stakeholders, making it essential to develop strong communication skills. Additionally, problem-solving is a critical aspect of data science, requiring collaboration with stakeholders to identify business problems and provide actionable recommendations.

A Closer Look at CRISP-DM

The Cross-Industry Standard Process for Data Mining (CRISP-DM) framework provides a structured approach to the data mining process. The acronym stands for Cross-Industry Standard Process for Data Mining, and it was introduced in 1996 as a way to standardize the process of data mining across various industries.

A Closer Look at Awesome Framework

The Awesome Framework is designed to provide a standardized protocol for carrying out data science tasks. S-N-O-S-E-M stands for Storytelling, Notation, Operations, Software Engineering, and Evaluation Metrics. This framework emphasizes the importance of storytelling in presenting insights to stakeholders, using notation to describe models, performing operations such as feature engineering and model selection, utilizing software engineering skills to optimize code, and evaluating metrics to measure performance.

The Awesome Framework also highlights the significance of soft skills in data science, including communication, collaboration, and problem-solving. By emphasizing these aspects, the framework provides a comprehensive approach to data science that goes beyond mere technical proficiency.

A Brief History of CRISP-DM

CRISP-DM was introduced in 1996 as an attempt to standardize the process of data mining across various industries. The acronym stands for Cross-Industry Standard Process for Data Mining, and it consists of five phases: business understanding, data understanding, model building, evaluation, and deployment.

A Schematic Diagram of CRISP-DM

The following is a schematic diagram of the CRISP-DM framework:

Business Understanding

Data Understanding

Model Building

Evaluation

Deployment

Conclusion

In conclusion, my journey into the world of data science has taken me through various stages, from the early days of data mining to the more comprehensive field we know today as data science. The introduction of CRISP-DM in 1996 marked a significant milestone in this journey, providing a standardized protocol for carrying out data mining tasks. Fast-forwarding to the Awesome Framework introduced in 2010, which emphasizes the importance of storytelling, notation, operations, software engineering, and evaluation metrics.

As I reflect on my experiences, I am reminded that data science is no longer just about technical proficiency but also encompasses essential soft skills like communication, collaboration, and problem-solving. By emphasizing these aspects, we can ensure that insights gained from data mining are actionable and effective in real-world applications.

"WEBVTTKind: captionsLanguage: enin this video i'm going to be talking about the data science process so it's going to be the building blocks of a typical data science workflow and so without further ado we're starting right now so the contents of this video is borrowed from a previous article that i have written on the medium platform in the towards data science and the article is entitled the data science process a visual guide to standard procedures in data science so let me do it like this i'm going to be reading the contents of the article and i'm going to be adding examples as i go along okay so let's suppose that you're being given a data problem to solve and you're expected to produce unique insights from the data given to you so the question is what do you exactly do to transform a data problem through to completion and generate data data-driven insights and most importantly of all where do you start so in order to answer this question let us use some analogy here in the construction of a house or a building the guiding piece of information used is the blueprint and so what sort of information are contained within these blueprints so the type of information will include the building infrastructure the layout and the exact dimension of each room the location of the water pipe and also where are the electrical wires etc so continuing from where we left off earlier where do we start when given a data problem that is where the data science process comes in as will be discussed shortly the data science process provides a systematic approach for tackling a data problem so by following through on these recommended guidelines you will be able to make use of a tried and true workflow in approaching data science project and so before we dive in into the specific details of the data science process let's have a look at the data science life cycle so at the core of a typical data science process we could envision it to be essentially boiling down to the data science life cycle so the data science life cycle is essentially comprised of data collection data cleaning exploratory data analysis model building and model deployment so further information on the data science roles has been provided previously in a video by kenji on his youtube channel so i'm going to be providing the links to that video in the description of this video and so an infographic of this life cycle is shown here so here you're going to be seeing that data engineers will be responsible for the data collection and their cleaning and data analysts will be responsible for cleaning the data and performing the exploratory data analysis machine learning engineers will be responsible for model building and model deployment and data scientists will pretty much be expected to be able to perform all of these tasks and so such process or workflow of drawing insights from data can be described by the crisp dm and awesome framework and so let's continue with the crisp dm it was released at a time back in 1996 when data mining had just first started and it has just started to gain traction and was missing a standard protocol for carrying out data mining tasks in a robust manner so 14 years later in 2010 the awesome framework was introduced awesome s-n-o-s-e-m-n although it sounds like awesome right and so it generally described the key task of a data scientist and so personally i ventured into the world of data back in 2004 and at the time the field was known as data mining and so much of the emphasis at the time was placed in translating data to knowledge whereas another common term that is also used to refer to data mining at the time was knowledge discovery and data and a popular website back in the time and also currently is called kdd knuckles so that's a great resource that i always look into when learning data mining and also data science so over the years the field has matured and evolved from the core of data collection data pre-processing and data modeling into what is now known as data science so let's take a moment here to reflect upon what data science skill set encompasses so what i've done was created a infographic that highlights the eight essential skill set required to become a data scientist and so the infographic is provided here and i'm also going to provide you the links to this infographic in the description of the video and also a prior video where i've talked about how to become a data scientist and the learning path and the skill set needed so the links to that previous youtube video will also be provided in the description of the video here so the eight skill sets that i have previously mentioned includes at the core is programming because essentially programming will allow a data scientist to perform all of the various tasks that is needed starting from data collection data pre-processing performing some exploratory data analysis perform descriptive statistics in order to provide insights on the data initial insights on the data and also to encompass the model building process by employing machine learning deep learning as well as making data visualization and at the underlying of the principles and understanding of machine learning lies in mathematics as well as concepts such as linear algebra geometry calculus discrete mathematics and also another very important skill set would be in the realm of software engineering and so this will essentially also touch upon the field of data engineering as well and so it will be essentially how you can optimize your code how you could make your code run faster also to deploy your model make a web application create some api so as you will see there's a lot of skill set going on in becoming a full stack data scientist and so as i've mentioned previously above the typical data science life cycle starts from data collection data pre-processing and so data pre-processing will be essentially transforming the data into a high quality version that could be used for creating predictive models and then after that to deliver the insights to the business stakeholders and so the insights storytelling problem solving are the essential part of the soft skills all right and so now let's have a closer look at the crisp dm so the acronym for crisp dm stands for cross industry standard process for data mining and it was introduced in 1996 in efforts to standardize the process of data mining also referred to as knowledge discovery and data so that a standard and reliable workflow can be adopted and applied in various industries so essentially chris dm was created as a sort of best practices so that the same protocol could be used in several industries so aside from providing a reliable and consistent process by which to follow in carrying out data mining projects it would also instill confidence to customers and stakeholders who are looking to adopt data mining in their organization and so it should be noted that back in 1996 data mining had just started and the adoption of this best practices would help to lay solid foundation and groundwork for the early adopters and so a more in-depth and historical look of crisp dm is provided in the article by word and hip in 2000. so this is the schematic diagram of the crisp dm and so it's going to be starting at the business understanding so business understanding is essentially the domain understanding and so if you're coming from the field of business it's the knowledge of the business that you're working in the industry that you're working in if you're coming from biology so it's going to be the biology domain expertise and then afterwards you're going to select a particular area of the business that you would like to study or particularly the specific area of the domain that you would like to study and so that will be essentially taking a subset or a specific niche of the domain and then you're going to have that as your data and so that particular data you would also have to have this sort of data understandin this video i'm going to be talking about the data science process so it's going to be the building blocks of a typical data science workflow and so without further ado we're starting right now so the contents of this video is borrowed from a previous article that i have written on the medium platform in the towards data science and the article is entitled the data science process a visual guide to standard procedures in data science so let me do it like this i'm going to be reading the contents of the article and i'm going to be adding examples as i go along okay so let's suppose that you're being given a data problem to solve and you're expected to produce unique insights from the data given to you so the question is what do you exactly do to transform a data problem through to completion and generate data data-driven insights and most importantly of all where do you start so in order to answer this question let us use some analogy here in the construction of a house or a building the guiding piece of information used is the blueprint and so what sort of information are contained within these blueprints so the type of information will include the building infrastructure the layout and the exact dimension of each room the location of the water pipe and also where are the electrical wires etc so continuing from where we left off earlier where do we start when given a data problem that is where the data science process comes in as will be discussed shortly the data science process provides a systematic approach for tackling a data problem so by following through on these recommended guidelines you will be able to make use of a tried and true workflow in approaching data science project and so before we dive in into the specific details of the data science process let's have a look at the data science life cycle so at the core of a typical data science process we could envision it to be essentially boiling down to the data science life cycle so the data science life cycle is essentially comprised of data collection data cleaning exploratory data analysis model building and model deployment so further information on the data science roles has been provided previously in a video by kenji on his youtube channel so i'm going to be providing the links to that video in the description of this video and so an infographic of this life cycle is shown here so here you're going to be seeing that data engineers will be responsible for the data collection and their cleaning and data analysts will be responsible for cleaning the data and performing the exploratory data analysis machine learning engineers will be responsible for model building and model deployment and data scientists will pretty much be expected to be able to perform all of these tasks and so such process or workflow of drawing insights from data can be described by the crisp dm and awesome framework and so let's continue with the crisp dm it was released at a time back in 1996 when data mining had just first started and it has just started to gain traction and was missing a standard protocol for carrying out data mining tasks in a robust manner so 14 years later in 2010 the awesome framework was introduced awesome s-n-o-s-e-m-n although it sounds like awesome right and so it generally described the key task of a data scientist and so personally i ventured into the world of data back in 2004 and at the time the field was known as data mining and so much of the emphasis at the time was placed in translating data to knowledge whereas another common term that is also used to refer to data mining at the time was knowledge discovery and data and a popular website back in the time and also currently is called kdd knuckles so that's a great resource that i always look into when learning data mining and also data science so over the years the field has matured and evolved from the core of data collection data pre-processing and data modeling into what is now known as data science so let's take a moment here to reflect upon what data science skill set encompasses so what i've done was created a infographic that highlights the eight essential skill set required to become a data scientist and so the infographic is provided here and i'm also going to provide you the links to this infographic in the description of the video and also a prior video where i've talked about how to become a data scientist and the learning path and the skill set needed so the links to that previous youtube video will also be provided in the description of the video here so the eight skill sets that i have previously mentioned includes at the core is programming because essentially programming will allow a data scientist to perform all of the various tasks that is needed starting from data collection data pre-processing performing some exploratory data analysis perform descriptive statistics in order to provide insights on the data initial insights on the data and also to encompass the model building process by employing machine learning deep learning as well as making data visualization and at the underlying of the principles and understanding of machine learning lies in mathematics as well as concepts such as linear algebra geometry calculus discrete mathematics and also another very important skill set would be in the realm of software engineering and so this will essentially also touch upon the field of data engineering as well and so it will be essentially how you can optimize your code how you could make your code run faster also to deploy your model make a web application create some api so as you will see there's a lot of skill set going on in becoming a full stack data scientist and so as i've mentioned previously above the typical data science life cycle starts from data collection data pre-processing and so data pre-processing will be essentially transforming the data into a high quality version that could be used for creating predictive models and then after that to deliver the insights to the business stakeholders and so the insights storytelling problem solving are the essential part of the soft skills all right and so now let's have a closer look at the crisp dm so the acronym for crisp dm stands for cross industry standard process for data mining and it was introduced in 1996 in efforts to standardize the process of data mining also referred to as knowledge discovery and data so that a standard and reliable workflow can be adopted and applied in various industries so essentially chris dm was created as a sort of best practices so that the same protocol could be used in several industries so aside from providing a reliable and consistent process by which to follow in carrying out data mining projects it would also instill confidence to customers and stakeholders who are looking to adopt data mining in their organization and so it should be noted that back in 1996 data mining had just started and the adoption of this best practices would help to lay solid foundation and groundwork for the early adopters and so a more in-depth and historical look of crisp dm is provided in the article by word and hip in 2000. so this is the schematic diagram of the crisp dm and so it's going to be starting at the business understanding so business understanding is essentially the domain understanding and so if you're coming from the field of business it's the knowledge of the business that you're working in the industry that you're working in if you're coming from biology so it's going to be the biology domain expertise and then afterwards you're going to select a particular area of the business that you would like to study or particularly the specific area of the domain that you would like to study and so that will be essentially taking a subset or a specific niche of the domain and then you're going to have that as your data and so that particular data you would also have to have this sort of data understand\n"