Tons of FREE Data for Data Science (TidyTuesday)

Exploring Tidy Tuesday Data Sets: A Hands-On Approach to Data Science

The world of data science is filled with exciting and challenging projects, and one of the best ways to hone your skills is by exploring various data sets. In this article, we'll be diving into two fascinating data sets from Tidy Tuesday, a platform that provides high-quality data for analysis and visualization.

On January 7th, 2020, we found a data set called "Australian Fires" that was nicely documented with a background section detailing its origin and the URL to access the most recent updated data. This dataset is perfect for those looking to apply their data science skills in analyzing new weekly data or creating their own projects.

To access this data set, one can either read it indirectly from the GitHub link provided or use programming languages like R or Python to download the data. For instance, using R, we can install the Tidy Tuesday package and specify the dataset we'd like to analyze. The "Australian Fires" dataset provides a wealth of information about the fires, including rainfall, temperature, and more.

The code for loading this dataset in R is straightforward: `rainfall <- read_csv("https://raw.githubusercontent.com/ravenswoodpeter/australianfires/master/data/aus_fires_2020.csv")` and `temperature <- read_csv("https://raw.githubusercontent.com/ravenswoodpeter/australianfires/master/data/aus_temps_2020.csv")`. This code defines the variables "rainfall" and "temperature," which contain the respective data sets. By running this code, we can load the dataset into our R environment.

One of the benefits of using Tidy Tuesday datasets is that they are well-documented and provide a wealth of information about their origin and structure. Additionally, these datasets are often used as examples in other projects, allowing users to build upon existing work and learn from others' approaches.

Let's move on to another dataset, "Song Genres," which comes from the Alcohol and Tobacco Taps. This dataset provides state-level beer production by year, as well as information about the number of brewers, production size, and monthly beer stats. To access this data set, we can use the Tidy Tuesday package in R. If we have already installed the package, we can simply type `tw_2020_01()` to load the dataset.

The "Song Genres" dataset is a fascinating one, with multiple variables that provide insights into the world of music and beer production. By analyzing this data set, we can gain a deeper understanding of the relationship between music genres and beer production, or explore other correlations within the data.

Moving on, let's examine the "Volcano Eruptions" dataset, provided by Smithsonian Institution. This dataset is part of the Tidy Tuesday challenge for week 20, which focuses on predicting eruptions of volcanoes. By analyzing this data set, we can learn more about the factors that influence volcanic eruptions and develop our skills in machine learning and predictive modeling.

Finally, for those who are looking for a change of pace, there's the "Penguins" dataset, which was released earlier as part of Tidy Tuesday. This dataset is a classic example of how data science can be used to explore and understand the natural world.

Tidy Tuesday datasets offer a wealth of opportunities for hands-on learning and exploration in the field of data science. With over 100 different datasets available, there's something for everyone, from music genres to climate change. By applying our skills to these challenging projects, we can hone our abilities and develop a deeper understanding of the world around us.

To get started with Tidy Tuesday datasets, it's essential to have a solid grasp of data science concepts and tools, such as R or Python. However, even for those new to data science, the platform provides an excellent introduction to the field through interactive tutorials and exercises.

As we conclude this article, we encourage you to explore the various Tidy Tuesday datasets available on GitHub and R packages like `tw_2020_01()`. By doing so, you'll be able to apply your data science skills in real-world projects and gain a deeper understanding of the tools and techniques used in the field.

"WEBVTTKind: captionsLanguage: enthree two one welcome back to the data professor youtube channel my name is tanin natas and i'm an associate professor of bioinformatics so at the end of every video on this youtube channel you'll notice that i always end my videos with the slogan of the best way to learn data science is to do data science and so in today's video i'll be focusing on just that how you can do data science and so the important part of doing data science is the data so if you're starting out learning data science or if you want to hone your skills in data science sometimes finding a suitable data set to analyze might be a difficult task because you might be bored with the typical data science data set that you're using like for example the iris data set or a boston housing data set titanic data set i mean these data sets are good but let's say that you have already outgrown those data sets and then you will be wondering what next and so in this video i will be showing you how you can have access to an ever-growing resource of data set so without further ado we're starting right now and so the data set resource that i will be talking about today is called tidy tuesday and so tidy tuesday is a weekly data project in r from the r4ds or the r for data science online learning community and so let's have a look at what is the r4ds stands for so r4ds stands for r for data science visualize model transform tidy and import data and it is written by hadley wickham and garrett grohmann and as you can see here that you could either purchase the book or you could also have access to the online book and the entire book is for free here so you could read the entire book for free okay and so i'll be providing the links to both of these in the description of this video so let's have a look what is the r for data science online learning community so in this learning community it is a place where learners of data science are connected to mentors in this community based on the slack platform so let's have a look at the about and so the community here was established in august of 2017 by jesse mostly pack and it is called the r for data science online learning community and essentially it was created as a supportive and complementary online space for learners of data science and mentors can be connected in the online community namely on the slack platform and so here you will see that there are essentially three types of members here learners mentors and administrators so the administrator will keep this online learning community running and then the mentors will be contributing about one to two hours a week and the learners are aspiring data scientists okay and so and so how does this work every tuesday a new data set will be released the goal of it is to allow the learners of data science to have access to new data sets where they could play with the data set they could wrangle the data set they could handle missing data they could experiment with their newfound knowledge and skills in data science make creative data visualization analyze the data so something challenging for them kind of like a sudoku for data science so if you're using r then you could easily import the data using the package available here or if you're using python then you could get access to the direct link where you could download the data set or actually you could just come on to this github and so i'm going to provide you the link in the video description and so you could directly download the data set from the github here so let's take a further look so it was designed for you to use ggplot2 tidyr d plier and other tools in the tidy verse in order to play around with the new data set that are released every week okay so as mentioned it is provided as like a toy for you to practice data science so it offers the ideal opportunity to do data science so you could think of this as kind of like exercise so you could try out your machine learning workflow onto the new data set so you could take this as an opportunity to exercise your data science skills and you could experiment with the machine learning workflow that you have accumulated that you have developed in your learning phase and then you could apply it on the new data set here and so if you would like to share your own data set you could just hashtag tidy tuesday on twitter and if they like it then they will feature it here or you could also submit it directly as a issue to the github okay so you're using the hashtag tideytuesday if you create your own version so probably you're downloading one of the dataset on the tidy tuesday and then you modify it make your own version and you want to add a hashtag tidytuesday but if you want to submit your own data set then you could submit an issue onto the github here so you could find out more details here okay and so let's see so you can see that there are a lot of data available starting back at 2018 2019 and 2020. so let's have a look at the ones in 2020. so on january 7 2020 you see that there is a data set called the australian fires so let's have a look at that and so very nicely documented you get to see the background of the data set here where it is from and the url is right here for the newest updated data okay so they're describing how they derive the data set and if you would like to use the data set in your own data science workflow in your own projects then you could just copy the link here so let's try it shall we let me fire up the rstudio copy the link here get the data so i'll be copying both of these lines maybe one by one enter all right cool rainfall let's have a look rainfall all right so the data is loaded temperature right here copy paste it enter okay so here there are many ways for you to get the data you could either read it indirectly from the github link here and so for this option you could use r you could use python if you like so use the programming language that you are familiar with and apply your own data science workflow your skills in data science to analyze the new weekly data or if you would like you could choose one of the data set at random and analyze it or even do all of it and then share it on your github profile so think of it as like an exercise the more you do it the better you get so more practice the more opportunity you get to hone your data science skills and here if you're using r you could also install the tidy tuesday let's try it okay so recall that above here we define the variable called rainfall and also temperature so here we see that there are rainfall and temperature amongst the files that were downloaded so let me copy the code here so this will define rainfall let me try from scratch clearing it here okay i have to rerun i could just press on the up arrow and it will go back to the previous commands downloading it all right let me try the rainfall okay let's see all right it's it works it gives the same data set let me try modifying this to temperature and this to temperature as well okay so it also works all right okay so loading in the data works like a charm and here it provides you the description of the data temperature dot csv will comprise of a total of five variables so it has the city name the date the temperature the temperature type minimum maximum daily actual site or weather station and for rainfall it has several variables station code city name year month day rainfall period quality latitude longitude and the station name and then there are other data as well here okay and so let's have a look at some of the example analysis created by others okay so you could play around and you could see how others are analyzing the data here okay you can follow along here okay and so here are some example codes for trying out so this is the first data set that we have taken a look at so let's have a look at another one song genres nfl attendance hotel bookings beer production let's have a look at that this data comes from the alcohol and tobacco taps all right so it has state level beer production by year number of brewers by production size monthly beer stats so you can get the data here so if we have already installed the tidy tuesday for the r users then you could just type in here either this one or this one and then you could specify year and then week number let's give it a try all right so two state data twos data and then dollar sign and then the four data frames here so you could just specify the dollar sign and then you select the data frame that you want to see okay so you can even try your luck right select a random number enter and then play around with the data set here if you can't decide which one you want to try pick a random number okay so number 20 seems to be about volcano predicting eruptions of the volcano i guess number 20 yeah volcano eruptions and the data was provided by smithsonian okay so a lot of very interesting data set for you to play around with so for those of you who are bored with the iris data set so here so a couple of weeks ago i released a video about the penguins data set and so here there's so many data sets for you to play around with so more than 100 and if you're finding value in this video please give it a thumbs up subscribe if you haven't yet done so hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and i'll see you in the next one but in the meantime please check out these videosthree two one welcome back to the data professor youtube channel my name is tanin natas and i'm an associate professor of bioinformatics so at the end of every video on this youtube channel you'll notice that i always end my videos with the slogan of the best way to learn data science is to do data science and so in today's video i'll be focusing on just that how you can do data science and so the important part of doing data science is the data so if you're starting out learning data science or if you want to hone your skills in data science sometimes finding a suitable data set to analyze might be a difficult task because you might be bored with the typical data science data set that you're using like for example the iris data set or a boston housing data set titanic data set i mean these data sets are good but let's say that you have already outgrown those data sets and then you will be wondering what next and so in this video i will be showing you how you can have access to an ever-growing resource of data set so without further ado we're starting right now and so the data set resource that i will be talking about today is called tidy tuesday and so tidy tuesday is a weekly data project in r from the r4ds or the r for data science online learning community and so let's have a look at what is the r4ds stands for so r4ds stands for r for data science visualize model transform tidy and import data and it is written by hadley wickham and garrett grohmann and as you can see here that you could either purchase the book or you could also have access to the online book and the entire book is for free here so you could read the entire book for free okay and so i'll be providing the links to both of these in the description of this video so let's have a look what is the r for data science online learning community so in this learning community it is a place where learners of data science are connected to mentors in this community based on the slack platform so let's have a look at the about and so the community here was established in august of 2017 by jesse mostly pack and it is called the r for data science online learning community and essentially it was created as a supportive and complementary online space for learners of data science and mentors can be connected in the online community namely on the slack platform and so here you will see that there are essentially three types of members here learners mentors and administrators so the administrator will keep this online learning community running and then the mentors will be contributing about one to two hours a week and the learners are aspiring data scientists okay and so and so how does this work every tuesday a new data set will be released the goal of it is to allow the learners of data science to have access to new data sets where they could play with the data set they could wrangle the data set they could handle missing data they could experiment with their newfound knowledge and skills in data science make creative data visualization analyze the data so something challenging for them kind of like a sudoku for data science so if you're using r then you could easily import the data using the package available here or if you're using python then you could get access to the direct link where you could download the data set or actually you could just come on to this github and so i'm going to provide you the link in the video description and so you could directly download the data set from the github here so let's take a further look so it was designed for you to use ggplot2 tidyr d plier and other tools in the tidy verse in order to play around with the new data set that are released every week okay so as mentioned it is provided as like a toy for you to practice data science so it offers the ideal opportunity to do data science so you could think of this as kind of like exercise so you could try out your machine learning workflow onto the new data set so you could take this as an opportunity to exercise your data science skills and you could experiment with the machine learning workflow that you have accumulated that you have developed in your learning phase and then you could apply it on the new data set here and so if you would like to share your own data set you could just hashtag tidy tuesday on twitter and if they like it then they will feature it here or you could also submit it directly as a issue to the github okay so you're using the hashtag tideytuesday if you create your own version so probably you're downloading one of the dataset on the tidy tuesday and then you modify it make your own version and you want to add a hashtag tidytuesday but if you want to submit your own data set then you could submit an issue onto the github here so you could find out more details here okay and so let's see so you can see that there are a lot of data available starting back at 2018 2019 and 2020. so let's have a look at the ones in 2020. so on january 7 2020 you see that there is a data set called the australian fires so let's have a look at that and so very nicely documented you get to see the background of the data set here where it is from and the url is right here for the newest updated data okay so they're describing how they derive the data set and if you would like to use the data set in your own data science workflow in your own projects then you could just copy the link here so let's try it shall we let me fire up the rstudio copy the link here get the data so i'll be copying both of these lines maybe one by one enter all right cool rainfall let's have a look rainfall all right so the data is loaded temperature right here copy paste it enter okay so here there are many ways for you to get the data you could either read it indirectly from the github link here and so for this option you could use r you could use python if you like so use the programming language that you are familiar with and apply your own data science workflow your skills in data science to analyze the new weekly data or if you would like you could choose one of the data set at random and analyze it or even do all of it and then share it on your github profile so think of it as like an exercise the more you do it the better you get so more practice the more opportunity you get to hone your data science skills and here if you're using r you could also install the tidy tuesday let's try it okay so recall that above here we define the variable called rainfall and also temperature so here we see that there are rainfall and temperature amongst the files that were downloaded so let me copy the code here so this will define rainfall let me try from scratch clearing it here okay i have to rerun i could just press on the up arrow and it will go back to the previous commands downloading it all right let me try the rainfall okay let's see all right it's it works it gives the same data set let me try modifying this to temperature and this to temperature as well okay so it also works all right okay so loading in the data works like a charm and here it provides you the description of the data temperature dot csv will comprise of a total of five variables so it has the city name the date the temperature the temperature type minimum maximum daily actual site or weather station and for rainfall it has several variables station code city name year month day rainfall period quality latitude longitude and the station name and then there are other data as well here okay and so let's have a look at some of the example analysis created by others okay so you could play around and you could see how others are analyzing the data here okay you can follow along here okay and so here are some example codes for trying out so this is the first data set that we have taken a look at so let's have a look at another one song genres nfl attendance hotel bookings beer production let's have a look at that this data comes from the alcohol and tobacco taps all right so it has state level beer production by year number of brewers by production size monthly beer stats so you can get the data here so if we have already installed the tidy tuesday for the r users then you could just type in here either this one or this one and then you could specify year and then week number let's give it a try all right so two state data twos data and then dollar sign and then the four data frames here so you could just specify the dollar sign and then you select the data frame that you want to see okay so you can even try your luck right select a random number enter and then play around with the data set here if you can't decide which one you want to try pick a random number okay so number 20 seems to be about volcano predicting eruptions of the volcano i guess number 20 yeah volcano eruptions and the data was provided by smithsonian okay so a lot of very interesting data set for you to play around with so for those of you who are bored with the iris data set so here so a couple of weeks ago i released a video about the penguins data set and so here there's so many data sets for you to play around with so more than 100 and if you're finding value in this video please give it a thumbs up subscribe if you haven't yet done so hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and i'll see you in the next one but in the meantime please check out these videosthree two one welcome back to the data professor youtube channel my name is tanin natas and i'm an associate professor of bioinformatics so at the end of every video on this youtube channel you'll notice that i always end my videos with the slogan of the best way to learn data science is to do data science and so in today's video i'll be focusing on just that how you can do data science and so the important part of doing data science is the data so if you're starting out learning data science or if you want to hone your skills in data science sometimes finding a suitable data set to analyze might be a difficult task because you might be bored with the typical data science data set that you're using like for example the iris data set or a boston housing data set titanic data set i mean these data sets are good but let's say that you have already outgrown those data sets and then you will be wondering what next and so in this video i will be showing you how you can have access to an ever-growing resource of data set so without further ado we're starting right now and so the data set resource that i will be talking about today is called tidy tuesday and so tidy tuesday is a weekly data project in r from the r4ds or the r for data science online learning community and so let's have a look at what is the r4ds stands for so r4ds stands for r for data science visualize model transform tidy and import data and it is written by hadley wickham and garrett grohmann and as you can see here that you could either purchase the book or you could also have access to the online book and the entire book is for free here so you could read the entire book for free okay and so i'll be providing the links to both of these in the description of this video so let's have a look what is the r for data science online learning community so in this learning community it is a place where learners of data science are connected to mentors in this community based on the slack platform so let's have a look at the about and so the community here was established in august of 2017 by jesse mostly pack and it is called the r for data science online learning community and essentially it was created as a supportive and complementary online space for learners of data science and mentors can be connected in the online community namely on the slack platform and so here you will see that there are essentially three types of members here learners mentors and administrators so the administrator will keep this online learning community running and then the mentors will be contributing about one to two hours a week and the learners are aspiring data scientists okay and so and so how does this work every tuesday a new data set will be released the goal of it is to allow the learners of data science to have access to new data sets where they could play with the data set they could wrangle the data set they could handle missing data they could experiment with their newfound knowledge and skills in data science make creative data visualization analyze the data so something challenging for them kind of like a sudoku for data science so if you're using r then you could easily import the data using the package available here or if you're using python then you could get access to the direct link where you could download the data set or actually you could just come on to this github and so i'm going to provide you the link in the video description and so you could directly download the data set from the github here so let's take a further look so it was designed for you to use ggplot2 tidyr d plier and other tools in the tidy verse in order to play around with the new data set that are released every week okay so as mentioned it is provided as like a toy for you to practice data science so it offers the ideal opportunity to do data science so you could think of this as kind of like exercise so you could try out your machine learning workflow onto the new data set so you could take this as an opportunity to exercise your data science skills and you could experiment with the machine learning workflow that you have accumulated that you have developed in your learning phase and then you could apply it on the new data set here and so if you would like to share your own data set you could just hashtag tidy tuesday on twitter and if they like it then they will feature it here or you could also submit it directly as a issue to the github okay so you're using the hashtag tideytuesday if you create your own version so probably you're downloading one of the dataset on the tidy tuesday and then you modify it make your own version and you want to add a hashtag tidytuesday but if you want to submit your own data set then you could submit an issue onto the github here so you could find out more details here okay and so let's see so you can see that there are a lot of data available starting back at 2018 2019 and 2020. so let's have a look at the ones in 2020. so on january 7 2020 you see that there is a data set called the australian fires so let's have a look at that and so very nicely documented you get to see the background of the data set here where it is from and the url is right here for the newest updated data okay so they're describing how they derive the data set and if you would like to use the data set in your own data science workflow in your own projects then you could just copy the link here so let's try it shall we let me fire up the rstudio copy the link here get the data so i'll be copying both of these lines maybe one by one enter all right cool rainfall let's have a look rainfall all right so the data is loaded temperature right here copy paste it enter okay so here there are many ways for you to get the data you could either read it indirectly from the github link here and so for this option you could use r you could use python if you like so use the programming language that you are familiar with and apply your own data science workflow your skills in data science to analyze the new weekly data or if you would like you could choose one of the data set at random and analyze it or even do all of it and then share it on your github profile so think of it as like an exercise the more you do it the better you get so more practice the more opportunity you get to hone your data science skills and here if you're using r you could also install the tidy tuesday let's try it okay so recall that above here we define the variable called rainfall and also temperature so here we see that there are rainfall and temperature amongst the files that were downloaded so let me copy the code here so this will define rainfall let me try from scratch clearing it here okay i have to rerun i could just press on the up arrow and it will go back to the previous commands downloading it all right let me try the rainfall okay let's see all right it's it works it gives the same data set let me try modifying this to temperature and this to temperature as well okay so it also works all right okay so loading in the data works like a charm and here it provides you the description of the data temperature dot csv will comprise of a total of five variables so it has the city name the date the temperature the temperature type minimum maximum daily actual site or weather station and for rainfall it has several variables station code city name year month day rainfall period quality latitude longitude and the station name and then there are other data as well here okay and so let's have a look at some of the example analysis created by others okay so you could play around and you could see how others are analyzing the data here okay you can follow along here okay and so here are some example codes for trying out so this is the first data set that we have taken a look at so let's have a look at another one song genres nfl attendance hotel bookings beer production let's have a look at that this data comes from the alcohol and tobacco taps all right so it has state level beer production by year number of brewers by production size monthly beer stats so you can get the data here so if we have already installed the tidy tuesday for the r users then you could just type in here either this one or this one and then you could specify year and then week number let's give it a try all right so two state data twos data and then dollar sign and then the four data frames here so you could just specify the dollar sign and then you select the data frame that you want to see okay so you can even try your luck right select a random number enter and then play around with the data set here if you can't decide which one you want to try pick a random number okay so number 20 seems to be about volcano predicting eruptions of the volcano i guess number 20 yeah volcano eruptions and the data was provided by smithsonian okay so a lot of very interesting data set for you to play around with so for those of you who are bored with the iris data set so here so a couple of weeks ago i released a video about the penguins data set and so here there's so many data sets for you to play around with so more than 100 and if you're finding value in this video please give it a thumbs up subscribe if you haven't yet done so hit on the notification bell in order to be notified of the next video and as always the best way to learn data science is to do data science and please enjoy the journey thank you for watching please like subscribe and share and i'll see you in the next one but in the meantime please check out these videos\n"