Importing Data into R: A Step-by-Step Guide
The standard distribution of R provides functionality to import flat files, such as CSV and tab-delimited files, into R data frames. These functions belong to the utils package that is loaded by default when you start R. The mother of all these data port functions is the read.table function.
The read.table function can read in any file in table format and create a data frame from it. The number of arguments you can specify for this function is huge, so I won't go through each and every one of them. Instead, let's have a look at the read.table call that imports states.csv and try to understand what happens.
The first argument of the read.table function is the path to the file you want to import into R. If the file is in your current working directory, simply passing the file name as a character string works. However, if your file is located somewhere else, things get tricky depending on the platform you're working on. Linux, Microsoft, Mac - whatever file paths are specified differently.
To build a path to a file in a platform-independent way, you can use the file.path function. Suppose our states.csv file is located in the data sets folder of the home directory. You can use file.path like this: `file.path("home", "data sets", "states.csv")`. Because I'm working on a Mac, this is resulting in the path `/Users/me/Documents/data sets/states.csv`. For Windows, the resulting character string will be different.
This path can now be used inside read.table to point to the correct file. Now for the header arguments. If you set this to true, you tell R that the first row of the text file contains the variable names, which is the case here. Read.table sets this argument false by default, which would mean that the first row is already an observation.
The next argument, SE, specifies how fields in records are separated. For our CSV file, here the field separator is a comma, so we use a comma inside quotes. Finally, there's the strings as factors argument. It's true by default, which means that columns or variables that are strings are imported into R as factors. The data structure to store categorical variables in this case, the column containing the country names shouldn't be a factor, so we set strings as factors to false.
If we actually run this call now, we indeed get a data frame with five observations and four variables that corresponds nicely to the CSV file. We started with. Read.table function works fine but it's pretty tiring to specify all these arguments every time. Right? CSV files are a common and standardized type of flat files that's why the utils package also provides the read.csv function.
This function is a repeper around the read.table function, this means that read.csv calls read.table behind the scenes but with different default arguments to match with the CSV format more specifically the default for header is true and for SE is a comma so you don't have to manually specify these anymore. This means that this read.table call from before is thus exactly the same as this read.csv call apart from CSV files there are also other types of flat files.
Take this tab-delimited file states.dx with the same data to import it with read., you again have to specify a bunch of arguments this time you should point to the txt file instead of the CSV file and the SE arguments should be set to a tab. So back SLT, you can also use the read.tab function which again is a repeper around read.table, the default Arguments for header and SE are adapted among some others result of both calls is again a nice translation of the flat file to an art data frame.
Now there's one last thing I want to discuss here. Here have a look at this US CSV file and its European counterpart States.eu.csv you'll notice that the Europeans use commas for decimal points while normally one uses a DOT this means that they can't use the comma as the field limiter anymore, they need a semicolon to deal with this easily R provides the read.csv2 function both the SE argument as the Des argument to tell which character is used for decimal points are different.
Likewise for read.tab you have a read.tab 2 alternative Can you spot the difference again this time only the Des argument had to change let's try to import the states.eu.csv file with the basic read.csv function, AR gives result but it clearly is not result we want it's a data set with five observations but a single variable if you try again with read.csv 2 it works perfectly fine this time of course. The possibilities of importing flat files with the util functions are practically endless and he'll typically have to experiment a bit to get it right give it a try in the first set of exercises
"WEBVTTKind: captionsLanguage: enhi and welcome to the first video of this importing data into our course imagine this situation a poor colleague of yours is still doing his or her analysis in cess and you want to continue working on the sess data but inside R you need an easy way to convert the sess data into an R data frame but you can't seem to find the tools to do so well getting to know these tools is exactly what this course four more specifically we will focus on five types of data data from Flat files from Excel files from other statistical software from databases and finally data imported from the web each chapter will focus on each one of these data formats and you'll learn to convert all of them into an art data frame in this first chapter we'll start with flat files they're typically Simple Text files that contain table data have a look at states. CSV a flat file containing separated values the data lists basic information on some US states the first line here gives the names of the different columns or Fields after that each line is a record and the fields are separated by a comma hence the name comma separated values for example there's a state Hawaii with a capital Honolulu and a total population of 1.42 million what would that data look like an R well actually the structure nicely corresponds to an R data frame that ideally looks like this the rows in the data frame correspond to the records and The Columns of the data frame correspond to the fields the field names are used to name the data frame columns but how to go from the CSV file to this data frame we're in luck because the standard distribution of R provides functionality to import these flat files into r as a data frame these functions belong to the utils package that is loaded by default when you start R the mother of all these data Port functions is the read.table function it can read in any file in table format and create a data frame from it the number of arguments you can specify for this function is huge so I won't go through each and every one of these instead let's have a look at the read. table call that import states. CSV and try to understand what happens the first argument of the read.table function is the path to the file you want to import into R if the file is in your current working directory simply passing the file name as a character string works if your file is located somewhere else things get tricky depending on the platform you're working on Linux Microsoft Mac whatever file paths are specified differently to build a path to a file in a platform independent way you can use the file. path function suppose our states. CSV file is located in the data sets folder of the home directory you can use file. path like this because I'm working on a Mech this is resulting path but for Windows resulting character string will be different this path can now be used inside read. table to point to the correct file like this now for the header arguments if you set this to true you tell r that the first row of the text file contains the variable names which is the case here read. table sets this argument false by default which would mean that the first row is already an observation next SE is the argument that specifies how fields in record are separated for our CSV file here the field separator is a comma so we use a comma inside quotes finally the strings as factors argument is pretty important it's true by default which means that columns or variables that are strings are imported into r as factors the data structure to store categorical variables in this case the column containing the country names shouldn't be a factor so we set strings as factors to false if we actually run this call now we indeed get a data frame with five observations and four variables that corresponds nicely to the CSV file we started with the read.table function works fine but it's pretty tiring to specify all these arguments every time right CSV files are a common and standardized type of flat files that's why the utils package also provides the read. CSV function this function is a repper around the read. table function this means that read. CSV calls read.table behind the scenes but with different default arguments to match with the CSV format more specifically the default for header is true and for SE is a comma so you don't have to manually specify these anymore this means that this read.table call from before is thus exactly the same as this read. CSV call apart from CSV files there are also other types of flood files take this tab delimited file states. dxt with the same data to import it with read. you again have to specify a bunch of arguments this time you should point to the txt file instead of the CSV file and the SE arguments should be set to a tab so back SLT you can also use the read. the Lim function which again is a repper around read.table the default Arguments for header and SE are adapted among some others result of both calls is again a nice translation of the flat file to an art data frame now there's one last thing I want to discuss here here have a look at this us CSV file and its European counterpart States eu. CSV you'll notice that the Europeans use commas for decimal points while normally one uses a DOT this means that they can't use the comma as the field limiter anymore they need a semicolon to deal with this easily R provides the read. csv2 function both the SE argument as the Des argument to tell which character is used for decimal points are different likewise for read. theim you have a read. theim 2 alternative Can you spot the difference again this time only the Des argument had to change let's try to import the states eu. CSV file with the basic read. CSV function AR gives result but it clearly is not result we want it's a data set with five observations but a single variable if you try again with read. CSV 2 it works perfectly fine this time of course the possibilities of importing flat file with the util functions are practically endless he'll typically have to experiment a bit to get it right give it a try in the first set of exerciseshi and welcome to the first video of this importing data into our course imagine this situation a poor colleague of yours is still doing his or her analysis in cess and you want to continue working on the sess data but inside R you need an easy way to convert the sess data into an R data frame but you can't seem to find the tools to do so well getting to know these tools is exactly what this course four more specifically we will focus on five types of data data from Flat files from Excel files from other statistical software from databases and finally data imported from the web each chapter will focus on each one of these data formats and you'll learn to convert all of them into an art data frame in this first chapter we'll start with flat files they're typically Simple Text files that contain table data have a look at states. CSV a flat file containing separated values the data lists basic information on some US states the first line here gives the names of the different columns or Fields after that each line is a record and the fields are separated by a comma hence the name comma separated values for example there's a state Hawaii with a capital Honolulu and a total population of 1.42 million what would that data look like an R well actually the structure nicely corresponds to an R data frame that ideally looks like this the rows in the data frame correspond to the records and The Columns of the data frame correspond to the fields the field names are used to name the data frame columns but how to go from the CSV file to this data frame we're in luck because the standard distribution of R provides functionality to import these flat files into r as a data frame these functions belong to the utils package that is loaded by default when you start R the mother of all these data Port functions is the read.table function it can read in any file in table format and create a data frame from it the number of arguments you can specify for this function is huge so I won't go through each and every one of these instead let's have a look at the read. table call that import states. CSV and try to understand what happens the first argument of the read.table function is the path to the file you want to import into R if the file is in your current working directory simply passing the file name as a character string works if your file is located somewhere else things get tricky depending on the platform you're working on Linux Microsoft Mac whatever file paths are specified differently to build a path to a file in a platform independent way you can use the file. path function suppose our states. CSV file is located in the data sets folder of the home directory you can use file. path like this because I'm working on a Mech this is resulting path but for Windows resulting character string will be different this path can now be used inside read. table to point to the correct file like this now for the header arguments if you set this to true you tell r that the first row of the text file contains the variable names which is the case here read. table sets this argument false by default which would mean that the first row is already an observation next SE is the argument that specifies how fields in record are separated for our CSV file here the field separator is a comma so we use a comma inside quotes finally the strings as factors argument is pretty important it's true by default which means that columns or variables that are strings are imported into r as factors the data structure to store categorical variables in this case the column containing the country names shouldn't be a factor so we set strings as factors to false if we actually run this call now we indeed get a data frame with five observations and four variables that corresponds nicely to the CSV file we started with the read.table function works fine but it's pretty tiring to specify all these arguments every time right CSV files are a common and standardized type of flat files that's why the utils package also provides the read. CSV function this function is a repper around the read. table function this means that read. CSV calls read.table behind the scenes but with different default arguments to match with the CSV format more specifically the default for header is true and for SE is a comma so you don't have to manually specify these anymore this means that this read.table call from before is thus exactly the same as this read. CSV call apart from CSV files there are also other types of flood files take this tab delimited file states. dxt with the same data to import it with read. you again have to specify a bunch of arguments this time you should point to the txt file instead of the CSV file and the SE arguments should be set to a tab so back SLT you can also use the read. the Lim function which again is a repper around read.table the default Arguments for header and SE are adapted among some others result of both calls is again a nice translation of the flat file to an art data frame now there's one last thing I want to discuss here here have a look at this us CSV file and its European counterpart States eu. CSV you'll notice that the Europeans use commas for decimal points while normally one uses a DOT this means that they can't use the comma as the field limiter anymore they need a semicolon to deal with this easily R provides the read. csv2 function both the SE argument as the Des argument to tell which character is used for decimal points are different likewise for read. theim you have a read. theim 2 alternative Can you spot the difference again this time only the Des argument had to change let's try to import the states eu. CSV file with the basic read. CSV function AR gives result but it clearly is not result we want it's a data set with five observations but a single variable if you try again with read. CSV 2 it works perfectly fine this time of course the possibilities of importing flat file with the util functions are practically endless he'll typically have to experiment a bit to get it right give it a try in the first set of exercises\n"