The Need for 2-Dimensional Labelled Data Structures in Data Science
When it comes to data analysis and modeling, arrays are incredibly powerful and serve a number of essential purposes. However, one of the most basic needs of a data scientist is to have 2-dimensional labelled data structures with columns of potentially different types that can easily be performed a plethora of data science e type things on manipulated slice reshaped grouped joined merged performed statistics in a missing value friendly manner dealt with time series. The need for such a data structure, among other issues, prompted the question: where is McKinney to develop the pandas library for Python?
The Answer Lies in the Pandas Library
Python has long been great for data munging and preparation, but less so for data analysis and modeling. This gap was filled by the development of the pandas library, which enables you to carry out your entire data analysis workflow in Python without having to switch to a more domain-specific language like R. The data structure most relevant to the data manipulation and analysis workflow that pandas offers is the DataFrame, and it is the pythonic analog of our Data Frame.
As Hadley Wickham tweeted, "A matrix has rows and columns, a DataFrame has observations and variables." Manipulating DataFrames in pandas can be useful in all steps of the data scientific method, from exploratory data analysis to data wrangling, pre-processing, building models, and visualization. The great utility of pandas lies in its ability to deal with missing data, comments, and other issues that plague working data scientists today.
Using Pandas to Import Flat Files
Now that we have discussed the benefits of using pandas for data analysis and modeling, it's time to talk about how to use pandas to import flat files. In the most basic case, all you need to do is call the function `read_csv()` and supply it with a single argument: the name of the file. Having assigned the DataFrame to the variable `data`, you can check the first five rows of the DataFrame including the header with the command `data.head()`. You can also easily convert the DataFrame to a NumPy array by calling the DataFrame attribute `values`.
Now it's your turn to play around importing flat files using Python. You'll get experience importing a flat file that is straightforward and you'll also get experience importing a flat file that has a few issues, such as one that contains comments and strings that should be interpreted as missing values. Have fun importing!
"WEBVTTKind: captionsLanguage: enCongrats you're now able to import a bunch of different types of flat files into Python as numpy arrays although arrays are incredibly powerful and serve a number of essential purposes they cannot fulfill one of the most basic needs of a data scientist to have 2-dimensional labelled data structures with columns of potentially different types that you can easily perform a plethora of data science e type things on manipulate slice reshape group by join merge perform statistics in a missing value friendly manner deal with time series the need for such a data structure among other issues prompted where's mckinney to develop the pandas library for python nothing speaks of the project de pandas more than the documentation itself python has long been great for data munging and preparation but less so for data analysis and modeling pandas helps fill this gap enabling you to carry out your entire data analysis workflow in python without having to switch to a more domain-specific language like r the data structure most relevant to the data manipulation and analysis workflow that pandas offers is the data frame and it is the pythonic analog of ours data frame as Hadley Wickham tweeted a matrix has rows and columns a data frame has observations and variables manipulating data frames in pandas can be useful in all steps of the data scientific method from exploratory data analysis to data wrangling pre-processing building models and visualization here we will see as great utility in importing flat files even merely in the way that it deals with missing data comments along with many other issues that plague working data scientists today for all of these reasons it is now standard and best practice in data science to use pandas to import flat files of as data frames later in this course we'll see how many other types of data whether they're stored in oral databases hdf5 MATLAB or Excel files can easily be imported as data frames to use pandas you first need to import it then if we wish to import a CSV in the most basic case all we need to do is to call the function read CSV and supply it with a single argument the name of the file having assigned the data frame to the variable data we can check the first five rows of the data frame including the header with the command data dot head we can also easily convert the data frame to a numpy array by calling the data frame attribute values now it's your turn to play around importing flat files using Python you'll get experience importing a flat file that is straightforward and you'll also get experience importing a flat file that has a few issues such as one that contains comments and strings that should be interpreted as missing values have fun importingCongrats you're now able to import a bunch of different types of flat files into Python as numpy arrays although arrays are incredibly powerful and serve a number of essential purposes they cannot fulfill one of the most basic needs of a data scientist to have 2-dimensional labelled data structures with columns of potentially different types that you can easily perform a plethora of data science e type things on manipulate slice reshape group by join merge perform statistics in a missing value friendly manner deal with time series the need for such a data structure among other issues prompted where's mckinney to develop the pandas library for python nothing speaks of the project de pandas more than the documentation itself python has long been great for data munging and preparation but less so for data analysis and modeling pandas helps fill this gap enabling you to carry out your entire data analysis workflow in python without having to switch to a more domain-specific language like r the data structure most relevant to the data manipulation and analysis workflow that pandas offers is the data frame and it is the pythonic analog of ours data frame as Hadley Wickham tweeted a matrix has rows and columns a data frame has observations and variables manipulating data frames in pandas can be useful in all steps of the data scientific method from exploratory data analysis to data wrangling pre-processing building models and visualization here we will see as great utility in importing flat files even merely in the way that it deals with missing data comments along with many other issues that plague working data scientists today for all of these reasons it is now standard and best practice in data science to use pandas to import flat files of as data frames later in this course we'll see how many other types of data whether they're stored in oral databases hdf5 MATLAB or Excel files can easily be imported as data frames to use pandas you first need to import it then if we wish to import a CSV in the most basic case all we need to do is to call the function read CSV and supply it with a single argument the name of the file having assigned the data frame to the variable data we can check the first five rows of the data frame including the header with the command data dot head we can also easily convert the data frame to a numpy array by calling the data frame attribute values now it's your turn to play around importing flat files using Python you'll get experience importing a flat file that is straightforward and you'll also get experience importing a flat file that has a few issues such as one that contains comments and strings that should be interpreted as missing values have fun importing\n"