Python Tutorial - Review of pandas DataFrames

Learning to Get Data In and Look at It with Pandas: A Comprehensive Guide

Pandas is a powerful library for data analysis, and understanding its core concepts is essential to unlocking its full potential. One of the most fundamental tools in pandas is the DataFrame, which is a tabular data structure with labeled rows and columns. The DataFrame is the power tool of pandas, allowing users to easily manipulate and analyze large datasets.

To begin working with DataFrames, it's necessary to understand the concept of an index. In pandas, indexes are tabled lists of labels that permit fast lookup and powerful relational operations. The index labels in our example DataFrame are dates in reverse chronological order, which improve the clarity and intuition of many data analysis tasks. When asked for the type of the object apple, it's a DataFrame when asked for its shape, it has 8514 rows and six columns.

The DataFrame columns attribute gives the names of its columns, which in this case are open, high, low, close, volume, and adjusted close. Notably, the apple.columns attribute is also a pandas index, while the apple.index attribute in this case is a special kind of index called a datetime index. We'll study datetime indexes and time series later in the article.

DataFrames can be sliced like numpy arrays or Python lists using colons to specify the start, end, and stride of a slice. First, we can slice from the start of the data frame to the fifth row non-inclusive using the dot ilok accessor to express the slice positionally. Second, we can slice from the fifth last row to the end of the data frame using a negative index. It's also possible to slice using labels with the dot lock accessor.

Another useful feature of DataFrames is the head method, which allows us to see just the top rows of the data frame. Specifying head five returns the first five rows, specifying head two returns just the first two rows. The head method is particularly useful because our data frame here has over eight thousand rows. On the other hand, tail without an argument returns the last five rows by default, specifying tail three returns the last three rows again. Tail gives a useful summary of large data frames.

Another useful summary method in pandas is info, which returns other useful summary information including the kind of index, column labels, the number of rows and columns, and the data type of each column. Pandas' DataFrame slices also support broadcasting, which we'll learn more about later.

Assigning a scalar value to a slice, such as nan or not a number, results in a slice consisting of every third row starting from zero in the last column. We can see this change by calling head six and observing that the changes have been made. Additionally, we can call info and notice that the last column has fewer non-null entries than the others due to our assigning nan to every third element.

The columns of a DataFrame themselves are a specialized data structure called a series. Extracting a single column from a DataFrame returns a series, which has its own head method and inherits its name attribute from the DataFrame. To extract the numerical entries from the series, we use the values attribute. Interestingly, the data in this series actually form a numpy array, which is what the values attribute yields.

A DataFrame is a two-dimensional labeled array whose columns are series. We've seen a few concepts extending what we already knew including head, tail, info, index, values, and series. While these concepts take some time to practice using in the exercises, they're essential for mastering pandas and working with DataFrames effectively.

"WEBVTTKind: captionsLanguage: enlet's learn how to get data in and look at it we'll need to remember a few things about pandas first pandas is a library for data analysis the power tool of pandas is the data frame a tabular data structure with labeled rows and columns as an example we'll use a data frame with apple stock data the rows are labeled by a special data structure called an index we'll learn more about indexes later indexes in pandas are tabled lists of labels that permit fast lookup and some powerful relational operations the index labels in the apple data frame are dates in reverse chronological order labeled rows and columns improve the clarity and intuition of many data analysis tasks when we ask for the type of the object apple it's a data frame when we ask for its shape it has 8514 rows and six columns the data frame columns attribute gives the names of its columns open high low close volume and adjusted close notice the apple.columns attribute is also a pandas index actually the apple.index attribute in this case is a special kind of index called a datetime index we'll study datetime indexes and time series later data frames can be sliced like numpy arrays or python lists using colons to specify the start end and the stride of a splice first we can slice from the start of the data frame to the fifth row non-inclusive using the dot ilok accessor to express the slice positionally second we can slice from the fifth last row to the end of the data frame using a negative index remember it's also possible to slice using labels with the dot lock accessor there's another way to see just the top rows of the data frame the head method specifying head five returns the first five rows specifying head two returns just the first two rows the head method is particularly useful because our data frame here has over eight thousand rows the opposite of head is tail specifying tail without an argument returns the last five rows by default specifying tail three returns the last three rows again tail gives a useful summary of large data frames another useful summary method is info info returns other useful summary information including the kind of index the column labels the number of rows and columns and the data type of each column panda's data frame slices also support broadcasting we'll learn more about this later here a slice is assigned a scalar value in this case nan or not a number the slice consists of every third row starting from zero in the last column we can see head six to see the changes we can also call info and notice the last column has fewer non-null entries than the others due to our assigning nan to every third element the columns of a data frame themselves are a specialized data structure called a series extracting a single column from a data frame returns a series notice the series extracted has its own head method and inherits its name attribute from the data frame column to extract the numerical entries from the series use the values attribute the data in this series actually form a numpy array which is what the values attribute actually yields a panda series then is a one-dimensional labeled numpy array and a data frame is a two-dimensional labeled array whose columns are series we've seen a few concepts extending what we already knew including head tail info index values and series take some time to practice using these concepts in the exerciseslet's learn how to get data in and look at it we'll need to remember a few things about pandas first pandas is a library for data analysis the power tool of pandas is the data frame a tabular data structure with labeled rows and columns as an example we'll use a data frame with apple stock data the rows are labeled by a special data structure called an index we'll learn more about indexes later indexes in pandas are tabled lists of labels that permit fast lookup and some powerful relational operations the index labels in the apple data frame are dates in reverse chronological order labeled rows and columns improve the clarity and intuition of many data analysis tasks when we ask for the type of the object apple it's a data frame when we ask for its shape it has 8514 rows and six columns the data frame columns attribute gives the names of its columns open high low close volume and adjusted close notice the apple.columns attribute is also a pandas index actually the apple.index attribute in this case is a special kind of index called a datetime index we'll study datetime indexes and time series later data frames can be sliced like numpy arrays or python lists using colons to specify the start end and the stride of a splice first we can slice from the start of the data frame to the fifth row non-inclusive using the dot ilok accessor to express the slice positionally second we can slice from the fifth last row to the end of the data frame using a negative index remember it's also possible to slice using labels with the dot lock accessor there's another way to see just the top rows of the data frame the head method specifying head five returns the first five rows specifying head two returns just the first two rows the head method is particularly useful because our data frame here has over eight thousand rows the opposite of head is tail specifying tail without an argument returns the last five rows by default specifying tail three returns the last three rows again tail gives a useful summary of large data frames another useful summary method is info info returns other useful summary information including the kind of index the column labels the number of rows and columns and the data type of each column panda's data frame slices also support broadcasting we'll learn more about this later here a slice is assigned a scalar value in this case nan or not a number the slice consists of every third row starting from zero in the last column we can see head six to see the changes we can also call info and notice the last column has fewer non-null entries than the others due to our assigning nan to every third element the columns of a data frame themselves are a specialized data structure called a series extracting a single column from a data frame returns a series notice the series extracted has its own head method and inherits its name attribute from the data frame column to extract the numerical entries from the series use the values attribute the data in this series actually form a numpy array which is what the values attribute actually yields a panda series then is a one-dimensional labeled numpy array and a data frame is a two-dimensional labeled array whose columns are series we've seen a few concepts extending what we already knew including head tail info index values and series take some time to practice using these concepts in the exercises\n"