Represent a Data Set -Dimensionality reduction and visualization Lecture 3 @ Applied AI Course

The Representation of a Data Set: Understanding and Visualizing the Iris Dataset

A data set is typically written with a capital D, and there are multiple ways to represent one. One of the most commonly used representations involves writing down what it is. For example, when referring to the iris dataset, we have "setosa vsicolor virginica" as our data set C containing n data points.

In this representation, each data point is denoted by X and has a corresponding Y. The notation means that for every X in Rd, where D represents the dimension of the vector (in this case, 4), there exists a real-valued vector with four fields representing the sepal length, sepal width, petal length, and petal width, respectively. Each of these vectors has a corresponding class label Y, which can be either setosa, versicolor, or virginica.

This representation is useful because it provides a clear understanding of what each data point represents. In the case of the iris dataset, we have four features that describe each flower: sepal length and width, as well as petal length and width. By writing down this notation, we can quickly identify the type of classification task at hand and understand how each feature contributes to it.

One important thing to note is that not all values in a data set are real-valued. In some machine learning tasks, you may encounter non-real valued data or even categorical data. However, in most cases, when representing a data set with a capital D, we write everything as a real value vector.

In the context of the iris dataset, this means that each data point is represented by a four-dimensional column vector consisting of sepal length, sepal width, petal length, and petal width. This representation allows us to visualize and understand the relationships between the different features and how they contribute to the classification task at hand.

Another important aspect of representing a data set with a capital D is understanding how it relates to class labels. In this case, each data point belongs to one of three classes: setosa, versicolor, or virginica. This notation provides a clear way to identify the type of classification task and understand how each feature contributes to it.

In conclusion, representing a data set with a capital D is an essential skill for anyone working in machine learning or data science. By understanding this notation and its implications, we can clearly communicate our ideas and develop effective algorithms for classifying data points.

"WEBVTTKind: captionsLanguage: enso now let's see how to represent a data set a data set typically is written with a capital D this is one way of representing a data set that multiple ways we learn two of the most used ones data set is basically a collection of data points and it's corresponding so I let me explain what each of these first let me write down what it is right for example I just take a ride is data set setosa versicolor virginica let me explain what this means this means that our data set C contains n data points right so because I am writing I equals to 1 to n I have x and y I so for every X I have a corresponding Y I and I have n such I have n such data points right this n represents the number of data points in my data set now here the moment I say X I belong to Rd what do I know that X I is a D dimensional vector of real values so in the case of iris data set in the case of iris data set we had four features right so X I belongs to R for right which means for every data point for every flower for which you have taken observations we have for for real menus right sorry we have for real values we have the corresponding sepal length sepal width petal length and petal as i mentioned do if you're not informed what it is this basically means a column vector by default right and for every column vector with four fields you have a corresponding Y and this Y I could be either set dosa versicolor or virginica right so if somebody writes this you can quickly and of course there are other types of why is why I could be 0 or 1 why I could itself belong to real value will see different machine learning tasks for which there will be different voice but typically and not all the values may be real valued right here we're always writing everything as a real value vector right is data set everything is a real value because your sepal length sepal width petal length petal width are all real values we will see more variations of data this is this hobby represented so whenever I write like this what it means just to quickly reiterate is that my data set D is a collection of data points X I and class labels Y so x i's are my data points and y eyes are my class labels okay and i have n such data points and class labels right so where my each data point belongs to rd in the case of is its r4 which means it's a four dimensional vector again column vector which consists of sepal length sepal width petal and then petal bit and while i now belongs to one of the three types of flowers we have a third setosa what c color or virginica this is how this is one way of representing a data set there are other ways we'll see one other very useful way of representing a data set but this is one way of representing itso now let's see how to represent a data set a data set typically is written with a capital D this is one way of representing a data set that multiple ways we learn two of the most used ones data set is basically a collection of data points and it's corresponding so I let me explain what each of these first let me write down what it is right for example I just take a ride is data set setosa versicolor virginica let me explain what this means this means that our data set C contains n data points right so because I am writing I equals to 1 to n I have x and y I so for every X I have a corresponding Y I and I have n such I have n such data points right this n represents the number of data points in my data set now here the moment I say X I belong to Rd what do I know that X I is a D dimensional vector of real values so in the case of iris data set in the case of iris data set we had four features right so X I belongs to R for right which means for every data point for every flower for which you have taken observations we have for for real menus right sorry we have for real values we have the corresponding sepal length sepal width petal length and petal as i mentioned do if you're not informed what it is this basically means a column vector by default right and for every column vector with four fields you have a corresponding Y and this Y I could be either set dosa versicolor or virginica right so if somebody writes this you can quickly and of course there are other types of why is why I could be 0 or 1 why I could itself belong to real value will see different machine learning tasks for which there will be different voice but typically and not all the values may be real valued right here we're always writing everything as a real value vector right is data set everything is a real value because your sepal length sepal width petal length petal width are all real values we will see more variations of data this is this hobby represented so whenever I write like this what it means just to quickly reiterate is that my data set D is a collection of data points X I and class labels Y so x i's are my data points and y eyes are my class labels okay and i have n such data points and class labels right so where my each data point belongs to rd in the case of is its r4 which means it's a four dimensional vector again column vector which consists of sepal length sepal width petal and then petal bit and while i now belongs to one of the three types of flowers we have a third setosa what c color or virginica this is how this is one way of representing a data set there are other ways we'll see one other very useful way of representing a data set but this is one way of representing it\n"