R Tutorial - What is an anomaly

Welcome to the Course: Anomaly Detection in R

The course is all about anomaly detection in R, and we'll start by considering what is meant when we talk about anomalies. An anomaly can be defined as a data point or collection of data points that don't seem to follow the same pattern as the rest of the data.

There are several ways in which a data point can differ from the rest of a data set. To make this clearer, let's consider some specific examples. One type of anomaly is the point anomaly, which is defined as a single data point that is unusual or anomalous compared to the rest of the data. For example, observing a single unseasonably hot spring day could be considered anomalous because the temperature is extreme compared to all of the others.

Point anomalies often occur in this way as a singular extreme value on a single attribute of the data point. The summary function prints the maximum, minimum upper and lower quartiles, and the mean and median, which can give a sense for how far an extreme point lies from the rest of the data. It's quite clear in this case that the 30 Celsius day is a long way from the median of 22.4.

The easiest way to get a sense for how unusual a particular value is is by using a graphical summary like a box plot in R. The boxplot function takes a column of values as an input argument and produces a box and whiskers representation of the distribution of the values. Note that the ylab argument accepts a character string with which to label the y-axis, and in this case, the units are in Celsius.

The box extends to the upper and lower quartiles, while the whiskers stretch further and often extend to the maximum and minimum values in the data. The whiskers don't always reach the maximum and minimum values when extreme points are present, and instead, the extreme values are represented as distinct points, making them easier to spot. In this case, the maximum temperature of 30 Celsius stands out from the others and looks like a clear point anomaly.

It's worth noting that a point anomaly is not always necessarily extreme. A point anomaly can also arise as an unusual combination of values across several attributes. This is because anomalies can be caused by a variety of factors, including individual data points or collections of data points.

A collective anomaly is a collection of similar data instances that can be considered anomalous together when compared to the rest of the data. For example, a consecutive 10-day period of high temperatures are shown by the red points in the plot. These daily temperatures are unusual because they occur together and are likely caused by the same underlying weather event.

Data points in a collective anomaly may each also be point anomalies, but this needn't be true. For instance, in the case of daily temperatures in a heat wave, a single warm day in summer may be completely normal for the season, but several such days that occur consecutively can cause the event to be considered an anomaly.

Collective anomalies are particularly important in studies over time where events can cause several data points to appear anomalous at the same time.

"WEBVTTKind: captionsLanguage: enwelcome to the course which is all about anomaly detection in r let's start by considering what is meant when we talk about anomalies an anomaly can be defined as a data point or collection of data points that don't seem to follow the same pattern as the rest of the data there are a number of different ways in which a data point can differ from the rest of a data set to make this clearer let's consider some specific examples a point anomaly is the simplest type of anomaly and is the motivation for many of the techniques covered by this course a point anomaly is defined as a single data point that is unusual or anomalous compared to the rest of the data for example observing a single unseasonably hot spring day could be considered anomalous the hot spring is anomalous because the temperature is extreme compared to all of the others point anomalies often occur in this way as a singular extreme value on a single attribute of the data point the summary function prints the maximum minimum upper and lower quartiles and the mean and the median and could give a sense for how far an extreme point lies from the rest of the data it's quite clear in this case that the 30 celsius day is a long way from the median of 22.4 the easiest way to get a sense for how unusual a particular value is is by using a graphical summary like a box plot in r this is created using the boxplot function the boxplot function takes a column of values as an input argument here illustrated with the temperature data and produces a box and whiskers representation of the distribution of the values note that the ylab argument accepts a character string with which to label the y-axis in this case the units are in celsius the box extends to the upper and lower quartiles while the whiskers stretch further and often extend to the maximum and minimum values in the data the whiskers don't always reach the maximum and minimum values when extreme points are present and instead the extreme values are represented as distinct points making them easier to spot in the case shown here the maximum temperature of 30 celsius stands out from the others and looks like a clear point anomaly it's important to note that a point anomaly is not always necessarily extreme a point anomaly can also arise as an unusual combination of values across several attributes a collective anomaly is a collection of similar data instances that can be considered anomalous together when compared to the rest of the data for example a consecutive 10-day period of high temperatures are shown by the red points in the plot these daily temperatures are unusual because they occur together and are likely caused by the same underlying weather event data points in a collective anomaly may each also be point anomalies but this needn't be true for example in the case of daily temperatures in a heat wave a single warm day in summer may be completely normal for the season but several such days that occur consecutively can cause the event to be considered an anomaly collective anomalies are particularly important in studies over time where events can cause several data points to appear anomalous at the same time let's putwelcome to the course which is all about anomaly detection in r let's start by considering what is meant when we talk about anomalies an anomaly can be defined as a data point or collection of data points that don't seem to follow the same pattern as the rest of the data there are a number of different ways in which a data point can differ from the rest of a data set to make this clearer let's consider some specific examples a point anomaly is the simplest type of anomaly and is the motivation for many of the techniques covered by this course a point anomaly is defined as a single data point that is unusual or anomalous compared to the rest of the data for example observing a single unseasonably hot spring day could be considered anomalous the hot spring is anomalous because the temperature is extreme compared to all of the others point anomalies often occur in this way as a singular extreme value on a single attribute of the data point the summary function prints the maximum minimum upper and lower quartiles and the mean and the median and could give a sense for how far an extreme point lies from the rest of the data it's quite clear in this case that the 30 celsius day is a long way from the median of 22.4 the easiest way to get a sense for how unusual a particular value is is by using a graphical summary like a box plot in r this is created using the boxplot function the boxplot function takes a column of values as an input argument here illustrated with the temperature data and produces a box and whiskers representation of the distribution of the values note that the ylab argument accepts a character string with which to label the y-axis in this case the units are in celsius the box extends to the upper and lower quartiles while the whiskers stretch further and often extend to the maximum and minimum values in the data the whiskers don't always reach the maximum and minimum values when extreme points are present and instead the extreme values are represented as distinct points making them easier to spot in the case shown here the maximum temperature of 30 celsius stands out from the others and looks like a clear point anomaly it's important to note that a point anomaly is not always necessarily extreme a point anomaly can also arise as an unusual combination of values across several attributes a collective anomaly is a collection of similar data instances that can be considered anomalous together when compared to the rest of the data for example a consecutive 10-day period of high temperatures are shown by the red points in the plot these daily temperatures are unusual because they occur together and are likely caused by the same underlying weather event data points in a collective anomaly may each also be point anomalies but this needn't be true for example in the case of daily temperatures in a heat wave a single warm day in summer may be completely normal for the season but several such days that occur consecutively can cause the event to be considered an anomaly collective anomalies are particularly important in studies over time where events can cause several data points to appear anomalous at the same time let's put\n"