The Power of ggplot2: Unlocking Statistics with Gohm Functions
In our previous courses, we've explored the importance of proper visualization techniques when working with statistical data. One crucial aspect of visualization is the use of density estimates to avoid over-plotting and ensure that every data point is visible. However, this can be particularly challenging when dealing with low precision or integer data.
To address this issue, ggplot2 provides a range of useful functions known as "Gohm" functions. These functions are specifically designed to handle low precision and integer data, providing elegant solutions for common visualization challenges. In this article, we'll delve into two essential Gohm functions: `gohm_count` and `gohm_quantile`.
First, let's explore the `gohm_count` function. This function is particularly useful when dealing with low precision data, where jittering can create the illusion of more precision than is actually present. Jittering is a common technique used to prevent over-plotting by randomly offsetting each data point from its mean. However, this can sometimes lead to the impression that we have more precision than we do.
To avoid this issue, we can use `gohm_count` to plot the count of observations at each location. This function counts the number of points at each x-value and then maps the count onto size, with larger values represented by larger points on the plot. By using area instead of radius, we more intuitively perceive the data in this way. Remember that `gohm_count` is associated with a stats function that can be called directly, allowing us to easily integrate it into our workflows.
While `gohm_count` is particularly useful for low precision data, it's essential to note that even with this function, over-plotting can still occur if the points are colored according to another variable. This can make the plot difficult to read and requires careful consideration when selecting colors or adding additional layers to the visualization.
Another critical Gohm function is `gohm_quantile`. This function allows us to model quantiles, which are robust against outliers and more suitable for describing data with non-constant variance. Quantiles are simply values that divide a dataset into equal parts, such as the 25th percentile (Q1) or the median (50th percentile). In contrast, linear models typically rely on the mean, which can be sensitive to outliers.
`gohm_quantile` enables us to choose any quantile we're interested in and model it using a robust statistical approach. This is particularly useful when dealing with heteroscedasticity, where the variance across the predictor variable is not consistent. In such cases, linear models may not be valid, and `gohm_quantile` provides an essential tool for modeling more complex data distributions.
To illustrate the power of `gohm_quantile`, let's consider an example from the economics journal dataset in the `eer` package. This dataset exhibits heteroscedasticity, where variance on the y-axis is not consistent as we move along the x-axis. By using `gohm_quantile` to model the fifth and 95th percentiles (Q1 and Q3), as well as the median (50th percentile), we can better understand the underlying distribution of the data.
In conclusion, the Gohm functions provided by ggplot2 offer a powerful toolkit for statistical visualization, enabling us to handle low precision and integer data with ease. By understanding how to effectively use `gohm_count` and `gohm_quantile`, we can create more informative and accurate visualizations that accurately represent our data.
"WEBVTTKind: captionsLanguage: enlet's have a proper discussion of stats called from within G ohms by looking at two more useful functions G ohm count and jump on tile in the first course we saw that over plotting is always a concern whenever we use G on point every data point must be visible we discussed four ways in which our visualizations may mislead us we can now add a new giome function to our solutions for low precision and integer data and gio mount plots the count at each location a course 3 we'll see even more elegant solutions that can be applied to all four situations let's look at an example with geum count the iris dataset where we have low precision data jittering gives the impression that we have more precision than we actually do we should always mention that we've jittered our data because of this to avoid this problem we can use another variant of G on point G ohm count counts the number of observations at each location and then map's the count on to size as the point area our data is mapped onto the area of the circle as opposed to its radius since we more intuitively perceived area than radius remember that these genomes are associated with stats functions that can be called directly as shown here calling the stats function gives the exact same part we'll see this trick used with integer data in the exercises which is a very common use but be careful here you'll still encounter over plotting if the points are colored according to another variable this makes it particularly difficult to read the plot the last function I want to look at in this section is giome quantile it's another great tool for describing our data this method allows us to model quantiles which are robust as opposed to linear models which model the non robust mean we can choose any quantile we're interested in such as a medium which is just the second quantile a typical case of using quantile regression would be when you have heteroscedasticity that is the variance across the predictor variable is not consistent in which case linear models may not be valid here's an example of heteroscedasticity from a data set of economics journals from the eer package we won't get into the details of the data but you can see that variance on the y-axis is not consistent as we move along the x axis here we can use GM quantile to model the fifth and the 95th percentile as well as the median the fiftieth percentile like the previous gyms this is also associated with the stats function that we can actually call directly let's take these functions for a spinlet's have a proper discussion of stats called from within G ohms by looking at two more useful functions G ohm count and jump on tile in the first course we saw that over plotting is always a concern whenever we use G on point every data point must be visible we discussed four ways in which our visualizations may mislead us we can now add a new giome function to our solutions for low precision and integer data and gio mount plots the count at each location a course 3 we'll see even more elegant solutions that can be applied to all four situations let's look at an example with geum count the iris dataset where we have low precision data jittering gives the impression that we have more precision than we actually do we should always mention that we've jittered our data because of this to avoid this problem we can use another variant of G on point G ohm count counts the number of observations at each location and then map's the count on to size as the point area our data is mapped onto the area of the circle as opposed to its radius since we more intuitively perceived area than radius remember that these genomes are associated with stats functions that can be called directly as shown here calling the stats function gives the exact same part we'll see this trick used with integer data in the exercises which is a very common use but be careful here you'll still encounter over plotting if the points are colored according to another variable this makes it particularly difficult to read the plot the last function I want to look at in this section is giome quantile it's another great tool for describing our data this method allows us to model quantiles which are robust as opposed to linear models which model the non robust mean we can choose any quantile we're interested in such as a medium which is just the second quantile a typical case of using quantile regression would be when you have heteroscedasticity that is the variance across the predictor variable is not consistent in which case linear models may not be valid here's an example of heteroscedasticity from a data set of economics journals from the eer package we won't get into the details of the data but you can see that variance on the y-axis is not consistent as we move along the x axis here we can use GM quantile to model the fifth and the 95th percentile as well as the median the fiftieth percentile like the previous gyms this is also associated with the stats function that we can actually call directly let's take these functions for a spin\n"