Author: Jason Brownlee
Calculating the average of a variable or a list of numbers is a common operation in machine learning.
It is an operation you may use every day either directly, such as when summarizing data, or indirectly, such as a smaller step in a larger procedure when fitting a model.
The average is a synonym for the mean, a number that represents the most likely value from a probability distribution. As such, there are multiple different ways to calculate the mean based on the type of data that you’re working with.
This can trip you up if you use the wrong mean for your data. You may also enter some of these more exotic calculations of mean values when using performance metrics to evaluate your model, such as the G-mean or the F-Measure.
In this tutorial, you will discover the difference between the arithmetic mean, the geometric mean, and the harmonic mean.
After completing this tutorial, you will know:
- The central tendency summarizes the most likely value for a variable, and the average is the common name for the calculation of the mean.
- The arithmetic mean is appropriate if the values have the same units, whereas the geometric mean is appropriate if the values have differing units.
- The harmonic mean is appropriate if the data values are ratios of two variables with different measures, called rates.
Let’s get started.
Tutorial Overview
This tutorial is divided into five parts; they are:
- What Is the Average?
- Arithmetic Mean
- Geometric Mean
- Harmonic Mean
- How to Choose the Correct Mean?
What Is the Average?
The central tendency is a single number that represents the most common value for a list of numbers.
More technically, it is the value that has the highest probability from the probability distribution that describes all possible values that a variable may have.
There are many ways to calculate the central tendency for a data sample, such as the mean which is calculated from the values, the mode, which is the most common value in the data distribution, or the median, which is the middle value if all values in the data sample were ordered.
The average is the common term for the mean. They can be used interchangeably.
The mean is different from the median and the mode in that it is a measure of the central tendency that is calculated from the data. As such, there are different ways to calculate the mean based on the type of data.
Three common types of mean calculations that you may encounter are the arithmetic mean, the geometric mean, and the harmonic mean. There are other means, and many more central tendency measures, but these three means are perhaps the most common (e.g. the so-called Pythagorean means).
Let’s take a closer look at each calculation of the mean in turn.
Arithmetic Mean
The arithmetic mean is calculated as the sum of the values divided by the total number of values, referred to as N.
- Arithmetic Mean = (x1 + x2 + … + xN) / N
A more convenient way to calculate the arithmetic mean is to calculate the sum of the values and to multiply it by the reciprocal of the number of values (1 over N); for example:
- Arithmetic Mean = (1/N) * (x1 + x2 + … + xN)
The arithmetic mean is appropriate when all values in the data sample have the same units of measure, e.g. all numbers are heights, or dollars, or miles, etc.
When calculating the arithmetic mean, the values can be positive, negative, or zero.
The arithmetic mean can be easily distorted if the sample of observations contains outliers (a few values far away in feature space from all other values), or for data that has a non-Gaussian distribution (e.g. multiple peaks, a so-called multi-modal probability distribution).
The arithmetic mean is useful in machine learning when summarizing a variable, e.g. reporting the most likely value. This is more meaningful when a variable has a Gaussian or Gaussian-like data distribution.
The arithmetic mean can be calculated using the mean() NumPy function.
The example below demonstrates how to calculate the arithmetic mean for a list of 10 numbers.
# example of calculating the arithmetic mean from numpy import mean # define the dataset data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] # calculate the mean result = mean(data) print('Arithmetic Mean: %.3f' % result)
Running the example calculates the arithmetic mean and reports the result.
Arithmetic Mean: 4.500
Geometric Mean
The geometric mean is calculated as the N-th root of the product of all values, where N is the number of values.
- Geometric Mean = N-root(x1 * x2 * … * xN)
For example, if the data contains only two values, the square root of the product of the two values is the geometric mean. For three values, the cube-root is used, and so on.
The geometric mean is appropriate when the data contains values with different units of measure, e.g. some measure are height, some are dollars, some are miles, etc.
The geometric mean does not accept negative or zero values, e.g. all values must be positive.
One common example of the geometric mean in machine learning is in the calculation of the so-called G-Mean (geometric mean) metric that is a model evaluation metric that is calculated as the geometric mean of the sensitivity and specificity metrics.
The geometric mean can be calculated using the gmean() SciPy function.
The example below demonstrates how to calculate the geometric mean for a list of 10 numbers.
# example of calculating the geometric mean from scipy.stats import gmean # define the dataset data = [1, 2, 3, 40, 50, 60, 0.7, 0.88, 0.9, 1000] # calculate the mean result = gmean(data) print('Geometric Mean: %.3f' % result)
Running the example calculates the geometric mean and reports the result.
Geometric Mean: 7.246
Harmonic Mean
The harmonic mean is calculated as the number of values N divided by the sum of the reciprocal of the values (1 over each value).
- Harmonic Mean = N / (1/x1 + 1/x2 + … + 1/xN)
If there are just two values (x1 and x2), a simplified calculation of the harmonic mean can be calculated as:
- Harmonic Mean = (2 * x1 * x2) / (x1 + x2)
The harmonic mean is the appropriate mean if the data is comprised of rates.
Recall that a rate is the ratio between two quantities with different measures, e.g. speed, acceleration, frequency, etc.
In machine learning, we have rates when evaluating models, such as the true positive rate or the false positive rate in predictions.
The harmonic mean does not take rates with a negative or zero value, e.g. all rates must be positive.
One common example of the use of the harmonic mean in machine learning is in the calculation of the F-Measure (also the F1-Measure or the Fbeta-Measure); that is a model evaluation metric that is calculated as the harmonic mean of the precision and recall metrics.
The harmonic mean can be calculated using the hmean() SciPy function.
The example below demonstrates how to calculate the harmonic mean for a list of nine numbers.
# example of calculating the harmonic mean from scipy.stats import hmean # define the dataset data = [0.11, 0.22, 0.33, 0.44, 0.55, 0.66, 0.77, 0.88, 0.99] # calculate the mean result = hmean(data) print('Harmonic Mean: %.3f' % result)
Running the example calculates the harmonic mean and reports the result.
Harmonic Mean: 0.350
How to Choose the Correct Mean?
We have reviewed three different ways of calculating the average or mean of a variable or dataset.
The arithmetic mean is the most commonly used mean, although it may not be appropriate in some cases.
Each mean is appropriate for different types of data; for example:
- If values have the same units: Use the arithmetic mean.
- If values have differing units: Use the geometric mean.
- If values are rates: Use the harmonic mean.
The exceptions are if the data contains negative or zero values, then the geometric and harmonic means cannot be used directly.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
APIs
Articles
- Average, Wikipedia.
- Central tendency, Wikipedia.
- Arithmetic mean, Wikipedia.
- Geometric mean, Wikipedia.
- Harmonic mean, Wikipedia.
Summary
In this tutorial, you discovered the difference between the arithmetic mean, the geometric mean, and the harmonic mean.
Specifically, you learned:
- The central tendency summarizes the most likely value for a variable, and the average is the common name for the calculation of the mean.
- The arithmetic mean is appropriate if the values have the same units, whereas the geometric mean is appropriate if the values have differing units.
- The harmonic mean is appropriate if the data values are ratios of two variables with different measures, called rates.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
The post Arithmetic, Geometric, and Harmonic Means for Machine Learning appeared first on Machine Learning Mastery.