Outlier

The outlier is an extreme observation value that might badly influence the test results.

Example

The teacher measured the height of a group of students: 110, 115, 130, 145, 721, 151, 160, 128, 137.
With the following results:
Average: 199.7, Standard deviation: 196.2.

She made a clear typo mistake.

Following the results after correcting the outlier to 121:
Average: 133, Standard deviation: 16.7.

If she would analyze the data in a public holiday and the correction would not be so obvious, she might need to exclude the observation and get the following results:
Average: 134.5, Standard deviation: 17.2.

Why do we get outliers?

Observation errors

The observation's value is not correct due to various reasons.
You would like to exclude these incorrect outliers.

Measurement error

Measurement tool error, or wrong measurement process.

Experiment error

Example: when counting bacteria, some of the Petri dishes are contaminated and show a larger count.

Human error

Any human error, like filling incorrect value, reading tool incorrectly, lie.

Incorrect statistical model

Since you use a wrong model some values appear as outliers, removing the outliers would be a mistake. Instead, you should fix the model.
You don't want to exclude these incorrect outliers!

Incorrect distribution

The real statistical distribution is not symmetric, and the outlier is valid.
How to fix it?
Use the correct distribution or use a non-parametric test for not normally distributed data or transform the data to fit the normal distribution better.

Mixture of populations

The checked population is composed of two or more groups with different characteristics.
How to fix it?
Analyze each data population separately or treat the separation in the model, like adding a predictor to a regression.

Valid outliers

There is a low probability to get a genuine extreme value.
When you use a large sample size, you will undoubtedly get some such observations, and you must not exclude it from the research.
For example, in a normal distribution, there is a probability of 0.05 to get an extreme value of more than two standard deviations from the average.

Detection Methods

There are many ways to identify outliers. Following two of the commonly used methods.

Z-score

Usually with k=3.
Lower = Average − k × Standard Deviation.
Upper = Average + k × Standard Deviation.
A potential problem is that the outliers may increase the standard deviation of the sample.

Tukey's Fences

Usually with k=1.5.
Interquartile Range: IQR = Q3 − Q1.
Lower = Q1 − k × IQR.
Upper = Q3 + k × IQR.

Even list

[21, 13, 14, 16, 38, 17, 18, 11, 20, 22, 22, 26].
Sorted: [11, 13, 14, 16, 17, 18, 20, 21, 22, 22, 26, 38].
Divide into two equal lists, then each list into 2 again:
[11, 13, 14, 16, 17, 18], [20, 21, 22, 22, 26, 38].
Q1 = (14+16)/2 = 15.
Q3 = (22+22)/2 = 22.
IQR = Q3 − Q1 = 22 − 15 = 7.
Lower = Q1 − k × IQR = 15 − 1.5 × 7 = 4.5.
Upper = Q3 + k × IQR = 22 + 1.5 × 7 = 32.5.
An outlier is every observation below the lower threshold or above the upper threshold.

Odd list

[11, 13, 14, 16, 17, 18, 20, 21, 22, 22, 26].
The number 18 is in the middle of the list. We choose to include it in both divided lists (removing it from both is also valid).
[11, 13, 14, 16, 17, 18], [18, 20, 21, 22, 22, 26].
Q1 = (14+16)/2 = 15.
Q3 = (21+22)/2 = 21.5.
The rest of the calculation is identical to the even list.

Calculators

Outlier Calculator Descriptive Statistics Boxplot Maker Standard Deviation Mean Median Mode IQR Shapiro-Wilk Average Probability