Tukey's fences, Z-score
The outlier calculator identifies the outliers and graphs the data. It includes a scatter plot, boxplot, histogram, and optional step-by-step calculation.
Leaving empty cells is okay. The tool ignores empty cells or non-numeric cells.
What is an outlier?
The outlier is an extreme observation value. The outlier's location is far from most data observations.
An outlier may be a valid value or an incorrect value.
Why should you identify outliers?
Outliers may identify potentially incorrect observations or incorrect statistical distribution.
Several statistics are sensitive to outliers, like average and standard deviation, and some statistics are robust to outliers, like median and mode.
Some statistical tests like variance tests are very sensitive to outliers, and some statistical tests are robust to outliers like non-parametrical tests.
What to do with the outlier?
After using the outlier calculator you need to decide what to do with the outliers.
You should exclude only invalid outliers.
|Type||Reason||Description||What to do?|
|Observation error||Measurement error||The measurement tool is not good, or not calibrated|
Wrong measurement process
|Exclude such outliers|
|Experiment error||For example, the temperature of some subjects was higher, and this resulted in higher values||Exclude the outlier or repeat the experiment|
|Human error||Any mistake that is done by a person, like incorrectly reading the measurement tool||Exclude|
|Incorrect statistical model||Incorrect distribution||For example if you assume the normal distribution and use the z-score method to identify the outliers.|
A skewed distribution or heavy-tailed distribution will result in many outliers
|1. Use the correct distribution to identify the outliers|
2. Use the Tukey's fence method that is less sensitive to the distribution
3. Transform the data to fit the normal distribution
4. Use a non-parametric test that is not sensitive to outliers
|Mixed population||When the data include two or more groups with different characteristics||1. Analyze each data group separately|
2. Use a model that treats the group, for example, add group predictor in a regression
|Valid outliers||Random||For any method you use to identify the outliers there is a low probability to identify valid data as an outlier.|
If you use a large sample size you will get some valid outliers.
For example, when using the z-score with two standard deviations, around 4.5% of the valid observations will be outliers.
Outliers calculation methods
There are many methods to identify outliers, this outlier calculator uses the following methods.
Q1 - Lower quartiles.
Q3 - Upper quartiles.
Interquartile Range : IRQ = Q3 - Q1.
Usually with k = 1.5 for a regular outliers and k = 3 for extreme outliers.
Some people recommend to use k = 2.2
Lower fence formula
Lower fence = Q1 - k * IRQ.
Upper fence formula
Upper fence = Q3 + k * IRQ.
The data should be symmetrical, and if the data's distribution is normal you may estimate the number of valid outliers.
Usually, we use Z-score = 3, allowing three standard deviations from the average. In this case, if the data distributes normally with no invalid outliers, 0.27% of the data will be outliers on average. p( z < -3 ) + p( z > 3) = 0.0027, when z's distribution is standard normal, N(0,1).
Some people use Z-score = 2, allowing two standard deviations from the average. In this case, if the data distributes normally with no invalid outliers, 4.55% of the data will be outliers on average.
Lower = Average - k * S.
Lower = Average - k * S.
Outliers - Visual Identification
univariate outliers: outliers of objects that contains only one dimension.
multivariate outliers: outliers of multi-dimensional objects.
The outlier calculator identifies only the univariate outliers
For multivariate outliers you may use the following calculators:
1. Multiple Linear regression - you may find the outliers in 'Residual' column.
2. Cluster analysis - using the Silhouette method.