Data Science Basics — How to find an Outlier?
If you've been playing with data, you may have surely come across outliers! But what exactly is an outlier?
Wikipedia Definition :
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error.
Clearly, an outlier is something that doesn't match with most of our observations. For example, let's say we are measuring the heights of students in a class. Consider the data as follows:
heights = [150,158,156,162,147,164,157,256,150]
When we plot these values, we find out that something seems off!
Clearly, entry 256 must have been a mistake. That's more than 8 feet! This is our outlier in this dataset. Let us see what causes an outlier.
Possible causes for Outliers
- Variance in our data: Sometimes, the outliers are genuine values, but just extreme cases. For example, when we measure the price of tickets for a football game, there may be some people that paid extraordinarily more, perhaps for more premium seats or last-minute reservations. These observations are genuine but do not represent the majority of the data.
- Errors in measurement, data entry, or sampling: This is quite self-explanatory. Maybe while noting down the heights of the students, the examiner accidentally wrote 256 instead of 156.
Visualizing an Outlier
Some common ways in which outliers can be visualized are by plotting the various data points and looking for extreme values.
Let us take a common data set as our example. We use the train.csv dataset from the Kaggle Competition — Titanic: Machine Learning from Disaster.
We get the data frame as follows:
We pick up the “Fare” feature from this dataset and try to visualize the values to find any outliers.
Firstly, let us store our values for “Fare” in a variable X. and find out how many values we have.
X = df["Fare"].values# Finding the length of X
len(X)
We have 891 values in “Fare”. Now let us try to visualize these values.
1. Scatter Plot
One of the easiest ways to visualize the data is by using a Scatter Plot. It will show us what range most of our data lies in and if there are any extreme values.
plt.figure(figsize=(8,6))
plt.scatter(range(1,892),X)
plt.xlabel("Passenger Number")
plt.ylabel("Fare")
plt.title("Fare Value for each passenger")
plt.show()
Observation: We can see that some passengers paid a very high price for the ticket. Most passengers have paid between 0 to 300 but there are few passengers that have paid over 500 as well. This could either be an error in recording the values for the Fare. Or it could be because these people could afford to pay a high price for maybe a more premium ticket.
2. Box Plot
Perhaps the easiest way to visualize outliers is through plotting a box plot. It will show us the distribution of the values in “Fare” and also plot the outliers.
plt.figure(figsize = (12,6))sns.boxplot(X)plt.xlabel("Fare")
plt.title("Boxplot for Fare")
plt.show()
We observe that even values close to 100 are being considered as outliers. This is because the box plot detects outliers using the Interquartile Range. We will explore what that is in the coming paragraphs.
So clearly, there are different ways to find outliers. By mere visualization, we can't exactly say which points are outliers and which aren’t. So let's explore 2 simple ways by which we can find the outlier points in our data.
Finding an Outlier
2 of the ways by which we can find an outlier are:
1. Using Z-Score
We can predict which values are outliers using Z-Score. The Z-Score value can be calculated as
z = (observation — Mean)/(Std Deviation)
or
z = (X — μ) / σ
In statistics, z-score tells us how many standard deviations away a value is from the mean. Usually, we assume that if a value is 3 standard deviations away from the mean; it is an outlier.
There are 2 ways by which you can calculate the z-score for the data:
- Using the function zscore() from scipy
from scipy import stats
z_score = stats.zscore(X)
outliers= X[np.abs(z_score)>3]
print(outliers)
Here, we calculate the z-score for each value using the inbuilt function from the scipy library. Then we store all the points that have z-score > 3 in outliers. We are trying to find all the observations that are 3 standard deviations away from the mean. We get the following output:
- Using the formula: z-score = (observation — Mean)/(Std Deviation)
We can also manually calculate the z-score for each value using the formula:
z-score = (X — μ) / σ
zscore = []
mean = np.mean(X)
std = np.std(X)
for value in X:
zscore.append((value - mean)/std)
out = X[np.abs(zscore)>3]
print(out)
Here, we manually calculate the z-score for each value using the formula and then store the values which have z-score > 3 in the variable out. Our output is as follows :
We can see that the output is the same for both methods!
Visualizing which points are being considered as outliers:
If we use the z-score method to find outliers, then exactly what points from this dataset will be considered as outliers? We try to visualize it by plotting a scatter plot in the following way:
We first find the minimum of these outliers — we can use the min() function for this, and we get the output as 211.3375. Any point above this will also be an outlier. So we separate our scatter plot into two sections based on this.
plt.figure(figsize=(10,8))
plt.scatter(range(1,892),X)
plt.plot(range(1,892),np.repeat(211.3375,891),color='red',linewidth=2)
plt.xlabel("Passenger Number")
plt.ylabel("Fare")
plt.title("Fare Value for each passenger")
plt.show()
All the points above the red line are being considered as outliers. Let us see how this compares with our second method which is finding outliers using the Interquartile Range.
2. Using the Interquartile Range
The interquartile range is defined as the difference between the 75th and 25th percentiles, or between the upper and lower quartiles.
We define:
Q1 = 25th percentile of our data
Q2 = 75th percentile of our data
IQR = Q3 — Q1
Using the IQR, an outlier can be found in the following way:
Any value x such that x < Q1 — (1.5*IQR) OR Q3 + (1.5*IQR) < x
is considered an outlier.
This is the exact method used by a box plot to plot the outliers!
We can find the values of Q1 and Q3 using the following code:
Q1,Q3 = np.quantile(X,[.25,.75])
print("Q1 : ",Q1)
print("Q3 : ",Q3)
We can now use the formula IQR = Q3 — Q1 to find the value of IQR
IQR = Q3 - Q1
print("Interquartile Range (IQR) : ",IQR)
To find the outliers, we set the upper bound as Q3 + (1.5*IQR) and the lower bound as Q1 — (1.5*IQR). Any points lying outside this range will be treated as outliers.
# Defining the lower and upper bound for our data:
lb = Q1 - 1.5*IQR
ub = Q3 + 1.5*IQR
print("Lower Bound : ",lb)
print("Upper Bound : ",ub)
Now, we can run a loop and filter out the values that do not lie in this range.
outs = []
for x in X:
if(x<lb or x>ub):
outs.append(x)
print(outs)
Clearly, we get many more outliers in this case compared to the Z-score method. Let us try to visualize it and see exactly what values from the data are being considered as outliers.
plt.figure(figsize=(10,8))
plt.scatter(range(1,892),X)
plt.plot(range(1,892),np.repeat(65.6344,891),color='red',linewidth=2)
plt.xlabel("Passenger Number")
plt.ylabel("Fare")
plt.title("Fare Value for each passenger")
plt.show()
Any points above the Red line are being treated as outliers.
Based on our data, we are free to choose the way we want to classify our outliers. Clearly, using the Z-score may give different results compared to using the IQR. Our data and the type of model we are making will decide which method would be a good choice.
You can check out the code related to this post on my GitHub: