Skip to main content

Detecting outlying patterns in Categorical Variables



It has always been amazing to play with data. Most specifically the continuous data. However, its a pain when I get to analyse the insights of the categorical data. Ofcourse, there are different ways to handle the categorical data such as converting them to binary form, creating dummy columns for them or by factoring them and giving them all a separate new numeric number. I wonder if its only me, but these methods have really never helped me when I get real data with many categorical variables and many categories in them. 


Real data in work mostly comes with categorical variables and sometimes even with more than 100 categories. 


For example, say We are given we a problem to find which are the most uncommon patterns of the categorical variables. 

Below is the data with country and country codes. 

AFGHANISTAN AFG
ALBANIA ALB
ALGERIA DZA
AMERICAN SAMOA ASM
ANDORRA AND
ANGOLA AGO
ANGUILLA AIA
BAHAMAS BHS
BAHRAIN BHR
BANGLADESH BGD
BARBADOS BRB
AFGHANISTAN AFG
ALBANIA ALB
ALGERIA DZA
AMERICAN SAMOA ASM
ANDORRA AND

Recently I was working on a outlier detection problem where I had to find the outlying patterns of the data. The current outlying detection problems are powerful in handling the continuous data. But, when it comes to get the uncommon patterns of the categorical variables, I never got a desired result out of the current outlier detection algorithms. I spent many days spending in utilising the SVM one class,  kmeans algorithms to find the outliers of my data. As a result, I was able to get the outlying continuous data. But, was not able to get the uncommon patterns in the categorical variables of the data.


Then, I decided to write my own code from scratch to get the most uncommon patterns in my data.

To explain my program, lets have a look at the data. The data is uploaded in my github profile.

link to my github profile :     



Lets visualize the data to see the pattern of data. From a sample of data, the below bar chart is made.

The plot is the frequency of country and countryside pattern. 




From the plot, we can see that pattern like ‘COMOROS-COM’ , ‘CAMBODIA-KHM’ has a frequency of 2 where as patterns like ‘BARBADOS-BRB’ ‘AFGHANISTAN-AFG’ has a frequency of 1. As the graph was drawn on the sample data, there might have been some data loss. But, its fine. The graph is just to illustrate what the requirement is. 

Our requirement is to find the most uncommon patterns i.e. the patterns with low frequency. 

This problem is entirely based on the categorical data. 

Now here comes my solution. 

I have used the below python program to get the outlying patterns.


import pandas as pd
import matplotlib.pyplot as plt
import random

file=pd.read_csv("/Users/maitree/outlier_detection/countrycode.csv",sep=',',delimiter=None)
data = file[['Country','Countrycode']]


def outlie(data,threshold=1):
        data=data.applymap(str)
        concat = data.apply(lambda x: ''.join(x),axis=1)
        count=concat.apply(lambda x:list(concat).count(x))
        result=count.apply(lambda x:1 if x>threshold else -1)
        return(concat,count,result)
    
file['concat'],file['count'],file['outlier']=outlie(data,threshold=1)

normal=file[(file['outlier']==1)]
abnormal=file[(file['outlier']==-1)]


plt.figure()
plt.plot(normal['Serial'],normal['count'],'g^')
plt.plot(abnormal['Serial'],abnormal['count'],'ro')
plt.xlabel('Serial')
plt.ylabel('count')
plt.show()


Lets visualize the result. The below plot shows plot between index and count of the record pattern. The red points are the abnormal points where the count of the pattern is 1 (This 1 is the threshold limit I have used in my program).


I have also created a reusable module which needs to be feed in with all categorical variables and a threshold limit that we need. This reusable module is available in my github profile. This module can be used for as many categorical variables that we have. 


Path for the reusable module:

https://github.com/Maitree1986/outlier_detection/blob/master/outlie.py









Comments

  1. Its an intriguing question really. I didn't understand the code above. What was the algorithm used?

    ReplyDelete

Post a Comment