How to do KNN with Python & Sci-Kit Learn

Sharing is Caring
Share

KNN or K-nearest neighbor is one of the easiest and most popular machine learning algorithm available to data scientists and machine learning enthusiasts.

In this post, we are going to implement KNN model with python and sci-kit learn library.

You can also implement KNN in R but that is beyond the scope for this post.

In this post, I am not going to discuss under the hood concepts of KNN and will only demonstrate the implementation.

If you want to learn more about the KNN, you can visit here.

Also, if you want to learn more about sci-kit learn librarywhich I am using here, click here.

Further, if you are interested in implementation of logisticregression using Azure ML studio, click here.

Alright, so let’s start with KNN implementation in python and with sci-kit learn library.

KNN with Python & Sci-Kit Learn

 

You can download the dataset from https://www.kaggle.com/ntnu-testimon/paysim1

Let’s import few libraries which are must for any machine learning algorithm in python and jupyter notebook.

Let’s load the dataset using pandas

Now let’s review the head to verify the dataset.

step type amount nameOrig oldbalanceOrg newbalanceOrig nameDest oldbalanceDest newbalanceDest isFraud isFlaggedFraud
1 PAYMENT 9839.64 C1231006815 170136 160296 M1979787155 0 0 0 0
1 PAYMENT 1864.28 C1666544295 21249 19384.7 M2044282225 0 0 0 0
1 TRANSFER 181 C1305486145 181 0 C553264065 0 0 1 0
1 CASH_OUT 181 C840083671 181 0 C38997010 21182 0 1 0
1 PAYMENT 11668.1 C2048537720 41554 29885.9 M1230701703 0 0 0 0


(6362620, 11)

There are 6362620 row, that is observation and 11 variables, i.e. columns.

let’s drop unnecessary columns.

Let’s divide our data set into two parts;

  1. Dataset where isFraud == 1. We will call it pay_data_fraud.
  2. Dataset where isFraud == 0. We will call it pay_data_nofraud.
step type amount oldbalanceOrg newbalanceOrig oldbalanceDest newbalanceDest isFraud
1 TRANSFER 181 181 0 0 0 1
1 CASH_OUT 181 181 0 21182 0 1
1 TRANSFER 2806 2806 0 0 0 1
1 CASH_OUT 2806 2806 0 26202 0 1
1 TRANSFER 20128 20128 0 0 0 1
step type amount oldbalanceOrg newbalanceOrig oldbalanceDest newbalanceDest isFraud
1 PAYMENT 9839.64 170136 160296.36 0 0 0
1 PAYMENT 1864.28 21249 19384.72 0 0 0
1 PAYMENT 11668.14 41554 29885.86 0 0 0
1 PAYMENT 7817.71 53860 46042.29 0 0 0
1 PAYMENT 7107.77 183195 176087.23 0 0 0

Let’s find out the dimension of the fraud & No Fraud data set.

This is a huge data set which will take lot of time for any model.Therefore let’s reduce it to so that it’s take optimum time for KNN.We have over 8000 values in fraud and so let’s take twice of those from no fraud data set.

Let’s join both of these data sets to build the data set which will be used for running KNN model.

Let’s draw few charts for better understanding the data.

We are going to use seaborn and matplot lib in this analysis.

seaborn plot in python

 

seaborn plot 2

seaborn plot 3

seaborn plot 4

seaborn plot 5

 

With scikit-learn library,  you cannot implement the machine learning algorithms on categorical columns. Therefore we cannot implement KNN on categorical columns in our data set. We will need to use label encoding and hot encoding in order to resolve this issue.

But first, we need to import label encoder and hot encoder modules in our session.

step

amount oldbalanceOrg newbalanceOrig oldbalanceDest newbalanceDest isFraud type_0 type_1 type_2 type_3 type_4
1 9839.64 170136 160296.36 0 0 0 0 0 0 0 1
1 1864.28 21249 19384.72 0 0 0 0 1 0 0 0
1 181 181 0 0 0 1 0 0 0 0 1
1 181 181 0 21182 0 1 0 1 0 0 0
1 11668.14 41554 29885.86 0 0 0 0 0 0 0 1

let’s check the data set for missing values.

There are missing values in all of the above column with True. For this analysis, we will simply replace the missing value with 0 but you can explore more ideas.

In this KNN model;

  • dependent variable is isFraud
  • It contain 0 and 1
  • Remaining columns/variables are independent variables
  • Creating ind_var for independent variable & dep_var for dependent variable.

Let’s import split model from sklearn and create training & testing data sets.

KNN Model Summary

This KNN model is exhibiting high accuracy but this is a very basic model. If you are looking to further explore KNN, i will suggest you to use different techniques to handle missing values and try different parameters in the model.

Please leave your comment and let me know your feedback. 

Sharing is Caring
Share
About akhilendra

Hi, I’m Akhilendra and I write about Business Analysis, Data Science, IT & Web. Join me on Twitter, Facebook & Google+

Comments

  1. M. Venkatakumar says

    Hi

Speak Your Mind

*