I am working on a Classification dataset that is biased which means output or independent variable in the dataset has uneven class distributions.
Regarding dataset, it is regarding Diabetes and I have downloaded from Kaggle. It has 2000 instances with eight input variables and one output or target variable. The target variable has two classes, 1 and 0. 1 means, the person is diabetic and 0 means the person is not diabetic.
Out of 2000 instances, 1316 instances have outcome zero, i.e., these people don’t have diabetes and 684 instances are people with diabetes which is clear that dataset has more records with people not having diabetes. As Machine Learning model developed with biased data results in low accuracy, we have to make it unbiased, i.e., all the classes of target variable should be either equally distributed or it should be 60% and 40%.
I have imported the dataset into RStudio
diabetes <- read.csv(“diabetes-dataset.csv”, sep = “,”, header = TRUE)
Now, I am factorising the output variable, Outcome.
diabetes$Outcome <- as.factor(diabetes$Outcome)
Now, I have partitioned the dataset based on output variable.
#holds instances where outcome is 1
diabetes_true <- diabetes[(diabetes$Outcome == 1), ]#holds instances where outcome is 0
diabetes_false <- diabetes[(diabetes$Outcome == 0), ]
As the class with value 0 is heavily weighted, I have performed Under- Sampling on diabetes_false as shown below —
Under-Sampling is nothing but reducing the number of instances of heavily weighted class.
#UnderSampling the data for biasing the outcome
diabetes_false[sample(nrow(diabetes_false),1026, replace = FALSE, prob = NULL),]
Finally, I have combined the two data frames, diabetes_true and diabetes_false. Now, classes are distributed as 40% and 60%.
diabetes_final <- rbind(diabetes_true,diabetes_false)
Visual Representation of Class Distribution of Output variable before and after under-Sampling
Thank you and have a good day :)