Of Data sets and its effects on Classification

So recently, I was bogged by a lot of questions with regards to the performance of classification algorithms. I was wondering how do I tell apart a good data set from a bad one. Some say more the number of samples, better is the performance. Though certain critical fields like precision medicine have limited data. How do we go about it then? The thought that came to my mind (at least for Supervised Learning), was that we should be asking whether the data set has equal population from all the output classes.

Following this chain of thought, I realized that not all real world data sets can be expected to be so “nicely behaved”. So do we prune the data set to have equal populations (in the process, losing some information) OR keep it as it is to glean the maximum information. This to me, seems the classic problem of Exploration v/s Exploitation.Depending on which side of the table you are, you would want to either achieve maximum possible accuracy or you would want the maximum possible information about the predictor-class relationship. Do we get both at the same time ?

To delve deeper, I did a classification project. I first simulated a data set. The two predictor values were random numbers generated from a normal distribution while the class values were generated from binomial distribution (20 samples). The two predictors had means at (0,0) and (1,1). Standard deviation: 1. With the assumption of equal co variance matrix, I superimposed the class boundaries of Linear Discriminant Analysis and the optimal Bayes classifier on the sample point distribution. The graph can be seen below.optimal_lda

I surmised that LDA calculates the mean and variance from the given sample points whereas the optimal classifier assumes the entire distribution. Hence the difference. Also I did the exercise of varying the number of sample points and observing its effect on the error estimate. The graph is attached below.error-vs-sample-set-size

To put it simply, there was no apparent trend. Normally, one would expect error rate to decrease with increasing size of data set. The reasons for this anomaly might be that the sample size might be still within the range of what people call “small data sets”. Also the classes might not be well separated (re: Euclidean distance between means) and come with a relatively high variance.

Such seemingly theoretical excursions dealing with the quality of data sets from the perspective of classification performance are necessary. In the recent AI landscape, I believe that approaches dealing with solving real world challenges should look at the “conditioning” of available data along with algorithmic advances as well. I will stop here for now. In the follow up post I will discuss about working with a real world data set. Hint: You might want to look up Material Informatics !!

Standard

Leave a comment