Hold Out Method & Random Sub-Sampling Method

#machinelearning #sampling #holdout #randomsubsampling

Hajare Akash Feb 10 2021 · 4 min read
Share this

Hold-Out Method

Hold out method is the most basic of the Cross-Validation (CV) techniques.

But why do we need this ?

Suppose you train a model on a given dataset using any specific algorithm. You tried to find the accuracy of the trained model using the same training data and found the accuracy to be 95% or maybe even 100%. What does this mean? Is your model ready for prediction?

The answer is no , Why?

 Because your model has trained itself on the given data, i.e. it knows the data and it has generalized over it very well. But when you try and predict over a new set of data, it’s most likely to give you very bad accuracy, because it has never seen the data before and thus it fails to generalizes well over it. This is the problem of overfitting. To tackle such problem, Hold-Out Method comes into the picture. Hold-Out Method is a resampling technique with a basic idea of dividing the training dataset into two parts i.e. train and test. On one part(train) you try to train the model and on the second part(test) i.e. the data which is unseen for the model, you make the prediction and check how well your model works on it. If the model works with good accuracy on your test data, it means that the model has not overfitted the training data and can be trusted with the prediction, whereas if it performs with bad accuracy then our model is not to be trusted and we need to change our algorithm.

So this is How we proceed with the Hold out method:

1.  In the first step, we randomly divide our available data into two subsets: a training and a test set. Setting test data aside is our work-around for dealing with the imperfections of a non-ideal world, such as limited data and resources, and the inability to collect more data from the generating distribution. Here, the test set shall represent new, unseen data to our learning algorithm; it’s important that we only touch the test set once to make sure we don’t introduce any bias when we estimate the generalization accuracy. Typically, we assign 2/3 to the training set, and 1/3 of the data to the test set. Other common training/test splits are 60/40, 70/30, 80/20, or even 90/10

2.  We set our test samples aside, we pick a learning algorithm that we think could be appropriate for the given problem. Now, what about the Hyperparameter Values depicted in the figure above? As a quick reminder, hyperparameters are the parameters of our learning algorithm, or meta-parameters if you will. And we have to specify these hyperparameter values manually – the learning algorithm doesn’t learn them from the training data in contrast to the actual model parameters.

Since hyperparameters are not learned during model fitting, we need some sort of “extra procedure” or “external loop” to optimize them separately – this holdout approach is ill-suited for the task. So, for now, we have to go with some fixed hyperparameter values – we could use our intuition or the default parameters of an off-the-shelf algorithm if we are using an existing machine learning library.

3.  Our learning algorithm fit a model in the previous step. The next question is: How “good” is the model that it came up with? That’s where our test set comes into play. Since our learning algorithm hasn’t “seen” this test set before, it should give us a pretty unbiased estimate of its performance on new, unseen data! So, what we do is to take this test set and use the model to predict the class labels. Then, we take the predicted class labels and compare it to the “ground truth,” the correct class labels to estimate its generalization accuracy.

4.  Finally, we have an estimate of how well our model performs on unseen data.

Pros - 

  • It is computationally less expensive.
  • time complexities are less.
  • Cons -

  • Evaluation based on the Hold-out set can have a high variance because it depends heavily on which data points end up in the training set and which in test data.
  • The evaluation will be different every time this division changes.(division of train and test data)
  • The Hold-out method is not well suited for sparse data-set. Sparse data set is the data set in which classes are not equally distributed. For example: consider following data set-

    here for convenience I have not written features (x1,x2...) in detail

    Now for Hold out method if we did this 70-30 split as shown , so you can see that in Train set we’ve majority of records of class A & only single record is of B. So ; if we train/ learn our algorithm  on that train set , do you think it will train itself good enough to predict B class ?  While evaluating this model  on the test set we will get wrong prediction for B class & the error will increase.

    In this situation The Random-Subsampling is the better approach than Hold-out method

    Random Sub-sampling Method

    Lets understand How Random Sub-sampling works:

  • Random Subsampling performs ‘k’ iterations of entire dataset ,i.e. we form ‘k’ replica of given data. 
  • For each iteration [for each replica] a fix no. of observation is chosen by without replacement method and it is kept aside as test set.
  • The model is fitted to training set from each iteration , and an estimate of prediction error is obtained from each test set.
  • Let the estimated PE (prediction error) in the ith test set be denoted by Ei . The true error estimate is obtained as the average of the separate estimates Ei .
  • Pros - It is better approach than Hold out method for sparse dataset.

    Cons - There is chances of selecting same record in test set for other iteration.

    Example :

    Here for my convenience I represent dataset in the just labels(Y) column. Blue represents Test set and Red represents Train set 

    As shown above,

    Suppose we select k = 4 = no. of iterations;

  •   Then we form k copies of given data set , then split the them into train-test sets. ( for each iteration we use without replacement method to form train-test split – same as hold out method )
  • Then we fit the model on each train set and then evaluate its performance on test set . So now we have 4 errors , the final error is average of these 4 errors.
  • This approach is better than Hold out method because every time we might get different test-train set. So it might happen all records of class B are present in training set, in this case the model can learn patterns of B in better way.
  • Now talking about cons ;  there is chances of selecting same record again and again in test sets

    Blue represents Test set and Red represents Train set 

    Here you can see in iteration 2,3,4 we have majority of B in test set .. again we will face same problem i.e. again our model will not be able to learn for B class and will fail in validation.

    To solve this problem there is another good method called as K-fold Cross Validation.

    Read next