Whenever we perform any machine learning task, we follow the normal pipeline of preprocessing data, feature engineering choosing the model followed by training and testing
Whenever we perform any machine learning task, we follow the normal pipeline of preprocessing data, feature engineering, choosing the model followed by training and testing, fine tuning the whole thing and then deploying the resulting algorithm for prediction. In this process, we have two kinds of parameters, the first ones are the weights that accompany the model and are bound to it, they are in fact updated after each training, the second type of parameters are the hyperparameters which are the focal point of this piece.
The hyperparameters are independent from the process we’ve seen above, they control the process of optimization from the moment they start being tuned. As examples of hyperparameters to be optimized,the learning rate, dropout rate, batch size and number of layers for neural networks, among others, their values are not exclusively numeric since they may take other values like the optimizer (adam, SGD, momentum...).
In fact we can classify hyperparameters into 3 types (hutter, 2018):
- Continuous: Learning rate
- Integer: number of layers, number of units
- Categorical:
- Finite domain, unordered:
- Example 1: algo ∈ {SVM, RF, NN}
- Example 2: activation function ∈{ReLU, Leaky ReLU, tanh}
- Example 3: operator ∈ {conv3x3, separable conv3x3, max pool, ...}
- Special case: binary
- Conditional hyperparameters: B are only active if other hyperparameters A are set a certain way
- Example 1:
- A = choice of optimizer (Adam or SGD)
- B = Adam‘s second momentum hyperparameter (only active if A=Adam)
- Example 2:
- A = type of layer k (convolution, max pooling, fully connected, ...)
- B = conv. kernel size of that layer (only active if A = convolution)
- Example 3:
- A = choice of classifier (RF or SVM)
- B = SVM‘s kernel parameter (only active if A = SVM)
Let’s talk now about the main approaches of hyperparameter optimization, namely: Manual search, Grid search, random search and the most sophisticated one of this pack the bayesian optimization.
Manual tuning:
A hunch, a rule of thumb or a simple intuition is all it takes to find the right set of hyperparameters for a given dataset with a certain model, this modus operandi is inspired from heuristics which allows the human expertise to control the whole process of optimization. Usually the requirement here is a certain amount of expertise at tuning the hyperparameters, the expert knows the value of the hyperparameters and their combinations that return the highest scores.
The problem with manual tuning is that it relies on expertise in certain domains and on past experience, when it is applied to novel cases, it is just a matter of gambling and approximation that may not lead toward a convergence of the model at hand.
Grid search:
A search method for the optimal hyperparameters that is based on brute force meaning that the search extends to all the values predefined which means that the entire efficiency of the model depends on these values. An example of the implementation of this method is the module GridSearchCV in the python library ScikitLearn.
The setback with this method is that it may be suitable for small training datasets and models with small variations of hyperparameters but its extremely expensive both in resources and time.
Random search:
A search method based on randomness introduced by Bergestra James and Yoshua Bengio in 2012. Here we provide a statistical distribution for each hyperparameter from which the values may be randomly sampled. Although it seems unlikely for randomness to find the optimal hyperparameters, the research and experimentation has shown that in most cases it outperforms grid search and that it is surprisingly effective and yields good results.
The inconvenience of this method lies on its randomness, it might be effective but it as naive as it gets when it comes to approaching the hyperparameter space at hand, also The number of search iterations is set based on time/resources, so it might be expensive but not like the grid search.
Bayesian optimization:
This approach is very popular in research and was adopted in a multitude of research papers, without delving into the details, it can be summed up as a method based on 2 principles randomness and probability distributions, it doesn't hit the search space of the hyperparameters head on like the grid search or the random one but instead of that it approaches them differently by creating a probability distribution out of an objective function (score of the model per example) with regard to the set of hyperparameters P(score | hyperparameters) and then the optimization will be focused on the latter (called also the surrogate in the literature) since it is easier and cheaper to find and is continually updated after each evaluation of the objective function.
The main idea of the bayesian optimization is to spend less time on the objective function and more time on the promising set of hyperparameters hence the main characteristic of this technique which is being an informed technique.
Sources:
https://www.automl.org/wp-content/uploads/2018/09/chapter1-hpo.pdf.
https://media.neurips.cc/Conferences/NIPS2018/Slides/hutter-vanschoren-part1-2.pdf
author: Ettaik Noureddine PhD Condidate At FSBM