It is non-trivial to run Bayesian Optimization in a parallel manner
In DL, model training takes a long time, parallel computing is especially important and ASHA leverages this.
#### Successive Halving Algorithm (SHA)
Example:
- There are 64 hyper-parameter configurations to evaluate
- SHA will strategize:
- Referring to the table below, SHA will first evaluate all 64 hyperparameter configurations for 1 epoch each.
- The performance of these 64 trials will be compared against each other and the top **1/reduction factor** of trials will be “promoted”.
- In our example, we arbitrarily set reduction factor to 4, so 1/4 * 64 = 16 trials will be “promoted” to the next rung where another 3 epochs will be run. Similar comparison is performed at each of the next rungs until only 1 trial is left.
![[Pasted image 20210917132956.png]]
##### The key insight from SHA is that we do not need to fully evaluate all trials for a full 64 epochs in order to select the optimal hyperparameter configurations.
##### More often than not, the least promising trials can be identified after a number of epochs and we can perform early stopping of these trials to reallocate our resources to other more promising ones.
#### Asynchronous Algorithm
SHA needs to wait for all trials in a rung to complete before it can continue on to the next run, it is difficult to parallelize optimally in the presence of *stragglers, ie. trials that run more slowly than the rest *
Suppose we have 10 computer nodes, allocate 1 computer node per trial, SHA will results in more and more leftover computing nodes that are not utilized as the number of trials decreases for higher rungs (eg. 6 idle nodes at rung 3)
ASHA tweaks the SHA algorithm by **promoting configurations whenever possible**, rather than waiting for all trials to finish in a rung. Referring to the same example in table above, ASHA will promote one trial to next rung for every 4 trials evaluated.
When compared to other existing algorithms that can be run in a parallel setting, ASHA generally performs as well as or even better according to [this paper](https://arxiv.org/pdf/1810.05934.pdf).
My intuition from the paper is that ASHA may perform better because it was able to **make full use of parallel computing to evaluate as many configurations as possible given a time frame.** For other algorithms, they may not be able to as fully utilise parallel computing in the presence of stragglers.