Asynchronous Successive Halving Algorithm (ASHA)

It is non-trivial to run Bayesian Optimization in a parallel manner In DL, model training takes a long time, parallel computing is especially important and ASHA leverages this. #### Successive Halving Algorithm (SHA) Example: - There are 64 hyper-parameter configurations to evaluate - SHA will strategize: - Referring to the table below, SHA will first evaluate all 64 hyperparameter configurations for 1 epoch each. - The performance of these 64 trials will be compared against each other and the top **1/reduction factor** of trials will be “promoted”. - In our example, we arbitrarily set reduction factor to 4, so 1/4 * 64 = 16 trials will be “promoted” to the next rung where another 3 epochs will be run. Similar comparison is performed at each of the next rungs until only 1 trial is left. ![[Pasted image 20210917132956.png]] ##### The key insight from SHA is that we do not need to fully evaluate all trials for a full 64 epochs in order to select the optimal hyperparameter configurations. ##### More often than not, the least promising trials can be identified after a number of epochs and we can perform early stopping of these trials to reallocate our resources to other more promising ones. #### Asynchronous Algorithm SHA needs to wait for all trials in a rung to complete before it can continue on to the next run, it is difficult to parallelize optimally in the presence of *stragglers, ie. trials that run more slowly than the rest * Suppose we have 10 computer nodes, allocate 1 computer node per trial, SHA will results in more and more leftover computing nodes that are not utilized as the number of trials decreases for higher rungs (eg. 6 idle nodes at rung 3) ASHA tweaks the SHA algorithm by **promoting configurations whenever possible**, rather than waiting for all trials to finish in a rung. Referring to the same example in table above, ASHA will promote one trial to next rung for every 4 trials evaluated. When compared to other existing algorithms that can be run in a parallel setting, ASHA generally performs as well as or even better according to [this paper](https://arxiv.org/pdf/1810.05934.pdf). My intuition from the paper is that ASHA may perform better because it was able to **make full use of parallel computing to evaluate as many configurations as possible given a time frame.** For other algorithms, they may not be able to as fully utilise parallel computing in the presence of stragglers.