Google Switch Transformers : One Expert is Better than Two
Contrary to the common wisdom that says: Two Heads are Better than One. In January 2021 Google Brain team published Switch Transformers paper [1] which tells us that one expert is better, more precisely we need to have many experts, but we select only one for a specific context, this one is the best for that context.
Switch Transformer is a transformer in which the Feed-Forward neural net is replaced by a set of experts with a routing gate that selects the best expert to process each incoming token. An expert is a simple feed-forward neural network.
The figure below shows a Switch Layer where we have four experts and each input token is routed to the best expert to produce a better token representation on the output. This architecture allows us to increase the number of experts without increasing the computational cost and makes it easy to parallelize the experts.

Where this idea came from
Increasing the model size has been an effective path towards powerful neural language models. But increasing the model size comes at the cost of computation. The authors wanted to increase the number of model parameters without a proportional increase in computational cost. So they decided to start with the Mixture-of-Expert (MoE) paradigm as it has been shown to perform better in (Shazeer, 2017)[2].
From MOE Transformer to Switch Transformer
The MoE consists of a number of experts and a trainable gating network which selects a combination of k-top experts to process each input. This architecture is not widely adopted, mainly because of its computational cost.
To make MOE computationally efficient, the authors had to simplify it. Therefore, they came out with Switch Transformer architecture which has many experts as in MoE paradigm but without combining k-top experts. For each input, Switch transformer selects only one expert. Thus, ignoring the routing cost, each input token has the same computational cost as in a vanilla Transformer.
This architecture outperforms the MoE 2-top experts routing as shown in the figure below. Thus, one expert is better than combining two.
It seems that the different experts become highly specialized based on syntax and semantic so that combining more than one expert for a specific token is not efficient.

It’s also faster than T5-Transformer
Compared to the T5 transformer, a state-of-the-art Transformer of Google, Results show that having more parameters (experts) speeds up training when keeping the computational cost fixed and equal for T5-base and Switch-Base. Switch-Base 64 expert model achieves the same performance of the T5-Base model at step 60k at step 450k, which is a 7.5x speedup in terms of step time as shown in the figure below.

Is it a New Bert?
This model makes it easy to parallelize the experts without the need to parallelize the whole model unless it’s very large, which reduces the complexity of training relatively large models. This could encourage researchers to widely adopt and explore this architecture.
Ref.
1. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.
2. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.