Google Switch Transformers : One Expert is Better than Two

TeqnoVerse
3 min readJan 27, 2021

--

Contrary to the common wisdom that says: Two Heads are Better than One. In January 2021 Google Brain team published Switch Transformers paper [1] which tells us that one expert is better, more precisely we need to have many experts, but we select only one for a specific context, this one is the best for that context.

Switch Transformer is a transformer in which the Feed-Forward neural net is replaced by a set of experts with a routing gate that selects the best expert to process each incoming token. An expert is a simple feed-forward neural network.

The figure below shows a Switch Layer where we have four experts and each input token is routed to the best expert to produce a better token representation on the output. This architecture allows us to increase the number of experts without increasing the computational cost and makes it easy to parallelize the experts.

Switch Transformer encoder block

Where this idea came from

Increasing the model size has been an effective path towards powerful neural language models. But increasing the model size comes at the cost of computation. The authors wanted to increase the number of model parameters without a proportional increase in computational cost. So they decided to start with the Mixture-of-Expert (MoE) paradigm as it has been shown to perform better in (Shazeer, 2017)[2].

From MOE Transformer to Switch Transformer

The MoE consists of a number of experts and a trainable gating network which selects a combination of k-top experts to process each input. This architecture is not widely adopted, mainly because of its computational cost.

To make MOE computationally efficient, the authors had to simplify it. Therefore, they came out with Switch Transformer architecture which has many experts as in MoE paradigm but without combining k-top experts. For each input, Switch transformer selects only one expert. Thus, ignoring the routing cost, each input token has the same computational cost as in a vanilla Transformer.

This architecture outperforms the MoE 2-top experts routing as shown in the figure below. Thus, one expert is better than combining two.

It seems that the different experts become highly specialized based on syntax and semantic so that combining more than one expert for a specific token is not efficient.

Switch Transformer outperforms MoE with 2-top routing

It’s also faster than T5-Transformer

Compared to the T5 transformer, a state-of-the-art Transformer of Google, Results show that having more parameters (experts) speeds up training when keeping the computational cost fixed and equal for T5-base and Switch-Base. Switch-Base 64 expert model achieves the same performance of the T5-Base model at step 60k at step 450k, which is a 7.5x speedup in terms of step time as shown in the figure below.

Is it a New Bert?

This model makes it easy to parallelize the experts without the need to parallelize the whole model unless it’s very large, which reduces the complexity of training relatively large models. This could encourage researchers to widely adopt and explore this architecture.

Ref.

1. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.

2. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

TeqnoVerse
TeqnoVerse

Written by TeqnoVerse

Passionate about Tech: AI, robotics, the metaverse, Bitcoin and crypto! Sharing insights and discoveries from en.teqnoverse.com

No responses yet

Write a response