Global-batch load balance almost free lunch to improve your MoE LLM training

qwenlm.github.io Infra & hardware 1 min read

GITHUB HUGGING FACE MODELSCOPE DISCORD Background The Mixture-of-Experts (MoEs) architecture has become a popular model-parameter-scale-up technique. Typically, one MoE layer consists of a router (often parameterized as one single Linear layer) and a group of experts (for transformer-based models, each expert is one feedforward layer). Given an input, only a subset of experts will be activated, and then their outputs will be aggregated based on the scores the router assigned.

Read the original on qwenlm.github.io

AI News Hub links to primary sources. This page shows the publisher's own title and excerpt with a link to the full article. We point you at the news; we don't rewrite it.