Skip to content

Enabling Muon Optimizer in DeepSpeed #7509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

PKUWZP
Copy link
Collaborator

@PKUWZP PKUWZP commented Aug 23, 2025

Authorship: @pengdurice and @PKUWZP

Related Issue: #7438

Introduction

Muon, a new optimizer that has attracted the community’s attention recently shows promising results in training large language models. Adding the Muon Optimizer to DeepSpeed, a popular OSS framework for large scale training and inference is critically important for DeepSpeed users and developers. There has been a PR attempting the adoption. (Huge Thanks to @qimcis), which is a good starting point. It still requires more substantial effort to make it fully compatible and work within DeepSpeed. We are publishing this PR to fully enable Muon Optimizer capabilities for DeepSpeed.

Issues and solutions

Issues

  1. With stage 1, 2 or 3, the optimizer states will be partitioned within the same data parallel group. This means that each process is already handling only parts of the model parameters and there is no need to use the DP solution as in the code.
  2. The parameters (and the gradients) will be flattened to 1D vector before being used in the optimizer, thus nullifying the major hypothesis of the muon optimizer: it works by orthogonalizing the updates for each matrix (dim >=2)

Solutions

To solve the issues, we propose this new PR in which:

  1. We simplify the Muon code by removing the partitioning and muon updates logics.

  2. We move the muon update to the get_flat_partition function of stage 1 and 2 DeepSpeedZeroOptimizer in which per parameter gradients are collected before being flattened and used by the optimizer to update the model parameters. Since each parameter is still in its original shape, we can easily apply the muon updates.

  3. We also save the momentum buffer into the optimizer’ state so that we have a smooth convergence after applying the saved checkpoints.

  4. We added comprehensive unit tests to validate Muon Optimizer's correctness and functionality.

Future directions and roadmap

In the future, several follow up works are of interests:

  • Create a CPU offload version.
  • Apply Muon to Stage 3
  • Use the highly optimized version of Adam for the Adam part of MuonWithAuxAdam optimizer.
  • More efficient implementations e.g. a) add specialized kernels for Newton-Schulz iteration and muon updates; b) parallelize updates for the parameters (currently, each parameter is updated separately and sequentially)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants