Skip to content

add scattermoe kernel for fast MoE training #40365

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

mayank31398
Copy link
Contributor

No description provided.

@Rocketknight1
Copy link
Member

cc @ArthurZucker

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @mayank31398 ! Nice pr, happy to add something like that, do you mind using kernels like what we do for GPT_OSS?!
This way we keep a slow path, compatible with all torch, all hardwares etc and don't have code changes for the core modeling, and just have the kernel on the hub!

WDYT? 🤗

@mayank31398
Copy link
Contributor Author

@ArthurZucker scattermoe doesnt support bias for now, I will add this soon!
meanwhile supporting every model is hard since some models have expert weights as a moduleList instead of a 3D tensor :/

Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: granitemoe

@shawntan
Copy link
Contributor

Hello @mayank31398 ! Nice pr, happy to add something like that, do you mind using kernels like what we do for GPT_OSS?! This way we keep a slow path, compatible with all torch, all hardwares etc and don't have code changes for the core modeling, and just have the kernel on the hub!

WDYT? 🤗

Is there an existing triton kernel you could point to that I could follow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants