-
Notifications
You must be signed in to change notification settings - Fork 322
Add Int4PlainInt32 Tensor #2845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2845
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New FailuresAs of commit 282f1a8 with merge base 3bf21d0 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
"int4_xpu_int_zp is referring to the format used by int4 weight-only quantization on XPU with int zero point, which is a groupwise quantization format." | ||
INT4_XPU_INT_ZP = "int4_xpu_int_zp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't include int4 and xpu in the name, can you name this in terms of of how the quantized data is packed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The int4 weight xpu is a plain format tensor according to this doc, it just pack 2 int4 weight elements in a byte and then store the 4*int4 as int32. So I change it to the plain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, we have plain that stores 2*int4 as int8, can you reuse it or would need a new one? https://github.com/pytorch/ao/blob/main/torchao/quantization/quantize_/workflows/int4/int4_tensor.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liangan1 can you use PLAIN_INT32
for packing_format, and rename things accordingly (tensor subclass, files etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jerryzh168. I have added PLAIN_INT32 to be used by the xpu int4. Per my understanding, the packing format should be a dispatch policy to select the right tensor subclassing and a tensor subclass should cover a specific quantization recipe. So I suppose I should keep the current tensor name for int4 xpu.
In this PR, we just want to enable the int xpu with int zp domain. The current oneDNN backend can not support the float zp as CUDA/CPU backend and the feature is WIP. I plain to reuse this packing format in the future and dispatch the tensor with the zero point domain information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can reuse the packing format and the tensor for float32 zero_point as well in the future I think, but today we structure tensor subclass by: dtype + packing_format, so Int4PlainInt32
might be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. change it to Int4PlainInt32. pls help to review again.
torchao/quantization/quantize_/workflows/int4/int4_xpu_tensor.py
Outdated
Show resolved
Hide resolved
@@ -14,6 +14,9 @@ | |||
from .int4.int4_tensor import ( | |||
Int4Tensor, | |||
) | |||
from .int4.int4_xpu_tensor import ( | |||
Int4XPUTensorIntZP, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should rename the tensor names etc. as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my understanding, the packing format should show the weight packing format, but I also want add the zp domian information in this tensor, so keep the tensor name as Int4XPUTensorIntZP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the packing format should contain a description of how the quantized weights are laid out in memory, it should not mention XPU
, and also not INTZP
since these are not relevant to how the quantized weights laid out.
can you change it to something like Int4PlainInt32
since PLAIN_INT32
is the packing_format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Int4PlainInt32 is good for me. But I still have some concern about the relation between packing_format and subclassing tensor? it must be 1 x 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, it's 1x1, we are also going to do a follow up refactor to have Int4PackingFormat instead having a global PackingFormat, since there is not much reuse
theoretically, 1 packing format can correspond to multiple tensors with different dtypes, but we found that there is not much reuse of packing_format, that's why we want to move it to local to each dtype (int4, float8, intx)
what's the concern for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Your explanation is clear enough. No more concern now.
torchao/quantization/quantize_/workflows/int4/int4_xpu_tensor.py
Outdated
Show resolved
Hide resolved
eps = 1e-6 | ||
scale_dtype = None | ||
zero_point_dtype = torch.int32 | ||
scale, zero_point = _choose_qparams_affine( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can use the official op?
ao/torchao/quantization/quant_primitives.py
Line 1194 in f02354d
def choose_qparams_affine( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
return Int4WeightOnlyConfig( | ||
group_size=group_size, | ||
packing_format="plain_int32", | ||
zero_point_domain=ZeroPointDomain.INT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we don't need this anymore I think, also we want to remove ZeroPointDomain in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. But I have a question that how to selelct the int zp domain for user if there is no this param?
torchao/quantization/__init__.py
Outdated
"Int4PreshuffledTensor", | ||
"Int4MarlinSparseTensor", | ||
"IntxOpaqueTensor", | ||
"IntxOpaqueTensor", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra spaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
int4 weight-only quantization on XPU with oneDNN as backend (groupwise quantization only) | ||
|
||
Tensor Attributes: | ||
qdata: packed int4 weigh, always viewed as a 2D (N, K/2) tensor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be good to callout the dtype here, since it's important in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
|
||
Non-Tensor Attributes: | ||
block_size: the block size for quantization, representing the granularity, for groupwise quantization, will have block_size (1, group_size). | ||
we only support group_size = 32/64/128. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add an assert in from_hp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected. There is no block size limitation for XPU backend.
scale: (K/group_size, N), dtype is the same as the original Tensor dtype | ||
zero_point: (K/group_size, N) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also dtypes for these for more clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the dtype information for zero_point.
This PR is used to enable the Int4XPUTensorIntZP. The pacing format name is "int4_xpu_int_zp"
Testcase:
bash python test/quantization/quantize_/workflows/int4/test_int4_xpu.py