Skip to content

Conversation

ZhiweiYan-96
Copy link
Collaborator

@ZhiweiYan-96 ZhiweiYan-96 commented Oct 9, 2024

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Oct 9, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137566

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures

As of commit 8b14d42 with merge base d7f3cd0 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 9, 2024
ZhiweiYan-96 added a commit that referenced this pull request Oct 9, 2024
ghstack-source-id: 403047a
Pull Request resolved: #137566
@ZhiweiYan-96 ZhiweiYan-96 marked this pull request as draft October 9, 2024 07:26
@ZhiweiYan-96
Copy link
Collaborator Author

@zhuyuhua-v Could you please review the PR?

@ZhiweiYan-96 ZhiweiYan-96 requested a review from EikanWang October 9, 2024 07:28
@ZhiweiYan-96 ZhiweiYan-96 added module: xpu Intel XPU related issues topic: not user facing topic category ciflow/xpu Run XPU CI tasks labels Oct 9, 2024
"oneDNN input matrixes must have the same ranks");
TORCH_CHECK(result.defined(), "oneDNN matmul result should be defined");

at::Device curDevice = at::Device(at::kXPU, at::xpu::current_device());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please unify the code style. curDevice -> cur_device.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you suggestions, the naming has been changed.

mb = dst.size(0);
TORCH_CHECK(
mb == m1.size(0) && mb == m2.size(0),
"batch size mismatch, dst mb: ",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is mb a common term? Can users fully understand the exact meaning of mb?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestions, mb means minibach here, but i review the code and remove mb in the code since int4_gemm has no need to handle batch currently.

scale_usr_md = dnnl::memory::desc(scale_dims, scale_user_dt, scale_strides);
zp_usr_md = dnnl::memory::desc(zp_usr_dims, zp_user_dt, zp_usr_strides);
dst_usr_md = dnnl::memory::desc(dst_dims, dst_usr_dt, dst_strides);
// STEP4: create dnnl::memory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are STEP 2 and STEP 3?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed these kind of comments and add new comments in the codes.

args.insert({DNNL_ARG_ATTR_ZERO_POINTS | DNNL_ARG_WEIGHTS, zp_usr_m});

sycl::event matmul_event = dnnl::sycl_interop::execute(matmul_p, stream, args, deps);
if (!dst.is_same(result))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is dst not the same as result?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These issues roots from the woq matmul is ported from regular matmul. int4_gemm has no need to consider dst is not same as result currently, I have removed the code.

sycl::event matmul_event = dnnl::sycl_interop::execute(matmul_p, stream, args, deps);
if (!dst.is_same(result))
result.copy_(dst);
result = resize_as_onednn_mat1(mat1_, result);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is resize_as_onednn_mat1 required?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

args.insert({DNNL_ARG_SCRATCHPAD, scratchpad_memory});

if (attr.with_binary())
attr.construct_post_binary(matmul_pd, args);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attr constructs the post binary. However, dnnl::post_ops po = attr.extract_post_ops(dst); has extracted the post ops and pattr.set_post_ops(po); has assigned the post op to matmul primitive attribute. Is it a valid behavior?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int4 would have no post-ops currently, i have removed the code, thanks.

dnnl::memory::data_type::s8);
// Set fpmath mode with `apply_to_int=true` to apply fpmath mode behavior to
// integral primitives (in this example, matmul).
pattr.set_fpmath_mode(dnnl::fpmath_mode::f16, true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OneDNN supports both f16 and bf16. Why do we need to constrain the dtype?

Copy link
Collaborator Author

@ZhiweiYan-96 ZhiweiYan-96 Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know have a control statement to determine which dtype is used for fpmath_mode, thanks. However, bf16 would have runtime issue in oneDNN at current version. The bf16 dtype is valid in newer version of onednn.

@ZhiweiYan-96
Copy link
Collaborator Author

@liangan1 Could you please review the PR?

TORCH_CHECK(
dims == mat1.dim() && dims == mat2.dim(),
"oneDNN input matrixes must have the same ranks");
TORCH_CHECK(result.defined(), "oneDNN matmul result should be defined");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you have flatten the mat1 and mat2 into dims=2 and the result is also 2 dimension empty tensor. when will dim=3 and result is not defined? Can you show a example?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some logic is too old, and we have removed such weird code in gemm integration now.

Attr attr,
const c10::optional<Tensor>& g_idx,
const std::vector<sycl::event>& deps,
Tensor b_raw = at::Tensor()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to bias_raw?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bias is not presented in weight_int4pack_mm API, and I have removed bias related code in newest commit. Thanks for your suggestions.

(b.size(0) == 1 && b.size(1) == 1),
"matmul supports [m, n] or [1, n] or [m, 1] or [1, 1] when bias dim is 2 ...");
if (b.size(0) == 1 && b.size(1) == 1)
b = b.expand({1, n}).contiguous();
Copy link
Contributor

@liangan1 liangan1 Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other case(e.g., b.dim()=1/3/0), you always expand the b to the same dim to m1. Whether it works when the m1.dim()==3 while b.dim()==2? According to the doc of onednn: "all tensors (including bias
, if it exists) must have the same number of dimensions."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These codes have been removed by me as it is bias -related. Thanks for you reminding.

auto m2_usr_dt = get_onednn_dtype(m2);
auto scale_user_dt = get_onednn_dtype(scale_); // half <==> fp16
// auto zp_user_dt = dnnl::memory::data_type::s4; // int32, representing 8xint4
auto zp_user_dt = get_onednn_dtype(zp_);
Copy link
Contributor

@liangan1 liangan1 Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to change xxx_user_xxx to xxx_usr_xxx to unify the style. Due to onednn support different data types , suggest to change to "e.g., half<==>f16"

return output.view_symint(sizes);
}

sycl::event woq_matmul_int4(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to add more function description here. e.g. the activation data type supported, data layout information for both inputs. etc...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More detailed description is added in older commits.


m2_usr_dims = {compressed_k, n};
scale_dims = {num_groups, n};
zp_dims = {1};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dims of zp_dims is not aligned with the original zp inputs. With this limitation, only the symmetric or per-tensor quantization is supported. Pls add the comments about this limitation of oneDNN.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OneDNN provides us with a way to support asymmetry, allowing us to handle asymmetrical scenarios. I'm currently testing it, and if it works, I will modify it here to support both symmetric and asymmetric logic.

Copy link

@airMeng airMeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a prepack process since OneDNN doesn't support the most popular layout

[ghstack-poisoned]
ZhiweiYan-96 added a commit that referenced this pull request Nov 2, 2024
ghstack-source-id: 451a44c
Pull Request resolved: #137566
[ghstack-poisoned]
ZhiweiYan-96 added a commit that referenced this pull request Nov 5, 2024
ghstack-source-id: ab34a0e
Pull Request resolved: #137566
@liangan1
Copy link
Contributor

liangan1 commented Nov 5, 2024

There should be a prepack process since OneDNN doesn't support the most popular layout

https://github.com/intel/torch-xpu-ops/pull/1035/files This PR is used to do int4 weight prepack.

[ghstack-poisoned]
ZhiweiYan-96 added a commit that referenced this pull request Nov 24, 2024
ghstack-source-id: 9863ceb
Pull Request resolved: #137566
[ghstack-poisoned]
ZhiweiYan-96 added a commit that referenced this pull request Nov 28, 2024
ghstack-source-id: 6eb581c
Pull Request resolved: #137566
[ghstack-poisoned]
ZhiweiYan-96 added a commit that referenced this pull request Nov 28, 2024
ghstack-source-id: 6919d03
Pull Request resolved: #137566
b, n_bit=4, q_group_size=q_group
)
# b_int4pack [n, k//8]
b_int4pack = torch._convert_weight_to_int4pack(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be b_int4pack [k//8, n]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminding, I have modified the description here.

sizes[sizes.size() - 1] = n;
return output.view_symint(sizes);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should remove this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these codes have been removed in newest commit, thanks for reminding.

Tensor m1 = is_onednn_matmul_strides(mat1_) ? mat1_ : mat1_.contiguous();
//m2_ may be a 4 dims fake tensor in torchAO with shape {N / 8, K / (16 * innerKTiles), 32, innerKTiles / 2}
//Tensor m2 = mat2_.flatten(0, -2); //ToDo: change to the fke shape: mat2_.flatten(0, -2); // N1
Tensor m2 = is_onednn_matmul_strides(mat2_) ? mat2_ : mat2_.contiguous();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this comments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

auto expected_m1_md = matmul_pd.src_desc();
auto expected_m2_md = matmul_pd.weights_desc();
auto expected_dst_md = matmul_pd.dst_desc();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to remove this part.

zeros = min_val + scales * (2 ** (n_bit - 1))
zeros = min_int - min_val.div(scales).round()
zeros = torch.clamp(zeros, min_int, max_int)
zeros = zeros.to(torch.int8)
assert torch.isnan(zeros).sum() == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also used in tinygemm, should not change this one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out, I have moved the codes to xpu/test_gemm.py

const at::Tensor& zp, // [k/group_size, N]
int64_t group_size,
Attr attr,
const std::vector<sycl::event>& deps = {});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this operation require deps?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formerly, fs1 requires we add events at oneDNN integration layer for profiling purposes. For me, it is just intended to have consistent API with conv/gemm. Do we need to remove this?

const at::Tensor& scale, // [K/group_size, N]
const at::Tensor& zp, // [k/group_size, N]
int64_t group_size,
Attr attr,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Attr attr,
std::optional<Attr> attr = std::nullopt,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will removed attr, as we do not append post-op currently.

const at::Tensor& zp, // [k/group_size, N]
int64_t group_size,
Attr attr,
const std::vector<sycl::event>& deps = {});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const std::vector<sycl::event>& deps = {});
const std::optional<std::vector<sycl::event>>& deps = std::nullopt);


// qscale:[K/qGroupSize, N]
// qzp:[K/qGroupSize, N]
woq_matmul_int4(C, A, B, qScale, qZeros, qGroupSize, onednn::Attr());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any case that we need to fuse other operations? What's the motivation here to provide attributes?

Comment on lines +422 to +426
const Tensor& A,
const Tensor& B,
int64_t qGroupSize,
const Tensor& qScale,
const Tensor& qZeros) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZhiweiYan-96 , the code style of Blass.cpp is snake_case, why is the style of these variables camelCase?

Copy link
Collaborator Author

@ZhiweiYan-96 ZhiweiYan-96 Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


at::Device cur_device = at::Device(at::kXPU, at::xpu::current_device());
auto engine = GpuEngineManager::Instance().get_engine(cur_device);
auto stream = GpuStreamManager::Instance().get_stream();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZhiweiYan-96 , may I know where the guard code to ensure all the input tensors to be on the same device?

dst_md = dnnl::memory::desc(dst_dims, dst_dt, dst_strides);

std::unordered_map<int, dnnl::memory> args;
dnnl::post_ops po = attr.extract_post_ops(dst);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The po should be useless. Has this file been added to torch linter?

Copy link
Collaborator Author

@ZhiweiYan-96 ZhiweiYan-96 Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Post-ops is not required at present. We can remove the post op and added it back when it is necessary.
  • All file in xpu/detail/*.cpp is in linter checking list. I met this before. It should caused that, linter does not check this noused style issue.

ZhiweiYan-96 and others added 2 commits March 4, 2025 01:48
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
Comment on lines 29 to 30
auto engine = GpuEngineManager::Instance().get_engine(cur_device);
auto stream = GpuStreamManager::Instance().get_stream();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto engine = GpuEngineManager::Instance().get_engine(cur_device);
auto stream = GpuStreamManager::Instance().get_stream();
auto& engine = GpuEngineManager::Instance().get_engine();
auto& stream = GpuStreamManager::Instance().get_stream();

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the information. Has updated the code.

[ghstack-poisoned]
@ZhiweiYan-96
Copy link
Collaborator Author

ZhiweiYan-96 commented Mar 25, 2025

Update

  1. Remove all usage of post-ops code (attr, post-op). The reason is that no post-op requirement for int4 gemm currently.
  2. Remove sycl::event related code. We will add this back when it is really required.

@ZhiweiYan-96 ZhiweiYan-96 requested a review from EikanWang March 25, 2025 08:33
@EikanWang EikanWang moved this from Pre-Review Required to Approved in PyTorch Intel Apr 7, 2025
@EikanWang EikanWang marked this pull request as ready for review April 7, 2025 13:17
@EikanWang EikanWang requested a review from gujinghui as a code owner April 7, 2025 13:17
[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Rebased gh/ZhiweiYan-96/47/orig onto refs/remotes/origin/viable/strict because #147962 was rebased, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/137566)

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #147962

1 similar comment
@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #147962

pytorchmergebot pushed a commit that referenced this pull request Apr 8, 2025
…tration (#147962)

Pull Request resolved: #147962
Approved by: https://github.com/jerryzh168, https://github.com/guangyey, https://github.com/EikanWang
ghstack dependencies: #137566

Co-authored-by: xiaolil1 <xiaoli.liu@intel.com>
@github-project-automation github-project-automation bot moved this from Approved to Done in PyTorch Intel Apr 8, 2025
timocafe pushed a commit to timocafe/pytorch that referenced this pull request Apr 16, 2025
timocafe pushed a commit to timocafe/pytorch that referenced this pull request Apr 16, 2025
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
Divigroup-RAP pushed a commit to Divigroup-RAP/PYTORCH that referenced this pull request Apr 22, 2025
ghstack-source-id: c2c4f90
Pull Request resolved: pytorch/pytorch#137566
@github-actions github-actions bot deleted the gh/ZhiweiYan-96/32/head branch May 15, 2025 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/xpu Run XPU CI tasks Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: xpu Intel XPU related issues open source topic: not user facing topic category
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

9 participants