Skip to content

Conversation

bowang
Copy link
Contributor

@bowang bowang commented Jun 7, 2017

This PR implements the channel groups in convolutional layers (Conv1D, Conv2D, Conv3D, Conv2DTransposed).

The grouped convolution was firstly implemented in AlexNet as a way to share filter parameters across feature maps. A detailed discussion on this feature can be found on here. This feature is supported by Caffe (doc) and used in its reference caffenet. Adding this feature to TensorFlow makes it easier to compare models on two different frameworks and migrate from Caffe to TensorFlow.

@tensorflow-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@jhseu
Copy link
Contributor

jhseu commented Jun 7, 2017

Francois, mind commenting whether this would be useful to add?

@fchollet
Copy link
Contributor

fchollet commented Jun 7, 2017

I don't think it's currently widely used enough to justify becoming part of the core layers. The proposed implementation would also not be useful in any practical setting, because it relies on manually splitting tensors, running independent convolutions ops, and concatenating the outputs. This will be very slow and inefficient.

When we want to add support for convolution groups, it should happen at the op level, not be manually implemented as a graph of ops on the Python side. In theory, a convolution with groups (e.g. 8 groups in a 128-big channel space) should be significantly faster than a regular convolution, but with this setup it would be dramatically slower. Since the added speed / efficiency is the core reason for using them, that is clearly a big issue.

@jhseu
Copy link
Contributor

jhseu commented Jun 7, 2017

Thanks for the pull request! If you'd like to add support for this feature in the meanwhile, consider adding it to contrib, or reopening the feature request and discussing it there.

@jhseu jhseu closed this Jun 7, 2017
@bowang
Copy link
Contributor Author

bowang commented Jun 7, 2017

The lack of group convolution prevents migrating many pre-trained models from Caffe to TensorFlow. For example, 3 out of 5 example Caffe models use this feature. The users have to manually split the Caffe kernel to multiple TensorFlow conv layers, which is especially laborious and error-prone for large networks. This manual approach is also done in Python that is no faster, if not slower, than in conv layer.

In terms of performance, if the group number is set to 1 (as default), there is no performance difference from the current version. It would be ideal to implement this feature in the op. Supporting
this feature on the API can be a first step. It allows users to migrate to TensorFlow. Then we can profile and optimize.

@sleepfin
Copy link

sleepfin commented Jun 9, 2017

@bowang
I think its much slower when group is large when using your codes.
For example, group=32 is used in resnext
group convolution is more than 2 times slower than normal convolution(group=1)
(total_ops is 128% and total_params is 34% for group=32 & group=1 in my case)

I find there may be something to do with the device.
The results above are based on GPU running. When I run my network on CPU, it seems that group convolution is more than 3 times faster than normal convolution

GPU is slower somehow when loop-intensive ?

Is there any way for optimization?

@bowang
Copy link
Contributor Author

bowang commented Jun 9, 2017

@sleepfin do you mean a 32-group convolution is 2x slower than 32 normal convolutions?

It is understandable that group convolution is slower than normal convolution since it decomposes a big convolution into multiple smaller convolutions. Batching usually helps performance.

A fair comparison should be one 32-group 4-channel convolution vs 32 single-group 4-channel convolutions, rather than one single-group 128-channel convolution.

@fchollet
Copy link
Contributor

fchollet commented Jun 9, 2017

Again, if we add it, it should be done at the op level and should be fast.

If we add a feature as part of the core API then users should have a reasonable expectation that the feature does what it claims to do. People's motivation for using convolution groups is that they should result in cheaper, faster convolutions. If we provide this option but the result is actually slower convolutions, we are not answering user expectations.

@sleepfin
Copy link

sleepfin commented Jun 10, 2017

@bowang
But in a real case, we usually replace "one single-group 128-channel convolution" with "32-group 4-channel convolution" like ResNeXt, which means the param:group =1 or 32.
And notice that on CPU divece, "32-group 4-channel convolution" is faster than "one single-group 128-channel convolution" but the opposite on GPU device
My point is it will be great to have a high performance group convolution on both CPU and GPU.

@sleepfin
Copy link

@bowang
Another question:
BN & Activation & Bias_add layer are applied to:
1 -> The output of a single path of conv groups ?
2 -> The output of final concat layer ?
In your codes, I think it will be the first case if I use with arg_scope
I'm not sure if its correct and more efficent to apply those layers to the final concat output.

@bowang
Copy link
Contributor Author

bowang commented Jun 13, 2017

@sleepfin Thanks for the performance measurement! My hypothesis on the GPU/CPU performance difference is caused by the change of per-op and total computation workload in group convolution.

As in the current implementation that a single large convolution is decomposed into a number of small convolutions, the workload for a single small convolution may not be able to fully utilize a GPU to its limit. Thus, the GPU utilization decrease may lead to the slowdown.

Meanwhile, when comparing the total amount of workload, that of the group convolution is smaller than that of a single convolution. Since CPU has much lower computation power, it is always saturated. Thus, the speedup on CPU was due to the reduction of total workload.

This is just my hypothesis. Hope it may generate some ideas for further performance debugging.

@bowang
Copy link
Contributor Author

bowang commented Jun 13, 2017

@sleepfin Bias and activation are applied on the concatenated results of conv groups as you may find on line 194 (concat), line 218 (bias add), and line 221 (activation). BN is implemented as a separate layer which I think is out of the scope of conv layers.

@kwotsin
Copy link
Contributor

kwotsin commented Jul 11, 2017

A recent paper highlights the possible usefulness of group convolutions for mobile uses: https://arxiv.org/abs/1707.01083

Looking forward to see a possibility of group convolution implemented in TF.

@kwotsin
Copy link
Contributor

kwotsin commented Jul 13, 2017

Again, if we add it, it should be done at the op level and should be fast.

@fchollet Do you have a typical pathway to recommend for people interested in efficiently implementing a layer in TF (i.e. where in specific in the op level should one implement)?

@lgeiger
Copy link
Contributor

lgeiger commented Dec 19, 2019

@fchollet #25818 Added support for Conv2D Group Convolutions using CUDNN on GPUs. Would you be open to revisit this addition for Keras? I'd be happy to send a PR.

@danijar
Copy link
Contributor

danijar commented Jan 7, 2020

@lgeiger @fchollet It would be great to have group convolutions and their transposed version in Keras. There are at least depth-wise convolutions, but no DepthwiseConv2DTranspose to build image decoders.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants