[RFC] Low-level speed optimizations for PowerSGD

## 🚀 Feature

### Communication compression

PyTorch features several [DDP Communication Hooks](https://pytorch.org/docs/stable/ddp_comm_hooks.html) that compress messages exchanged between workers in distributed optimization. If communication time is a bottleneck, these hooks can speed up distributed training. Of course, compression is only beneficial if the *time required to compress the messages* is significantly smaller than the *time spent communicating*. Speeding up compression times can open up communication savings to a wider range of models and hardware.

### PowerSGD

In training problems with a strong communication bottleneck, the current [PowerSGD](https://arxiv.org/abs/1905.13727) hook in PyTorch already improves training times (@SciPioneer), but a [recent paper](https://arxiv.org/pdf/2103.00543.pdf) argues that in many settings, the gains in communication do not yet weigh up to the added compression time. 

### Proposed optimizations 
The recent [DALL-E paper](https://arxiv.org/abs/2102.12092) uses PowerSGD for large-scale distributed training, and the paper's appendix contains many recommendations on how to implement the algorithm efficiently. The most actionable recommendation is 
- the creation of a specialized CUDA kernel for the orthogonalization of matrices with many rows and few columns.  

Orthogonalization is the most expensive step in PowerSGD compression and based on timing results from the DALL-E authors, there is a potential for speedups up to 100x in this operation. 

### Benefits
- With faster compression times, compression will generate speedups for more models and with faster communication hardware.
- A faster orthogonalization operation will allow PowerSGD to use higher 'rank's (more accurate compression). Currently, such large ranks would make compression too slow. With more accurate compression, we can avoid drops in model accuracy.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Low-level speed optimizations for PowerSGD #65813

🚀 Feature

Communication compression

PowerSGD

Proposed optimizations

Benefits

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Low-level speed optimizations for PowerSGD #65813

Description

🚀 Feature

Communication compression

PowerSGD

Proposed optimizations

Benefits

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions