[RFC] Introducing NumPy-compatible coding experience into MXNet

## Motivation
Today deep learning scientists spend majority of their time on data processing, debugging tensor algorithms, and tuning model parameters, instead of architecting models from scratch by themselves as a result from the abundant pre-trained models existing in many deep learning model zoos. This has highlighted the usability of tensor APIs as a key factor for a framework to be widely adopted.

MXNet was firstly designed with the focus on memory efficiency, computation throughput and scalability. The usability problems begin to show up nowadays when more and more models demonstrate dynamic natures, e.g. unknown-shape tensors before runtime, control flow depending on a runtime result, etc. Here we highlight the most frequent complaints about usability from users.
- Scalar tensors (aka zero-dim tensors) are not supported. For example, given `a = [0, 1, 2]`, `a[1]` will generate an `NDArray` of shape `(1,)`, instead of `()` as in NumPy.
- Zero-size tensor is not supported. For example, a tensor of shape `(0, 16, 256)` cannot be passed to an operator, because our system currently treats 0, the first dimension size, as unknown, rather than a concrete number.
- Many operators' signatures and functionality are not NumPy compatible, e.g. `nd.dot` vs. `np.dot`, `nd.concatenate` vs. `np.concatenate`, etc.
- Many NumPy operators are missing. See the [reference link](https://github.com/apache/incubator-mxnet/issues?q=is%3Aissue+numpy+label%3ANumpy) to GitHub issues.
- Operators whose outputs' shapes can only be determined at runtime are not supported, e.g. `data[data < 0]` cannot run.
- Diverged programming experience due to the separation of imperative and symbolic operators registered under `mxnet.ndarray` and `mxnet.symbol`.
- Control flow operators are hard to use. Users have to understand the complicated signatures of control flow operators, instead of writing native Python code using `for`, `while`, `if/else`, etc.
For example, we have learned (in a hard way) that it does not make a lot of sense to ask users to write code like the following to perform a cumulative sum.
```python
def sum(state, i):
    s = state + data[i]
    return s, [s, i + 1]

def sum_cond(state, i):
    return i < 4
    
out, state = F.contrib.while_loop(sum_cond, sum, [F.zeros((1)), F.zeros((1))],
                                  max_iterations=5)
```
Instead, users should be able to just write native Python code as the following and if required, let the framework serialize it into a computation graph for optimization and deployment.
```python
data = np.arange(5)
out = 0
i = 0
while i < 5:
    out = out + data[i]
```

It is not hard to figure out that all of the above pain points can be summarized as a result from lack of NumPy-compatible coding experience in MXNet. While addressing the problems of better support of control flow operators and a consolidated coding style for writing imperative and symbolic code with more flexibility requires introducing fundamental changes into the codebase for building new infrastructures, such as a new graph IR and executor, which is extremely non-trivial and should be executed with a long-term plan, we can, at the moment, improve the usability by fixing the issue of zero-dim/size tensors and implementing NumPy operators in MXNet. Please allow us to discuss how to achieve these short-term goals in the following.

## Support of zero-dim and zero-size tensors
### What's the problem?
Zero-dim and zero-size tensors are valid tensors in NumPy. The former, whose shapes are `()`, represent scalars in `numpy.ndarray` format. The latter, which have one or multiple zero dimension sizes in shapes, can be useful as a placeholder for many `ndarray` operations, such as concatenating a zero-size `ndarray` with another `ndarray`. MXNet does not support them due to the reserved semantics of empty shape `()` and shapes with zero dimension sizes indicating unknown shape information. Such information need to be filled out during the shape inference stage in order to move forward to tensor computations later.

### How to resolve the problem?
We can first change the current semantics to comply with NumPy definition.
1. Change the definition of unknown shapes from `ndim = 0` to `ndim = -1` in `TShape` class.
2. Change the definition of unknown dimension sizes from `dim_size = 0` to `dim_size = -1` in `TShape` class.

After this, we need to scan all over the codebase to modify the code accordingly where `shape.ndim() == 0` and `shape.Size() == 0` is used to perform unknown shape checks.

Please note that although MXNet's shape is a type inheriting from `nnvm::Tuple`, which is often used to represent an list-like object, such as `axis=(1, 2, 3)`, we will not change the meaning of an empty tuple. This separation of definitions for empty shape and empty tuple keeps the their roles clearly decoupled.

We propose to breakdown the efforts into the following steps.
1. Copy `tuple.h` from NNVM to MXNet and rename `nnvm::TShape` to `mxnet::TShape`.
2. Replace all the places in MXNet where `nnvm::Tuple` and `nnvm::TShape` are used with `mxnet::Tuple` and `mxnet::TShape`, respectively.
3. Change the definition of `TShape` in `tuple.h` to use `ndim = -1` to indicate unknown shapes and `dim_size = -1` to indicate unknown shape dim sizes.
4. Modify all the existing shape inference and utility functions where `ndim == 0` and `dim_size == 0` is used to accommodate the above changes.
5. Modify NNVM passes, `InferShape`, `PlanMemory`, and `Gradient`, where `nnvm::TShape` is used, to accommodate the above changes.
6. Add sufficient unit tests.

### How is backward compatibility guaranteed?
By default, we do not change the original definition of output shapes in shape inference functions; we just change `ndim==0` to `ndim==-1` for unknown shape verification. No backward compatibility issues are expected for all but one case, `NDArray` indexing. To elaborate, the current behavior determines that `x[i]` always returns a tensor with `ndim >= 1`. We can keep the current behavior unchanged and implement a global switch for users to turn on for expecting NumPy-compatible results.

Previous discussion of this topic can be seen [here](https://discuss.mxnet.io/t/rank-0-arrays-in-mxnet-aka-pi-is-wrong/108).

## Implementation of NumPy operators
### What to do?
To address the problems of operator incompatibility with NumPy and alleviate the pain of diverged programming experience due to the operator namespace separation: `mxnet.ndarray` and `mxnet.symbol`, we propose creating a new namespace `mxnet.numpy`, adopting operator APIs from NumPy, and implementing those operator APIs under the namespace. `mxnet.numpy` should provide the same imperative programming experience as NumPy and will gradually replace all the non-neural-network operators in the current codebase. While implementing NumPy operators in MXNet, it is possible for us to leverage TVM to generate high-performance kernels ([ref.](https://docs.tvm.ai/tutorials/get_started.html#sphx-glr-tutorials-get-started-py)).

### Can `mxnet.numpy` operators be used in Gluon for hybridization?
The newly implemented NumPy operators can still be accessed through the module (`ndarray`/`symbol`) delegate `F` in Gluon, e.g. `F.numpy.dot`. This works because the new operators are still registered under `mxnet.ndarray` and `mxnet.symbol` behind the scene. It is just that users are encouraged to access NumPy operator APIs through `mxnet.numpy` to write pure imperative code and Gluon APIs for achieving hybrid coding experience. 

## Where to contribute code?
A dev branch has been opened for this proposal.
https://github.com/apache/incubator-mxnet/tree/numpy

@junrushao1994 @szha @eric-haibin-lin @zheng-da @yzhliu 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Introducing NumPy-compatible coding experience into MXNet #14253

Motivation

Support of zero-dim and zero-size tensors

What's the problem?

How to resolve the problem?

How is backward compatibility guaranteed?

Implementation of NumPy operators

What to do?

Can `mxnet.numpy` operators be used in Gluon for hybridization?

Where to contribute code?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Introducing NumPy-compatible coding experience into MXNet #14253

Description

Motivation

Support of zero-dim and zero-size tensors

What's the problem?

How to resolve the problem?

How is backward compatibility guaranteed?

Implementation of NumPy operators

What to do?

Can mxnet.numpy operators be used in Gluon for hybridization?

Where to contribute code?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Can `mxnet.numpy` operators be used in Gluon for hybridization?