Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

[RFC] Introducing NumPy-compatible coding experience into MXNet #14253

@reminisce

Description

@reminisce

Motivation

Today deep learning scientists spend majority of their time on data processing, debugging tensor algorithms, and tuning model parameters, instead of architecting models from scratch by themselves as a result from the abundant pre-trained models existing in many deep learning model zoos. This has highlighted the usability of tensor APIs as a key factor for a framework to be widely adopted.

MXNet was firstly designed with the focus on memory efficiency, computation throughput and scalability. The usability problems begin to show up nowadays when more and more models demonstrate dynamic natures, e.g. unknown-shape tensors before runtime, control flow depending on a runtime result, etc. Here we highlight the most frequent complaints about usability from users.

  • Scalar tensors (aka zero-dim tensors) are not supported. For example, given a = [0, 1, 2], a[1] will generate an NDArray of shape (1,), instead of () as in NumPy.
  • Zero-size tensor is not supported. For example, a tensor of shape (0, 16, 256) cannot be passed to an operator, because our system currently treats 0, the first dimension size, as unknown, rather than a concrete number.
  • Many operators' signatures and functionality are not NumPy compatible, e.g. nd.dot vs. np.dot, nd.concatenate vs. np.concatenate, etc.
  • Many NumPy operators are missing. See the reference link to GitHub issues.
  • Operators whose outputs' shapes can only be determined at runtime are not supported, e.g. data[data < 0] cannot run.
  • Diverged programming experience due to the separation of imperative and symbolic operators registered under mxnet.ndarray and mxnet.symbol.
  • Control flow operators are hard to use. Users have to understand the complicated signatures of control flow operators, instead of writing native Python code using for, while, if/else, etc.
    For example, we have learned (in a hard way) that it does not make a lot of sense to ask users to write code like the following to perform a cumulative sum.
def sum(state, i):
    s = state + data[i]
    return s, [s, i + 1]

def sum_cond(state, i):
    return i < 4
    
out, state = F.contrib.while_loop(sum_cond, sum, [F.zeros((1)), F.zeros((1))],
                                  max_iterations=5)

Instead, users should be able to just write native Python code as the following and if required, let the framework serialize it into a computation graph for optimization and deployment.

data = np.arange(5)
out = 0
i = 0
while i < 5:
    out = out + data[i]

It is not hard to figure out that all of the above pain points can be summarized as a result from lack of NumPy-compatible coding experience in MXNet. While addressing the problems of better support of control flow operators and a consolidated coding style for writing imperative and symbolic code with more flexibility requires introducing fundamental changes into the codebase for building new infrastructures, such as a new graph IR and executor, which is extremely non-trivial and should be executed with a long-term plan, we can, at the moment, improve the usability by fixing the issue of zero-dim/size tensors and implementing NumPy operators in MXNet. Please allow us to discuss how to achieve these short-term goals in the following.

Support of zero-dim and zero-size tensors

What's the problem?

Zero-dim and zero-size tensors are valid tensors in NumPy. The former, whose shapes are (), represent scalars in numpy.ndarray format. The latter, which have one or multiple zero dimension sizes in shapes, can be useful as a placeholder for many ndarray operations, such as concatenating a zero-size ndarray with another ndarray. MXNet does not support them due to the reserved semantics of empty shape () and shapes with zero dimension sizes indicating unknown shape information. Such information need to be filled out during the shape inference stage in order to move forward to tensor computations later.

How to resolve the problem?

We can first change the current semantics to comply with NumPy definition.

  1. Change the definition of unknown shapes from ndim = 0 to ndim = -1 in TShape class.
  2. Change the definition of unknown dimension sizes from dim_size = 0 to dim_size = -1 in TShape class.

After this, we need to scan all over the codebase to modify the code accordingly where shape.ndim() == 0 and shape.Size() == 0 is used to perform unknown shape checks.

Please note that although MXNet's shape is a type inheriting from nnvm::Tuple, which is often used to represent an list-like object, such as axis=(1, 2, 3), we will not change the meaning of an empty tuple. This separation of definitions for empty shape and empty tuple keeps the their roles clearly decoupled.

We propose to breakdown the efforts into the following steps.

  1. Copy tuple.h from NNVM to MXNet and rename nnvm::TShape to mxnet::TShape.
  2. Replace all the places in MXNet where nnvm::Tuple and nnvm::TShape are used with mxnet::Tuple and mxnet::TShape, respectively.
  3. Change the definition of TShape in tuple.h to use ndim = -1 to indicate unknown shapes and dim_size = -1 to indicate unknown shape dim sizes.
  4. Modify all the existing shape inference and utility functions where ndim == 0 and dim_size == 0 is used to accommodate the above changes.
  5. Modify NNVM passes, InferShape, PlanMemory, and Gradient, where nnvm::TShape is used, to accommodate the above changes.
  6. Add sufficient unit tests.

How is backward compatibility guaranteed?

By default, we do not change the original definition of output shapes in shape inference functions; we just change ndim==0 to ndim==-1 for unknown shape verification. No backward compatibility issues are expected for all but one case, NDArray indexing. To elaborate, the current behavior determines that x[i] always returns a tensor with ndim >= 1. We can keep the current behavior unchanged and implement a global switch for users to turn on for expecting NumPy-compatible results.

Previous discussion of this topic can be seen here.

Implementation of NumPy operators

What to do?

To address the problems of operator incompatibility with NumPy and alleviate the pain of diverged programming experience due to the operator namespace separation: mxnet.ndarray and mxnet.symbol, we propose creating a new namespace mxnet.numpy, adopting operator APIs from NumPy, and implementing those operator APIs under the namespace. mxnet.numpy should provide the same imperative programming experience as NumPy and will gradually replace all the non-neural-network operators in the current codebase. While implementing NumPy operators in MXNet, it is possible for us to leverage TVM to generate high-performance kernels (ref.).

Can mxnet.numpy operators be used in Gluon for hybridization?

The newly implemented NumPy operators can still be accessed through the module (ndarray/symbol) delegate F in Gluon, e.g. F.numpy.dot. This works because the new operators are still registered under mxnet.ndarray and mxnet.symbol behind the scene. It is just that users are encouraged to access NumPy operator APIs through mxnet.numpy to write pure imperative code and Gluon APIs for achieving hybrid coding experience.

Where to contribute code?

A dev branch has been opened for this proposal.
https://github.com/apache/incubator-mxnet/tree/numpy

@junrushao1994 @szha @eric-haibin-lin @zheng-da @yzhliu

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCPost requesting for comments

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions