Skip to content

Prevent Unions during outer concatenation with awkward arrays #898

@ivirshup

Description

@ivirshup

cc: @grst

We would like to stop returning unions being returned from outer concatenation with awkward arrays.

This presents us with some options about what we'd like to return, and some questions about how we're going to implement it. We haven't implemented anything yet, since we are blocked by a few bugs in awkward array. The tooling for handling unions should also be improving in 2.1.x.

Proposed output

Non-universal fields

When combining records, fields not contained in all arrays become optional

e.g.

  • Input:
    • n * var * {a: int, b: int}
    • m * var * {a: int}
  • Output:
    • (n + m) * var * {a: int, b: ?int}

Fill values

Empty variable length arrays will be used as fill values when there is a missing entries.

  • Input:
    • n * var * int
    • None
  • Output:
    • n * var

This will come up when we need to expand an axis for outer join, or when an AnnData being concatenated does not have a value.

Potential issue

This works when there is an empty value. This is true for simple ragged arrays, but is not true for all array types that awkward can represent. For example: n * 20 * int cannot have "empty arrays" be the fill value because all dimension sizes are fixed. I believe so long as one of the dimensions is variable we can make this work. Otherwise we would need a null fill value.

Related discussion

Implementation

Here's an implementation of the internals. We'll still need the ability to merge unions from upstream to actually implement.

Defining `create_empty_array` + tests
from __future__ import annotations
import awkward as ak, numpy as np


def empty_lists(length: int) -> ak.contents.ListArray:
    return ak.contents.ListArray(
        ak.index.Index(np.broadcast_to(np.int32(0), length)),
        ak.index.Index(np.broadcast_to(np.int32(0), length)),
        ak.contents.EmptyArray(),
    )

def create_empty_array(length: int, typ: ak.types.Type) -> ak.contents.Content | ak.Array:
    """Create an empty array of a set length and type.
    
    Requires that the type can be "empty", (i.e. variable length)
    
    Parameters
    ----------
    length
        Length of the array to create
    typ
        Type of the array to create
    """
    if isinstance(typ, ak.types.ListType):
        return empty_lists(length)
    elif isinstance(typ, ak.types.RecordType):
        return ak.contents.RecordArray(
            [create_empty_array(length, t) for t in typ.contents],
            typ.fields,
            length
        )
    elif isinstance(typ, ak.types.ArrayType):
        # Strip of high level array
        return ak.Array(create_empty_array(length, typ.content))
    elif isinstance(typ, ak.types.OptionType):
        # option should propagate by itself
        return create_empty_array(length, typ.content)
    elif isinstance(typ, ak.types.RegularType):
        return ak.contents.RegularArray(
            create_empty_array(typ.size * length, typ.content),
            typ.size
        )
    elif isinstance(typ, ak.types.UnionType):
        raise NotImplementedError("Union type not implemented")
    elif isinstance(typ, ak.types.NumpyType):
        raise ValueError("Fixed size type, cannot contain empty arrays")
    else:
        raise Exception("Should be unreachable")


# Tests

def check_empty_array(input_array):
    empty_array = create_empty_array(5, input_array.type)
    # Check type (but remove outer array type)
    combined = ak.concatenate([input_array, empty_array])
    assert combined.type.content == input_array.type.content
    assert len(combined) == len(input_array) + 5

def test_empty_arrays():
    # 2 * {a: var * int64, b: var * int64}
    check_empty_array(ak.Array([{"a": [1, 2], "b": [1, 2]}, {"a": [3], "b": [4]}]))

     # 2 * {a: var * int64, b: option[var * int64]}
    check_empty_array(ak.Array([{"a": [1, 2], "b": [1, 2]}, {"a": [3]}]))

     # 2 * var * {a: int64, b: string}
    check_empty_array(ak.Array([[{"a": 1, "b": "foo"}], [{"a": 2, "b": "bar"}, {"a": 3, "b": "baz"}]]))

     # 2 * var * {a: int64, b: ?string}
    check_empty_array(ak.Array([[{"a": 1, "b": "foo"}], [{"a": 2, "b": "bar"}, {"a": 3}]]))


test_empty_arrays()  # Run tests

In the end, outer concatenation along axis 0 for awkward arrays in obsm or varm would look something like:

def outer_join_awkward(
    arrays: list[ak.Array | MissingType], lengths: list[int]
) -> ak.Array:
    result_type = ak.merge_union_of_records(
        ak.concatenate([a[0:0] for a in arrays if not_missing(a)])
    ).type
    return ak.concatenate(
        [
            a if a is not_missing(a) else create_empty_array(l, result_type)
            for l, a in zip(arrays, lengths)
        ]
    )

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions