-
Notifications
You must be signed in to change notification settings - Fork 171
Description
cc: @grst
We would like to stop returning unions being returned from outer concatenation with awkward arrays.
This presents us with some options about what we'd like to return, and some questions about how we're going to implement it. We haven't implemented anything yet, since we are blocked by a few bugs in awkward array. The tooling for handling unions should also be improving in 2.1.x.
Proposed output
Non-universal fields
When combining records, fields not contained in all arrays become optional
e.g.
- Input:
n * var * {a: int, b: int}
m * var * {a: int}
- Output:
(n + m) * var * {a: int, b: ?int}
Fill values
Empty variable length arrays will be used as fill values when there is a missing entries.
- Input:
n * var * int
None
- Output:
n * var
This will come up when we need to expand an axis for outer join, or when an AnnData being concatenated does not have a value.
Potential issue
This works when there is an empty value. This is true for simple ragged arrays, but is not true for all array types that awkward can represent. For example: n * 20 * int
cannot have "empty arrays" be the fill value because all dimension sizes are fixed. I believe so long as one of the dimensions is variable
we can make this work. Otherwise we would need a null fill value.
Related discussion
- first attempt to support awkward arrays #647 (comment)
- Currently blocked by:
- The ability to turn Union[{a: int}, {a: int, b:int}] into [{a: int, b: ?int}]
ak.concatenate
is currently being non-commutative, being weird about strings, and sometimes throwing an error.
Implementation
Here's an implementation of the internals. We'll still need the ability to merge unions from upstream to actually implement.
Defining `create_empty_array` + tests
from __future__ import annotations
import awkward as ak, numpy as np
def empty_lists(length: int) -> ak.contents.ListArray:
return ak.contents.ListArray(
ak.index.Index(np.broadcast_to(np.int32(0), length)),
ak.index.Index(np.broadcast_to(np.int32(0), length)),
ak.contents.EmptyArray(),
)
def create_empty_array(length: int, typ: ak.types.Type) -> ak.contents.Content | ak.Array:
"""Create an empty array of a set length and type.
Requires that the type can be "empty", (i.e. variable length)
Parameters
----------
length
Length of the array to create
typ
Type of the array to create
"""
if isinstance(typ, ak.types.ListType):
return empty_lists(length)
elif isinstance(typ, ak.types.RecordType):
return ak.contents.RecordArray(
[create_empty_array(length, t) for t in typ.contents],
typ.fields,
length
)
elif isinstance(typ, ak.types.ArrayType):
# Strip of high level array
return ak.Array(create_empty_array(length, typ.content))
elif isinstance(typ, ak.types.OptionType):
# option should propagate by itself
return create_empty_array(length, typ.content)
elif isinstance(typ, ak.types.RegularType):
return ak.contents.RegularArray(
create_empty_array(typ.size * length, typ.content),
typ.size
)
elif isinstance(typ, ak.types.UnionType):
raise NotImplementedError("Union type not implemented")
elif isinstance(typ, ak.types.NumpyType):
raise ValueError("Fixed size type, cannot contain empty arrays")
else:
raise Exception("Should be unreachable")
# Tests
def check_empty_array(input_array):
empty_array = create_empty_array(5, input_array.type)
# Check type (but remove outer array type)
combined = ak.concatenate([input_array, empty_array])
assert combined.type.content == input_array.type.content
assert len(combined) == len(input_array) + 5
def test_empty_arrays():
# 2 * {a: var * int64, b: var * int64}
check_empty_array(ak.Array([{"a": [1, 2], "b": [1, 2]}, {"a": [3], "b": [4]}]))
# 2 * {a: var * int64, b: option[var * int64]}
check_empty_array(ak.Array([{"a": [1, 2], "b": [1, 2]}, {"a": [3]}]))
# 2 * var * {a: int64, b: string}
check_empty_array(ak.Array([[{"a": 1, "b": "foo"}], [{"a": 2, "b": "bar"}, {"a": 3, "b": "baz"}]]))
# 2 * var * {a: int64, b: ?string}
check_empty_array(ak.Array([[{"a": 1, "b": "foo"}], [{"a": 2, "b": "bar"}, {"a": 3}]]))
test_empty_arrays() # Run tests
In the end, outer concatenation along axis 0 for awkward arrays in obsm
or varm
would look something like:
def outer_join_awkward(
arrays: list[ak.Array | MissingType], lengths: list[int]
) -> ak.Array:
result_type = ak.merge_union_of_records(
ak.concatenate([a[0:0] for a in arrays if not_missing(a)])
).type
return ak.concatenate(
[
a if a is not_missing(a) else create_empty_array(l, result_type)
for l, a in zip(arrays, lengths)
]
)