Skip to content

Discussion - data.table and record types #4910

@DavisVaughan

Description

@DavisVaughan

Hi data.table team!

I would like to start a discussion regarding a feature request of allowing record types as columns of a data.table. If you aren't familiar with the term, we define a record type as a classed list of equal length vectors, where the length() of the object is the length of the vectors, not the length of the list.

As of now, these aren't particularly common in R, but there is one example in base R, POSIXlt. I'm aware of the fact that POSIXlt is converted to POSIXct upon entry into a data.table, and I understand the reasons why you all do this. However, if you look beyond POSIXlt, I think that record types can be a powerful way to convey a lot of meaning into a single vector.

As an example, I've developed a new package called clock that makes heavy use of these record types. But of course, they don't work as columns of a data.table:

library(clock)
library(data.table)

x <- year_month_day(2019, 1:3)
x
#> <year_month_day<month>[3]>
#> [1] "2019-01" "2019-02" "2019-03"
unclass(x)
#> $year
#> [1] 2019 2019 2019
#> 
#> $month
#> [1] 1 2 3
#> 
#> attr(,"precision")
#> [1] 2

y <- duration_milliseconds(c(1e9, 10))
y
#> <duration<millisecond>[2]>
#> [1] 1000000000 10
unclass(y)
#> $ticks
#> [1] 11  0
#> 
#> $ticks_of_day
#> [1] 49600     0
#> 
#> $ticks_of_second
#> [1]  0 10
#> 
#> attr(,"precision")
#> [1] 8


data.table(x = x)
#> Error in dimnames(x) <- dn: length of 'dimnames' [1] not equal to array extent
data.table(y = y)
#> Error in dimnames(x) <- dn: length of 'dimnames' [1] not equal to array extent

data.frame(x = x)
#>         x
#> 1 2019-01
#> 2 2019-02
#> 3 2019-03
data.frame(y = y)
#>            y
#> 1 1000000000
#> 2         10

clock builds on the vctrs_rcrd type from the vctrs package. That type provides a lot of S3 method scaffolding to make it easier to create new record types on top of it. Because it is now much more straightforward to construct a record type in R, I feel that more might start appearing in the wild over the next few years.

I realize that this would probably be a lot of work. In the tidyverse, it was much easier to add support for these types once we added support for columns of a data frame that are also data frames (df-cols, for short). Record types can be thought of in a similar way, and often use the same underlying code when ordering, slicing, or comparing instances of them.

If you do think that this is worth pursuing, I am happy to discuss this further!

Metadata

Metadata

Assignees

No one assigned

    Labels

    non-atomic columne.g. list columns, S4 vector columns

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions