-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Description
This issue is a roadmap and checklist for ongoing work re-factoring DMatrix (see RFC #4354).
The first steps are to use a common interface to external data, unifying the way DMatrix objects are constructed and simplifying the process of adding new external data sources.
- Create adapter interface to external data, implement constructors for SimpleDMatrix (Demo of external data adapters #5044)
- Extend data adapters to SparsePageDMatrix (Use adapters for SparsePageDMatrix #5092)
- Extend data adapters to cudf data loading (Use dynamic types for array interface columns instead of templates #5108)
- Add support for cupy (Support dmatrix construction from cupy array #5206)
- Implement DMatrix slice via adapter (Implement slice via adapters #5198)
After the above all DMatrix constructors will be happening via adapters, missing value handling and use of threads will be consistent.
Then I plan to start reducing the number of classes associated with DMatrix.
- Remove SimpleCSRSource into SimpleDMatrix (Remove SimpleCSRSource #5315)
- Remove SparsePageSource into SparsePageDMatrix
The final goal is to save memory by constructing histogram matrices for the hist
and gpu_hist
algorithms directly from external data using adapters. We will need some discussion on the interface e.g. if a user wants to build the histogram DMatrix directly, specify an enum to the constructor indicating DMatrix type
- Develop interface for instantiating histogram DMatrix directly
- Build constructor for EllPack matrix directly using adapters
Lastly:
- Enable weighted sketching for
DeviceQuantileDMatrix
.