Skip to content

Conversation

blee1234
Copy link
Collaborator

@blee1234 blee1234 commented May 31, 2022


Overview

Currently, there is no support for custom data sources. File types not accounted for in feathr will simply throw a file system exceptions, as it defaults to looking for the file locally.

This pr will add support for user provided data sources through the new case class DataPathHandler through the FeathrClient. This is important particularly in some private environments, where they use privately owned data sources/file types.

For the user this change should be backwards compatible, as we only adding additional builder methods, and all the classes with their required parameters are only accessible in feathr's offline domain and its tests.

The class design will go over the added classes, and the API Flow will discuss how these hooks are passed around, and where they are invoked. The Testing section goes over how tests were modified and which tests were added.


Class Design

The key case class that enclose the handler functionality, DataPathHandler, is defined in DataSourceAccessor.scala, and its child classes DataAccessorHandler and DataLoaderHandler are defined in DataSourceAccessor.scala and DataLoaderFactory.scala respectively.

Both DataAccessorHandler and DataLoaderHandler have a shared validatePath functionality, that performs the same path checking functionality. Perhaps in future, we can refactor this to be cleaner.

DataAccessorHandler has a custom hook getAccessor which depending on the validatePath condition will return a user provided data accessor.

DataLoaderHandler has the following custom hooks, which like DataAccessorHandler will only be used if validatePath is true:

  1. createDataFrame => used to create a data frame given a path
  2. createUnionDataFrame => used to create a data frame given multiple paths
  3. writeDataFrame => used to write a data frame to a path

It is necessary to provide the DataLoaderHandler hook interface for users, as creating/writing data frame depends on the data/path type.

The interface was copied from SparkIOUtils, which handles a lot of the data writing/reading outside of data loaders.

More details regarding the main access points for file writing/reading is in the following section.


API Flow

For the current scope, the functionality is exposed primarily though FeathrClient. As such, there was a key investigation on which modules are dependent on type of file path, and tracking those APIs to FeathrClient.

The key access points for writing/reading files, while also high enough on call chain to prevent redundant refactoring are the following modules.

  1. DataLoaderFactory => returns typed DataAccessors that use DataLoaders to access dataframes
  2. DataSourceAccessor => returns path specific DataAccessors that use DataLoaders to access dataframes
  3. SparkIOUtils => Util package that reads/writes dataframes to/from paths

For data loaders and data accessors, the following object/call dependency graph was created to track the overall flow for instantiating data loaders and accessors,

V4 - Dali Object+Class Dependency Graph - Frame + Feathr drawio

For the SparkIOUtils, it was traced separately.

Default Values

For most of the of apis, default values were not chosen, as we don't want to data handlers to accidentally be dropped during the api flow.


Testing

External Integration Testing

Work was done in a downstream library to verify that data handling works.

Modifications to existing Feathr tests

Work was done to modify many unit tests, as although additional data handler parameters were added to a lot of classes, default values were not added.

The reasoning for this is because we don't want opaque parameter dropping. Only top-level components, such as the feathr client, and local test related components had default values appendend, to reduce the refactoring needed.

[WIP] Feathr Regression Testing

Additional work is being done to add regression testing to Feathr.


FAQ

  1. Was there any consideration to use DI to avoid the long chain of dependencies?
    • DI is not within scope, as we would need to perform a lot more refactoring, add an addition depenency on DI such as Guice, or use existing scala native patterns such as cake pattern, which is both unusable specifically for our use case, and equally complex.
  2. Why choose a list of handlers, and not a map of data type to handlers?
    • This would be useful if the user provides a config for their data source. Otherwise, this approach is less complex, as we don't have to deal with config parsing, and refactoring all data loading/writing to use inputLocation.
  3. Was there any existing classes to utilize for hooking custom data sources such as InputLocation?
    • The decision was made not at this time to use the existing InputLocation classes as all InputLocation:
      • Classes cannot be passed as a parameter, and thus would not be "hookable"
      • Batch data loaders and data accessors currently do not utilize the InputLocation to read/write to file system. They default to using SimplePath, and reading/writing files as regular file paths via FileFormat.loadHdfsDataFrame

@blee1234 blee1234 requested a review from jaymo001 May 31, 2022 20:24
@blee1234 blee1234 added the work-in-progress/do-not-merge Work in Progress PR, do not merge label May 31, 2022
@blee1234 blee1234 changed the title [WIP] Adding custom data loader handler support. Adding custom data loader handler support. May 31, 2022
@blee1234 blee1234 removed the work-in-progress/do-not-merge Work in Progress PR, do not merge label Jun 1, 2022
@blee1234 blee1234 requested review from xiaoyongzhu and hangfei June 1, 2022 20:26
@bozhonghu
Copy link
Collaborator

Overall change LGTM but we will probably need some help to integrate this into the FCM branch also.

@jaymo001 jaymo001 merged commit 3dc3549 into main Jun 2, 2022
bozhonghu pushed a commit that referenced this pull request Jun 2, 2022
* main:
  added init py for import error (#308)
  add contribution text (#153)
  Add 'feathr hocon' command (#317)
  Feathr UI: API spec alignment and ux experience improvement (#303)
  bump version (#315)
  Adding custom data loader handler support. (#309)
  Update product_recommendation_demo.ipynb
  Update product_recommendation_demo.ipynb
  Product sample (#314)
  [Feathr] Add a example for product recommendation (#312)
@jaymo001
Copy link
Collaborator

jaymo001 commented Jun 4, 2022

Follow-up ticket: #328

@xiaoyongzhu xiaoyongzhu deleted the dali_support branch June 7, 2022 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants