Adding custom data loader handler support. #309
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
Currently, there is no support for custom data sources. File types not accounted for in feathr will simply throw a file system exceptions, as it defaults to looking for the file locally.
This pr will add support for user provided data sources through the new case class
DataPathHandler
through theFeathrClient
. This is important particularly in some private environments, where they use privately owned data sources/file types.For the user this change should be backwards compatible, as we only adding additional builder methods, and all the classes with their required parameters are only accessible in feathr's
offline
domain and its tests.The
class design
will go over the added classes, and theAPI Flow
will discuss how these hooks are passed around, and where they are invoked. TheTesting
section goes over how tests were modified and which tests were added.Class Design
The key case class that enclose the handler functionality,
DataPathHandler
, is defined in DataSourceAccessor.scala, and its child classesDataAccessorHandler
andDataLoaderHandler
are defined in DataSourceAccessor.scala and DataLoaderFactory.scala respectively.Both
DataAccessorHandler
andDataLoaderHandler
have a sharedvalidatePath
functionality, that performs the same path checking functionality. Perhaps in future, we can refactor this to be cleaner.DataAccessorHandler
has a custom hookgetAccessor
which depending on thevalidatePath
condition will return a user provided data accessor.DataLoaderHandler
has the following custom hooks, which like DataAccessorHandler will only be used ifvalidatePath
is true:It is necessary to provide the
DataLoaderHandler
hook interface for users, as creating/writing data frame depends on the data/path type.The interface was copied from SparkIOUtils, which handles a lot of the data writing/reading outside of data loaders.
More details regarding the main access points for file writing/reading is in the following section.
API Flow
For the current scope, the functionality is exposed primarily though
FeathrClient
. As such, there was a key investigation on which modules are dependent on type of file path, and tracking those APIs toFeathrClient
.The key access points for writing/reading files, while also high enough on call chain to prevent redundant refactoring are the following modules.
For data loaders and data accessors, the following object/call dependency graph was created to track the overall flow for instantiating data loaders and accessors,
For the SparkIOUtils, it was traced separately.
Default Values
For most of the of apis, default values were not chosen, as we don't want to data handlers to accidentally be dropped during the api flow.
Testing
External Integration Testing
Work was done in a downstream library to verify that data handling works.
Modifications to existing Feathr tests
Work was done to modify many unit tests, as although additional data handler parameters were added to a lot of classes, default values were not added.
The reasoning for this is because we don't want opaque parameter dropping. Only top-level components, such as the feathr client, and local test related components had default values appendend, to reduce the refactoring needed.
[WIP] Feathr Regression Testing
Additional work is being done to add regression testing to Feathr.
FAQ
InputLocation
?InputLocation
classes as allInputLocation
:FileFormat.loadHdfsDataFrame