Rework on HDF5: add list_hdf5_datasets and read_hdf5 ops#392
Merged
yongtang merged 3 commits intotensorflow:masterfrom Aug 4, 2019
Merged
Rework on HDF5: add list_hdf5_datasets and read_hdf5 ops#392yongtang merged 3 commits intotensorflow:masterfrom
yongtang merged 3 commits intotensorflow:masterfrom
Conversation
Member
Author
|
/cc @CaptainDuke to take a look as well. Think |
This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in tensorflow#382 and tensorflow#366. Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file. In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow. The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement `__len__` and `__getitem__`. With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the `batch_size` to be fed in tf.keras. The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work. This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see 384 for similar changes. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Member
Author
|
@CaptainDuke I changed the Plan to merge this PR shortly. |
Member
Author
|
/cc @terrytangyuan in case you want to take a look at |
Member
|
@yongtang Thanks! |
1 similar comment
|
@yongtang Thanks! |
i-ony
pushed a commit
to i-ony/io
that referenced
this pull request
Feb 8, 2021
) * Rework on HDF5: add list_hdf5_datasets and read_hdf5 ops This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in tensorflow#382 and tensorflow#366. Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file. In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow. The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement `__len__` and `__getitem__`. With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the `batch_size` to be fed in tf.keras. The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work. This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see 384 for similar changes. Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Process default value of count and start Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Support HDF5Datast in graph mode Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR is part of the effort in enhancing performance and ease of use for tf.data pipeline, as was discussed in #382 and #366.
Previously, HDF5Dataset is relatively manual and user has to find out the dataset (columns) in hdf5 file.
In this PR, the idea is to allow user to use list_hdf5_datasets to probe the shape, dtype, and name of the datasets within HDF5. A subsequent call to read_hdf5 will bring content to a shaped Tensor so that it could be used later in TensorFlow.
The read_hdf5 has the option to specify a slice (or a subblock) of the dataset. This should open up possibility in the future to allow binding a class with a hdf5 file by implement
__len__and__getitem__.With list_hdf5_datasets and read_hdf5 ops, it is also possible to ease the HDF5Dataset in eager mode. In eager mode, HDF5Dataset could juse call list_hdf5_datasets to find out all the necessary information, then calling read_hdf5 in pieces to maintain the
batch_sizeto be fed in tf.keras.The limitation would be in graph mode as in graph mode user still has to specify almost everything dtype, shape, name for HDF5Dataset to work.
This PR has not changed HDF5Dataset implementation to use list_hdf5_datasets and read_hdf5 ops. But this could be easily done and see #384 for similar changes.
Signed-off-by: Yong Tang yong.tang.github@outlook.com