[WIP] Update Image to use variant for reference image objects#186
[WIP] Update Image to use variant for reference image objects#186yongtang wants to merge 2 commits intotensorflow:masterfrom
Conversation
|
/cc @suphoff |
|
Some additional items:
|
61d2060 to
26032f6
Compare
Next step is to expose attributes for variant Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
|
With PR #184 merged, this PR has been rebased and all tests passed. /cc @terrytangyuan to take a look as well. |
| """ | ||
|
|
||
| def __init__(self, filename): | ||
| """Create a ImageReader. |
There was a problem hiding this comment.
ImageReader -> ImageDataset
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
|
@suphoff Yes the idea is about right. There are several additional notes here:
As we discussed the idea is to store metadata and this metadata does not necessarily related to Dataset. But we also want to make sure the graph creation fits into the |
|
@suphoff Adding You can take a look at the io/tensorflow_io/core/kernels/dataset_ops.h Lines 307 to 325 in 4019a0c |
|
@suphoff One additional note about As text files are separated by lines and there is nothing else, the function above does not do anything. |
|
Let me change the PR to Working-in-Progress, so that we could have additional thinking. There are several things we want to solve here:
Variant Tensor is immutable, maybe we could have a pass through operation to explicitly read data, e.g., We can pass through the content (no do anything) if the input has already been resolved, e.g., |
|
@yongtang : Variants may be immutable - but you can wrap a pointer to a reference counted C++ object into one - just like the Dataset implementation. As for distribution across host/device I see two scenarios
I just don't see a lot of usage of enumerating files matched by wildcard on one host and sending the filenames to a second host - but admit I could be totally wrong here. A second issue is that I have not investigated with reference counted objects wrapped in variant tensors is graph saving/restore ( Just haven't looked into graph save/restore yet) Happy to discuss in a VC or on gitter. |
|
@suphoff it may not be a big issue for images as images are likely small in size. But for other data formats the data could be huge like GBs of data (and only part or a small chunk of data are truly used). In that situation, serialize and passing tensors of GB size around from one host, is not efficient. The reference of filename/entry will help distribute the with data that are not dense. The Dataset in tensorflow was designed as an iterable or iterator, so it is not suited for distributed systems. The distribute strategy helps to an extent, but it still only applies to dense dataset where every bytes will be used, not other format where only a part or small chunks of data needed. |
|
@yongtang : I agree serializing GBs of data would not be a good solution. |
|
@suphoff Eventually if each file (or archived object) is still too large, it is possible to split the file or object so that each host is only responsible for a chunk of the data. So host one processes |
Next step is to expose attributes for variant
Signed-off-by: Yong Tang yong.tang.github@outlook.com