Add read_text to read lines from splittable text file by yongtang · Pull Request #397 · tensorflow/io

yongtang · 2019-07-30T18:17:02Z

This PR is part of the effort to rework on Dataset with large files reading into Tensors first to speed up performance. See #382 and #366 for related discussions.

Summary:

read_text is able to read a text file with in the range of [offset, offset+length]
that gives us the Splittable text file where we could read file in chunks (similar to hadoop)
the plan is to read a text file in big chunks and then wire up with tf.data.Dataset
read_text is a primitive C++ op so it could be used in tf.data, and it could be used in other places.

Signed-off-by: Yong Tang yong.tang.github@outlook.com

This PR is part of the effort to rework on Dataset with large files reading into Tensors first to speed up performance. See 382 and 366 for related discussions. Summary: 1) read_text is able to read a text file with in the range of [offset, offset+length] 2) that gives us the Splittable text file where we could read file in chunks (similar to hadoop) 3) the plan is to read a text file in big chunks and then wire up with tf.data.Dataset 4) read_text is a primitive C++ op so it could be used in tf.data, and it could be used in other places. Note once PR 393 is merged I will convert TextDataset to use this ops (and remove the native C++ implementation of TextDataset) Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang · 2019-08-04T15:52:13Z

Will merge this PR soon as well. It exposes a primitive kernel op read_text which allows reading text in slices. This is more useful to tf.data API in most of the cases.

* Add read_text to read lines from splittable text file This PR is part of the effort to rework on Dataset with large files reading into Tensors first to speed up performance. See 382 and 366 for related discussions. Summary: 1) read_text is able to read a text file with in the range of [offset, offset+length] 2) that gives us the Splittable text file where we could read file in chunks (similar to hadoop) 3) the plan is to read a text file in big chunks and then wire up with tf.data.Dataset 4) read_text is a primitive C++ op so it could be used in tf.data, and it could be used in other places. Note once PR 393 is merged I will convert TextDataset to use this ops (and remove the native C++ implementation of TextDataset) Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Use read_text to implement TextDataset Signed-off-by: Yong Tang <yong.tang.github@outlook.com> * Fix python 3 failure Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang added kokoro:force-run kokoro:run Kokoro CI labels Jul 30, 2019

kokoro-team removed kokoro:run Kokoro CI kokoro:force-run labels Jul 30, 2019

yongtang force-pushed the read_text branch 2 times, most recently from 0705db4 to 902867a Compare July 31, 2019 23:37

yongtang added 3 commits July 31, 2019 23:53

Use read_text to implement TextDataset

632330e

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

Fix python 3 failure

33220b3

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>

yongtang force-pushed the read_text branch from 902867a to 33220b3 Compare August 4, 2019 00:48

yongtang merged commit a8506f6 into tensorflow:master Aug 4, 2019

yongtang deleted the read_text branch August 4, 2019 18:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add read_text to read lines from splittable text file#397

Add read_text to read lines from splittable text file#397
yongtang merged 3 commits intotensorflow:masterfrom
yongtang:read_text

yongtang commented Jul 30, 2019 •

edited

Loading

Uh oh!

yongtang commented Aug 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yongtang commented Jul 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yongtang commented Aug 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yongtang commented Jul 30, 2019 •

edited

Loading