Add --recursive option to fs ls command#513
Add --recursive option to fs ls command#513justinTM wants to merge 4 commits intodatabricks:mainfrom
--recursive option to fs ls command#513Conversation
pietern
left a comment
There was a problem hiding this comment.
Thanks for your PR, this looks useful.
Two general comments:
- It seems to me that recursive implies absolute, or at least relative to the specified path. As is, it would allow for recursively showing the basename only, which doesn't make a lot of sense. Can you make the behavior of non-absolute recursive listing show the relative paths?
- Please add a couple tests for this behavior.
databricks_cli/dbfs/cli.py
Outdated
| if f.is_dir: | ||
| recursive_echo(this_dbfs_path.join(f.basename)) | ||
|
|
||
| recursive_echo(dbfs_path) if recursive else echo_path(dbfs_path) |
There was a problem hiding this comment.
This modifies existing behavior to no longer list files in the specified path, but just the path itself.
There was a problem hiding this comment.
hi @pietern thank you for your helpful feedback! i believe i fixed this in the latest commit:
databricks-cli/databricks_cli/dbfs/api.py
Line 94 in 3330faf
databricks_cli/dbfs/cli.py
Outdated
|
|
||
| def echo_path(files): | ||
| table = tabulate([f.to_row(is_long_form=l, is_absolute=absolute) for f in files], | ||
| tablefmt='plain') |
There was a problem hiding this comment.
What happens to tabulate if the max width of these files is different?
There was a problem hiding this comment.
@pietern thanks again. i believe tabulate will only be called once now after the latest commit, and the max file width should be a single value.
databricks_cli/dbfs/cli.py
Outdated
| for f in files: | ||
| if f.is_dir: | ||
| recursive_echo(this_dbfs_path.join(f.basename)) |
There was a problem hiding this comment.
Do we have a rough estimation how much more requests this would generate? Do we have customers who already manually doing this and we are just giving them a shortcut?
There was a problem hiding this comment.
hi @stormwindy thanks for your review. i truly have no idea how many more requests it will generate; totally dependent on customer file structure. will be 1 extra request for each directory in the path and all of their subdirectories.
There was a problem hiding this comment.
I see. What are the use cases for this, at least in your perspective? Unless there is a crazy deep folder structure this shouldn't cause any problems.
There was a problem hiding this comment.
getting filepaths of all worker logs in a cluster: #512
There was a problem hiding this comment.
Overall the idea to add this option is good on our end. I am also happy with the overall implementation direction. I have left one not comment about the code. After that @pietern can decide if/when to approve the PR. Thanks a lot for the contribution.
fix Dbfs files assignment
| paths = self.client.list_files(*args, **kwargs) | ||
| files = [p for p in paths if not p.is_dir] | ||
| for p in paths: | ||
| files = files + self._recursive_list(p) if p.is_dir else files |
There was a problem hiding this comment.
I had to read this a couple times to understand what you're doing. I think it should be simplified to iterate over paths just once and then have an if p.is_dir in there with both dealing with a file or a directory.
There was a problem hiding this comment.
Separate comment: this also need to pass along the headers from the **kwargs to be consistent with the list_files API today. Otherwise you use it on the first call but not later calls.
| assert len(files) == 2 | ||
| assert TEST_FILE_INFO0 == files[0] | ||
| assert TEST_FILE_INFO1 == files[1] | ||
|
|
There was a problem hiding this comment.
There doesn't seem to be a test for a real recursive call.
The variable TEST_FILE_JSON2 is unused. When you make a test that uses it and does a real recursive call, I expect it to fail for the reason I commented about above; the mismatch between FileInfo and DbfsPath.
No description provided.