pygmt.which: Refactor to get rid of temporary files#3148
Conversation
pygmt/src/which.py
Outdated
| if not path: | ||
| result = lib.virtualfile_to_dataset(vfname=vouttbl, output_type="pandas") |
There was a problem hiding this comment.
We can set output_type to numpy to simplify the code below no?
There was a problem hiding this comment.
Trying to think what's the most efficient way to get the output from which. It seems a bit overkill going virtualfile -> pandas -> numpy -> list[str].
There was a problem hiding this comment.
Yes, it's not very efficient. Instead, we can add another internal type output_type="string", which returns the vector of the trailing text only. It's useful if we know that a module only outputs text strings like which. Actually, GMT provides an API function GMT_Get_Strings which does exactly the same thing (if I understand it correctly). We can also wrap that API function instead.
There was a problem hiding this comment.
Instead, we can add another internal type
output_type="string", which returns the vector of the trailing text only. It's useful if we know that a module only outputs text strings likewhich.
Done in PR #3157.
Actually, GMT provides an API function
GMT_Get_Stringswhich does exactly the same thing (if I understand it correctly). We can also wrap that API function instead.
This API function requires the data family to be GMT_IS_VECTOR/GMT_IS_MATRIX, so it cannot be used in this case (family is GMT_IS_DATASET). xref: https://github.com/GenericMappingTools/gmt/blob/fcb795ef196714fe51f7e5e68d30a18e981294d0/src/gmt_api.c#L15785
CodSpeed Performance ReportMerging #3148 will degrade performances by 46.35%Comparing Summary
Benchmarks breakdown
|
…ay of trailing texts
|
Ping @GenericMappingTools/pygmt-maintainers for final reviews. The |
|
The cache GMT artifacts workflow is failing at https://github.com/GenericMappingTools/pygmt/actions/runs/8680209408/job/23800357490: Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/Users/runner/work/pygmt/pygmt/pygmt/helpers/caching.py", line 103, in cache_data
which(fname=datasets, download="a")
File "/Users/runner/work/pygmt/pygmt/pygmt/helpers/decorators.py", line 607, in new_module
return module_func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/runner/work/pygmt/pygmt/pygmt/helpers/decorators.py", line 780, in new_module
return module_func(*bound.args, **bound.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/runner/work/pygmt/pygmt/pygmt/src/which.py", line 67, in which
paths = lib.virtualfile_to_dataset(vfname=vouttbl, output_type="strings")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/runner/work/pygmt/pygmt/pygmt/clib/session.py", line 1940, in virtualfile_to_dataset
return result.to_strings()
^^^^^^^^^^^^^^^^^^^
File "/Users/runner/work/pygmt/pygmt/pygmt/datatypes/dataset.py", line 156, in to_strings
return np.char.decode(textvector) if textvector else np.array([], dtype=str)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/runner/micromamba/envs/pygmt/lib/python3.12/site-packages/numpy/core/defchararray.py", line 615, in decode
_vec_string(a, object_, 'decode', _clean_args(encoding, errors)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: string operation on non-string arrayThe error doesn't really indicate what happened, so I added a It seems like already downloaded files might return as None? |
It should return the path if the file is already downloaded. Not sure where |
It seems to happen when I have a mix of downloaded and not downloaded files. I get |
|
Now I can reproduce the issue locally. It doesn't always happen for a mix of downloaded/undownloaded files, for example: |
|
Here is a minimum example to reproduce the issue. From what I can see, the issue only occurs if we try to download a tile like |
Address #2730.
Output to a pandas.DataFrame via virtualfiles and then convert the DataFrame to a list.