GMT_DATASET.to_dataframe: Return an empty DataFrame if a file contains no data#3131
GMT_DATASET.to_dataframe: Return an empty DataFrame if a file contains no data#3131
Conversation
9d4abf9 to
3f3c0c5
Compare
3f3c0c5 to
175ba3c
Compare
| return df | ||
|
|
||
|
|
||
| def dataframe_from_gmt(fname): |
There was a problem hiding this comment.
For reference, GMT provides two special/undocumented modules read and write (their source codes are gmt/src/gmtread.c/gmt/src/gmtwrite.c) that can read a file into a GMT object (e.g, reading a tabular file as GMT_DATASET, or reading a grid as GMT_GRID). Currently, we're frequently using the special read module in the doctest of the pygmt.clib.session module (similar to lines 46-50 below). We may want to make it public in the future as already done in GMT.jl (https://www.generic-mapping-tools.org/GMT.jl/dev/#GMT.gmtread-Tuple{String} and https://www.generic-mapping-tools.org/GMT.jl/dev/#GMT.gmtwrite).
Co-authored-by: Yvonne Fröhlich <94163266+yvonnefroehlich@users.noreply.github.com>
pygmt/datatypes/dataset.py
Outdated
| ) | ||
|
|
||
| df = pd.concat(objs=vectors, axis=1) | ||
| df = pd.concat(objs=vectors, axis=1) if vectors else pd.DataFrame() |
There was a problem hiding this comment.
An empty pd.DataFrame() won't have any columns. Should there still be columns returned (even if there are no rows)? How would this work with #3117 for example?
There was a problem hiding this comment.
A DataFrame with columns but no rows is still empty. So I guess it's fine.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame()
In [3]: df
Out[3]:
Empty DataFrame
Columns: []
Index: []
In [4]: df = pd.DataFrame(columns=None)
In [5]: df
Out[5]:
Empty DataFrame
Columns: []
Index: []
In [6]: df = pd.DataFrame(columns=["col1", "col2"])
In [7]: df
Out[7]:
Empty DataFrame
Columns: [col1, col2]
Index: []
In [8]: df.empty
Out[8]: True
There was a problem hiding this comment.
Column names are set here like so:
Lines 1853 to 1861 in 1eb6dec
So we would do something like:
import pandas as pd
df = pd.DataFrame(columns=None)
assert df.empty
df.columns = ["col1", "col2"]But setting column names to ["col1", "col2"] errors with:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[8], line 1
----> 1 df.columns = ["col1", "col2"]
File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/generic.py:6310, in NDFrame.__setattr__(self, name, value)
6308 try:
6309 object.__getattribute__(self, name)
-> 6310 return object.__setattr__(self, name, value)
6311 except AttributeError:
6312 pass
File properties.pyx:69, in pandas._libs.properties.AxisProperty.__set__()
File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/generic.py:813, in NDFrame._set_axis(self, axis, labels)
808 """
809 This is called from the cython code when we set the `index` attribute
810 directly, e.g. `series.index = [1, 2, 3]`.
811 """
812 labels = ensure_index(labels)
--> 813 self._mgr.set_axis(axis, labels)
814 self._clear_item_cache()
File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/internals/managers.py:238, in BaseBlockManager.set_axis(self, axis, new_labels)
236 def set_axis(self, axis: AxisInt, new_labels: Index) -> None:
237 # Caller is responsible for ensuring we have an Index object.
--> 238 self._validate_set_axis(axis, new_labels)
239 self.axes[axis] = new_labels
File ~/mambaforge/envs/pygmt/lib/python3.12/site-packages/pandas/core/internals/base.py:98, in DataManager._validate_set_axis(self, axis, new_labels)
95 pass
97 elif new_len != old_len:
---> 98 raise ValueError(
99 f"Length mismatch: Expected axis has {old_len} elements, new "
100 f"values have {new_len} elements"
101 )
ValueError: Length mismatch: Expected axis has 0 elements, new values have 2 elementsThere was a problem hiding this comment.
I think we need to refactor GMT_DATASET.to_dataframe() to accept more panda-specific parameters (e.g., column_names, index_col). The the virtualfile_to_dataset will be called like:
result = self.read_virtualfile(vfname, kind="dataset").contents.to_dataframe(columns=column_names)
if output_type == "numpy": # numpy.ndarray output
return result.to_numpy()
return result # pandas.DataFrame output
There was a problem hiding this comment.
Better to do it in a separate PR so that #3117 can focus on parsing the column names from header.
There was a problem hiding this comment.
pygmt/datatypes/dataset.py
Outdated
|
|
||
| # Return an empty DataFrame if no columns are found. | ||
| if len(vectors) == 0: | ||
| return pd.DataFrame() |
There was a problem hiding this comment.
Currently, it returns an empty DataFrame without columns and rows, but an empty DataFrame with columns is also allowed, e.g.,
return pd.DataFrame(column=column_names)
I guess either is fine. I think we can use return pd.DataFrame() now and make changes if necessary.
There was a problem hiding this comment.
I think we should return the column_names, so that users who want to e.g. do pd.concat on multiple pd.DataFrame outputs from running an algorithm like pygmt.select in a for-loop can do so in a more straightforward way. Note that we should also set the dtypes of the columns properly, even if the rows are empty, otherwise the dtypes will all become object:
df1 = pd.DataFrame(data=[[0, 1, 2]], columns=["x", "y", "z"])
print(df1.dtypes)
# x int64
# y int64
# z int64
# dtype: object
df2 = pd.DataFrame(columns=["x", "y", "z"])
print(df2.dtypes)
# x object
# y object
# z object
# dtype: object
pd.concat(objs=[df1, df2]).dtypes
# x object
# y object
# z object
# dtype: objectSee my other suggestion at #3131 (comment) on not returning an empty pd.DataFrame() early, until the dtype is set with df.astype(dtype) below.
pygmt/datatypes/dataset.py
Outdated
|
|
||
| # Return an empty DataFrame if no columns are found. | ||
| if len(vectors) == 0: | ||
| return pd.DataFrame() |
There was a problem hiding this comment.
I think we should return the column_names, so that users who want to e.g. do pd.concat on multiple pd.DataFrame outputs from running an algorithm like pygmt.select in a for-loop can do so in a more straightforward way. Note that we should also set the dtypes of the columns properly, even if the rows are empty, otherwise the dtypes will all become object:
df1 = pd.DataFrame(data=[[0, 1, 2]], columns=["x", "y", "z"])
print(df1.dtypes)
# x int64
# y int64
# z int64
# dtype: object
df2 = pd.DataFrame(columns=["x", "y", "z"])
print(df2.dtypes)
# x object
# y object
# z object
# dtype: object
pd.concat(objs=[df1, df2]).dtypes
# x object
# y object
# z object
# dtype: objectSee my other suggestion at #3131 (comment) on not returning an empty pd.DataFrame() early, until the dtype is set with df.astype(dtype) below.
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
d5283cd to
dbfc2ae
Compare
weiji14
left a comment
There was a problem hiding this comment.
Should probably find a way to test the case where an empty pd.DataFrame is returned with column names and set dtypes, but it looks a bit tricky 🙂
Co-authored-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
Sometimes, GMT may output nothing to a virtual file, for example,
gmt selectmay find no data points that satisfy the specified criteria.In such cases,
GMT_DATASET.to_dataframeraises an exception. As shown below:Instead of raising exceptions,
GMT_DATASET.to_dataframeshould return a reasonable value. I think an empty DataFrame (i.e.,pd.DataFrame()) makes more sense thanNone.This PR updates the
GMT_DATASET.to_dataframe()method to return an empty DataFrame in such cases and also add two tests (one for benchmark and one for testing the empty DataFrame).