-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the enhancement requested
I'd like to cast a string array to float, but it can contain entries that do not represent floats.
import pyarrow as pa
import pyarrow.compute as pc
arr = pa.array(["1.2", "3", "10-20", None, "nan", ""])
out = pc.cast(arr, pa.float64(), safe=False) # raises ArrowInvalid
print(out) # E: [1.2, 3, null, null, nan, null]My current workaround is to export to pandas and use pandas.to_numeric(errors="coerce").
However, it would be nice if pyarrow had some built-in machinery to deal with this situation:
-
Add a function that yields a boolean mask of all values that are castable.
def is_castable(arr, target_type, options=None) -> Array[bool]: """Returns boolean mask of values that can be cast to target_type, under the chosen options."""
Such a function would also be useful for extracting the set of all values that cannot be cast.
-
Either a
force_castfunction or aerrors={"raise", "coerce"}option likepandas.to_numericthat catches conversion errors, essentially as a shortcut fordef force_cast(array, options): # 1. check which values are castable with the given options mask = pa.is_castable(array, options) # 2. mask out all non-castable values array = pc.compute.where(mask, array, None) # 3. cast the result (guaranteed to succeed) return pc.cast(array, options)
Alternatives
Of course one can emulate is_castable in this particular case by using a regex, but this is problematic since it needs to be synced with pyarrows internal logic of when it considers a string castable to float.
Component(s)
C++, Python