Skip to content

EHN: Add a is_castable function and/or errors=coerce option to cast #48972

@randolf-scholz

Description

@randolf-scholz

Describe the enhancement requested

I'd like to cast a string array to float, but it can contain entries that do not represent floats.

import pyarrow as pa
import pyarrow.compute as pc

arr = pa.array(["1.2", "3", "10-20", None, "nan", ""])

out = pc.cast(arr, pa.float64(), safe=False)  # raises ArrowInvalid

print(out)  # E: [1.2, 3, null, null, nan, null]

My current workaround is to export to pandas and use pandas.to_numeric(errors="coerce").
However, it would be nice if pyarrow had some built-in machinery to deal with this situation:

  1. Add a function that yields a boolean mask of all values that are castable.

    def is_castable(arr, target_type, options=None) -> Array[bool]:
        """Returns boolean mask of values that can be cast to target_type,
        under the chosen options."""

    Such a function would also be useful for extracting the set of all values that cannot be cast.

  2. Either a force_cast function or a errors={"raise", "coerce"} option like pandas.to_numeric that catches conversion errors, essentially as a shortcut for

    def force_cast(array, options):
        # 1. check which values are castable with the given options
        mask = pa.is_castable(array, options)
        # 2. mask out all non-castable values
        array = pc.compute.where(mask, array, None)
        # 3. cast the result (guaranteed to succeed)
       return pc.cast(array, options)

Alternatives

Of course one can emulate is_castable in this particular case by using a regex, but this is problematic since it needs to be synced with pyarrows internal logic of when it considers a string castable to float.

Component(s)

C++, Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions