Skip to content

Conversation

@drin
Copy link
Contributor

@drin drin commented Nov 12, 2025

This decomposes a custom optimizer rule into the datafusion expression simplifier (work-in-progress).

Which issue does this PR close?

Closes #18319.

Rationale for this change

To transform binary expressions that compare date_trunc with a constant value into a form that can be better utilized (improved performance).

For Bauplan, we can see the following (approximate average over a handful of runs):

Q1:

SELECT PULocationID, trip_miles, tips
  FROM taxi_fhvhv
 WHERE date_trunc('month', pickup_datetime) <= '2025-01-08'::DATE

Q2:

SELECT PULocationID, trip_miles, tips
  FROM taxi_fhvhv
 WHERE pickup_datetime < date_trunc('month', '2025-02-08'::DATE)
Query Time (s) Options
Q1 ~3 no cache, optimization enabled
Q1 ~35 no cache, optimization disabled
Q2 ~3 no cache, optimization enabled
Q2 ~3 no cache, optimization disabled

What changes are included in this PR?

A few additional support functions and additional match arms in the simplifier match expression.

Are these changes tested?

Our custom rule has tests of the expression transformations and for correct evaluation results. These will be added to the PR after the implementation is in approximately good shape.

Are there any user-facing changes?

Better performance and occasionally confusing explain plan. In short, a date_trunc('month', col) = '2025-12-03'::DATE will always be false (because the truncation result can never be a non-truncated value), which may produce an unexpected expression (false).

Explain plan details below (may be overkill but it was fun to figure out):

Initial query:

SELECT  PULocationID
           ,pickup_datetime
      FROM taxi_view_2025
     WHERE date_trunc('month', pickup_datetime) = '2025-12-03'

After simplify_expressions:

logical_plan after simplify_expressions                    | Projection: taxi_view_2025.PULocationID, taxi_view_2025.pickup_datetime                                                                                            |
|                                                            |   Filter: date_trunc(Utf8("month"), CAST(taxi_view_2025.pickup_datetime AS Timestamp(Nanosecond, None))) = TimestampNanosecond(1764720000000000000, None)          |
|                                                            |     TableScan: taxi_view_2025

Before and after date_trunc_optimizer (our custom rule):

logical_plan after optimize_projections                    | Filter: date_trunc(Utf8("month"), CAST(taxi_view_2025.pickup_datetime AS Timestamp(Nanosecond, None))) = TimestampNanosecond(1764720000000000000, None)            |
|                                                            |   TableScan: taxi_view_2025 projection=[PULocationID, pickup_datetime]                                                                                             |
| logical_plan after date_trunc_optimizer                    | Filter: Boolean(false)                                                                                                                                             |
|                                                            |   TableScan: taxi_view_2025 projection=[PULocationID, pickup_datetime]

@github-actions github-actions bot added the optimizer Optimizer rules label Nov 12, 2025
@drin drin marked this pull request as draft November 12, 2025 15:18
@drin
Copy link
Contributor Author

drin commented Nov 12, 2025

@UBarney something I think could use some definite improvement in handling of the source expressions along transformation and failure paths (https://github.com/drin/datafusion/blob/8cba13ceafcf0df047e753f20bf54ad85a02f019/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L690-L720).

I try to avoid moving until I know what to return (transformed expression or source expression), but I don't know rust/datafusion well enough to know best practices for when to clone and when to move and how to avoid either until necessary.

This is a work-in-progress but decomposes an existing custom optimizer
rule into some places in the expression simplifier that seem appropriate
at first glance.

This is essentially a messy code dump, but hopefully done in a way that
someone with experience can appropriately integrate into the datafusion
codebase.
@drin drin force-pushed the octalene.feat-optimize-datetrunc branch from 8cba13c to e4b2cf5 Compare November 12, 2025 15:31
@github-actions
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Jan 12, 2026
@drin
Copy link
Contributor Author

drin commented Jan 13, 2026

I will try to push this forward this week

@github-actions github-actions bot removed the Stale PR has not had any activity for some time label Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize the evaluation of DATE_TRUNC(<col>) == <constant>) when pushed down

1 participant