Skip to content

Parquet stats pruning: IS NULL/IS NOT NULL Simplification for Literal Arguments #19092

@adriangb

Description

@adriangb

This is a dump of a conversation with Claude Code

Problem

When a predicate like b IS NOT NULL OR a = 24 is rewritten for a file schema where column b doesn't exist, the b IS NOT NULL becomes IS NOT NULL(NULL literal). This should simplify to false, allowing the predicate to become a = 24 for proper row group pruning.

Currently:

  1. PhysicalExprSimplifier doesn't simplify IS NULL(literal) or IS NOT NULL(literal)
  2. build_predicate_expression treats IS NOT NULL(literal) as "unhandled" → returns true
  3. Result: true OR (pruning_expr) = true → no pruning possible

Solution

Implement both fixes so that IS NOT NULL(lit(1)) and similar expressions benefit from simplification at multiple levels.


Change 1: Add IS NULL/IS NOT NULL Simplification to PhysicalExprSimplifier

File: datafusion/physical-expr/src/simplifier/mod.rs

Modifications:

  1. Add new module import:

    pub mod is_null;
  2. Update f_up method to chain the new simplification:

    fn f_up(&mut self, node: Self::Node) -> Result<Transformed<Self::Node>> {
        let rewritten = simplify_not_expr(&node, self.schema)?
            .transform_data(|node| is_null::simplify_is_null_expr(&node))?
            .transform_data(|node| {
                unwrap_cast::unwrap_cast_in_comparison(node, self.schema)
            })?;
        Ok(rewritten)
    }

New File: datafusion/physical-expr/src/simplifier/is_null.rs

Create a new module following the pattern of not.rs:

//! Simplify IS NULL and IS NOT NULL expressions on literals
//!
//! - IS NULL(literal) -> true/false based on literal.value().is_null()
//! - IS NOT NULL(literal) -> true/false based on !literal.value().is_null()

use std::sync::Arc;
use datafusion_common::{tree_node::Transformed, Result, ScalarValue};
use crate::expressions::{lit, IsNullExpr, IsNotNullExpr, Literal};
use crate::PhysicalExpr;

pub fn simplify_is_null_expr(
    expr: &Arc<dyn PhysicalExpr>,
) -> Result<Transformed<Arc<dyn PhysicalExpr>>> {
    // Handle IS NULL(literal)
    if let Some(is_null_expr) = expr.as_any().downcast_ref::<IsNullExpr>() {
        if let Some(literal) = is_null_expr.arg().as_any().downcast_ref::<Literal>() {
            let result = literal.value().is_null();
            return Ok(Transformed::yes(lit(ScalarValue::Boolean(Some(result)))));
        }
    }

    // Handle IS NOT NULL(literal)
    if let Some(is_not_null_expr) = expr.as_any().downcast_ref::<IsNotNullExpr>() {
        if let Some(literal) = is_not_null_expr.arg().as_any().downcast_ref::<Literal>() {
            let result = !literal.value().is_null();
            return Ok(Transformed::yes(lit(ScalarValue::Boolean(Some(result)))));
        }
    }

    Ok(Transformed::no(Arc::clone(expr)))
}

Tests to Add (in mod.rs)

#[test]
fn test_simplify_is_null_literal() -> Result<()> {
    let schema = test_schema();
    let mut simplifier = PhysicalExprSimplifier::new(&schema);

    // IS NULL(1) -> false
    let is_null_expr = Arc::new(IsNullExpr::new(lit(ScalarValue::Int32(Some(1)))));
    let result = simplifier.simplify(is_null_expr)?;
    assert_eq!(as_literal(&result).value(), &ScalarValue::Boolean(Some(false)));

    // IS NULL(NULL) -> true
    let is_null_expr = Arc::new(IsNullExpr::new(lit(ScalarValue::Int32(None))));
    let result = simplifier.simplify(is_null_expr)?;
    assert_eq!(as_literal(&result).value(), &ScalarValue::Boolean(Some(true)));

    Ok(())
}

#[test]
fn test_simplify_is_not_null_literal() -> Result<()> {
    let schema = test_schema();
    let mut simplifier = PhysicalExprSimplifier::new(&schema);

    // IS NOT NULL(1) -> true
    let expr = Arc::new(IsNotNullExpr::new(lit(ScalarValue::Int32(Some(1)))));
    let result = simplifier.simplify(expr)?;
    assert_eq!(as_literal(&result).value(), &ScalarValue::Boolean(Some(true)));

    // IS NOT NULL(NULL) -> false
    let expr = Arc::new(IsNotNullExpr::new(lit(ScalarValue::Int32(None))));
    let result = simplifier.simplify(expr)?;
    assert_eq!(as_literal(&result).value(), &ScalarValue::Boolean(Some(false)));

    Ok(())
}

Change 2: Handle Literal Arguments in build_predicate_expression

File: datafusion/pruning/src/pruning_predicate.rs

Modify build_predicate_expression function (~line 1415-1427):

Before:

if let Some(is_null) = expr_any.downcast_ref::<phys_expr::IsNullExpr>() {
    return build_is_null_column_expr(is_null.arg(), schema, required_columns, false)
        .unwrap_or_else(|| unhandled_hook.handle(expr));
}
if let Some(is_not_null) = expr_any.downcast_ref::<phys_expr::IsNotNullExpr>() {
    return build_is_null_column_expr(
        is_not_null.arg(),
        schema,
        required_columns,
        true,
    )
    .unwrap_or_else(|| unhandled_hook.handle(expr));
}

After:

if let Some(is_null) = expr_any.downcast_ref::<phys_expr::IsNullExpr>() {
    // If argument is a literal, evaluate directly
    if let Some(literal) = is_null.arg().as_any().downcast_ref::<phys_expr::Literal>() {
        let result = literal.value().is_null();
        return Arc::new(phys_expr::Literal::new(ScalarValue::Boolean(Some(result))));
    }
    return build_is_null_column_expr(is_null.arg(), schema, required_columns, false)
        .unwrap_or_else(|| unhandled_hook.handle(expr));
}
if let Some(is_not_null) = expr_any.downcast_ref::<phys_expr::IsNotNullExpr>() {
    // If argument is a literal, evaluate directly
    if let Some(literal) = is_not_null.arg().as_any().downcast_ref::<phys_expr::Literal>() {
        let result = !literal.value().is_null();
        return Arc::new(phys_expr::Literal::new(ScalarValue::Boolean(Some(result))));
    }
    return build_is_null_column_expr(
        is_not_null.arg(),
        schema,
        required_columns,
        true,
    )
    .unwrap_or_else(|| unhandled_hook.handle(expr));
}

Files to Modify

  1. datafusion/physical-expr/src/simplifier/mod.rs - Add is_null module import and chain simplification
  2. datafusion/physical-expr/src/simplifier/is_null.rs - New file for IS NULL/IS NOT NULL simplification
  3. datafusion/pruning/src/pruning_predicate.rs - Handle literal arguments in build_predicate_expression

Testing

  1. Run the original failing test:

    cargo test --package datafusion --test core_integration --all-features -- schema_adapter::schema_adapter_integration_tests::test_parquet_missing_column --exact --nocapture
    
  2. Run simplifier tests:

    cargo test --package datafusion-physical-expr --lib simplifier
    
  3. Run pruning predicate tests:

    cargo test --package datafusion-pruning --lib
    

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions