Skip to content

feat(substrait): remove dependency on datafusion default features#13594

Merged
alamb merged 2 commits intoapache:mainfrom
notfilippo:fr/remove-parquet-from-substrait
Dec 4, 2024
Merged

feat(substrait): remove dependency on datafusion default features#13594
alamb merged 2 commits intoapache:mainfrom
notfilippo:fr/remove-parquet-from-substrait

Conversation

@notfilippo
Copy link
Member

Which issue does this PR close?

Closes #13593

What changes are included in this PR?

  • physical feature, which enables production and consumption of physical substrait plans.

Are these changes tested?

Yes, running cargo tree -p datafusion-substrait with and without the --no-default-features shows that parquet and many other dependencies are not included.

Are there any user-facing changes?

No, the new feature is enabled by default. Users who might want to benefit from this change will have to use default-features = false when including the datafusion-substrait crate.

@github-actions github-actions bot added the substrait Changes to the substrait crate label Nov 28, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @notfilippo -- this looks like a good change to me.

I wonder if there is some way we can add a CI test for this (to avoid breaking it in the future) -- perhaps something like

run: cargo check --all-targets --no-default-features -p datafusion-common

@alamb
Copy link
Contributor

alamb commented Dec 2, 2024

I think we need to

  1. Add this to CI (suggested diff below)
  2. Make sure it actually compiles
index 50bebc5b4..0b2338cd3 100644
--- a/.github/workflows/rust.yml
+++ b/.github/workflows/rust.yml
@@ -80,9 +80,12 @@ jobs:
       - name: Check datafusion-common without default features
         run: cargo check --all-targets --no-default-features -p datafusion-common

-      - name: Check datafusion-functions
+      - name: Check datafusion-functions without default features
         run: cargo check --all-targets --no-default-features -p datafusion-functions

+      - name: Check datafusion-substrait without default features
+        run: cargo check --all-targets --no-default-features -p datafusion-substrait
+
       - name: Check workspace in debug mode
         run: cargo check --all-targets --workspace

I tested with cargo check --all-targets --no-default-features -p datafusion-substrait and the crate doesn't compile:

andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$ cargo check --all-targets --no-default-features -p datafusion-substrait
warning: function `coerce_file_schema_to_view_type` is never used
   --> datafusion/core/src/datasource/file_format/mod.rs:428:15
    |
428 | pub(crate) fn coerce_file_schema_to_view_type(
    |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: `#[warn(dead_code)]` on by default

warning: function `coerce_file_schema_to_string_type` is never used
   --> datafusion/core/src/datasource/file_format/mod.rs:492:15
    |
492 | pub(crate) fn coerce_file_schema_to_string_type(
    |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

warning: `datafusion` (lib) generated 2 warnings
    Checking datafusion-substrait v43.0.0 (/Users/andrewlamb/Software/datafusion/datafusion/substrait)
error[E0432]: unresolved import `datafusion::datasource::physical_plan::ParquetExec`
  --> datafusion/substrait/tests/cases/roundtrip_physical_plan.rs:25:61
   |
25 | use datafusion::datasource::physical_plan::{FileScanConfig, ParquetExec};
   |                                                             ^^^^^^^^^^^ no `ParquetExec` in `datasource::physical_plan`
   |
note: found an item that was configured out
  --> /Users/andrewlamb/Software/datafusion/datafusion/core/src/datasource/physical_plan/mod.rs:34:25
   |
34 | pub use self::parquet::{ParquetExec, ParquetFileMetrics, ParquetFileReaderFactory};
   |                         ^^^^^^^^^^^
note: the item is gated behind the `parquet` feature
  --> /Users/andrewlamb/Software/datafusion/datafusion/core/src/datasource/physical_plan/mod.rs:33:7
   |
33 | #[cfg(feature = "parquet")]
   |       ^^^^^^^^^^^^^^^^^^^

error[E0432]: unresolved import `datafusion_substrait::physical_plan`
  --> datafusion/substrait/tests/cases/roundtrip_physical_plan.rs:29:27
   |
29 | use datafusion_substrait::physical_plan::{consumer, producer};
   |                           ^^^^^^^^^^^^^ could not find `physical_plan` in `datafusion_substrait`
   |
note: found an item that was configured out
  --> /Users/andrewlamb/Software/datafusion/datafusion/substrait/src/lib.rs:79:9
   |
79 | pub mod physical_plan;
   |         ^^^^^^^^^^^^^
note: the item is gated behind the `physical` feature
  --> /Users/andrewlamb/Software/datafusion/datafusion/substrait/src/lib.rs:78:7
   |
78 | #[cfg(feature = "physical")]
   |       ^^^^^^^^^^^^^^^^^^^^

error[E0425]: cannot find function `parquet_test_data` in module `datafusion::test_util`
   --> datafusion/substrait/tests/cases/roundtrip_physical_plan.rs:154:43
    |
154 |     let testdata = datafusion::test_util::parquet_test_data();
    |                                           ^^^^^^^^^^^^^^^^^ help: a function with a similar name exists: `arrow_test_data`
    |
   ::: /Users/andrewlamb/Software/datafusion/datafusion/common/src/test_util.rs:201:1
    |
201 | pub fn arrow_test_data() -> String {
    | ---------------------------------- similarly named function `arrow_test_data` defined here
    |
note: found an item that was configured out
   --> /Users/andrewlamb/Software/datafusion/datafusion/core/src/test_util/mod.rs:60:39
    |
60  | pub use datafusion_common::test_util::parquet_test_data;
    |                                       ^^^^^^^^^^^^^^^^^
note: the item is gated behind the `parquet` feature
   --> /Users/andrewlamb/Software/datafusion/datafusion/core/src/test_util/mod.rs:59:7
    |
59  | #[cfg(feature = "parquet")]
    |       ^^^^^^^^^^^^^^^^^^^

error[E0599]: no method named `register_parquet` found for struct `datafusion::prelude::SessionContext` in the current scope
   --> datafusion/substrait/tests/cases/roundtrip_physical_plan.rs:145:9
    |
145 |     ctx.register_parquet("data", "tests/testdata/data.parquet", explicit_options)
    |         ^^^^^^^^^^^^^^^^
    |
help: there is a method `register_arrow` with a similar name
    |
145 |     ctx.register_arrow("data", "tests/testdata/data.parquet", explicit_options)
    |         ~~~~~~~~~~~~~~

error[E0599]: no method named `register_parquet` found for struct `datafusion::prelude::SessionContext` in the current scope
   --> datafusion/substrait/tests/cases/roundtrip_physical_plan.rs:155:9
    |
155 |     ctx.register_parquet(
    |     ----^^^^^^^^^^^^^^^^
    |
help: there is a method `register_arrow` with a similar name
    |
155 |     ctx.register_arrow(
    |         ~~~~~~~~~~~~~~

Some errors have detailed explanations: E0425, E0432, E0599.
For more information about an error, try `rustc --explain E0425`.
error: could not compile `datafusion-substrait` (test "substrait_integration") due to 5 previous errors
andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$

@alamb alamb marked this pull request as draft December 2, 2024 18:23
@alamb
Copy link
Contributor

alamb commented Dec 2, 2024

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

@notfilippo notfilippo marked this pull request as ready for review December 3, 2024 13:30
@github-actions github-actions bot added the development-process Related to development process of DataFusion label Dec 3, 2024
@notfilippo
Copy link
Member Author

notfilippo commented Dec 3, 2024

Thanks @alamb – Weirdly enough I didn't receive any pings from GitHub about activity on this issue / PR... Anyway thanks for pointing out the issue with the roundtrip test. The fix I posted should be enough.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @notfilippo -- makes sense to me

@alamb alamb merged commit 11d49b4 into apache:main Dec 4, 2024
@alamb
Copy link
Contributor

alamb commented Dec 4, 2024

Thanks again @notfilippo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

development-process Related to development process of DataFusion substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[substrait] make dependency on parquet optional

2 participants