Add tests and CI for optional pyarrow module#1711
Conversation
|
@wjones127 looks like the newly added CI job is failing. |
@houqp Thanks. I got it passing now in my branch: https://github.com/wjones127/arrow-datafusion/runs/5008841849?check_suite_focus=true |
alamb
left a comment
There was a problem hiding this comment.
https://github.com/apache/arrow-datafusion/runs/5009634913?check_suite_focus=true
I tried to run this locally and I got an error (even after following the suggestion of setting PYO3_PYTHON=$(which python)).
PYO3_PYTHON=$(which python) cargo test --features=pyarrow -p datafusion -- pyarrow
Finished test [unoptimized + debuginfo] target(s) in 0.16s
Running unittests (target/debug/deps/datafusion-59322809a250018b)
running 2 tests
test pyarrow::tests::test_py_scalar ... FAILED
test pyarrow::tests::test_roundtrip ... FAILED
failures:
---- pyarrow::tests::test_py_scalar stdout ----
thread 'pyarrow::tests::test_py_scalar' panicked at 'pyarrow not found
Executable: /Users/alamb/Software/arrow-datafusion/target/debug/deps/datafusion-59322809a250018b
Python path: ["/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python39.zip", "/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9", "/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/lib-dynload", "/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages"]
HINT: try `pip install pyarrow`
HINT: if wrong Python path, try `PYO3_PYTHON=$(which python)`: PyErr { type: <class 'ModuleNotFoundError'>, value: ModuleNotFoundError("No module named 'pyarrow'"), traceback: Some(<traceback object at 0x110e69740>) }', datafusion/src/pyarrow.rs:97:26
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---- pyarrow::tests::test_roundtrip stdout ----
thread 'pyarrow::tests::test_roundtrip' panicked at 'pyarrow not found
Executable: /Users/alamb/Software/arrow-datafusion/target/debug/deps/datafusion-59322809a250018b
Python path: ["/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python39.zip", "/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9", "/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/lib-dynload", "/usr/local/Cellar/python@3.9/3.9.10/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages"]
HINT: try `pip install pyarrow`
HINT: if wrong Python path, try `PYO3_PYTHON=$(which python)`: PyErr { type: <class 'ModuleNotFoundError'>, value: ModuleNotFoundError("No module named 'pyarrow'"), traceback: Some(<traceback object at 0x110e69740>) }', datafusion/src/pyarrow.rs:97:26
failures:
pyarrow::tests::test_py_scalar
pyarrow::tests::test_roundtrip
test result: FAILED. 0 passed; 2 failed; 0 ignored; 0 measured; 903 filtered out; finished in 0.02s
error: test failed, to rerun pass '-p datafusion --lib'
Any thoughts @wjones127 ?
| locals.get_item("python_path").unwrap().extract().unwrap(); | ||
|
|
||
| Err(err).expect( | ||
| format!( |
|
Thank you for the contribution @wjones127 ❤️ |
TBH, I get this too on my local. It seems to be using your homebrew Python 3.9 but not the version of Python that you've installed pyarrow in. (On my local, I was trying to use a pyenv-installed Python, but it didn't seem to want to use that.) To get it to pass, I set At the very least, it seems to work in CI without much configuration. I assumed this is specific to my Mac OS + PyEnv setup, and that |
|
I also tried PYO3_PRINT_CONFIG=1 PYO3_PYTHON=$(which python) cargo test --features=pyarrow -p datafusion -- pyarrow
error: failed to run custom build command for `pyo3 v0.15.1`
Caused by:
process didn't exit successfully: `/Users/alamb/Software/arrow-datafusion/target/debug/build/pyo3-0d2dc5a0a08d5a5d/build-script-build` (exit status: 101)
--- stdout
cargo:rerun-if-env-changed=PYO3_CROSS
cargo:rerun-if-env-changed=PYO3_CROSS_LIB_DIR
cargo:rerun-if-env-changed=PYO3_CROSS_PYTHON_VERSION
cargo:rerun-if-env-changed=PYO3_PRINT_CONFIG
-- PYO3_PRINT_CONFIG=1 is set, printing configuration and halting compile --
implementation=CPython
version=3.9
shared=true
abi3=false
lib_name=python3.9
lib_dir=/usr/local/opt/python@3.9/Frameworks/Python.framework/Versions/3.9/lib
executable=/Users/alamb/Software/virtual_envs/arrow_dev/bin/python
pointer_width=64
build_flags=WITH_THREAD
suppress_build_script_link_lines=falseSo it seems like maybe pyo3 is picking up the right executable (my virtual env) but not the correct libdir 🤔 |
Have you checked that your virtualenv has the python3.9.lib file? I think some installations just have the Python binary and no libraries. Otherwise, I do think there is something on MacOS about shared libraries vs the Frameworks folder libraries that I don't yet comprehend, but might be key here. |
@alamb Actually this seems to be a known issue: PyO3/pyo3#1741 (comment). Python doesn't seem to do the virtual environment path computation for embedded systems on Mac OS. A workaround is to extract that into This is orthogonal to the original issue I had, which is that you must use a framework based Python install. That's default in Homebrew and Python.org installers, but not in pyenv. I have updated the error help to reflect these two issues. |
…sion into 1635-pyarrow-test
d545c46 to
0764a3b
Compare
|
Sorry -- I missed this one -- I just started CI again. When that passes I'll plan to merge it in. Thanks again @wjones127 |
* feat: add join type for logical plan display (#1674) * (minor) Reduce memory manager and disk manager logs from `info!` to `debug!` (#1689) * Move `information_schema` tests out of execution/context.rs to `sql_integration` tests (#1684) * Move tests from context.rs to information_schema.rs * Fix up tests to compile * Move timestamp related tests out of context.rs and into sql integration test (#1696) * Move some tests out of context.rs and into sql * Move support test out of context.rs and into sql tests * Fixup tests and make them compile * Add `MemTrackingMetrics` to ease memory tracking for non-limited memory consumers (#1691) * Memory manager no longer track consumers, update aggregatedMetricsSet * Easy memory tracking with metrics * use tracking metrics in SPMS * tests * fix * doc * Update datafusion/src/physical_plan/sorts/sort.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * make tracker AtomicUsize Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Implement TableProvider for DataFrameImpl (#1699) * Add TableProvider impl for DataFrameImpl * Add physical plan in * Clean up plan construction and names construction * Remove duplicate comments * Remove unused parameter * Add test * Remove duplicate limit comment * Use cloned instead of individual clone * Reduce the amount of code to get a schema Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Add comments to test * Fix plan comparison * Compare only the results of execution * Remove println * Refer to df_impl instead of table in test Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Fix the register_table test to use the correct result set for comparison * Consolidate group/agg exprs * Format * Remove outdated comment Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * refine test in repartition.rs & coalesce_batches.rs (#1707) * Fuzz test for spillable sort (#1706) * Lazy TempDir creation in DiskManager (#1695) * Incorporate dyn scalar kernels (#1685) * Rebase * impl ToNumeric for ScalarValue * Update macro to be based on * Add floats * Cleanup * Newline * add annotation for select_to_plan (#1714) * Support `create_physical_expr` and `ExecutionContextState` or `DefaultPhysicalPlanner` for faster speed (#1700) * Change physical_expr creation API * Refactor API usage to avoid creating ExecutionContextState * Fixup ballista * clippy! * Fix can not load parquet table form spark in datafusion-cli. (#1665) * fix can not load parquet table form spark * add Invalid file in log. * fix fmt * add upper bound for pub fn (#1713) Signed-off-by: remzi <13716567376yh@gmail.com> * Create SchemaAdapter trait to map table schema to file schemas (#1709) * Create SchemaAdapter trait to map table schema to file schemas * Linting fix * Remove commented code * approx_quantile() aggregation function (#1539) * feat: implement TDigest for approx quantile Adds a [TDigest] implementation providing approximate quantile estimations of large inputs using a small amount of (bounded) memory. A TDigest is most accurate near either "end" of the quantile range (that is, 0.1, 0.9, 0.95, etc) due to the use of a scalaing function that increases resolution at the tails. The paper claims single digit part per million errors for q ≤ 0.001 or q ≥ 0.999 using 100 centroids, and in practice I have found accuracy to be more than acceptable for an apprixmate function across the entire quantile range. The implementation is a modified copy of https://github.com/MnO2/t-digest, itself a Rust port of [Facebook's C++ implementation]. Both Facebook's implementation, and Mn02's Rust port are Apache 2.0 licensed. [TDigest]: https://arxiv.org/abs/1902.04023 [Facebook's C++ implementation]: https://github.com/facebook/folly/blob/main/folly/stats/TDigest.h * feat: approx_quantile aggregation Adds the ApproxQuantile physical expression, plumbing & test cases. The function signature is: approx_quantile(column, quantile) Where column can be any numeric type (that can be cast to a float64) and quantile is a float64 literal between 0 and 1. * feat: approx_quantile dataframe function Adds the approx_quantile() dataframe function, and exports it in the prelude. * refactor: bastilla approx_quantile support Adds bastilla wire encoding for approx_quantile. Adding support for this required modifying the AggregateExprNode proto message to support propigating multiple LogicalExprNode aggregate arguments - all the existing aggregations take a single argument, so this wasn't needed before. This commit adds "repeated" to the expr field, which I believe is backwards compatible as described here: https://developers.google.com/protocol-buffers/docs/proto3#updating Specifically, adding "repeated" to an existing message field: "For ... message fields, optional is compatible with repeated" No existing tests needed fixing, and a new roundtrip test is included that covers the change to allow multiple expr. * refactor: use input type as return type Casts the calculated quantile value to the same type as the input data. * fixup! refactor: bastilla approx_quantile support * refactor: rebase onto main * refactor: validate quantile value Ensures the quantile values is between 0 and 1, emitting a plan error if not. * refactor: rename to approx_percentile_cont * refactor: clippy lints * suppport bitwise and as an example (#1653) * suppport bitwise and as an example * Use $OP in macro rather than `&` * fix: change signature to &dyn Array * fmt Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * fix: substr - correct behaivour with negative start pos (#1660) * minor: fix cargo run --release error (#1723) * Convert boolean case expressions to boolean logic (#1719) * Convert boolean case expressions to boolean logic * Review feedback * substitute `parking_lot::Mutex` for `std::sync::Mutex` (#1720) * Substitute parking_lot::Mutex for std::sync::Mutex * enable parking_lot feature in tokio * Add Expression Simplification API (#1717) * Add Expression Simplification API * fmt * Add tests and CI for optional pyarrow module (#1711) * Implement other side of conversion * Add test workflow * Add (failing) tests * Get unit tests passing * Use python -m pip * Debug LD_LIBRARY_PATH * Set LIBRARY_PATH * Update help with better info * Update parking_lot requirement from 0.11 to 0.12 (#1735) Updates the requirements on [parking_lot](https://github.com/Amanieu/parking_lot) to permit the latest version. - [Release notes](https://github.com/Amanieu/parking_lot/releases) - [Changelog](https://github.com/Amanieu/parking_lot/blob/master/CHANGELOG.md) - [Commits](Amanieu/parking_lot@0.11.0...0.12.0) --- updated-dependencies: - dependency-name: parking_lot dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Prevent repartitioning of certain operator's direct children (#1731) (#1732) * Prevent repartitioning of certain operator's direct children (#1731) * Update ballista tests * Don't repartition children of RepartitionExec * Revert partition restriction on Repartition and Projection * Review feedback * Lint * API to get Expr's type and nullability without a `DFSchema` (#1726) * API to get Expr type and nullability without a `DFSchema` * Add test * publically export * Improve docs * Fix typos in crate documentation (#1739) * add `cargo check --release` to ci (#1737) * remote test * Update .github/workflows/rust.yml Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Move optimize test out of context.rs (#1742) * Move optimize test out of context.rs * Update * use clap 3 style args parsing for datafusion cli (#1749) * use clap 3 style args parsing for datafusion cli * upgrade cli version * Add partitioned_csv setup code to sql_integration test (#1743) * use ordered-float 2.10 (#1756) Signed-off-by: Andy Grove <agrove@apache.org> * #1768 Support TimeUnit::Second in hasher (#1769) * Support TimeUnit::Second in hasher * fix linter * format (#1745) * Create built-in scalar functions programmatically (#1734) * create build-in scalar functions programatically Signed-off-by: remzi <13716567376yh@gmail.com> * solve conflict Signed-off-by: remzi <13716567376yh@gmail.com> * fix spelling mistake Signed-off-by: remzi <13716567376yh@gmail.com> * rename to call_fn Signed-off-by: remzi <13716567376yh@gmail.com> * [split/1] split datafusion-common module (#1751) * split datafusion-common module * pyarrow * Update datafusion-common/README.md Co-authored-by: Andy Grove <agrove@apache.org> * Update datafusion/Cargo.toml * include publishing Co-authored-by: Andy Grove <agrove@apache.org> * fix: Case insensitive unquoted identifiers (#1747) * move dfschema and column (#1758) * add datafusion-expr module (#1759) * move column, dfschema, etc. to common module (#1760) * include window frames and operator into datafusion-expr (#1761) * move signature, type signature, and volatility to split module (#1763) * [split/10] split up expr for rewriting, visiting, and simplification traits (#1774) * split up expr for rewriting, visiting, and simplification * add docs * move built-in scalar functions (#1764) * split expr type and null info to be expr-schemable (#1784) * rewrite predicates before pushing to union inputs (#1781) * move accumulator and columnar value (#1765) * move accumulator and columnar value (#1762) * fix bad data type in test_try_cast_decimal_to_decimal * added projections for avro columns Co-authored-by: xudong.w <wxd963996380@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Yijie Shen <henry.yijieshen@gmail.com> Co-authored-by: Phillip Cloud <417981+cpcloud@users.noreply.github.com> Co-authored-by: Matthew Turner <matthew.m.turner@outlook.com> Co-authored-by: Yang <37145547+Ted-Jiang@users.noreply.github.com> Co-authored-by: Remzi Yang <59198230+HaoYang670@users.noreply.github.com> Co-authored-by: Dan Harris <1327726+thinkharderdev@users.noreply.github.com> Co-authored-by: Dom <dom@itsallbroken.com> Co-authored-by: Kun Liu <liukun@apache.org> Co-authored-by: Dmitry Patsura <talk@dmtry.me> Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com> Co-authored-by: Will Jones <willjones127@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: r.4ntix <r.4ntix@gmail.com> Co-authored-by: Jiayu Liu <Jimexist@users.noreply.github.com> Co-authored-by: Andy Grove <agrove@apache.org> Co-authored-by: Rich <jychen7@users.noreply.github.com> Co-authored-by: Marko Mikulicic <mmikulicic@gmail.com> Co-authored-by: Eduard Karacharov <13005055+korowa@users.noreply.github.com>

Which issue does this PR close?
Closes #1635.
Rationale for this change
The build was reported as broken, so we should test it.
In addition, datafusion-contrib/datafusion-python#21 will benefit from the completed implementation of roundtrip conversion.
What changes are included in this PR?
to_pyarrowimplementation forScalarValuepyarrowconversion logic.Are there any user-facing changes?
ScalarValue.to_pyarrowwill no longer throw a not implemented error. Does that count as an API change?