Skip to content

Conversation

@oleksandr-pavlyk
Copy link
Contributor

@oleksandr-pavlyk oleksandr-pavlyk commented Dec 6, 2023

Implemented data-parallel merge-sort algorithm, which powers array API operations dpct.tensor.sort and dpctl.tensor.argsort. Implemented set functions: dpctl.tensor.unique_values, dpctl.tensor.unique_counts, dpctl.tensor.unique_inverse, and doctl.tensor.unique_all.

  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to an issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • If this PR is a work in progress, are you opening the PR as a draft?

@github-actions
Copy link

github-actions bot commented Dec 6, 2023

@github-actions
Copy link

github-actions bot commented Dec 6, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_20 ran successfully.
Passed: 881
Failed: 50
Skipped: 59

@github-actions
Copy link

github-actions bot commented Dec 7, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_23 ran successfully.
Passed: 881
Failed: 50
Skipped: 59

@coveralls
Copy link
Collaborator

coveralls commented Dec 7, 2023

Coverage Status

coverage: 89.993% (+0.3%) from 89.693%
when pulling 5469832 on feature/sorting
into f8b2202 on master.

@github-actions
Copy link

github-actions bot commented Dec 7, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_27 ran successfully.
Passed: 881
Failed: 50
Skipped: 59

@github-actions
Copy link

github-actions bot commented Dec 8, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_32 ran successfully.
Passed: 891
Failed: 41
Skipped: 59

@github-actions
Copy link

github-actions bot commented Dec 8, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_33 ran successfully.
Passed: 891
Failed: 41
Skipped: 59

@github-actions
Copy link

github-actions bot commented Dec 9, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_35 ran successfully.
Passed: 890
Failed: 42
Skipped: 59

@github-actions
Copy link

github-actions bot commented Dec 9, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_40 ran successfully.
Passed: 889
Failed: 43
Skipped: 59

@github-actions
Copy link

github-actions bot commented Dec 9, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_41 ran successfully.
Passed: 890
Failed: 42
Skipped: 59

@oleksandr-pavlyk oleksandr-pavlyk marked this pull request as ready for review December 9, 2023 13:12
@github-actions
Copy link

github-actions bot commented Dec 9, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_43 ran successfully.
Passed: 894
Failed: 38
Skipped: 59

@github-actions
Copy link

github-actions bot commented Dec 9, 2023

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_44 ran successfully.
Passed: 891
Failed: 41
Skipped: 59

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_49 ran successfully.
Passed: 891
Failed: 41
Skipped: 59

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_67 ran successfully.
Passed: 891
Failed: 41
Skipped: 59

@ndgrigorian
Copy link
Collaborator

ndgrigorian commented Dec 18, 2023

@oleksandr-pavlyk
Tests seem to pass using the asynchronous unique_* functions. Should I go ahead and remove set_functions_sync.py and transition to the asynchronous implementations?

For me, using the async implementation shaves off ~1ms for small arrays

In [2]: x = dpt.ones((10), dtype="c8")

In [3]: %time z = dpt.unique_values(x)
CPU times: user 267 ms, sys: 20.1 ms, total: 287 ms
Wall time: 296 ms

In [4]: %time z = dpt.unique_values(x)
CPU times: user 2.77 ms, sys: 0 ns, total: 2.77 ms
Wall time: 4.52 ms

In [5]: %time z = dpt.unique_values(x)
CPU times: user 1.51 ms, sys: 371 µs, total: 1.88 ms
Wall time: 1.8 ms

In [6]: %time z = dpt.unique_values(x)
CPU times: user 1.6 ms, sys: 0 ns, total: 1.6 ms
Wall time: 1.59 ms

In [7]: %time z = dpt.unique_values(x)
CPU times: user 1.36 ms, sys: 0 ns, total: 1.36 ms
Wall time: 1.29 ms

In [8]: %time z = dpt.unique_values(x)
CPU times: user 2.05 ms, sys: 0 ns, total: 2.05 ms
Wall time: 1.78 ms

compared to synchronized

In [2]: x = dpt.ones((10), dtype="c8")

In [3]: %time z = dpt.unique_values(x)
CPU times: user 264 ms, sys: 14.6 ms, total: 279 ms
Wall time: 287 ms

In [4]: %time z = dpt.unique_values(x)
CPU times: user 1.6 ms, sys: 1.07 ms, total: 2.67 ms
Wall time: 3.43 ms

In [5]: %time z = dpt.unique_values(x)
CPU times: user 1.62 ms, sys: 1.08 ms, total: 2.7 ms
Wall time: 3.78 ms

In [6]: %time z = dpt.unique_values(x)
CPU times: user 2.69 ms, sys: 0 ns, total: 2.69 ms
Wall time: 3.03 ms

In [7]: %time z = dpt.unique_values(x)
CPU times: user 0 ns, sys: 3.01 ms, total: 3.01 ms
Wall time: 3.33 ms

@oleksandr-pavlyk
Copy link
Contributor Author

@ndgrigorian I have switched to asynchronous implementation per your feedback, but kept the synchronous one around just in case.

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_74 ran successfully.
Passed: 890
Failed: 42
Skipped: 59

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_77 ran successfully.
Passed: 895
Failed: 37
Skipped: 59

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_78 ran successfully.
Passed: 892
Failed: 40
Skipped: 59

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_79 ran successfully.
Passed: 896
Failed: 36
Skipped: 59

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_82 ran successfully.
Passed: 895
Failed: 37
Skipped: 59

@github-actions
Copy link

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_83 ran successfully.
Passed: 894
Failed: 38
Skipped: 59

oleksandr-pavlyk and others added 18 commits January 8, 2024 13:30
This is `std::vector<sycl::event>` passed by reference to collect
events associated with host_task submissions.

The synchronizing call `mask_positions` is releasing GIL before
wait on these events is called.

It is either this, or accumulated host tasks must be returned to
the user.
Changed hyperparameter choices to be different for CPU and GPU, resulting
in 20% performance gain on GPU.

The non-recursive implementation allows to avoid repeated USM allocations,
resulting in performance gains for large arrays.

Furthermore, corrected base step kernel to accumulate in outputT rather than
in size_t, which additionally realizes savings when int32 is used as
accumulator type.

Using example from gh-1249, previously, on my Iris Xe laptop:

```
In [1]: import dpctl.tensor as dpt
   ...: ag = dpt.ones((8192, 8192), device='gpu', dtype='f4')
   ...: bg = dpt.ones((8192, 8192), device='gpu', dtype=bool)

In [2]: cg = ag[bg]

In [3]: dpt.all(cg == dpt.reshape(ag, -1))
Out[3]: usm_ndarray(True)

In [4]: %timeit -n 10 -r 3 cg = ag[bg]
212 ms ± 56 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
```

while with this change:

```
In [4]: %timeit -n 10 -r 3 cg = ag[bg]
178 ms ± 24.2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
```
As the array will be copied if it isn't C-contiguous regardless,
this change does not impact performance and permits calls to set functions
with strided, ND data
Set functions were making copies and then sorting the original data,
resulting in incorrect results.
`unique_all` would not return a `UniqueAllResult` tuple when early exiting for an array of all unique elements

Fixed problems through `unique` functions where the cumulative sum was being allocated on a separate (default) queue, causing inputs with a non-default queue to fail

`unique` functions now behave correctly for 0-size array inputs
Were previously returning a 1D array of indices rather than an array with the same shape as input `x`
Changes to sync implementation of set functions so that test
suite passes with sync implementation as it does for async
implementation.
@github-actions
Copy link

github-actions bot commented Jan 8, 2024

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_85 ran successfully.
Passed: 893
Failed: 16
Skipped: 82

@github-actions
Copy link

github-actions bot commented Jan 8, 2024

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_87 ran successfully.
Passed: 893
Failed: 16
Skipped: 82

…x FP types

We use extended comparison operators compatible with NumPy's behavior:

https://numpy.org/devdocs/reference/generated/numpy.sort.html

Specifically, we use [R, nan] block ordering for reals, and
[(R, R), (R, nan), (nan, R), (nan, nan)] for complexes.
@github-actions
Copy link

Array API standard conformance tests for dpctl=0.15.1dev2=py310h15de555_89 ran successfully.
Passed: 895
Failed: 14
Skipped: 82

Copy link
Collaborator

@ndgrigorian ndgrigorian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @oleksandr-pavlyk , very close to full array API coverage.

Copy link
Contributor

@vtavana vtavana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, works fine for zero dimensional arrays and arrays with nan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants