First implementation of single-GPU FFT using cuFFT#238
First implementation of single-GPU FFT using cuFFT#238magnatelee merged 27 commits intonv-legate:branch-22.05from
Conversation
cunumeric/array.py
Outdated
| raise ValueError( | ||
| "Axis is out of bounds for array of size {}".format(self.ndim) | ||
| ) | ||
| fft_axes = [x % self.ndim for x in fft_axes] |
There was a problem hiding this comment.
I guess if you did this first, then you could have avoided the np.abs call at line 2128.
There was a problem hiding this comment.
Not really. axes = [-5, 10] is not a valid input and it should error out before sanitizing. Unless there is a compelling reason to deviate from numpy behavior?
There was a problem hiding this comment.
I'm not suggesting we deviate from NumPy. I believe your checking code at line 2128 is incorrect because of the asymmetry between positive and negative indices in Python (which made me find it somewhat odd). Index n is invalid for a sequence of size n, but -n is valid, as it points to the first element. Line 2128 doesn't take this into account and will reject the latter. (Let me know if you think otherwise.) A usual way of handling negative indices in Python would be to add the size of the sequence to any negative indices and then check if they are still negative or greater than or equal to the size.
cunumeric/array.py
Outdated
| if user_sizes: | ||
| # Zero padding if any of the user sizes is larger than input | ||
| zeropad_input = self | ||
| if np.any(np.greater(fft_s, fft_input.shape)): |
There was a problem hiding this comment.
if you're constantly throwing these to numpy calls, why don't you just make numpy ndarrays from them? the conversion between a Python list and a NumPy ndarray may not be as cheap as you would expect.
cunumeric/array.py
Outdated
| fft_input_shape = list(fft_input.shape) | ||
| for idx, ax in enumerate(fft_axes): | ||
| fft_input_shape[ax] = s[idx] | ||
| fft_input_shape = tuple(fft_input_shape) |
cunumeric/array.py
Outdated
| fft_input = ndarray( | ||
| shape=fft_input_shape, | ||
| thunk=zeropad_input._thunk.get_item(slices), | ||
| ).copy() |
There was a problem hiding this comment.
What's the reason for this copy?
There was a problem hiding this comment.
From your comments on slack
(Feb 16):
what I’d do for a functional implementation is to make a padded array, fill it with zeros, slice it to a sub-array whose shape matches the original array’s, and do a copy between the two
(Feb 24):
for zero padding, I think you can do something like this:
1. create an empty deferred array A of the size
2. call fill to zero it out
3. slice A to match the shape of s and call copy
Tests fail without the copy. But if this copy is redundant, I'd be happy to amend it with your feedback. Maybe the thunk is not being assigned correctly?
There was a problem hiding this comment.
I think I got confused by the fact that lines 2180-2183 already make a copy of the input. I think the copy at line 2193 should be moved to that if statement to make sure we make only one copy of the input.
src/cunumeric/fft/fft.cu
Outdated
| fftDirection direction) | ||
| { | ||
| const Point<DIM> zero = Point<DIM>::ZEROES(); | ||
| CHECK_CUFFT(cufftXtExec(plan, (void*)in.ptr(zero), (void*)out.ptr(zero), (int)direction)); |
There was a problem hiding this comment.
I don't think you need these castings to void*. That would happen implicitly. If you insist on casting them, please use static_cast. (I saw there are other places like this, so I'd like you to fix them all.)
src/cunumeric/fft/fft.cu
Outdated
| } | ||
|
|
||
| // Copy input to temporary buffer to perform FFTs one by one | ||
| DeferredBuffer<INPUT_TYPE, DIM> input_buffer( |
There was a problem hiding this comment.
Again, let's use create_buffer.
| for (auto& ax : axes) { | ||
| // Create the plan | ||
| cufftHandle plan; | ||
| CHECK_CUFFT(cufftCreate(&plan)); |
There was a problem hiding this comment.
Again, let me know how the existing plan cache can replace this code.
src/cunumeric/fft/fft.cu
Outdated
| num_elements_out * sizeof(OUTPUT_TYPE), | ||
| cudaMemcpyDefault, | ||
| stream)); | ||
| CHECK_CUDA(cudaStreamSynchronize(stream)); |
There was a problem hiding this comment.
You don't need this synchronization, as the runtime will do it for you.
|
|
||
| // Perform the FFT operation as multiple 1D FFTs along the specified axes, single R2C/C2R operation. | ||
| template <int DIM, typename OUTPUT_TYPE, typename INPUT_TYPE> | ||
| __host__ static inline void cufft_over_axis_r2c_c2r(AccessorWO<OUTPUT_TYPE, DIM> out, |
There was a problem hiding this comment.
This function looks quite similar to cufft_over_axis_c2c. Can you think of a way to factor out the common parts?
* Added sizes to fft input * Added initial support for per-axis FFTs * Added axes sanitization * Added working implementation of zero padding and size truncation * Added normalization * Added hermitian transform functions * Clean-up of code * Fixed runtime issues * Fixed repeating axes * Work over axes (#3) * First working version for single axes * R2C working for 3D + axes * Added fixes for R2C + axes * Fixed C2R for 3D * Clean-up part I * Removed axes boolean in C++ * Some minor renaming and refactoring * Fixed several issues, moved FFT to its own module within cuNumeric * Refactored and expanded test lists * Added default values to public API * Added conversions to R2C/C2R * Added docstrings and odd type tests * Fixed issue when running C2C/Z2Z with real values * Further refactoring, removing unnecessary code * Addressed PR feedback, refactor / code cleaning * Added host synchronization on FFT with internal data copies * Fixed an issue that caused C2R to run over axes unnecessarily * Fixed issues after rebase * Final fixes from MR * Replaced manual stream creation with cached streams * Minor fixes from MR feedback * Minor fixes from last MR feedback
for more information, see https://pre-commit.ci
447b990 to
3c8d3c8
Compare
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
cunumeric/array.py
Outdated
|
|
||
| # Shape | ||
| fft_input = self | ||
| fft_input_shape = np.asarray(list(self.shape)) |
There was a problem hiding this comment.
why do you create a list here? can't this be np.asarray(self.shape)?
cunumeric/array.py
Outdated
| # Shape | ||
| fft_input = self | ||
| fft_input_shape = np.asarray(list(self.shape)) | ||
| fft_output_shape = np.asarray(list(self.shape)) |
cunumeric/array.py
Outdated
| # Normalization | ||
| fft_norm = FFTNormalization.from_string(norm) | ||
| do_normalization = any( | ||
| [ |
cunumeric/config.py
Outdated
| @staticmethod | ||
| def real_to_complex_code(dtype): | ||
| if dtype == np.float64: | ||
| return FFT_D2Z() |
There was a problem hiding this comment.
I guess we could cache these objects in a dictionary. we don't want to create a fresh instance.
cunumeric/config.py
Outdated
| if dtype == np.float64: | ||
| return FFT_D2Z() | ||
| elif dtype == np.float32: | ||
| return FFT_R2C() |
cunumeric/config.py
Outdated
| @staticmethod | ||
| def complex_to_real_code(dtype): | ||
| if dtype == np.complex128: | ||
| return FFT_Z2D() |
cunumeric/config.py
Outdated
| if dtype == np.complex128: | ||
| return FFT_Z2D() | ||
| elif dtype == np.complex64: | ||
| return FFT_C2R() |
|
|
||
|
|
||
| # Match these to fftType in fft_util.h | ||
| class FFT_R2C: |
There was a problem hiding this comment.
I'd probably refactor these classes into one template that changes its properties based on the constructor arguments. you don't have to do that refactoring, but you're welcome to try.
There was a problem hiding this comment.
What constructor arguments would you suggest?
I'll probably leave this to you, as you seem to have a clear idea of how you'd like these classes to look, and any changes on my end might steer this again into C++ territory.
There was a problem hiding this comment.
Ok. I'll approve this PR and make some follow-up changes if you don't mind me doing it.
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
|
@mferreravila Like I said in the other comment, I'll take this over from you and polish it up, unless you're interested in finishing it up yourself. Just let me know. |
* duplicate conda envs from cunumeric * remove old env file * update README
Add support for FFTs in single-GPU using cuFFT as back-end.