Make reduced-precision cuBLAS mode opt-in#519
Make reduced-precision cuBLAS mode opt-in#519manopapad merged 1 commit intonv-legate:branch-22.10from
Conversation
| if (nullptr == disable_tensor_cores) { | ||
| // No request to disable tensor cores so turn them on | ||
| cublasStatus_t status = cublasSetMathMode(cublas_, CUBLAS_TENSOR_OP_MATH); | ||
| const char* fast_math = getenv("CUNUMERIC_FAST_MATH"); |
There was a problem hiding this comment.
I chose this against CUNUMERIC_DISABLE_TENSOR_CORES, because if we don't set anything then cuBLAS will stil use tensor cores, but only if that wouldn't result in precision loss.
| const char* fast_math = getenv("CUNUMERIC_FAST_MATH"); | ||
| if (fast_math != nullptr && atoi(fast_math) > 0) { | ||
| // Enable acceleration of single precision routines using TF32 tensor cores. | ||
| cublasStatus_t status = cublasSetMathMode(cublas_, CUBLAS_TF32_TENSOR_OP_MATH); |
There was a problem hiding this comment.
CUBLAS_TENSOR_OP_MATH is deprecated.
There was a problem hiding this comment.
Are TF32 tensor cores not used when this math mode isn't set? This table says that CUBLAS_DEFAULT_MATH can use tensor cores whenever possible, which might include TF32 tensor cores:
https://docs.nvidia.com/cuda/cublas/index.html#cublasmath_t
If we want to be absolutely sure about precision, I guess we should set CUBLAS_PEDANTIC_MATH in the else branch.
There was a problem hiding this comment.
AFAIU in CUBLAS_PEDANTIC_MATH mode tensor cores are disabled altogether, and (I assume) other mostly-safe optimizations like operation reordering.
It sounds like CUBLAS_DEFAULT_MATH is guaranteed to respect the (minimum) level of precision prescribed by the operation (e.g. cublassSgemmEx prescribes the use of fp32-wide intermediates) but can adjust upwards if it makes sense, and will use tensor cores where safe (in contrast to CUBLAS_TF32_TENSOR_OP_MATH, which allows use of tf32 intermediates for cublassSgemmEx, and CUBLAS_TENSOR_OP_MATH, which allows use of fp16 intermediates).
Therefore, I don't think we want CUBLAS_PEDANTIC_MATH to be the default. How about I just change the default to be CUBLAS_DEFAULT_MATH, and activate CUBLAS_PEDANTIC_MATH if the user passes CUNUMERIC_DISABLE_TENSOR_CORES? We can then pass this flag during testing, if necessary.
There was a problem hiding this comment.
I just wanted to confirm that the default setting is sufficient to avoid the precision problem we're seeing on A100s. CUBLAS_PEDANTIC_MATH is unnecessary if CUBLAS_DEFAULT_MATH does the right thing already.
No description provided.