Skip to content

Conversation

@JackCaoG
Copy link
Collaborator

@JackCaoG JackCaoG commented May 3, 2024

Fix the bug where if inplace operation is being applied multiple times, the aliasing won;t happened.

Consider the case without this pr

# Tensor ID 1, alias ID 1
t1 = torch.randn(5,5).to('xla:0')
# Tensor ID 2, alias ID 1
t1 += 1
# Tensor ID 3, alias ID 2
t1 * = 3

During the mark_step time we check that input buffer has tensor ID 1, and the output alias id is 2, hence it will skip donating input buffer of size (5,5).

for (size_t i = 0; i < indices.size(); ++i) {
size_t tensor_index = indices[i];
int64_t tensor_id = tensors[tensor_index]->data()->alias_id;
output_tensor_id_map[tensor_id] = i;
}

auto it = output_tensor_id_map.find(data_info->tensor_id);
// Parameter buffer's TensorId in output_tensor_id_map means
// this buffer is not needed after execution since XLATensor will get a
// new buffer.
if (it != output_tensor_id_map.end()) {
lowering_ctx->builder()->AddBufferDonor(/*param_number=*/i,
/*param_index=*/{});
buffer_donor_indexs.push_back(i);
}

Alias ID should track the tensor ID of the input buffer, not the tensor ID of last base.

@JackCaoG JackCaoG requested review from alanwaketan and wonjoo-wj May 3, 2024 17:37
@JackCaoG
Copy link
Collaborator Author

JackCaoG commented May 3, 2024

@alanwaketan @wonjoolee95 I think this one is ready for review.

auto input_tensor = bridge::GetXlaTensor(input);
auto output_tensor = bridge::GetXlaTensor(output);
output_tensor->data()->alias_id = input_tensor->GetUniqueId();
if (input_tensor->CurrentDataHandle() != nullptr ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can always use alias_id?

Copy link
Collaborator Author

@JackCaoG JackCaoG May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha that's what I thought but actually no. Look at my example down below

    // x.tensor_id = 1, x.alias_id = 1
    x = torch.randn(5,5).to(xla_device())
    // x.tensor_id = 2, x.alias_id should be 1
    x += 1
    xm.mark_step()
    // x.tensor_id =3, x.alias_id should be 2 since input tensor id will be 2
    // for this graph
    x *= 1 
    xm.mark_step()

if we always use alias_id, the alias_id of x in second would be 1, but we need it to be 2.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the second execution, input tensor id is 2, we need the alias ID to always match the input tensor ID. In other world we should not carry alias_id across mark_step.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky, even the underlying buffer is aliased, we still create a new PjrtBuffer object for x after the first mark_step. That DeviceData object(wrap about pjrtbuffer) will have data_info with tensor_id 2, since x's tensor id is 2 after the first mark_step.

Copy link
Collaborator

@alanwaketan alanwaketan May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess resetting alias_id after mark_step is probably very complicated. This is more like a simplified way to achieve that. Assuming IR/outputs becomes DeviceData/inputs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do that too(reset alias_id to tensor id after processed the input_output_alias info). That might make this code less confuse haha.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good follow up, but feel free to skip it.

Copy link
Collaborator

@alanwaketan alanwaketan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@JackCaoG JackCaoG merged commit e3fc033 into master May 3, 2024
@jeffhataws
Copy link
Collaborator

Will this go into 2.4? Any chance it can be backported to 2.3?

@JackCaoG
Copy link
Collaborator Author

JackCaoG commented Jun 4, 2024

This will be part of the 2.4, we don't do dot releases so it is unlikely for this one to be in the 2.3 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants