Hi authors,
First of all, thank you so much for your excellent work and for sharing it with the community!
I have a quick question regarding the training details mentioned in your supplementary material. You noted the following:
"Inspired by the training strategy of SVD, we first train the model from scratch with an AdamW [46] optimizer on RGBD images, with a fixed learning rate of 1e-4 for 40K iterations. Then, we finetune the temporal layers in the decoder for another 20K iterations on video data."
Could you please clarify what learning rate was used for this second stage (the another 20K iterations of finetuning on video data)? Did you keep it fixed at 1e-4, or was a different learning rate / schedule applied?
Thank you in advance for your time and help!