HEllo,
First of all, I wanted to show my appreciation. I was already a fan of EfficientSAM and EffcientTAM is great. EfficientSAM was included in a pacakge I was dveloping for microscopy annotation in Fiji (https://github.com/segment-anything-models-java/SAMJ-IJ) and I am now including EfficientTAM for image, video and 2.5D segmentation as it is now feasible to run models like this on CPU on several images thanks to your work.
As for as I understand, EfficientTAM uses the weigths of EfficientSAM, pretrains the image encoder model on segmentation task on the SA-1B dataset and then trains for video segmentation using the SAV dataset.
In my opinion, and it seems that by the paper metrics, EfficientTAM is better thatn EfficientSAM in image segmentation and on par of better than SAM1. HOw can this be?
As far as I have seen, EfficientTAM image encoder is just a vanilla encoder, how can a Tiny Vanilla encoder compare in performance to SAM1 Huge encoder?
Even comparing with SAM2, Base, which is way bigger and intriduces several prior biases with the Hiera ViT is shadowed by the small version of a Vanilla ViT.
Is it just a matter of pretraining with MAE on ImageNEt, then pretraining on SA1 for segmentation and then training for Video?
Is the incredible performace gain from a vanilla Tiny/Small ViT, just from training longer, on more data and in a smart way with various different objectives?
Has there any other trick been introduced?
I am very curious about this, because the performaces of Tiny and Small in segmentations are very impressive.
Thanks a lot for your time and congrats again on a great job!
REgards,
CArlos
HEllo,
First of all, I wanted to show my appreciation. I was already a fan of EfficientSAM and EffcientTAM is great. EfficientSAM was included in a pacakge I was dveloping for microscopy annotation in Fiji (https://github.com/segment-anything-models-java/SAMJ-IJ) and I am now including EfficientTAM for image, video and 2.5D segmentation as it is now feasible to run models like this on CPU on several images thanks to your work.
As for as I understand, EfficientTAM uses the weigths of EfficientSAM, pretrains the image encoder model on segmentation task on the SA-1B dataset and then trains for video segmentation using the SAV dataset.
In my opinion, and it seems that by the paper metrics, EfficientTAM is better thatn EfficientSAM in image segmentation and on par of better than SAM1. HOw can this be?
As far as I have seen, EfficientTAM image encoder is just a vanilla encoder, how can a Tiny Vanilla encoder compare in performance to SAM1 Huge encoder?
Even comparing with SAM2, Base, which is way bigger and intriduces several prior biases with the Hiera ViT is shadowed by the small version of a Vanilla ViT.
Is it just a matter of pretraining with MAE on ImageNEt, then pretraining on SA1 for segmentation and then training for Video?
Is the incredible performace gain from a vanilla Tiny/Small ViT, just from training longer, on more data and in a smart way with various different objectives?
Has there any other trick been introduced?
I am very curious about this, because the performaces of Tiny and Small in segmentations are very impressive.
Thanks a lot for your time and congrats again on a great job!
REgards,
CArlos