sashakunitsyn/vlrm-blip2-opt-2.7b
Image-to-Text • 4B • Updated • 671 • 19
An unsupervised reinforcement learning approach enhances image captioning using models like CLIP and BLIP2-ITM, achieving high recall scores.
In this work, we present an unsupervised method for enhancing an image captioning model (in our case, BLIP2) using reinforcement learning and vision-language models like CLIP and BLIP2-ITM as reward models. The RL-tuned model is able to generate longer and more comprehensive descriptions. Our model reaches impressive 0.90 R@1 CLIP Recall score on MS-COCO Carpathy Test Split. Weights are available at https://huggingface.co/sashakunitsyn/vlrm-blip2-opt-2.7b.
Get this paper in your agent:
hf papers read 2404.01911 curl -LsSf https://hf.co/cli/install.sh | bash No dataset linking this paper
No Collection including this paper