TL;DR: In this blog, I walk through the chain-of-thought behind my design for the ScoreRS project. I explain why I believe this work is important, detail the hypothesis and solutions I took to make it successful, and share the key challenges I overcome during development.
For the past two years, vision-language models (VLMs) have revolutionized the paradigm of the computer vision community. Almost overnight, researchers across the field began embracing the power of language to enhance visual capabilities since its unprecedented open-endedness and simpler formulations of complex natural vision problems.
The remote sensing (RS) community has also benefited from these paradigm shifts, producing excellent works ranging from RemoteCLIP to GeoChat. However, I've observed that even when we scale up domain-specific data volumes and model sizes, these "specialized models" often underperform compared to general-purpose VLMs on challenging and generalized RS tasks. While I recognize that our smaller, resource-constrained team cannot match the engineering capabilities of large industrial companies, I wanted to explore what improvements might be possible and identify the fundamental challenges we face.
Last year, our team invested significant effort in building RS-specific VLMs (LHRS-Bot). At the time, we primarily evaluated these VLMs on basic classification, question answering, and grounding tasks. While we successfully built these models and published our findings, I felt we were merely following trends rather than breaking new ground. Most of these tasks could be solved by more lightweight models with even higher precision. After discussions with @Zhengzhuo, I became further discouraged, questioning what unique role VLMs could actually play in the RS domain.
In the following year, we've witnessed numerous impressive applications powered by LLMs, such as Copilot, NotebookLLMs, and web agents. I've observed that many challenging tasks can now be effectively solved by LLMs.
This remarkable success prompted me to reconsider the role of LLMs in RS. In our early exploration, we focused primarily on well-formulated RS problems like classification and question answering.
<aside> đź’ˇ
However, many challenges in the RS domain aren't so neatly defined.
</aside>
RS problems can be divided into two categories. First, there are problems involving different combinations of spectral bands or interactions with physical parameters to determine environmental states or retrieve specific parameters. Most of these processes, I should acknowledge, are difficult even for humans to fully solve or understand due to the complex interactions in our physical world, let alone formalize with language. For these problems, VLMs may not be particularly useful since their core capability revolves around language—a human creation that can't always capture these intricate physical processes.
On the other hand, many human activities involving RS images—such as disaster analysis, urban planning, and transportation management—can be readily formalized through language. In these scenarios, we typically identify important characteristics in images and make decisions based on our knowledge (learned in academic settings, from books, or through past experiences with similar situations). These types of tasks, indeed, could be effectively addressed with VLMs.
Therefore, I've come to believe that VLMs could bring potential and benefits to the RS community through their continuously improving reasoning and planning abilities. Their vast knowledge base (drawn from books, websites, reports, and other sources in their pretraining corpus) functions similar to an RS expert's experience. Moreover, their capacity to interact with RAG (Retrieval-Augmented Generation) systems and participate in multi-agent collaborations.
With these insights in mind, I felt greatly encouraged and began my journey to develop RS-specific VLMs. However, rather than building an application framework or system for RS image analysis, I decided to focus on improving the capabilities of foundation models. This approach aligns with my expertise, and I firmly believe that the capabilities of the base model are the core of any successful application system.
I began by reassessing the gap between our specialized VLMs and other flagship general-purpose VLMs. After analyzing numerous comparative conversations (while controlling for base model differences like Llama-2 versus Llama-3), I discovered two key insights:
<aside> ⚠️
Important note: Here, "more likely" is important because during multiple prompting attempts, general VLMs succeed more frequently, but domain-specific VLMs also demonstrate the ability to produce correct answers. The difference is in consistency rather than absolute capability.
</aside>