Goals: To introduce INQUIRE, a text-to-image retrieval benchmark for evaluating multimodal vision-language models on expert-level natural world queries. The benchmark includes iNat24, a new dataset of five million natural world images paired with 200 expert-level retrieval queries covering 16 broad natural world categories. These queries address challenges related to reasoning about species identification, context, behavior, and image appearance.
Overview: INQUIRE evaluates two core retrieval tasks: INQUIRE-Fullrank (full dataset ranking) and INQUIRE-Rerank (reranking using a fixed initial ranking of 100 images per query). Compared to existing image retrieval datasets, INQUIRE is larger, contains more image matches per query, and requires both advanced image understanding and domain expertise. Evaluation of recent multimodal models shows INQUIRE poses a significant challenge, with the best models failing to achieve mAP@50 above 50%. The benchmark demonstrates that reranking with more powerful multimodal models can enhance retrieval performance, though significant room for improvement remains.