Goals: To improve image-based representations for taxonomic classification by leveraging DNA information during training through contrastive learning. This novel approach aligns image, DNA, and textual taxonomic embeddings in a shared space, enabling accurate taxonomic predictions using only image data at inference time. The method uniquely performs contrastive learning across three modalities – images, DNA sequences, and textual taxonomic levels – for biodiversity classification.
Overview: CLIBD represents the first work to integrate three data modalities in biodiversity classification, surpassing previous single-modality approaches by over 11% accuracy on zero-shot learning tasks. The model’s ability to generalize to unseen species is particularly valuable for real-world applications where encountering new or rare species is common. While DNA enables highly accurate taxonomic placement, CLIBD allows rapid, cost-effective image-based inference by using DNA data only during training. Future work will explore using Barcode Index Numbers (BINs) as species proxies to create more varied training examples, particularly for rare taxa.