Goals: The BIOSCAN-5M project follows up on the BIOSCAN-1M dataset release with a substantially larger and more curated release if 5M image-DNA barcode pairs with partial taxonomic labels. This release forms the basis for current work on multimodal vision-DNA-language models for biodiversity monitoring. The project explores three distinct machine learning tasks that showcase the power of integrating multiple data modalities: pretraining a masked language model on DNA barcode sequences, zero-shot transfer learning tasks for images and DNA barcodes, and multi-modality benchmarking through contrastive learning across DNA barcodes, image data, and taxonomic information.
Overview: The BIOSCAN-5M dataset demonstrates the ability to perform contrastive learning across DNA barcodes, image data, and taxonomic information, resulting in a general shared embedding space that enables flexible taxonomic classification using various combinations of input modalities. A key feature is the flexibility in using both DNA and image-based queries and keys for taxonomic classification, which is crucial for real-world biodiversity monitoring scenarios. The dataset and research were presented to the leadership at the Center for Biodiversity Genomics, who now use these vision models in their intake process. The work was also presented to LIFEPLAN, another international collaboration, at the Vector Institute's Computer Vision Symposium. Future work includes disseminating the dataset to encourage adoption as a benchmark and building "BIOSCAN Browser," a web-based interface to allow explorations and querying of the dataset.