Microsoft’s Azure AI team has introduced Florence-2, a groundbreaking vision foundation model that can handle a wide range of vision and vision-language tasks using a unified, prompt-based representation. Available under a permissive MIT license, Florence-2 comes in two sizes, 232M and 771M parameters, and excels at tasks such as captioning, object detection, visual grounding, and segmentation, performing on par or better than many large vision models. This innovative model has the potential to revolutionize the way enterprises approach vision applications, providing a single, unified approach that can save investments on separate task-specific vision models.
What sets Florence-2 apart is its ability to understand spatial data across different scales, from broad image-level concepts to fine-grained pixel details, as well as semantic details such as high-level captions to detailed descriptions. Microsoft’s approach involved generating a comprehensive visual dataset called FLD-5B, which includes 5.4 billion annotations for 126 million images, covering details from high-level descriptions to specific regions and objects. Florence-2 uses a sequence-to-sequence architecture, integrating an image encoder and a multi-modality encoder-decoder, enabling the model to handle various vision tasks without requiring task-specific architectural modifications.
The model’s performance is impressive, outperforming larger models in various tasks, including object detection, captioning, visual grounding, and visual question answering. Its compact size and versatility make it an attractive option for developers, who can now offload the need for separate vision models for different tasks, reducing compute costs and development time.











