Meta continues to push the boundaries of artificial intelligence and its contribution to computer vision with the introduction of Sapiens. Using large amounts of data, advanced computing resources, it shapes the human orientation of technology in the future and takes a leading position in research in this field.
Foundation AI Models
Foundation AI models can be understood as models pre-trained on large datasets, accumulating a broad knowledge base. This allows these models to perform various AI tasks and adapt them to specific domains. These models serve as a foundational tool because they come equipped with general tasks built in, and later, based on the need and the domain in which AI will be applied, they can be further adapted and fine-tuned with specific data for specialized tasks.
Some of these technologies include machine learning models, which are used to predict continuous data, deep learning models, or generative models, which create new content.
There are several key features that illustrate how these models work:
- Pre-training on large datasets: These models are trained on vast amounts of data and cover a broad range of content such as books, articles, images, websites, etc. This enables them to tackle and perform various tasks.
- Versatility: These models can be applied to different applications, such as text generation on various levels—summarization, translation, image generation from text, and speech recognition.
- Scalability: As mentioned, these models are flexible and can be trained with increasingly larger datasets, making the technology more advanced and efficient over time.
Foundation models are fundamental tools used in a wide range of applications. Here are a few examples of programs that use foundation models with fine-tuning to create high-quality applications:
- ChatGPT by OpenAI – Provides answers and tips, summarizes notes, and generates written content.
- DALL·E by OpenAI – Creates realistic images and artwork from natural language descriptions.
Meta’s Sapiens
Facebook introduced a new family of pre-trained computer vision models called Sapiens, following the well-known principle in the field that bigger models and more data equal better systems. These models improve results in areas such as 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction.
Key features include:
- 300 million photos: The Sapiens models were pre-trained on Humans-300M, a dataset compiled by Facebook that contains 300 million diverse, unlabeled images of humans. These images were used to train a family of vision transformers with parameter sizes ranging from 300 million to 2 billion.
- Compute resources: The largest model, Sapiens-2B, was pre-trained using 1024 A100 GPUs for 18 days via PyTorch, which equates to approximately 442,368 GPU hours. For comparison, Facebook notes that this is significantly less compute than required for their LLaMa language models (e.g., 1.46 million hours for the 8B model and 30.84 million hours for the 403B model).
The success of the Sapiens models highlights the importance of scale in AI development. Facebook attributes the superior performance of these models to three main factors:
- Large-scale pre-training on a curated dataset focused on understanding humans.
- The use of high-capacity vision transformers with high-resolution capabilities.
- High-quality annotations derived from both augmented studio and synthetic data.
Together, these factors emphasize the critical role of scale in driving advancements in computer vision.
Final Word
These models prove that scale, data, annotations are necessary for improving artificial intelligence. By investing in Foundation models, Meta and other technology giants push the boundaries of technology even further and provide new opportunities.
Sources: Ada Lovelace Institute, Meta, IBM