Gemma 3: Introducing Google’s Latest Open AI Model

2025-03-18

Google unveils its most versatile and powerful open AI model yet, taking multimodal capabilities mainstream.

Artificial intelligence – artistic impression. Gemma 3 is Google’s most versatile and powerful open AI model yet. Image credit: Rawpixel via Freepik, free license

The Evolution of Open AI Takes a Quantum Leap

In a significant advancement for the open AI ecosystem, Google has launched Gemma 3, the newest iteration of its open weight language model family. This release marks a fundamental shift in what developers can expect from accessible AI models, bringing multimodal capabilities, expanded multilingual support, and dramatically extended context windows to the forefront.

Gemma 3 doesn’t just improve on its predecessor—it reimagines what’s possible in the open AI space, positioning itself as a formidable competitor to both proprietary and open models across the industry.

Breaking Down Gemma 3’s Core Capabilities

Google’s latest AI offering comes in four distinct parameter sizes: 1 billion, 4 billion, 12 billion, and 27 billion. Each variant is available in both pre-trained and instruction-tuned versions, providing flexibility for different implementation needs.

While the 1B model remains text-only with a 32K token context window, the larger models (4B, 12B, and 27B) break new ground with:

Multimodal processing: Seamlessly handles both text and images
Expansive context window: 128K tokens, a 16x increase from Gemma 2’s 8K
Multilingual prowess: Support for 140+ languages beyond English

This represents a significant leap in capabilities compared to Gemma 2, which was limited to text-only processing with an 8K context window.

Technical Innovations Driving Performance

Gemma 3’s impressive performance stems from three key architectural enhancements that deserve closer examination.

Context Length: Efficient Scaling to 128K

Extending context length to 128K tokens posed significant computational challenges. Rather than training models from scratch with longer sequences, Google employed a strategic approach:

Models were pretrained with 32K token sequences
Only the 4B, 12B, and 27B variants were scaled to 128K tokens at pretraining’s end
RoPE positional embeddings were upgraded from a 10K base frequency in Gemma 2 to 1M in Gemma 3
Additional scaling factor of 8 was applied for longer contexts

To optimize memory usage, Gemma 3 refined the sliding window interleaved attention mechanism from Gemma 2, adjusting the ratio of local to global layers from 1:1 to 5:1 and reducing window size from 4096 to 1024 tokens. These optimizations maintain performance while enabling the massive context window expansion.

Multimodal Architecture: Vision Meets Language

Gemma 3 employs the SigLIP vision encoder to transform images into tokens that integrate with the language model. Key implementation details include:

Images are resized to 896×896 square format
An adaptive “pan and scan” algorithm enables high-resolution image processing by creating multiple 896×896 crops
Different attention mechanisms are employed for text versus images:
- Text uses one-way (causal) attention
- Images receive full bidirectional attention

This approach allows the model to understand visual content with remarkable clarity while maintaining efficient text processing.

# Sample code for multimodal inference with Gemma 3
import torch
from transformers import pipeline

pipe = pipeline(
    "image-text-to-text",
    model="google/gemma-3-4b-it", 
    device="cuda",
    torch_dtype=torch.bfloat16
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "path/to/your/image.jpg"},
            {"type": "text", "text": "What can you tell me about this image?"}
        ]
    }
]

output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])

Multilingual Enhancement: Global AI Access

To achieve meaningful multilingual support, Google doubled the amount of non-English training data compared to previous versions. The tokenizer was also completely revamped, now featuring:

SentencePiece tokenizer with 262K entries (same as Gemini 2.0)
Significantly improved encoding for Chinese, Japanese, and Korean text
Slight increase in token counts for English and code to accommodate broader language support

This balance ensures Gemma 3 performs well across diverse linguistic contexts while maintaining strong performance in English.

Performance Benchmarks: How Does It Stack Up?

Gemma 3’s 27B instruction-tuned model has earned an impressive 1339 LMSys Elo score, placing it among the top 10 models on the Chatbot Arena leaderboard. This puts it on par with o1-preview and above many other non-thinking open models, even when limited to text-only inputs.

Across specialized benchmarks, Gemma 3 demonstrates competitive performance:

MMLU-Pro: 67.5 (27B)
LiveCodeBench: 29.7 (27B)
Bird-SQL: 54.4 (27B)
GPQA Diamond: 42.4 (27B)
MATH: 69.0 (27B)
FACTS Grounding: 74.9 (27B)
MMMU: 64.9 (27B)

While it shows impressive strength in reasoning, mathematics, factual accuracy, and multimodal tasks, Gemma 3 does show some weakness in the SimpleQA benchmark (10.0 for the 27B model). However, when compared to Gemini 1.5 models, Gemma 3 often comes close and occasionally outperforms them, demonstrating its value as a high-performing yet accessible model.

Deploying Gemma 3: From Server to Edge

Google has prioritized deployment flexibility, making Gemma 3 available across diverse computing environments.

Server Deployment with Transformers

Gemma 3 launches with day-zero support in Hugging Face transformers, requiring just a simple installation:

pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

The implementation includes two specialized model classes:

Gemma3ForConditionalGeneration: For multimodal vision-language models (4B, 12B, 27B)
Gemma3ForCausalLM: For text-only usage (1B model or using larger models without vision)

For text-only inference, the implementation is straightforward:

import torch
from transformers import AutoTokenizer, Gemma3ForCausalLM

model = Gemma3ForCausalLM.from_pretrained(
    "google/gemma-3-4b-it", 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-4b-it")

messages = [
    [
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are a helpful assistant."},]
        },
        {
            "role": "user",
            "content": [{"type": "text", "text": "Explain quantum computing in simple terms."},]
        },
    ],
]
inputs = tokenizer.apply_chat_template(
    messages, 
    add_generation_prompt=True, 
    tokenize=True,
    return_dict=True, 
    return_tensors="pt"
).to(model.device)

generation = model.generate(**inputs, max_new_tokens=100)
decoded = tokenizer.decode(generation[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(decoded)

Edge Computing and Low-Resource Deployment

Gemma 3’s range of model sizes makes it particularly well-suited for edge deployment:

Apple Silicon (MLX) Gemma 3 launches with immediate support in mlx-vlm for Apple devices:

pip install git+https://github.com/Blaizzy/mlx-vlm.git

python -m mlx_vlm.generate --model mlx-community/gemma-3-4b-it-4bit --max-tokens 100 --temp 0.0 --prompt "What does this image show?" --image path/to/image.jpg

Llama.cpp Integration Pre-quantized GGUF files enable deployment on CPU-only systems:

./build/bin/llama-cli -m ./gemma-3-4b-it-Q4_K_M.gguf

Hugging Face Endpoints One-click deployment options exist for Gemma-3-27b-it and Gemma-3-12b-it through the Inference Catalog, with optimized TGI configurations for production use.

The Future Landscape: What Gemma 3 Means for AI Development

Gemma 3 represents a significant milestone in democratizing advanced AI capabilities. By bringing multimodal processing, extensive multilingual support, and massive context windows to the open model space, Google has raised the bar for what developers can expect from accessible AI models.

The competitive performance against closed models like Gemini 1.5 Pro suggests we’re entering an era where open models can rival their proprietary counterparts in many real-world applications. This accessibility empowers a broader range of businesses, researchers, and developers to build sophisticated AI solutions without the constraints of closed systems.

As the AI landscape continues evolving, Gemma 3 stands as a testament to Google’s commitment to advancing open AI technology while addressing growing demands for more capable, ethical, and accessible models. The combination of cutting-edge capabilities with flexible deployment options positions Gemma 3 as a compelling option for organizations of all sizes looking to leverage AI’s transformative potential.

If you are interested in this topic, we suggest you check our articles:

Sources: HuggingFace, 24 News HD

Written by Alius Noreika