From Prototype to Production: Implementing Multimodal AI in SaaS Products
Multimodal AI models can process text, images, audio, and video simultaneously. They promise to revolutionize content generation and understanding. But moving from a demo to a production-ready SaaS feature is a completely different challenge.
At Easygenerator, I led the development of an automated content generation pipeline using Google’s Vertex AI multimodal models. Here’s the journey from research prototype to production feature.
What Is Multimodal AI?
Traditional AI models are single-mode:
- Text models: GPT, Claude, LLaMA (text in → text out)
- Image models: DALL-E, Stable Diffusion (text → image)
- Audio models: Whisper (audio → text)
Multimodal models like Gemini, GPT-4V, and Claude Opus can process multiple input types simultaneously:
Input: Image of a chart + Text prompt
Output: Detailed analysis + Generated presentation slide
This unlocks entirely new capabilities for content creation.
The Use Case: Automated Course Content Generation
Our SaaS product helps organizations create e-learning content. The manual process was:
- Subject matter expert provides source materials (PDFs, slides, videos)
- Instructional designer extracts key concepts
- Designer creates structured lessons with visuals
- Content goes through review cycles
Goal: Automate steps 2-3 using multimodal AI while maintaining quality standards.
Architecture: Google Vertex AI + GCP
We chose Google Cloud Platform’s Vertex AI for several reasons:
Why Vertex AI?
- Native multimodal support: Gemini models handle text, images, PDFs, and video
- Enterprise features: VPC integration, audit logs, data residency controls
- Scalability: Auto-scaling inference with pay-per-token pricing
- Safety filters: Built-in content moderation for educational content
System Architecture
[User Upload] → [Cloud Storage]
↓
[Cloud Function]
↓
┌──────────┴──────────┐
↓ ↓
[Document Parser] [Media Processor]
↓ ↓
└──────────┬──────────┘
↓
[Vertex AI Gemini]
↓
┌──────────┴──────────┐
↓ ↓
[Content Structurer] [Asset Generator]
↓ ↓
└──────────┬──────────┘
↓
[Quality Checker]
↓
[PostgreSQL + Storage]
Implementation Phases
Phase 1: Research & Prototyping (2 weeks)
Goals: Validate technical feasibility and quality
Activities:
- Tested Gemini 1.5 Pro on sample course materials
- Benchmarked output quality against human-created content
- Experimented with prompt engineering strategies
- Measured costs per generation
Key Learning: Direct PDF + prompt works surprisingly well for structured educational content, achieving ~75% quality vs. human baseline.
Phase 2: Pipeline Development (4 weeks)
Goals: Build robust, scalable infrastructure
Challenges:
1. Document Preprocessing
Raw documents needed cleanup:
- Extract text from scanned PDFs (OCR)
- Preserve formatting and structure
- Handle multi-language content
- Size limits (Gemini: 10MB per request)
Solution: Cloud Functions pipeline:
async function preprocessDocument(fileUri: string) {
const storage = new Storage();
const file = await storage.bucket(bucket).file(fileUri);
// Extract and chunk content
const chunks = await chunkDocument(file, {
maxTokens: 30000,
preserveStructure: true
});
return chunks;
}2. Prompt Engineering at Scale
Different content types need different prompts:
- Technical documentation → Step-by-step procedures
- Marketing materials → Engaging narratives
- Research papers → Evidence-based summaries
Solution: Template system with dynamic prompt construction:
const promptTemplate = {
technical: `Analyze this technical document and create a structured lesson...`,
marketing: `Transform this content into an engaging learning experience...`,
research: `Extract key findings and create evidence-based training...`
};
const prompt = buildPrompt(contentType, context, userPreferences);3. Handling Multimodal Inputs
Combining text, images, and videos in single requests:
const request = {
contents: [{
role: 'user',
parts: [
{ text: prompt },
{ fileData: { mimeType: 'application/pdf', fileUri: pdfUri } },
{ fileData: { mimeType: 'video/mp4', fileUri: videoUri } },
{ inlineData: { mimeType: 'image/jpeg', data: base64Image } }
]
}]
};
const response = await vertexAI.generateContent(request);4. Quality Control
AI output needs validation before user sees it:
- Content accuracy checks
- Inappropriate content filtering
- Format validation
- Hallucination detection
Solution: Multi-stage validation pipeline:
async function validateOutput(content: GeneratedContent) {
// Stage 1: Safety filters
const safetyCheck = await vertexAI.moderateContent(content);
if (!safetyCheck.passed) throw new SafetyError();
// Stage 2: Structural validation
validateStructure(content);
// Stage 3: Fact-checking (for sourced content)
await verifyFactualClaims(content, sourceDocuments);
return content;
}Phase 3: Cost Optimization (2 weeks)
Initial costs: $2.50 per course generation (not viable at scale)
Optimization strategies:
1. Smart Chunking
- Only send relevant sections to AI
- Cache common transformations
- Savings: 40% reduction in tokens
2. Model Selection
- Use Gemini Flash for simple content
- Reserve Pro for complex multimodal tasks
- Savings: 60% cost reduction on 70% of requests
3. Batch Processing
- Group similar requests
- Amortize API overhead
- Savings: 15% efficiency gain
Final cost: $0.80 per course generation (3x improvement)
Phase 4: Production Rollout (3 weeks)
Strategy: Gradual rollout with human oversight
- Week 1: Internal beta with content team (shadow mode)
- Week 2: Invite 50 power users with “AI co-pilot” framing
- Week 3: General availability with clear AI labeling
Monitoring:
// Track quality metrics
metrics.track('ai_generation', {
duration_ms: generationTime,
token_count: tokensUsed,
cost_usd: estimatedCost,
user_edited: userMadeEdits,
user_rating: userFeedback
});Results
After 3 months in production:
- Adoption: 35% of new courses use AI generation
- Time savings: 60% reduction in content creation time
- Quality: 4.2/5 user rating (human baseline: 4.5/5)
- Cost: Sustainable at $0.80 per generation
- Edit rate: 45% of AI content used without modification
Lessons Learned
1. Start with Narrow Use Cases
Don’t try to solve everything with AI. We succeeded by:
- Focusing on structured educational content (not creative writing)
- Targeting document-heavy workflows
- Setting clear quality expectations
2. Human-in-the-Loop Is Critical
Even great AI needs human oversight:
- AI as co-pilot: Generate drafts, humans refine
- Confidence scoring: Flag low-confidence outputs for review
- Feedback loops: User edits improve future prompts
3. Cost Management from Day One
AI costs scale with usage. Plan for:
- Token budgets per user/organization
- Caching strategies
- Model selection based on task complexity
4. Prompt Engineering Is a Product Feature
Your prompts ARE your product differentiation:
- Version control your prompts
- A/B test prompt variations
- Collect user feedback on outputs
5. Multimodal != Magic
Combining modalities adds complexity:
- File format compatibility issues
- Increased latency (larger requests)
- Higher error rates
Only go multimodal when single-mode isn’t sufficient.
Technical Deep Dive: Vertex AI Integration
Setting Up Vertex AI
import { VertexAI } from '@google-cloud/vertexai';
const vertexAI = new VertexAI({
project: 'your-project-id',
location: 'us-central1'
});
const model = vertexAI.getGenerativeModel({
model: 'gemini-1.5-pro',
generationConfig: {
maxOutputTokens: 8192,
temperature: 0.4,
topP: 0.95
},
safetySettings: [
{
category: 'HARM_CATEGORY_DANGEROUS_CONTENT',
threshold: 'BLOCK_MEDIUM_AND_ABOVE'
}
]
});Handling Large Files
async function processLargeDocument(fileUri: string) {
// Upload to Cloud Storage if not already there
const gcsUri = await uploadToGCS(fileUri);
// Reference by URI (no size limit for PDFs/videos)
const request = {
contents: [{
role: 'user',
parts: [
{ text: yourPrompt },
{
fileData: {
mimeType: 'application/pdf',
fileUri: gcsUri
}
}
]
}]
};
const response = await model.generateContent(request);
return response.response.text();
}Error Handling
async function robustGeneration(request: GenerateRequest) {
const maxRetries = 3;
let attempt = 0;
while (attempt < maxRetries) {
try {
return await model.generateContent(request);
} catch (error) {
if (error.code === 'RESOURCE_EXHAUSTED') {
// Rate limited, exponential backoff
await sleep(2 ** attempt * 1000);
attempt++;
} else if (error.code === 'INVALID_ARGUMENT') {
// Bad request, don't retry
throw error;
} else {
// Transient error, retry
attempt++;
}
}
}
throw new Error('Max retries exceeded');
}What’s Next?
Future enhancements:
- Fine-tuning: Custom models for our domain
- Multi-agent workflows: Specialized agents for different content types
- Real-time collaboration: AI suggestions during editing
- Personalization: Learn from user preferences over time
Conclusion
Multimodal AI is production-ready for SaaS products, but requires careful engineering:
- Choose the right cloud provider and models
- Build robust pipelines with quality controls
- Manage costs from day one
- Keep humans in the loop
The technology is powerful, but success comes from thoughtful product integration, not just API calls.
Building multimodal AI into your SaaS product? I’d love to discuss architecture approaches and lessons learned. Connect with me on LinkedIn.