From Prototype to Production: Implementing Multimodal AI in SaaS Products

Multimodal AI models can process text, images, audio, and video simultaneously. They promise to revolutionize content generation and understanding. But moving from a demo to a production-ready SaaS feature is a completely different challenge.

At Easygenerator, I led the development of an automated content generation pipeline using Google’s Vertex AI multimodal models. Here’s the journey from research prototype to production feature.

What Is Multimodal AI?

Traditional AI models are single-mode:

Text models: GPT, Claude, LLaMA (text in → text out)
Image models: DALL-E, Stable Diffusion (text → image)
Audio models: Whisper (audio → text)

Multimodal models like Gemini, GPT-4V, and Claude Opus can process multiple input types simultaneously:

Input: Image of a chart + Text prompt
Output: Detailed analysis + Generated presentation slide

This unlocks entirely new capabilities for content creation.

The Use Case: Automated Course Content Generation

Our SaaS product helps organizations create e-learning content. The manual process was:

Subject matter expert provides source materials (PDFs, slides, videos)
Instructional designer extracts key concepts
Designer creates structured lessons with visuals
Content goes through review cycles

Goal: Automate steps 2-3 using multimodal AI while maintaining quality standards.

Architecture: Google Vertex AI + GCP

We chose Google Cloud Platform’s Vertex AI for several reasons:

Why Vertex AI?

Native multimodal support: Gemini models handle text, images, PDFs, and video
Enterprise features: VPC integration, audit logs, data residency controls
Scalability: Auto-scaling inference with pay-per-token pricing
Safety filters: Built-in content moderation for educational content

System Architecture

[User Upload] → [Cloud Storage]
                      ↓
                [Cloud Function]
                      ↓
           ┌──────────┴──────────┐
           ↓                     ↓
    [Document Parser]      [Media Processor]
           ↓                     ↓
           └──────────┬──────────┘
                      ↓
              [Vertex AI Gemini]
                      ↓
           ┌──────────┴──────────┐
           ↓                     ↓
    [Content Structurer]   [Asset Generator]
           ↓                     ↓
           └──────────┬──────────┘
                      ↓
              [Quality Checker]
                      ↓
            [PostgreSQL + Storage]

Implementation Phases

Phase 1: Research & Prototyping (2 weeks)

Goals: Validate technical feasibility and quality

Activities:

Tested Gemini 1.5 Pro on sample course materials
Benchmarked output quality against human-created content
Experimented with prompt engineering strategies
Measured costs per generation

Key Learning: Direct PDF + prompt works surprisingly well for structured educational content, achieving ~75% quality vs. human baseline.

Phase 2: Pipeline Development (4 weeks)

Goals: Build robust, scalable infrastructure

Challenges:

1. Document Preprocessing

Raw documents needed cleanup:

Extract text from scanned PDFs (OCR)
Preserve formatting and structure
Handle multi-language content
Size limits (Gemini: 10MB per request)

Solution: Cloud Functions pipeline:

async function preprocessDocument(fileUri: string) {
  const storage = new Storage();
  const file = await storage.bucket(bucket).file(fileUri);
 
  // Extract and chunk content
  const chunks = await chunkDocument(file, {
    maxTokens: 30000,
    preserveStructure: true
  });
 
  return chunks;
}

2. Prompt Engineering at Scale

Different content types need different prompts:

Technical documentation → Step-by-step procedures
Marketing materials → Engaging narratives
Research papers → Evidence-based summaries

Solution: Template system with dynamic prompt construction:

const promptTemplate = {
  technical: `Analyze this technical document and create a structured lesson...`,
  marketing: `Transform this content into an engaging learning experience...`,
  research: `Extract key findings and create evidence-based training...`
};
 
const prompt = buildPrompt(contentType, context, userPreferences);

3. Handling Multimodal Inputs

Combining text, images, and videos in single requests:

const request = {
  contents: [{
    role: 'user',
    parts: [
      { text: prompt },
      { fileData: { mimeType: 'application/pdf', fileUri: pdfUri } },
      { fileData: { mimeType: 'video/mp4', fileUri: videoUri } },
      { inlineData: { mimeType: 'image/jpeg', data: base64Image } }
    ]
  }]
};
 
const response = await vertexAI.generateContent(request);

4. Quality Control

AI output needs validation before user sees it:

Content accuracy checks
Inappropriate content filtering
Format validation
Hallucination detection

Solution: Multi-stage validation pipeline:

async function validateOutput(content: GeneratedContent) {
  // Stage 1: Safety filters
  const safetyCheck = await vertexAI.moderateContent(content);
  if (!safetyCheck.passed) throw new SafetyError();
 
  // Stage 2: Structural validation
  validateStructure(content);
 
  // Stage 3: Fact-checking (for sourced content)
  await verifyFactualClaims(content, sourceDocuments);
 
  return content;
}

Phase 3: Cost Optimization (2 weeks)

Initial costs: $2.50 per course generation (not viable at scale)

Optimization strategies:

1. Smart Chunking

Only send relevant sections to AI
Cache common transformations
Savings: 40% reduction in tokens

2. Model Selection

Use Gemini Flash for simple content
Reserve Pro for complex multimodal tasks
Savings: 60% cost reduction on 70% of requests

3. Batch Processing

Group similar requests
Amortize API overhead
Savings: 15% efficiency gain

Final cost: $0.80 per course generation (3x improvement)

Phase 4: Production Rollout (3 weeks)

Strategy: Gradual rollout with human oversight

Week 1: Internal beta with content team (shadow mode)
Week 2: Invite 50 power users with “AI co-pilot” framing
Week 3: General availability with clear AI labeling

Monitoring:

// Track quality metrics
metrics.track('ai_generation', {
  duration_ms: generationTime,
  token_count: tokensUsed,
  cost_usd: estimatedCost,
  user_edited: userMadeEdits,
  user_rating: userFeedback
});

Results

After 3 months in production:

Adoption: 35% of new courses use AI generation
Time savings: 60% reduction in content creation time
Quality: 4.2/5 user rating (human baseline: 4.5/5)
Cost: Sustainable at $0.80 per generation
Edit rate: 45% of AI content used without modification

Lessons Learned

1. Start with Narrow Use Cases

Don’t try to solve everything with AI. We succeeded by:

Focusing on structured educational content (not creative writing)
Targeting document-heavy workflows
Setting clear quality expectations

2. Human-in-the-Loop Is Critical

Even great AI needs human oversight:

AI as co-pilot: Generate drafts, humans refine
Confidence scoring: Flag low-confidence outputs for review
Feedback loops: User edits improve future prompts

3. Cost Management from Day One

AI costs scale with usage. Plan for:

Token budgets per user/organization
Caching strategies
Model selection based on task complexity

4. Prompt Engineering Is a Product Feature

Your prompts ARE your product differentiation:

Version control your prompts
A/B test prompt variations
Collect user feedback on outputs

5. Multimodal != Magic

Combining modalities adds complexity:

File format compatibility issues
Increased latency (larger requests)
Higher error rates

Only go multimodal when single-mode isn’t sufficient.

Technical Deep Dive: Vertex AI Integration

Setting Up Vertex AI

import { VertexAI } from '@google-cloud/vertexai';
 
const vertexAI = new VertexAI({
  project: 'your-project-id',
  location: 'us-central1'
});
 
const model = vertexAI.getGenerativeModel({
  model: 'gemini-1.5-pro',
  generationConfig: {
    maxOutputTokens: 8192,
    temperature: 0.4,
    topP: 0.95
  },
  safetySettings: [
    {
      category: 'HARM_CATEGORY_DANGEROUS_CONTENT',
      threshold: 'BLOCK_MEDIUM_AND_ABOVE'
    }
  ]
});

Handling Large Files

async function processLargeDocument(fileUri: string) {
  // Upload to Cloud Storage if not already there
  const gcsUri = await uploadToGCS(fileUri);
 
  // Reference by URI (no size limit for PDFs/videos)
  const request = {
    contents: [{
      role: 'user',
      parts: [
        { text: yourPrompt },
        {
          fileData: {
            mimeType: 'application/pdf',
            fileUri: gcsUri
          }
        }
      ]
    }]
  };
 
  const response = await model.generateContent(request);
  return response.response.text();
}

Error Handling

async function robustGeneration(request: GenerateRequest) {
  const maxRetries = 3;
  let attempt = 0;
 
  while (attempt < maxRetries) {
    try {
      return await model.generateContent(request);
    } catch (error) {
      if (error.code === 'RESOURCE_EXHAUSTED') {
        // Rate limited, exponential backoff
        await sleep(2 ** attempt * 1000);
        attempt++;
      } else if (error.code === 'INVALID_ARGUMENT') {
        // Bad request, don't retry
        throw error;
      } else {
        // Transient error, retry
        attempt++;
      }
    }
  }
 
  throw new Error('Max retries exceeded');
}

What’s Next?

Future enhancements:

Fine-tuning: Custom models for our domain
Multi-agent workflows: Specialized agents for different content types
Real-time collaboration: AI suggestions during editing
Personalization: Learn from user preferences over time

Conclusion

Multimodal AI is production-ready for SaaS products, but requires careful engineering:

Choose the right cloud provider and models
Build robust pipelines with quality controls
Manage costs from day one
Keep humans in the loop

The technology is powerful, but success comes from thoughtful product integration, not just API calls.

Building multimodal AI into your SaaS product? I’d love to discuss architecture approaches and lessons learned. Connect with me on LinkedIn.

aryem.dev

Explorer

From Prototype to Production: Implementing Multimodal AI in SaaS Products

From Prototype to Production: Implementing Multimodal AI in SaaS Products

What Is Multimodal AI?

The Use Case: Automated Course Content Generation

Architecture: Google Vertex AI + GCP

Why Vertex AI?

System Architecture

Implementation Phases

Phase 1: Research & Prototyping (2 weeks)

Phase 2: Pipeline Development (4 weeks)

Phase 3: Cost Optimization (2 weeks)

Phase 4: Production Rollout (3 weeks)

Results

Lessons Learned

1. Start with Narrow Use Cases

2. Human-in-the-Loop Is Critical

3. Cost Management from Day One

4. Prompt Engineering Is a Product Feature

5. Multimodal != Magic

Technical Deep Dive: Vertex AI Integration

Setting Up Vertex AI

Handling Large Files

Error Handling

What’s Next?

Conclusion

Graph View

Table of Contents