aryem.dev

From Prototype to Production: Implementing Multimodal AI in SaaS Products

Multimodal AI models can process text, images, audio, and video simultaneously. They promise to revolutionize content generation and understanding. But moving from a demo to a production-ready SaaS feature is a completely different challenge.

At Easygenerator, I led the development of an automated content generation pipeline using Google’s Vertex AI multimodal models. Here’s the journey from research prototype to production feature.

What Is Multimodal AI?

Traditional AI models are single-mode:

Multimodal models like Gemini, GPT-4V, and Claude Opus can process multiple input types simultaneously:

Input: Image of a chart + Text prompt
Output: Detailed analysis + Generated presentation slide

This unlocks entirely new capabilities for content creation.

The Use Case: Automated Course Content Generation

Our SaaS product helps organizations create e-learning content. The manual process was:

  1. Subject matter expert provides source materials (PDFs, slides, videos)
  2. Instructional designer extracts key concepts
  3. Designer creates structured lessons with visuals
  4. Content goes through review cycles

Goal: Automate steps 2-3 using multimodal AI while maintaining quality standards.

Architecture: Google Vertex AI + GCP

We chose Google Cloud Platform’s Vertex AI for several reasons:

Why Vertex AI?

  1. Native multimodal support: Gemini models handle text, images, PDFs, and video
  2. Enterprise features: VPC integration, audit logs, data residency controls
  3. Scalability: Auto-scaling inference with pay-per-token pricing
  4. Safety filters: Built-in content moderation for educational content

System Architecture

[User Upload] → [Cloud Storage]
                      ↓
                [Cloud Function]
                      ↓
           ┌──────────┴──────────┐
           ↓                     ↓
    [Document Parser]      [Media Processor]
           ↓                     ↓
           └──────────┬──────────┘
                      ↓
              [Vertex AI Gemini]
                      ↓
           ┌──────────┴──────────┐
           ↓                     ↓
    [Content Structurer]   [Asset Generator]
           ↓                     ↓
           └──────────┬──────────┘
                      ↓
              [Quality Checker]
                      ↓
            [PostgreSQL + Storage]

Implementation Phases

Phase 1: Research & Prototyping (2 weeks)

Goals: Validate technical feasibility and quality

Activities:

Key Learning: Direct PDF + prompt works surprisingly well for structured educational content, achieving ~75% quality vs. human baseline.

Phase 2: Pipeline Development (4 weeks)

Goals: Build robust, scalable infrastructure

Challenges:

1. Document Preprocessing

Raw documents needed cleanup:

Solution: Cloud Functions pipeline:

async function preprocessDocument(fileUri: string) {
  const storage = new Storage();
  const file = await storage.bucket(bucket).file(fileUri);

  // Extract and chunk content
  const chunks = await chunkDocument(file, {
    maxTokens: 30000,
    preserveStructure: true
  });

  return chunks;
}

2. Prompt Engineering at Scale

Different content types need different prompts:

Solution: Template system with dynamic prompt construction:

const promptTemplate = {
  technical: `Analyze this technical document and create a structured lesson...`,
  marketing: `Transform this content into an engaging learning experience...`,
  research: `Extract key findings and create evidence-based training...`
};

const prompt = buildPrompt(contentType, context, userPreferences);

3. Handling Multimodal Inputs

Combining text, images, and videos in single requests:

const request = {
  contents: [{
    role: 'user',
    parts: [
      { text: prompt },
      { fileData: { mimeType: 'application/pdf', fileUri: pdfUri } },
      { fileData: { mimeType: 'video/mp4', fileUri: videoUri } },
      { inlineData: { mimeType: 'image/jpeg', data: base64Image } }
    ]
  }]
};

const response = await vertexAI.generateContent(request);

4. Quality Control

AI output needs validation before user sees it:

Solution: Multi-stage validation pipeline:

async function validateOutput(content: GeneratedContent) {
  // Stage 1: Safety filters
  const safetyCheck = await vertexAI.moderateContent(content);
  if (!safetyCheck.passed) throw new SafetyError();

  // Stage 2: Structural validation
  validateStructure(content);

  // Stage 3: Fact-checking (for sourced content)
  await verifyFactualClaims(content, sourceDocuments);

  return content;
}

Phase 3: Cost Optimization (2 weeks)

Initial costs: $2.50 per course generation (not viable at scale)

Optimization strategies:

1. Smart Chunking

2. Model Selection

3. Batch Processing

Final cost: $0.80 per course generation (3x improvement)

Phase 4: Production Rollout (3 weeks)

Strategy: Gradual rollout with human oversight

Monitoring:

// Track quality metrics
metrics.track('ai_generation', {
  duration_ms: generationTime,
  token_count: tokensUsed,
  cost_usd: estimatedCost,
  user_edited: userMadeEdits,
  user_rating: userFeedback
});

Results

After 3 months in production:

Lessons Learned

1. Start with Narrow Use Cases

Don’t try to solve everything with AI. We succeeded by:

2. Human-in-the-Loop Is Critical

Even great AI needs human oversight:

3. Cost Management from Day One

AI costs scale with usage. Plan for:

4. Prompt Engineering Is a Product Feature

Your prompts ARE your product differentiation:

5. Multimodal != Magic

Combining modalities adds complexity:

Only go multimodal when single-mode isn’t sufficient.

Technical Deep Dive: Vertex AI Integration

Setting Up Vertex AI

import { VertexAI } from '@google-cloud/vertexai';

const vertexAI = new VertexAI({
  project: 'your-project-id',
  location: 'us-central1'
});

const model = vertexAI.getGenerativeModel({
  model: 'gemini-1.5-pro',
  generationConfig: {
    maxOutputTokens: 8192,
    temperature: 0.4,
    topP: 0.95
  },
  safetySettings: [
    {
      category: 'HARM_CATEGORY_DANGEROUS_CONTENT',
      threshold: 'BLOCK_MEDIUM_AND_ABOVE'
    }
  ]
});

Handling Large Files

async function processLargeDocument(fileUri: string) {
  // Upload to Cloud Storage if not already there
  const gcsUri = await uploadToGCS(fileUri);

  // Reference by URI (no size limit for PDFs/videos)
  const request = {
    contents: [{
      role: 'user',
      parts: [
        { text: yourPrompt },
        {
          fileData: {
            mimeType: 'application/pdf',
            fileUri: gcsUri
          }
        }
      ]
    }]
  };

  const response = await model.generateContent(request);
  return response.response.text();
}

Error Handling

async function robustGeneration(request: GenerateRequest) {
  const maxRetries = 3;
  let attempt = 0;

  while (attempt < maxRetries) {
    try {
      return await model.generateContent(request);
    } catch (error) {
      if (error.code === 'RESOURCE_EXHAUSTED') {
        // Rate limited, exponential backoff
        await sleep(2 ** attempt * 1000);
        attempt++;
      } else if (error.code === 'INVALID_ARGUMENT') {
        // Bad request, don't retry
        throw error;
      } else {
        // Transient error, retry
        attempt++;
      }
    }
  }

  throw new Error('Max retries exceeded');
}

What’s Next?

Future enhancements:

Conclusion

Multimodal AI is production-ready for SaaS products, but requires careful engineering:

The technology is powerful, but success comes from thoughtful product integration, not just API calls.


Building multimodal AI into your SaaS product? I’d love to discuss architecture approaches and lessons learned. Connect with me on LinkedIn.