ausaf Logo Image
Mohammad Ausaf

AI Studio

Enterprise Agentic AI Platform for Autonomous Content Generation

Overview

AI Studio is a multi-tenant agentic AI platform at Galleri5 that orchestrates content generation across 78+ models from 7 providers (Replicate, FAL, ElevenLabs, BytePlus, Google, RunPod, and an internal H100 GPU cluster). Users describe what they want in natural language; the system decomposes it into a DAG of tasks and executes them autonomously. The platform includes a filmmaking module co-developed with Microsoft for AI-driven feature films and episodic series.

How a Request Flows

A user submits a natural language prompt. Gemini parses it into a JSON workflow — a DAG of nodes with dependencies (e.g., image generation → video synthesis → audio overlay). Each node specifies a tool, inputs, and execution mode.

Nodes are topologically sorted (Kahn's algorithm) and dependencies resolved at runtime. A downstream node references upstream outputs directly via source-output key mappings. Fast tools (text, image gen) run synchronously. Heavy tools (video, lip-sync) submit async jobs and poll status endpoints — or receive webhook callbacks. Background tasks monitor completion and trigger downstream nodes.

A provider factory routes each tool invocation to the appropriate backend. Cost-aware selection downgrades to cheaper models when credits are low. The workflow executes end-to-end without human intervention.

Episodic Content Pipeline

The filmmaking module decomposes high-level episode briefs into a production-ready hierarchy: Episodes → Scenes → Shots. Each shot carries its own magnification (CU, MCU, WS, LS), action description, composition reference (paper edit), and explicit links to character, location, and prop assets. A state machine tracks every entity through its lifecycle — from creation through asset readiness, generation, review, and final approval.

The asset library maintains character turnaround sheets, expression grids, pose sheets, location establishing shots, and prop references — all versioned with approval workflows. When a shot is queued for generation, the system resolves all asset references, validates that each has been approved, and assembles them into a structured generation request. Shots within the same scene share location and character assets, enforcing visual continuity across sequential frames.

Bulk generation runs through GCP Cloud Tasks — hundreds of shots queued and dispatched asynchronously, each routed to the appropriate provider based on the shot's requirements. This pipeline generated the Mahabharat series currently streaming on JioHotstar.

Workflow Planner: LLM-Driven Agentic Orchestration

The Workflow Planner is a standalone agentic system that converts natural language prompts into executable multi-step plans. It uses Gemini to generate a DAG of WorkflowNodes — each specifying a tool, input parameters, execution mode, and dependency references. The tool manifest is entirely config-driven — new tools can be registered without code changes.

Before execution, the system classifies whether a user message requires a full workflow or a simple chat response. If it's a workflow, the planner generates nodes, validates for circular dependencies and missing references, performs topological ordering, and handles runtime dependency resolution — caching upstream outputs and injecting them into downstream inputs automatically.

The provider layer abstracts 8+ execution backends behind a uniform interface. Provider configs are parameterized via templates, so adding a new provider is a config change, not a code change. Async providers are handled transparently — the planner manages job submission, status polling, and result extraction identically regardless of whether the underlying provider is synchronous or webhook-based.

Prompt Engineering & Character Consistency

Each generation call goes through a structured prompt assembly layer that composes multimodal inputs from multiple reference sources. The prompt is built in sections: composition requirements from the paper edit image, character list with explicit appearance constraints from reference sheets, prop descriptions with visual requirements, shot details (magnification, framing, action), and an indexed reference image map so the model knows which image corresponds to which entity.

Character consistency is enforced through reference-based generation — the same character turnaround sheet is passed to every shot featuring that character, with the prompt explicitly instructing the model to match facial features, clothing, and build. Multiple reference images are encoded and sent alongside the prompt. Automatic template selection picks the right prompt structure (turnaround, expression grid, multi-angle sheet) based on asset metadata.

Review Loops & Quality Control

Generated assets pass through an iterative review cycle. After bulk generation, outputs enter a review interface where reviewers can refine via conversation, annotate with visual markup, or provide written feedback. Each decision is tracked with cycle count, reviewer ID, and timestamp.

Rejected assets re-enter the generation loop with feedback injected into the next prompt iteration. Quality gates enforce that assets must be approved before being used in downstream shot generation. The system maintains a complete revision history for auditability.

Inference Gateway (Blog)

The Inference Gateway is a standalone distributed system that serves as the single entry point for all AI job submissions across every customer org. The gateway was built on top of the GPU Dispatcher's core architecture — slot-based concurrency, priority queuing, crash recovery — extending those patterns to orchestrate both external API providers and an internal 8-server H100 fleet. Jobs are enqueued per-org with priority (HIGH / NORMAL / LOW) and dispatched using Weighted Round Robin — ensuring fair capacity distribution across tenants.

Before dispatching any job, four concurrency levels are checked in sequence: org global cap, org per-API cap, provider API cap, provider global cap. Org-level uses Redis counter-based tracking. Provider-level uses Redis slot-based concurrency with TTL (inherited from the GPU Dispatcher) — slots self-heal if a process crashes while holding one. Jobs are claimed via lease-based locking to prevent double-dispatch across workers.

Crash recovery: Write-ahead logging happens before enqueue, so even a crash between those steps is recoverable. Four background workers run continuously — sweeping expired leases, detecting orphaned jobs, reconciling pending state, and repairing concurrency counters — ensuring no job is ever lost or double-dispatched.

Idempotent submissions prevent duplicate work across retries. Rate limiting uses atomic operations for race-safe throughput checks. Failed jobs retry with exponential backoff; exhausted jobs move to a Dead Letter Queue for manual recovery. The system routes to 81+ models across 8 provider adapters — FAL, Replicate, BytePlus, Google, ElevenLabs, RunPod, GPU direct, and generic HTTP (sync/async/webhook).

Hard Problem #1: Stateless Planner in Multi-Turn Chat

The workflow planner is stateless — it takes a prompt and outputs a DAG. But it lives inside a chat with history, uploaded files, and previous generations. A user says "now make that image into a video" — and "that image" refers to something from 5 messages ago.

Solution: Before calling the planner, I build a context window — recent messages, file references with their storage URLs, previous workflow outputs. This gets injected into the planner prompt so it can resolve references like "that image" to actual asset URLs. The generated workflow executes independently, but its outputs feed back into the chat as assistant messages, maintaining conversational continuity.

Hard Problem #2: Error Handling Across 78 Models

External AI providers fail in different ways — rate limits, balance exhaustion, model unavailability, timeouts, malformed responses. Each failure mode needs different handling: retry with backoff, fallback to alternate providers, refund credits, or alert via Slack.

Solution: AI-powered error classification. Errors are sent to an LLM that categorises them by type — billing issues, rate limits, provider outages, malformed responses. Each category routes to a different alert channel with appropriate urgency. Financial errors escalate immediately. Timeouts trigger automatic cancellation via provider cancel APIs. On failure, credits auto-refund. No manual intervention needed.

Production Infrastructure

The system runs across GCP, Azure, and AWS — with MongoDB Atlas as the primary datastore and Redis for caching and coordination.

Caching is Redis-first with MongoDB fallback — sub-10ms retrieval in the happy path, graceful degradation if Redis goes down. Rate limiting uses sliding windows per-user and per-provider. Connection pools are sized differentially based on traffic patterns to prevent exhaustion under load.

The credit system is a unified ledger across all features — a single pool prevents cross-module arbitrage. Per-model pricing is hot-swappable without deploys. I lead on-call rotation, maintain runbooks, and own monitoring and alerting for the production infrastructure.

Built with AI Studio Production content generated using this platform.