Internal

Multi-Provider AI Gateway

High-availability AI routing microservice

PythonFastAPIRedisGeminiOpenAIClaude

Overview

The AI Gateway is a Python FastAPI microservice that acts as a smart proxy in front of multiple AI providers. It selects the best provider based on availability, cost budget, and routing rules — and falls back automatically when a provider fails or exceeds its budget. Responses are semantically cached in Redis to cut AI costs by up to 40%.

Key Features

Provider routing: Gemini → OpenAI → Claude → Ollama (configurable order)

Automatic fallback on provider timeout, rate limit, or error

Per-provider budget caps with live usage tracking

Semantic response caching via Redis (similarity-based key matching)

Async job queue for non-blocking resume processing

Resume parsing with structured JSON output

Usage logging to PostgreSQL for cost analytics

REST API consumed by the NestJS backend

How It Works

Request Arrives

NestJS backend sends a prompt to the AI Gateway REST endpoint with task type (score, parse, optimize, test).

Cache Check

Gateway computes a semantic hash of the prompt and checks Redis. If a similar request was made recently, returns the cached response instantly.

Provider Selection

If no cache hit, the router checks which providers are available and under budget. Picks the highest-priority provider.

AI Call + Fallback

Calls the selected provider. If it fails (timeout / rate limit), automatically retries with the next provider in the chain.

Log & Return

Logs token count, cost, and latency to the usage_log table. Returns structured JSON response to the NestJS backend.

Tech Stack

Runtime

Python 3.11
FastAPI
Uvicorn

AI Providers

Google Gemini
OpenAI GPT-4o
Anthropic Claude
Ollama (LLaMA)

Caching

Redis
Semantic similarity hashing

Queue

Background task queue
Async workers

Data

PostgreSQL (usage logs)
Pydantic schemas

Architecture

The gateway sits between the NestJS backend and all AI providers, acting as a smart router with caching, budget control, and async processing built in.

Consumers

NestJS Backend

Caller service

Gateway Core

FastAPI Server

Routes & validation

Provider Router

Priority + budget logic

Redis Cache

Semantic hashing

Async Queue

Background workers

AI Providers

Gemini API

Primary · Cheapest

OpenAI GPT-4o

Fallback #1

Claude API

Fallback #2

Ollama (LLaMA)

Local · Free

Storage

PostgreSQL

Usage logs & cost

Data Flow

1Request in → FastAPI validates schema → Router checks Redis cache for semantic match.
2Cache miss → Router picks provider by priority and budget → calls AI provider API.
3Provider fails → automatic fallback to next in chain (Gemini → OpenAI → Claude → Ollama).
4Response → cached in Redis with semantic key → usage logged to PostgreSQL → returned to NestJS.
5Heavy tasks (large resume processing) queued as async background jobs to avoid request timeout.

Interested in building something similar?

Get in Touch