Mongolian Transcription App

AI-powered video subtitle generator‍

An end-to-end transcription platform that converts Mongolian video and audio files into subtitle (.srt) format using Google Speech-to-Text v2. The app provides a smooth user experience: upload any media file through a modern web interface, track progress in real time, and download perfectly time-aligned subtitles ready for YouTube or DaVinci Resolve.

Under the hood, it runs a scalable microservice stack built entirely from scratch:

- FastAPI backend handles file presigning, job orchestration, and database state management.
- Worker service segments media into manageable chunks, extracts audio with FFmpeg, calls Google Speech-to-Text v2, and assembles SRT output with proper timestamps.
- MinIO (S3-compatible) storage for uploaded media and results.
- PostgreSQL + Redis for reliable persistence and queueing.
- Next.js frontend with upload progress, drag-and-drop support, error handling, and live video preview with subtitle track.

This system is optimized for Mongolian language recognition and designed with SaaS scalability in mind — easily extendable to other languages or speech models.

Project Background

While creating my own YouTube videos, I quickly realized how limited and expensive Mongolian transcription tools are.
Existing apps were either inaccurate or locked behind costly subscriptions, and there was no simple solution for generating subtitles in Mongolian. Out of curiosity, I asked ChatGPT if it would be possible to build my own transcription application — and to my surprise, it said yes.
That conversation sparked this entire project: I decided to build a complete, end-to-end system myself.

Key Features:

- Upload any video/audio file and get an .srt subtitle file automatically.
- Uses Google Speech-to-Text v2 (latest_long) model for accurate Mongolian transcription.
- Real-time progress tracking via job polling and toast notifications.
- Modular architecture with Docker Compose for local development.
- CORS-safe presigned uploads to S3 (MinIO) — secure and scalable.
- Easily deployable to Google Cloud Run for production.

Frontend

Next.js, TypeScript, TailwindCSS, React Hooks

Backend

FastAPI (Python), PostgreSQL, Redis, MinIO, FFmpeg

AI/Cloud

Google Cloud Speech-to-Text v2

Infrastructure

Docker, Docker Compose, Cloud Run (planned)

Tools

GitHub Actions (planned CI/CD), Stripe, Firebase, Firestore

Future Development Roadmap

Phase 1 — Cloud Migration

- Deploy backend and worker to Google Cloud Run
- Replace MinIO with Google Cloud Storage
- Migrate metadata from Postgres to Firestore

Phase 2 — SaaS Features

- Add Google Sign-In via Firebase Auth
- Integrate Stripe Checkout for subscription tiers (Starter / Pro)
- Add automatic usage tracking and limits

Phase 3 — Advanced Speech Features

- Enable Long-Running Recognize for unlimited audio length
- Support custom vocabulary hints to boost accuracy
- Generate multi-segment SRT with precise timestamps
- Optional “Polish mode” for enhanced text formatting and punctuation

Phase 4 — Monitoring & Polish

- Add Cloud Logging & Error Reporting dashboards
- Improve UI design and marketing landing page
- Add team/workspace support for collaborative projects

Mongolian Transcription App

Future Development Roadmap

Want to get in touch?Drop me a line!

Want to get in touch?
Drop me a line!