OpenAI: Multimodal Applications with GPT-4o
Build multimodal AI applications with GPT-4o vision, Whisper audio transcription, TTS speech synthesis, and gpt-image-1 image generation.
View badge details
About This Course
Course Curriculum
10 Lessons
Vision: Images & Document Understanding
Learn how GPT-4o processes images using multi-modal inputs. Covers URL-based and base64 image encoding, detail levels (low/high) and their token impact, and structured output extraction from visual content like invoices and documents.
Vision Applications
Practice using GPT-4o vision API by analyzing images via URL and base64 encoding, comparing detail levels, and extracting structured data from invoice images using Pydantic models.
Audio Input & Speech Recognition with Whisper
Learn to transcribe and translate audio using OpenAI Whisper. Covers basic transcription, word-level timestamps, translation to English, and building voice command routing pipelines by combining Whisper with GPT classification.
Speech Recognition Pipeline
Practice building audio transcription workflows with Whisper: basic transcription, word-level timestamps, multilingual translation, and a voice command routing system powered by Whisper + GPT classification.
Audio Output: Text-to-Speech
Learn to generate high-quality speech with OpenAI TTS models. Covers standard synthesis with tts-1, all six built-in voices, expressive speech with gpt-4o-mini-tts style instructions, and low-latency audio streaming.
Voice Output Applications
Practice generating speech with OpenAI TTS: synthesize audio with tts-1, compare all six voices, create expressive speech with style instructions using gpt-4o-mini-tts, and implement low-latency audio streaming.
Image Generation with gpt-image-1
Learn to generate and edit images using OpenAI's gpt-image-1 model. Covers text-to-image generation, quality and size settings, inpainting with masks, and batch generation of multiple variations in a single API call.
Image Generation Applications
Practice generating and editing images with gpt-image-1: create images from text prompts, compare quality settings, perform inpainting edits with masks, and generate multiple image variations in a single API call.
Capstone Briefing: Multimodal Meeting Assistant
Architecture overview of the capstone project: a meeting assistant that transcribes audio with Whisper, analyzes slides with GPT-4o vision, generates structured reports with GPT-4.1, and narrates summaries with TTS — demonstrating orchestration of all multimodal capabilities.
Capstone Project: Multimodal Meeting Assistant
Build a complete multimodal meeting assistant that transcribes audio with Whisper, analyzes presentation slides with GPT-4o, generates a structured JSON report with GPT-4.1, and narrates the executive summary with expressive TTS.