AI Instructor Live Labs Included

OpenAI: Multimodal Applications with GPT-4o

Build multimodal AI applications with GPT-4o vision, Whisper audio transcription, TTS speech synthesis, and gpt-image-1 image generation.

Intermediate

4h 40m

10 Lessons

OPENAI-203

View badge details

About This Course

Build real-world applications that combine vision, audio, and image generation using the OpenAI API. Learn to analyze images and documents with GPT-4o, transcribe and translate audio with Whisper, generate expressive speech with TTS, create and edit images with gpt-image-1, and orchestrate all modalities in a multimodal meeting assistant capstone project.

Course Curriculum

10 Lessons

AI Lesson

Vision: Images & Document Understanding

20m

Learn how GPT-4o processes images using multi-modal inputs. Covers URL-based and base64 image encoding, detail levels (low/high) and their token impact, and structured output extraction from visual content like invoices and documents.

Lab Exercise

Vision Applications

30m 4 Exercises

Practice using GPT-4o vision API by analyzing images via URL and base64 encoding, comparing detail levels, and extracting structured data from invoice images using Pydantic models.

Analyze Image via URL Send an image URL to GPT-4o using the multi-modal content format and return the text description. ~7 min

Base64 Image Encoding & Analysis Encode a local image file as base64 and analyze it with GPT-4o using the data URI format. ~7 min

Detail Levels & Token Impact Compare low vs high detail image analysis and measure the token cost difference for each setting. ~7 min

Structured Invoice Parsing Extract structured invoice data from an image using GPT-4o and responses.parse() with a Pydantic ParsedInvoice model. ~9 min

AI Lesson

Audio Input & Speech Recognition with Whisper

20m

Learn to transcribe and translate audio using OpenAI Whisper. Covers basic transcription, word-level timestamps, translation to English, and building voice command routing pipelines by combining Whisper with GPT classification.

Lab Exercise

Speech Recognition Pipeline

30m 4 Exercises

Practice building audio transcription workflows with Whisper: basic transcription, word-level timestamps, multilingual translation, and a voice command routing system powered by Whisper + GPT classification.

Basic Audio Transcription Transcribe an MP3 audio file using Whisper-1 and return the plain text transcript. ~7 min

Transcription with Word Timestamps Get verbose JSON transcription output with word-level timestamps including start/end times and duration. ~7 min

Multilingual Audio Translation Translate non-English audio to English using the Whisper translations endpoint in a single API call. ~7 min

Voice Command Routing Pipeline Build an end-to-end voice command system: transcribe audio, classify intent with GPT-4.1-mini, and route to the appropriate handler. ~9 min

AI Lesson

Audio Output: Text-to-Speech

20m

Learn to generate high-quality speech with OpenAI TTS models. Covers standard synthesis with tts-1, all six built-in voices, expressive speech with gpt-4o-mini-tts style instructions, and low-latency audio streaming.

Lab Exercise

Voice Output Applications

30m 4 Exercises

Practice generating speech with OpenAI TTS: synthesize audio with tts-1, compare all six voices, create expressive speech with style instructions using gpt-4o-mini-tts, and implement low-latency audio streaming.

Generate Speech with tts-1 Generate audio from text using tts-1 with the alloy voice and save the MP3 file using stream_to_file. ~7 min

Compare All Six Voices Generate the same text with all six built-in voices (alloy, echo, fable, onyx, nova, shimmer) and save each to a separate file. ~7 min

Expressive Speech with gpt-4o-mini-tts Use the instructions parameter with gpt-4o-mini-tts to generate expressive speech with different speaking styles and tones. ~7 min

Stream Audio for Low Latency Stream TTS audio in chunks using with_streaming_response and iter_bytes, measuring time-to-first-byte for latency analysis. ~9 min

AI Lesson

Image Generation with gpt-image-1

20m

Learn to generate and edit images using OpenAI's gpt-image-1 model. Covers text-to-image generation, quality and size settings, inpainting with masks, and batch generation of multiple variations in a single API call.

Lab Exercise

Image Generation Applications

30m 4 Exercises

Practice generating and editing images with gpt-image-1: create images from text prompts, compare quality settings, perform inpainting edits with masks, and generate multiple image variations in a single API call.

Generate Image from Prompt Generate a PNG image from a text prompt using gpt-image-1, decode the base64 response, and save it to disk. ~7 min

Quality & Size Settings Generate images at different quality levels (low, medium, high) and sizes, comparing file sizes and token usage across settings. ~7 min

Image Editing with Inpainting Edit an existing image using inpainting — provide a base image and an RGBA mask, then describe what to paint into the transparent area. ~7 min

Generate Multiple Image Variations Generate multiple image variations of the same prompt in a single API call using the n parameter, saving each variation to a separate file. ~9 min

AI Lesson

Capstone Briefing: Multimodal Meeting Assistant

20m

Architecture overview of the capstone project: a meeting assistant that transcribes audio with Whisper, analyzes slides with GPT-4o vision, generates structured reports with GPT-4.1, and narrates summaries with TTS — demonstrating orchestration of all multimodal capabilities.

Lab Exercise

Capstone Project: Multimodal Meeting Assistant

1h 0m 5 Exercises

Build a complete multimodal meeting assistant that transcribes audio with Whisper, analyzes presentation slides with GPT-4o, generates a structured JSON report with GPT-4.1, and narrates the executive summary with expressive TTS.

Transcribe Meeting Audio Implement transcribe_meeting() using Whisper with verbose_json and word+segment timestamp granularities to get a full timestamped transcript. ~10 min

Analyze Meeting Slide Implement analyze_slide() to extract structured data from a presentation slide image using GPT-4o vision and the SlideAnalysis Pydantic model. ~10 min

Generate Structured Meeting Report Implement generate_meeting_report() combining the transcript and slide analysis with GPT-4.1 to produce a structured MeetingReport with action items and decisions. ~10 min

Narrate Meeting Summary Implement narrate_summary() to convert the MeetingReport into an expressive MP3 audio summary using gpt-4o-mini-tts with professional narration instructions. ~10 min

Orchestrate Full Meeting Pipeline Implement run_meeting_assistant() to chain all four steps: transcription, slide analysis, report generation, and TTS narration, saving all output files. ~10 min

This course includes:

24/7 AI Instructor Support
Live Lab Environments
5 Hands-on Lessons
6 Months Access
Completion Badge
Certificate of Completion

Earn Your Badge

Complete all lessons to unlock the OpenAI Multimodal Developer achievement badge.

OpenAI: Multimodal Applications with GPT-4o

About This Course

Course Curriculum

Vision: Images & Document Understanding

Vision Applications

Audio Input & Speech Recognition with Whisper

Speech Recognition Pipeline

Audio Output: Text-to-Speech

Voice Output Applications

Image Generation with gpt-image-1

Image Generation Applications

Capstone Briefing: Multimodal Meeting Assistant

Capstone Project: Multimodal Meeting Assistant

This course includes:

Earn Your Badge

OpenAI Multimodal Developer

Skills You'll Earn