AI Instructor Live Labs Included

OpenAI: Multimodal Applications with GPT-4o

Build multimodal AI applications with GPT-4o vision, Whisper audio transcription, TTS speech synthesis, and gpt-image-1 image generation.

Intermediate
4h 40m
10 Lessons
OPENAI-203
OpenAI Multimodal Developer Badge

View badge details

About This Course

Build real-world applications that combine vision, audio, and image generation using the OpenAI API. Learn to analyze images and documents with GPT-4o, transcribe and translate audio with Whisper, generate expressive speech with TTS, create and edit images with gpt-image-1, and orchestrate all modalities in a multimodal meeting assistant capstone project.

Course Curriculum

10 Lessons
01
AI Lesson
AI Lesson

Vision: Images & Document Understanding

20m

Learn how GPT-4o processes images using multi-modal inputs. Covers URL-based and base64 image encoding, detail levels (low/high) and their token impact, and structured output extraction from visual content like invoices and documents.

02
Lab Exercise
Lab Exercise

Vision Applications

30m 4 Exercises

Practice using GPT-4o vision API by analyzing images via URL and base64 encoding, comparing detail levels, and extracting structured data from invoice images using Pydantic models.

Analyze Image via URL Send an image URL to GPT-4o using the multi-modal content format and return the text description. ~7 min
Base64 Image Encoding & Analysis Encode a local image file as base64 and analyze it with GPT-4o using the data URI format. ~7 min
Detail Levels & Token Impact Compare low vs high detail image analysis and measure the token cost difference for each setting. ~7 min
Structured Invoice Parsing Extract structured invoice data from an image using GPT-4o and responses.parse() with a Pydantic ParsedInvoice model. ~9 min
03
AI Lesson
AI Lesson

Audio Input & Speech Recognition with Whisper

20m

Learn to transcribe and translate audio using OpenAI Whisper. Covers basic transcription, word-level timestamps, translation to English, and building voice command routing pipelines by combining Whisper with GPT classification.

04
Lab Exercise
Lab Exercise

Speech Recognition Pipeline

30m 4 Exercises

Practice building audio transcription workflows with Whisper: basic transcription, word-level timestamps, multilingual translation, and a voice command routing system powered by Whisper + GPT classification.

Basic Audio Transcription Transcribe an MP3 audio file using Whisper-1 and return the plain text transcript. ~7 min
Transcription with Word Timestamps Get verbose JSON transcription output with word-level timestamps including start/end times and duration. ~7 min
Multilingual Audio Translation Translate non-English audio to English using the Whisper translations endpoint in a single API call. ~7 min
Voice Command Routing Pipeline Build an end-to-end voice command system: transcribe audio, classify intent with GPT-4.1-mini, and route to the appropriate handler. ~9 min
05
AI Lesson
AI Lesson

Audio Output: Text-to-Speech

20m

Learn to generate high-quality speech with OpenAI TTS models. Covers standard synthesis with tts-1, all six built-in voices, expressive speech with gpt-4o-mini-tts style instructions, and low-latency audio streaming.

06
Lab Exercise
Lab Exercise

Voice Output Applications

30m 4 Exercises

Practice generating speech with OpenAI TTS: synthesize audio with tts-1, compare all six voices, create expressive speech with style instructions using gpt-4o-mini-tts, and implement low-latency audio streaming.

Generate Speech with tts-1 Generate audio from text using tts-1 with the alloy voice and save the MP3 file using stream_to_file. ~7 min
Compare All Six Voices Generate the same text with all six built-in voices (alloy, echo, fable, onyx, nova, shimmer) and save each to a separate file. ~7 min
Expressive Speech with gpt-4o-mini-tts Use the instructions parameter with gpt-4o-mini-tts to generate expressive speech with different speaking styles and tones. ~7 min
Stream Audio for Low Latency Stream TTS audio in chunks using with_streaming_response and iter_bytes, measuring time-to-first-byte for latency analysis. ~9 min
07
AI Lesson
AI Lesson

Image Generation with gpt-image-1

20m

Learn to generate and edit images using OpenAI's gpt-image-1 model. Covers text-to-image generation, quality and size settings, inpainting with masks, and batch generation of multiple variations in a single API call.

08
Lab Exercise
Lab Exercise

Image Generation Applications

30m 4 Exercises

Practice generating and editing images with gpt-image-1: create images from text prompts, compare quality settings, perform inpainting edits with masks, and generate multiple image variations in a single API call.

Generate Image from Prompt Generate a PNG image from a text prompt using gpt-image-1, decode the base64 response, and save it to disk. ~7 min
Quality & Size Settings Generate images at different quality levels (low, medium, high) and sizes, comparing file sizes and token usage across settings. ~7 min
Image Editing with Inpainting Edit an existing image using inpainting — provide a base image and an RGBA mask, then describe what to paint into the transparent area. ~7 min
Generate Multiple Image Variations Generate multiple image variations of the same prompt in a single API call using the n parameter, saving each variation to a separate file. ~9 min
09
AI Lesson
AI Lesson

Capstone Briefing: Multimodal Meeting Assistant

20m

Architecture overview of the capstone project: a meeting assistant that transcribes audio with Whisper, analyzes slides with GPT-4o vision, generates structured reports with GPT-4.1, and narrates summaries with TTS — demonstrating orchestration of all multimodal capabilities.

10
Lab Exercise
Lab Exercise

Capstone Project: Multimodal Meeting Assistant

1h 0m 5 Exercises

Build a complete multimodal meeting assistant that transcribes audio with Whisper, analyzes presentation slides with GPT-4o, generates a structured JSON report with GPT-4.1, and narrates the executive summary with expressive TTS.

Transcribe Meeting Audio Implement transcribe_meeting() using Whisper with verbose_json and word+segment timestamp granularities to get a full timestamped transcript. ~10 min
Analyze Meeting Slide Implement analyze_slide() to extract structured data from a presentation slide image using GPT-4o vision and the SlideAnalysis Pydantic model. ~10 min
Generate Structured Meeting Report Implement generate_meeting_report() combining the transcript and slide analysis with GPT-4.1 to produce a structured MeetingReport with action items and decisions. ~10 min
Narrate Meeting Summary Implement narrate_summary() to convert the MeetingReport into an expressive MP3 audio summary using gpt-4o-mini-tts with professional narration instructions. ~10 min
Orchestrate Full Meeting Pipeline Implement run_meeting_assistant() to chain all four steps: transcription, slide analysis, report generation, and TTS narration, saving all output files. ~10 min

This course includes:

  • 24/7 AI Instructor Support
  • Live Lab Environments
  • 5 Hands-on Lessons
  • 6 Months Access
  • Completion Badge
  • Certificate of Completion
OpenAI Multimodal Developer Badge

Earn Your Badge

Complete all lessons to unlock the OpenAI Multimodal Developer achievement badge.

Category
Skill Level Intermediate
Total Duration 4h 40m
OpenAI Multimodal Developer Badge
Achievement Badge

OpenAI Multimodal Developer

Awarded for completing Multimodal Applications with GPT-4o. Demonstrates ability to analyze images with GPT-4o vision, transcribe and translate audio with Whisper, generate expressive speech with TTS, create and edit images with gpt-image-1, and orchestrate multimodal pipelines.

Course OpenAI: Multimodal Applications with GPT-4o
Criteria Complete all lessons and exercises in OPENAI-203: Multimodal Applications with GPT-4o
Valid For 730 days

Skills You'll Earn

GPT-4o Vision Whisper Text-to-Speech Image Generation Multimodal AI Pipeline Orchestration

Complete all lessons in this course to earn this badge