How to Use Synthesia to Make a Training Video With an AI Avatar

March 29, 2026 · Editorial Team · 6 min read · synthesia ai-video training-video

Corporate training videos have a reputation problem. They're usually too long, filmed once under fluorescent lights by someone who didn't want to be on camera, and never updated when the process changes. Synthesia solves most of that. You paste in a script, pick an avatar, and get a presenter-led video without a camera, studio, or on-screen talent.

I've used it to build onboarding modules for teams where consistent delivery matters more than production value, and the ability to update a video in minutes without re-recording anything is the single feature that justifies the cost on its own.

Setting up a new project

Log into your Synthesia workspace and click New Video. You have two paths:

Start from a template: recommended for first-timers. Synthesia has category-specific templates for onboarding, compliance training, product walkthroughs, and more.
Start from scratch: blank canvas, full control.

For most training videos, starting from a template is faster. The templates handle scene layout, text placement, and basic pacing. You're just swapping in your content.

If you choose a template, pick one that matches your tone. Synthesia's "Corporate Clean" and "Modern Minimal" templates work well for internal training. The more stylized templates (split-screen, news desk) can feel distracting when the goal is knowledge transfer.

Picking an avatar

The avatar library in Synthesia has over 150 options as of early 2026. For training content, the selection criteria are:

Demographic match: pick an avatar that reflects your team's makeup if possible. Synthesia's library includes avatars across many ages, ethnicities, and presentation styles.
Formality: some avatars are in casual attire, some in business dress, some in scrubs or workwear. Match to your training context.
Expression range: some avatars have higher expressiveness scores (visible in the avatar detail view). More expressive avatars work better for conversational topics; flatter ones are fine for compliance or procedural content.

You can also create a custom avatar from your own video footage, though this requires a separate recording session and takes a few business days for Synthesia to process.

One practical tip: audition a few avatars with the same 30-word test script before committing to your project avatar. The difference in lip sync quality and gesture naturalness between avatars is real and matters over the length of a 5-minute training video.

Script-to-video workflow

This is the core of Synthesia's workflow. Each video is made up of scenes, and each scene has:

A text script that the avatar reads aloud
A background (color, image, or video)
Optional overlay elements (bullet points, logos, images, screen recordings)

Click into a scene and paste your script into the script field. The avatar reads exactly what you type. Punctuation drives pacing: commas create short pauses, periods create longer ones. If you want an explicit pause, you can use the SSML tag <break time="1s"/> inside the script text.

A few script conventions that improve output quality:

Write for speech, not for reading. Short sentences. Active voice. No parenthetical asides.
Spell out acronyms the first time: "Service Level Agreement (SLA)" reads better than just "SLA."
Write numbers as words for cleaner delivery: "twenty-five percent" instead of "25%."
Test any jargon or proper nouns. Synthesia sometimes mispronounces unusual words. You can add phonetic corrections in the pronunciation dictionary (found under workspace Settings).

Scene length should match your speech pace. A typical training presenter covers about 130 words per minute, so a scene with 100 words of script is roughly 45 seconds. Keep individual scenes to 60 to 90 seconds max; shorter scenes are easier to update later.

Using screen recordings and visuals

Training videos work better when the avatar explains something while viewers see it happen. Synthesia supports overlaying:

Screen recordings (MP4 or WebM, recorded separately)
Static images (JPG, PNG)
PDFs converted to images
Text callouts and bullet lists added directly in the editor

The layout editor lets you resize and reposition the avatar window and your overlay content. Common layouts:

Layout	Best for
Avatar full screen	Introductions, transitions
Avatar inset, content full screen	Software walkthroughs, showing documents
Split screen	Comparison, two-column information
Avatar centered, text below	Key takeaways, summaries

For software training, record your screen in a separate screen recorder (OBS or built-in system tools), then import the recording as a video overlay in Synthesia. The avatar narrates while the screen recording plays.

Multi-language videos

This is where Synthesia's value proposition gets very clear for global teams. Once your English video is done, you can duplicate it and change the script language. The avatar will deliver the translated script in the target language with matching lip sync.

Steps:

Click the three-dot menu on your project and select Duplicate.
Open the duplicate and change the script text to your translated version.
In the voice selector, choose a voice in the target language. Synthesia has voices for over 140 languages and accents.
If needed, change the avatar to one that matches the regional audience.
Generate.

The translation itself is not automatic; you supply the translated script. You can use any translation tool to prepare it. Synthesia doesn't currently auto-translate within the platform, though language detection ensures the lip sync engine matches the script language.

The practical outcome: a 10-module English onboarding series can become a 10-module Spanish series (or French, German, Arabic, Mandarin) with the same visual consistency, without hiring separate talent for each language.

Updating a video without re-recording

This is honestly the feature that makes Synthesia worth it for training content, because training content is never final. A process changes. A product is updated. A compliance requirement shifts.

In Synthesia, updating a video means changing text, not re-recording. Open the project, find the scene that needs updating, change the script or swap the overlay image, and regenerate. Only the changed scenes re-render. The avatar, voice, and style stay identical to the original.

A real example: an IT onboarding module I built originally referenced a software UI that changed significantly six months later. Updating it took about 20 minutes: swap the screen recording overlays, update the script text in three scenes, regenerate those scenes, re-export. No new recording session, no scheduling anyone on camera, no editing a timeline.

For compliance training where you need dated proof of updates, Synthesia's version history tracks when changes were made and lets you export a record.

When your video is ready, click Export for an MP4 download. Alternatively, Synthesia generates a shareable link that lets viewers watch the video directly without downloading, which is convenient for LMS sharing.

If you're using an LMS like Docebo, TalentLMS, or Moodle, you can export SCORM packages from Synthesia to embed tracking directly. SCORM export is available on Business and Enterprise plans.

Resolution export options: 1080p on all paid plans, 4K on Enterprise. For most training purposes, 1080p is sufficient.

Practical tips before you start

Keep scripts concise. A 5-minute training video is usually more effective than a 15-minute one covering the same content, because people can focus for 5 minutes. Break long training sequences into short modules rather than one long video.

Use the avatar sparingly in heavy visual-content scenes. If a scene is 80% screen recording, you don't need the avatar visible at all. Hide it and just let the voice narrate. Reserve the avatar face time for introductions, summaries, and moments where the human-presenter feel actually adds warmth.

And test the pronunciation dictionary early. Nothing undermines trust in a training video faster than hearing your company name pronounced wrong three times in a row.