EnglishModels OverviewMiniMax (Voice Clone & TTS)

MiniMax Audio Models

Overview

MiniMax provides audio capabilities in StoryFlow, including voice cloning and text-to-speech (TTS). Use these when you need consistent narration voices across scenes.

MiniMax Voice Clone

What it does

Create a reusable voice ID by cloning timbre and speaking style from an audio sample.

Inputs

  • Audio Sample (Required): MP3/M4A/WAV, 10s–5min (max 20MB).
  • Demo Text (Required): used to generate a short preview after cloning.

Parameters

ParameterTypeDefaultAllowedWhat it does
voice_modelstringspeech-2.5-hd-previewspeech-2.5-hd-previewSelects the cloning model.
accuracynumber0.80.01.0Controls how strictly the cloned voice matches the sample. Higher = closer match.
need_noise_reductionbooleantruetrue, falseEnables noise reduction on the sample before cloning.
need_volume_normalizationbooleantruetrue, falseNormalizes volume for more consistent output.

Tips

  • Use clean, single-speaker audio with minimal music/noise.
  • Provide 30–90 seconds of steady speech for better timbre stability.
  • Voice IDs are temporary unless you use them in a TTS request within 7 days.

MiniMax TTS

What it does

Generate speech from text, using either system preset voices or your custom cloned voice ID.

Inputs

  • Text Input (Required): the content to speak.

Parameters

ParameterTypeDefaultAllowedWhat it does
voice_modelstringspeech-2.6-turbospeech-2.6-turbo, speech-2.6-hdSelects the TTS model. Turbo is faster; HD is higher quality.
voice_idstringmale-qn-qingsePreset listSelects a system timbre (preset voice). Disabled when use_custom_voice=true.
use_custom_voicebooleanfalsetrue, falseEnables using your cloned voice ID instead of a preset timbre.
custom_voice_idstring""-Your cloned voice ID (typically starts with voice_clone_). Only available when use_custom_voice=true.
emotionstringneutralneutral, happy, sad, angry, fearful, disgusted, surprisedControls emotional tone (supported in MiniMax 2.6 models).
text_normalizationbooleanfalsetrue, falseImproves reading of numbers/dates/symbols in English, with slight added latency.
speednumber1.00.52.0Controls speaking rate.
volnumber1.00.110.0Controls output volume.
pitchnumber0-1212 (integer)Controls vocal pitch shift.

Tips

  • Split long scripts into smaller paragraphs for more controllable pacing.
  • Keep the same voice settings across all scenes to maintain consistency.