TanStack AI provides support for text-to-speech generation through dedicated TTS adapters. This guide covers how to convert text into spoken audio using OpenAI and Gemini providers.
Text-to-speech (TTS) is handled by TTS adapters that follow the same tree-shakeable architecture as other adapters in TanStack AI. The TTS adapters support:
import { generateSpeech } from '@tanstack/ai'
import { openaiTTS } from '@tanstack/ai-openai'
// Create a TTS adapter (uses OPENAI_API_KEY from environment)
const adapter = openaiSpeech()
// Generate speech from text
const result = await generateSpeech({
adapter: openaiTTS('tts-1'),
text: 'Hello, welcome to TanStack AI!',
voice: 'alloy',
})
// result.audio contains base64-encoded audio data
console.log(result.format) // 'mp3'
console.log(result.contentType) // 'audio/mpeg'
import { generateSpeech } from '@tanstack/ai'
import { openaiTTS } from '@tanstack/ai-openai'
// Create a TTS adapter (uses OPENAI_API_KEY from environment)
const adapter = openaiSpeech()
// Generate speech from text
const result = await generateSpeech({
adapter: openaiTTS('tts-1'),
text: 'Hello, welcome to TanStack AI!',
voice: 'alloy',
})
// result.audio contains base64-encoded audio data
console.log(result.format) // 'mp3'
console.log(result.contentType) // 'audio/mpeg'
import { generateSpeech } from '@tanstack/ai'
import { geminiSpeech } from '@tanstack/ai-gemini'
// Create a TTS adapter (uses GOOGLE_API_KEY from environment)
const adapter = geminiSpeech()
// Generate speech from text
const result = await generateSpeech({
adapter: geminiTTS('gemini-2.5-flash-preview-tts'),
text: 'Hello from Gemini TTS!',
})
console.log(result.audio) // Base64 encoded audio
import { generateSpeech } from '@tanstack/ai'
import { geminiSpeech } from '@tanstack/ai-gemini'
// Create a TTS adapter (uses GOOGLE_API_KEY from environment)
const adapter = geminiSpeech()
// Generate speech from text
const result = await generateSpeech({
adapter: geminiTTS('gemini-2.5-flash-preview-tts'),
text: 'Hello from Gemini TTS!',
})
console.log(result.audio) // Base64 encoded audio
All TTS adapters support these common options:
| Option | Type | Description |
|---|---|---|
| text | string | The text to convert to speech (required) |
| voice | string | The voice to use for generation |
| format | string | Output audio format (e.g., "mp3", "wav") |
OpenAI provides several distinct voices:
| Voice | Description |
|---|---|
| alloy | Neutral, balanced voice |
| echo | Warm, conversational voice |
| fable | Expressive, storytelling voice |
| onyx | Deep, authoritative voice |
| nova | Friendly, upbeat voice |
| shimmer | Clear, gentle voice |
| ash | Calm, measured voice |
| ballad | Melodic, flowing voice |
| coral | Bright, energetic voice |
| sage | Wise, thoughtful voice |
| verse | Poetic, rhythmic voice |
| Format | Description |
|---|---|
| mp3 | MP3 audio (default) |
| opus | Opus audio (good for streaming) |
| aac | AAC audio |
| flac | FLAC audio (lossless) |
| wav | WAV audio (uncompressed) |
| pcm | Raw PCM audio |
const result = await generateSpeech({
adapter: openaiTTS('tts-1-hd'),
text: 'High quality speech synthesis',
voice: 'nova',
format: 'mp3',
modelOptions: {
speed: 1.0, // 0.25 to 4.0
},
})
const result = await generateSpeech({
adapter: openaiTTS('tts-1-hd'),
text: 'High quality speech synthesis',
voice: 'nova',
format: 'mp3',
modelOptions: {
speed: 1.0, // 0.25 to 4.0
},
})
| Option | Type | Description |
|---|---|---|
| speed | number | Playback speed (0.25 to 4.0, default 1.0) |
| instructions | string | Voice style instructions (GPT-4o audio models only) |
Note: The instructions and stream_format options are only available with gpt-4o-audio-preview and gpt-4o-mini-audio-preview models, not with tts-1 or tts-1-hd.
The TTS result includes:
interface TTSResult {
id: string // Unique identifier for this generation
model: string // The model used
audio: string // Base64-encoded audio data
format: string // Audio format (e.g., "mp3")
contentType: string // MIME type (e.g., "audio/mpeg")
duration?: number // Duration in seconds (if available)
}
interface TTSResult {
id: string // Unique identifier for this generation
model: string // The model used
audio: string // Base64-encoded audio data
format: string // Audio format (e.g., "mp3")
contentType: string // MIME type (e.g., "audio/mpeg")
duration?: number // Duration in seconds (if available)
}
// Convert base64 to audio and play
function playAudio(result: TTSResult) {
const audioData = atob(result.audio)
const bytes = new Uint8Array(audioData.length)
for (let i = 0; i < audioData.length; i++) {
bytes[i] = audioData.charCodeAt(i)
}
const blob = new Blob([bytes], { type: result.contentType })
const url = URL.createObjectURL(blob)
const audio = new Audio(url)
audio.play()
// Clean up when done
audio.onended = () => URL.revokeObjectURL(url)
}
// Convert base64 to audio and play
function playAudio(result: TTSResult) {
const audioData = atob(result.audio)
const bytes = new Uint8Array(audioData.length)
for (let i = 0; i < audioData.length; i++) {
bytes[i] = audioData.charCodeAt(i)
}
const blob = new Blob([bytes], { type: result.contentType })
const url = URL.createObjectURL(blob)
const audio = new Audio(url)
audio.play()
// Clean up when done
audio.onended = () => URL.revokeObjectURL(url)
}
import { writeFile } from 'fs/promises'
async function saveAudio(result: TTSResult, filename: string) {
const audioBuffer = Buffer.from(result.audio, 'base64')
await writeFile(filename, audioBuffer)
console.log(`Saved to ${filename}`)
}
// Usage
const result = await generateSpeech({
adapter: openaiTTS('tts-1'),
text: 'Hello world!',
})
await saveAudio(result, 'output.mp3')
import { writeFile } from 'fs/promises'
async function saveAudio(result: TTSResult, filename: string) {
const audioBuffer = Buffer.from(result.audio, 'base64')
await writeFile(filename, audioBuffer)
console.log(`Saved to ${filename}`)
}
// Usage
const result = await generateSpeech({
adapter: openaiTTS('tts-1'),
text: 'Hello world!',
})
await saveAudio(result, 'output.mp3')
| Model | Quality | Speed | Use Case |
|---|---|---|---|
| tts-1 | Standard | Fast | Real-time applications |
| tts-1-hd | High | Slower | Production audio |
| gpt-4o-audio-preview | Highest | Variable | Advanced voice control |
| gpt-4o-mini-audio-preview | High | Fast | Balanced quality/speed |
| Model | Status | Notes |
|---|---|---|
| gemini-2.5-flash-preview-tts | Experimental | May require Live API for full features |
try {
const result = await generateSpeech({
adapter: openaiTTS('tts-1'),
text: 'Hello!',
})
} catch (error) {
if (error.message.includes('exceeds maximum length')) {
console.error('Text is too long (max 4096 characters)')
} else if (error.message.includes('Speed must be between')) {
console.error('Invalid speed value')
} else {
console.error('TTS error:', error.message)
}
}
try {
const result = await generateSpeech({
adapter: openaiTTS('tts-1'),
text: 'Hello!',
})
} catch (error) {
if (error.message.includes('exceeds maximum length')) {
console.error('Text is too long (max 4096 characters)')
} else if (error.message.includes('Speed must be between')) {
console.error('Invalid speed value')
} else {
console.error('TTS error:', error.message)
}
}
The TTS adapters use the same environment variables as other adapters:
For production use or when you need explicit control:
import { createOpenaiTTS } from '@tanstack/ai-openai'
import { createGeminiTTS } from '@tanstack/ai-gemini'
// OpenAI
const openaiAdapter = createOpenaiTTS('your-openai-api-key')
// Gemini
const geminiAdapter = createGeminiTTS('your-google-api-key')
import { createOpenaiTTS } from '@tanstack/ai-openai'
import { createGeminiTTS } from '@tanstack/ai-gemini'
// OpenAI
const openaiAdapter = createOpenaiTTS('your-openai-api-key')
// Gemini
const geminiAdapter = createGeminiTTS('your-google-api-key')
Text Length: OpenAI TTS supports up to 4096 characters per request. For longer content, split into chunks.
Voice Selection: Choose voices appropriate for your content—use onyx for authoritative content, nova for friendly interactions.
Format Selection: Use mp3 for general use, opus for streaming, wav for further processing.
Caching: Cache generated audio to avoid regenerating the same content.
Error Handling: Always handle errors gracefully, especially for user-facing applications.
