OpenAI Codex Has No Audio Tools — Add Them in 30 Seconds

Add music and audio generation to OpenAI Codex in minutes. Full guide: Suno V5, ElevenLabs, Mureka via AnyCap CLI and MCP server. Includes video+music pipeline examples.

by AnyCap

Sound waveform and music notes above a terminal — audio generation in an AI coding environment

Yes — OpenAI Codex can generate music and audio. Codex ships with no built-in audio tools, but add the AnyCap CLI or MCP server and your Codex agent can compose background tracks, generate sound effects, produce voiceovers, and deliver polished audio files — all without leaving the agent session.


Why Codex Has No Built-In Audio Tools

Codex is a code-generation engine. Its sandbox exposes shell and file I/O, but no hooks into audio synthesis APIs. The models it calls (typically GPT-4o in agent mode) are text and code specialists — they can write the code that would generate audio, but they can't synthesize audio themselves.

AnyCap adds a lightweight CLI binary and MCP server that turn anycap music generate and anycap audio generate into first-class tool calls inside Codex.


What Audio Generation Unlocks for Codex

  • Background music for videos, demos, and presentations your agent builds
  • Voiceover narration for documentation, explainer videos, or onboarding flows
  • Sound effects for games, apps, and interactive prototypes
  • Podcast-style audio — synthesize an interview or explainer from a script
  • Localized audio — generate the same voiceover in multiple languages from one prompt

Available Audio & Music Models

Model Type Best For Quality Credits/min
Suno V5.5 Music Full songs with lyrics and vocals ★★★★★ ~30
Suno V5 Music Instrumental, background, ambient ★★★★★ ~25
ElevenLabs Music Music Cinematic, orchestral, game audio ★★★★★ ~20
Mureka V8 Music Consistent style, batch production ★★★★☆ ~15
ElevenLabs TTS Voice Voiceover, narration, text-to-speech ★★★★★ ~5

The MCP server exposes music_generate and audio_generate as native tools. Codex's agent calls them through natural language — no explicit shell commands needed.

Setup

Step 1: Install AnyCap in the Codex sandbox

curl -fsSL https://anycap.ai/install.sh | sh
anycap login   # paste your API key once

Step 2: Add to your Codex task configuration

In your codex.yaml or task system prompt, add:

tools:
  - type: mcp_server
    command: anycap mcp serve
    env:
      ANYCAP_API_KEY: "${ANYCAP_API_KEY}"

Or for OpenAI Codex CLI:

{
  "mcpServers": {
    "anycap": {
      "command": "anycap",
      "args": ["mcp", "serve"],
      "env": { "ANYCAP_API_KEY": "your_key_here" }
    }
  }
}

Step 3: Ask for audio in natural language:

Generate a 60-second upbeat background track suitable for a product
demo video. Energetic but not distracting. Save to assets/bg-track.mp3.

The agent calls music_generate with Suno V5, waits for the result, and saves the file — no further user input required.


Method 2: AnyCap CLI (Direct Shell Calls)

The CLI gives explicit control over every parameter. Use this when your Codex task needs precise audio specifications.

Install

curl -fsSL https://anycap.ai/install.sh | sh && anycap login

Generate background music

# Upbeat background track, 60 seconds
anycap music generate \
  --model suno-v5 \
  --prompt "Upbeat lo-fi hip-hop, 80 BPM, focus music, no lyrics" \
  --duration 60 \
  --output assets/bg-music.mp3

Output:

✓ Queued    suno-v5  [job: mus-2291]
✓ Composing 60s …
✓ Complete  assets/bg-music.mp3  (2.8 MB)
  Credits used: 25

Generate a full song with lyrics

anycap music generate \
  --model suno-v5-5 \
  --prompt "Indie pop song about shipping code at midnight, verse-chorus structure" \
  --duration 180 \
  --output assets/midnight-coder.mp3

Generate a voiceover from text

anycap audio generate \
  --model elevenlabs-tts \
  --voice "Rachel" \
  --text "Welcome to AnyCap. The capability layer that gives AI agents eyes, ears, and a voice." \
  --output assets/welcome-narration.mp3

Generate cinematic/orchestral score

anycap music generate \
  --model elevenlabs-music \
  --prompt "Epic orchestral intro, building tension, suitable for a product announcement" \
  --duration 30 \
  --output assets/announcement-score.mp3

Upload to Drive and share

anycap drive upload assets/bg-music.mp3 --share
# → https://drive.anycap.ai/f/xyz789

Real Workflow: Video + Music Pipeline in Codex

The most powerful use case is pairing music generation with video generation to produce complete multimedia output:

# 1. Generate the video
anycap video generate \
  --model veo-3-1 \
  --prompt "Product launch reveal animation, clean minimal design" \
  --duration 30 \
  --no-audio \
  --output /workspace/launch-video.mp4

# 2. Generate the soundtrack
anycap music generate \
  --model elevenlabs-music \
  --prompt "Cinematic product reveal score, builds to reveal moment, 30 seconds" \
  --duration 30 \
  --output /workspace/launch-score.mp3

# 3. Merge using ffmpeg (available in Codex sandbox)
ffmpeg -i /workspace/launch-video.mp4 \
       -i /workspace/launch-score.mp3 \
       -c:v copy -c:a aac \
       -shortest \
       /workspace/launch-final.mp4

# 4. Upload and share
anycap drive upload /workspace/launch-final.mp4 --share

Complete product launch video with original score — generated entirely inside a Codex agent session.


Model Selection Guide

Use Suno V5.5 when:

  • You want a full song with lyrics, melody, and vocals
  • You need a specific music genre (pop, rock, jazz, electronic)
  • The output will be the primary audio content (not just background)

Use Suno V5 when:

  • You need instrumental/ambient background music
  • You want consistent, loopable tracks
  • The music will play under voiceover or video

Use ElevenLabs Music when:

  • You need cinematic, orchestral, or game-style audio
  • The music needs to sync to specific emotional beats
  • You need professional quality for commercial use

Use Mureka V8 when:

  • You are generating many tracks with a consistent style
  • You need batch production at lower cost per minute
  • Style consistency across multiple pieces matters more than peak quality

Use ElevenLabs TTS when:

  • You need spoken-word audio: narration, voiceover, documentation read-aloud
  • You want multiple speaker voices for different characters
  • You need multilingual audio from the same text

FAQ

Q: Can Codex generate audio without me installing anything?
A: No. Codex's default sandbox has no audio synthesis tools. AnyCap is the fastest path to add them — installation takes about 30 seconds.

Q: Do I need an ElevenLabs or Suno account?
A: No. AnyCap aggregates all models under one API key. You pay only AnyCap credits — no separate subscriptions required.

Q: What audio formats are supported?
A: MP3 (default), WAV, and OGG. Pass --format wav for lossless output. WAV is recommended before any mixing or post-processing.

Q: Can I specify the key or tempo of generated music?
A: Include musical parameters in your prompt: "120 BPM, key of C major, 4/4 time signature". Suno V5 and ElevenLabs Music both follow musical specification prompts reliably.

Q: How long can a generated audio track be?
A: Maximum duration is 5 minutes (300 seconds) per call. For longer content, generate in segments and concatenate with ffmpeg.

Q: Can the TTS voice be customized?
A: Yes. Pass --voice with a voice name (Rachel, Josh, Bella, etc.) and --stability, --similarity parameters for fine-tuning. Run anycap audio voices to list all available voices.

Q: Does this work in Claude Code and Cursor too?
A: Yes. The same CLI and MCP server install identically in Claude Code and Cursor terminals. The only difference is where you put the MCP config file.