Speech Recognition dan Text-to-Speech: AI yang Bisa Mendengar dan Berbicara
“Hey Google, what’s the weather today?” — dalam sekejap, AI memahami suaramu dan menjawab dengan suara manusiawi. Pernah bertanya-tanya bagaimana teknologi ini bekerja? Mari kita jelajahi dunia Speech AI: Speech Recognition (suara ke teks) dan Text-to-Speech (teks ke suara)! 🎙️
Speech Recognition (ASR): Suara → Teks
Automatic Speech Recognition (ASR) adalah teknologi yang mengubah ucapan manusia menjadi teks tertulis.
Cara Kerja ASR
Suara (Waveform)
↓
Feature Extraction (MFCC, Spectrogram)
↓
Acoustic Model (Phonemes)
↓
Language Model (Kata-kata)
↓
Teks Output
1. Audio Preprocessing
Suara direkam sebagai waveform, kemudian diubah menjadi representasi yang bisa diproses.
MFCC (Mel-Frequency Cepstral Coefficients):
import librosa
# Load audio
audio, sr = librosa.load('speech.wav')
# Extract MFCC
mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=13)
Spectrogram: Visualisasi frekuensi suara sepanjang waktu.
2. Acoustic Model
Mengubah suara menjadi unit fonetik terkecil (phonemes).
Contoh Phonemes:
- “Halo” → /h/ /a/ /l/ /o/
- “AI” → /eɪ/ /aɪ/
3. Language Model
Menentukan urutan kata yang paling mungkin.
Contoh:
Acoustic model output: "reka nisi"
Language model: "recognize" lebih mungkin dari "reka nisi"
Final output: "recognize"
Model ASR Modern
Traditional: HMM + GMM
- Hidden Markov Models (HMM)
- Gaussian Mixture Models (GMM)
- Akurasi: ~80%
Deep Learning Era
RNN/LSTM:
- Memproses audio sequence
- Akurasi: ~90%
CNN + Attention:
- WaveNet (Google)
- Akurasi: ~95%
Transformer-Based:
- Whisper (OpenAI)
- Wav2Vec 2.0 (Meta)
- Akurasi: ~98%
Whisper oleh OpenAI
Model ASR state-of-the-art yang open source.
import whisper
# Load model
model = whisper.load_model("base")
# Transcribe
result = model.transcribe("audio.mp3")
print(result["text"])
Ukuran Model:
| Model | Parameters | VRAM | Speed | Accuracy |
|---|---|---|---|---|
| tiny | 39 M | ~1 GB | ~32x | Good |
| base | 74 M | ~1 GB | ~16x | Better |
| small | 244 M | ~2 GB | ~6x | Strong |
| medium | 769 M | ~5 GB | ~2x | Robust |
| large | 1550 M | ~10 GB | 1x | Best |
Aplikasi Speech Recognition
🏠 Virtual Assistants
- Google Assistant, Siri, Alexa
- Voice commands untuk smart home
📝 Transcription Services
- Otter.ai, Rev.com, Descript
- Meeting transcription otomatis
- Subtitle generation
📞 Customer Service
- IVR (Interactive Voice Response)
- Call center analytics
🎮 Gaming
- Voice commands dalam game
- Real-time translation untuk multiplayer
♿ Accessibility
- Speech-to-text untuk disabilitas
- Voice control untuk hands-free operation
Text-to-Speech (TTS): Teks → Suara
Text-to-Speech adalah teknologi yang mengubah teks tertulis menjadi ucapan manusiawi.
Evolusi TTS
1. Concatenative TTS (1970-2000)
Potong-potong rekaman suara manusia dan gabungkan.
"Hello" + "world" = "Hello world"
Kelebihan: Natural Kekurangan: Robotic, tidak fleksibel
2. Parametric TTS (2000-2015)
Model statistik (HMM) untuk generate suara.
Contoh: Festival TTS, eSpeak
Kelebihan: Fleksibel Kekurangan: Masih robotic
3. Neural TTS (2015-sekarang)
Deep learning untuk suara ultra-realistic.
Contoh: Google WaveNet, Amazon Polly, ElevenLabs
Arsitektur Neural TTS
Teks → Text Analysis → Linguistic Features
↓
Acoustic Model
↓
Spectrogram/Mel-spectrogram
↓
Vocoder (Waveform)
↓
Audio Output
1. Text Analysis
Normalisasi teks:
"12:30 PM" → "twelve thirty P M"
"2024" → "two thousand twenty four"
"Dr. Smith" → "Doctor Smith"
2. Acoustic Model
Generate spektrogram dari teks.
Tacotron 2:
- Seq2seq dengan attention
- Output: Mel-spectrogram
3. Vocoder
Ubah spektrogram menjadi audio waveform.
WaveNet:
- Autoregressive neural network
- Suara sangat natural
- Lambat (real-time challenging)
HiFi-GAN:
- GAN-based vocoder
- Real-time, high quality
TTS Modern Populer
Google Cloud Text-to-Speech
from google.cloud import texttospeech
client = texttospeech.TextToSpeechClient()
synthesis_input = texttospeech.SynthesisInput(text="Hello, world!")
voice = texttospeech.VoiceSelectionParams(
language_code="en-US",
ssml_gender=texttospeech.SsmlVoiceGender.NEUTRAL
)
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.MP3
)
response = client.synthesize_speech(
input=synthesis_input, voice=voice, audio_config=audio_config
)
with open("output.mp3", "wb") as out:
out.write(response.audio_content)
ElevenLabs (Voice Cloning)
Clone suara dengan beberapa menit sampel!
from elevenlabs import generate, play
audio = generate(
text="Halo, ini suara yang di-clone oleh AI!",
voice="Bella", # Atau custom cloned voice
model="eleven_multilingual_v2"
)
play(audio)
Coqui TTS (Open Source)
pip install TTS
# List available models
tts --list_models
# Generate speech
tts --text "Hello world" --model_name tts_models/en/ljspeech/tacotron2-DDC --out_path output.wav
Variasi dalam TTS
Multilingual TTS
Satu model untuk banyak bahasa:
- Google Cloud: 40+ bahasa
- Amazon Polly: 30+ bahasa
- ElevenLabs: 29 bahasa
Voice Styles
- Neutral: Informasi faktual
- Conversational: Chatbot
- News: Broadcasting
- Emotional: Joy, sadness, anger
SSML (Speech Synthesis Markup Language)
Kontrol prosody dan emphasis:
<speak>
Hello <break time="500ms"/>
<emphasis level="strong">world</emphasis>!
<prosody rate="slow" pitch="high">
How are you?
</prosody>
</speak>
End-to-End Speech AI Pipeline
Voice Assistant Lengkap
import speech_recognition as sr
import pyttsx3
import openai
# Initialize
recognizer = sr.Recognizer()
tts_engine = pyttsx3.init()
# 1. Speech Recognition
with sr.Microphone() as source:
audio = recognizer.listen(source)
text = recognizer.recognize_google(audio)
# 2. Process (LLM)
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": text}]
)
reply = response.choices[0].message.content
# 3. Text-to-Speech
tts_engine.say(reply)
tts_engine.runAndWait()
Tantangan dalam Speech AI
⚠️ Noise dan Environment
Suara background mengganggu recognition.
Solusi:
- Noise cancellation algorithms
- Beamforming microphones
- Speaker diarization (pisah multiple speakers)
⚠️ Accent dan Dialect
Model sering dilatih dengan “standard accent”.
Solusi:
- Diverse training data
- Accent adaptation
- User-specific fine-tuning
⚠️ Code-Switching
Campur bahasa dalam satu kalimat: “Let’s meeting di kantor jam 3 sore”
Solusi:
- Multilingual models
- Language identification
⚠️ Latency
Real-time processing butuh < 200ms latency.
Solusi:
- Model quantization
- Edge deployment
- Streaming inference
⚠️ Privacy
Suara adalah biometric data yang sensitif.
Solusi:
- On-device processing (Edge AI)
- Federated learning
- Data encryption
Aplikasi Futuristik Speech AI
🔮 Real-Time Translation
Meta’s Universal Speech Translator:
- Hokkien (unwritten language) → English real-time
- Menghilangkan language barrier
🔮 Voice Cloning
Clone suara dengan 3 detik sampel:
- Personalized audiobooks
- Voice preservation (untuk ALS patients)
- Dubbing film dengan voice aktor asli
🔮 Emotion-Aware TTS
Suara yang beradaptasi dengan konteks:
- Customer service yang empati
- Storytelling yang ekspresif
🔮 Silent Speech Interface
Baca pikiran dari muscle movements:
- Bicara tanpa suara
- Subvocalization detection
Tools dan Resources
Speech Recognition
| Tool | Type | Harga |
|---|---|---|
| Whisper (OpenAI) | Open source | Free |
| Google Speech-to-Text | Cloud API | Pay-per-use |
| AWS Transcribe | Cloud API | Pay-per-use |
| Azure Speech Services | Cloud API | Pay-per-use |
| DeepSpeech (Mozilla) | Open source | Free |
Text-to-Speech
| Tool | Type | Harga |
|---|---|---|
| Coqui TTS | Open source | Free |
| ElevenLabs | API/Platform | Freemium |
| Google Cloud TTS | Cloud API | Pay-per-use |
| Amazon Polly | Cloud API | Pay-per-use |
| Microsoft Azure TTS | Cloud API | Pay-per-use |
Kesimpulan
Speech AI telah mengubah cara kita berinteraksi dengan teknologi. Dari transkripsi otomatis sampai voice assistants yang natural — semua dimungkinkan oleh kemajuan deep learning.
Key takeaways:
- ASR: Suara → MFCC → Acoustic Model → Language Model → Teks
- TTS: Teks → Linguistic Features → Acoustic Model → Vocoder → Suara
- Model modern: Transformer-based (Whisper), Neural TTS (WaveNet, Tacotron)
- Aplikasi: Virtual assistants, transcription, accessibility, translation
- Tantangan: Noise, accent, latency, privacy
Next step: Coba Whisper untuk transkripsi audio, atau ElevenLabs untuk generate suara AI yang natural!
Sudah coba voice AI? Atau punya ide aplikasi keren dengan speech recognition? Share pengalamanmu!