Soniox Pro SDK - Professional Python SDK for Speech-to-Text

Features

⚡

Blazing Fast

Optimised HTTP client with connection pooling and async support

🔒

Type Safe

Full Pydantic v2 models with strict mypy compliance

🎯

Production Ready

Comprehensive error handling, retries, and logging

🌐

Real-time

WebSocket support for streaming transcription

🌍

60+ Languages

Support for multilingual transcription and translation

🎤

Speaker Diarization

Identify and separate multiple speakers

Quick Start

# Install with uv (recommended)
uv add soniox-pro-sdk

# Or with pip
pip install soniox-pro-sdk

from soniox import SonioxClient

# Initialize client
client = SonioxClient(api_key="your-api-key")

# Upload audio file
file_id = client.files.upload("audio.mp3")

# Create transcription
transcription = client.transcriptions.create(
    file_id=file_id,
    model="stt-rt-v3",
    enable_speaker_diarization=True
)

# Get results
result = client.transcriptions.get(transcription.id)
print(result.text)

SDK Architecture

graph TB
    subgraph "Client Layer"
        Client[SonioxClient]
        AsyncClient[SonioxAsyncClient]
        RTClient[SonioxRealtimeClient]
    end

    subgraph "API Resources"
        Files[FilesAPI]
        Trans[TranscriptionsAPI]
        Models[ModelsAPI]
        Auth[AuthAPI]
    end

    subgraph "Core Components"
        Config[SonioxConfig]
        Types[Pydantic Types]
        Errors[Custom Exceptions]
        Utils[Utilities]
    end

    subgraph "Transport Layer"
        HTTP[httpx Client]
        WS[WebSocket Client]
        Pool[Connection Pool]
    end

    Client --> Files
    Client --> Trans
    Client --> Models
    Client --> Auth

    AsyncClient -.-> Files
    AsyncClient -.-> Trans

    RTClient --> WS

    Files --> HTTP
    Trans --> HTTP
    Models --> HTTP
    Auth --> HTTP

    HTTP --> Pool

    Client --> Config
    Client --> Types
    Client --> Errors
    Client --> Utils

    style Client fill:#3b82f6,stroke:#2563eb,color:#fff
    style AsyncClient fill:#3b82f6,stroke:#2563eb,color:#fff
    style RTClient fill:#3b82f6,stroke:#2563eb,color:#fff
    style HTTP fill:#10b981,stroke:#059669,color:#fff
    style WS fill:#10b981,stroke:#059669,color:#fff

REST API Flow

sequenceDiagram
    participant App as Your Application
    participant SDK as Soniox SDK
    participant API as Soniox API

    App->>SDK: client.files.upload(audio)
    SDK->>API: POST /files
    API-->>SDK: {file_id, status}
    SDK-->>App: file_id

    App->>SDK: client.transcriptions.create(file_id)
    SDK->>API: POST /transcriptions
    API-->>SDK: {id, status: "processing"}
    SDK-->>App: Transcription object

    loop Poll for completion
        App->>SDK: client.transcriptions.get(id)
        SDK->>API: GET /transcriptions/{id}
        API-->>SDK: {id, status, text?, ...}
        SDK-->>App: Transcription object
    end

    Note over App,API: Status changes to "completed"

    App->>SDK: result.text
    SDK-->>App: Transcribed text with metadata

Real-time WebSocket Flow

sequenceDiagram
    participant App as Your Application
    participant SDK as Soniox SDK
    participant WS as WebSocket Server

    App->>SDK: client = SonioxRealtimeClient(config)
    App->>SDK: with client.stream() as stream
    SDK->>WS: WebSocket Connect
    WS-->>SDK: Connection Established

    SDK->>WS: Send Config Message
    WS-->>SDK: Acknowledge

    loop Stream audio chunks
        App->>SDK: stream.send_audio(chunk)
        SDK->>WS: Binary Audio Data
        WS-->>SDK: RealtimeToken (interim)
        SDK-->>App: Yield token
    end

    App->>SDK: stream.finalize()
    SDK->>WS: Finalize Message
    WS-->>SDK: RealtimeToken (final)
    SDK-->>App: Final tokens

    App->>SDK: Exit context manager
    SDK->>WS: Close Connection
    WS-->>SDK: Connection Closed

Type System

classDiagram
    class Token {
        +text string
        +start_ms int
        +end_ms int
        +confidence float
        +is_final bool
        +speaker string
        +language string
    }

    class Transcription {
        +id string
        +status TranscriptionStatus
        +text string
        +tokens List
        +model string
        +duration_ms float
        +created_at datetime
    }

    class TranscriptionConfig {
        +model string
        +enable_speaker_diarization bool
        +enable_streaming bool
        +translation TranslationConfig
        +context ContextConfig
    }

    class TranslationConfig {
        <<interface>>
    }

    class OneWayTranslationConfig {
        +type string
        +target_language string
    }

    class TwoWayTranslationConfig {
        +type string
        +language_a string
        +language_b string
    }

    class ContextConfig {
        +general List
        +text string
        +terms List
    }

    TranslationConfig <|-- OneWayTranslationConfig
    TranslationConfig <|-- TwoWayTranslationConfig
    Transcription --> Token
    TranscriptionConfig --> TranslationConfig
    TranscriptionConfig --> ContextConfig

Error Handling Hierarchy

graph TB
    Base[SonioxError]

    API[SonioxAPIError]
    Auth[SonioxAuthenticationError]
    NotFound[SonioxNotFoundError]
    Rate[SonioxRateLimitError]
    Validation[SonioxValidationError]
    Timeout[SonioxTimeoutError]
    Network[SonioxNetworkError]
    Config[SonioxConfigError]

    Base --> API
    Base --> Config
    Base --> Network
    Base --> Timeout

    API --> Auth
    API --> NotFound
    API --> Rate
    API --> Validation

    style Base fill:#ef4444,stroke:#dc2626,color:#fff
    style API fill:#f59e0b,stroke:#d97706,color:#fff
    style Auth fill:#f59e0b,stroke:#d97706,color:#fff
    style NotFound fill:#f59e0b,stroke:#d97706,color:#fff
    style Rate fill:#f59e0b,stroke:#d97706,color:#fff
    style Validation fill:#f59e0b,stroke:#d97706,color:#fff

Capabilities

Transcription

✓ Async file-based transcription
✓ Real-time streaming transcription
✓ Word-level timestamps
✓ Confidence scores per token
✓ Multiple audio formats (MP3, WAV, PCM)

Speaker Features

✓ Speaker diarization (who spoke when)
✓ Multi-speaker support
✓ Speaker identification
✓ Speaker change detection

Translation

✓ One-way translation (source → target)
✓ Two-way translation (bidirectional)
✓ Real-time translation
✓ 60+ language support
✓ Language auto-detection

Customisation

✓ Custom vocabulary
✓ Domain-specific context
✓ Custom terminology
✓ Context hints for accuracy
✓ Endpoint detection