Soniox Pro SDK

Professional Python SDK for Soniox Speech-to-Text API

Features

Blazing Fast

Optimised HTTP client with connection pooling and async support

🔒

Type Safe

Full Pydantic v2 models with strict mypy compliance

🎯

Production Ready

Comprehensive error handling, retries, and logging

🌐

Real-time

WebSocket support for streaming transcription

🌍

60+ Languages

Support for multilingual transcription and translation

🎤

Speaker Diarization

Identify and separate multiple speakers

Quick Start

# Install with uv (recommended)
uv add soniox-pro-sdk

# Or with pip
pip install soniox-pro-sdk
from soniox import SonioxClient

# Initialize client
client = SonioxClient(api_key="your-api-key")

# Upload audio file
file_id = client.files.upload("audio.mp3")

# Create transcription
transcription = client.transcriptions.create(
    file_id=file_id,
    model="stt-rt-v3",
    enable_speaker_diarization=True
)

# Get results
result = client.transcriptions.get(transcription.id)
print(result.text)

SDK Architecture

graph TB
    subgraph "Client Layer"
        Client[SonioxClient]
        AsyncClient[SonioxAsyncClient]
        RTClient[SonioxRealtimeClient]
    end

    subgraph "API Resources"
        Files[FilesAPI]
        Trans[TranscriptionsAPI]
        Models[ModelsAPI]
        Auth[AuthAPI]
    end

    subgraph "Core Components"
        Config[SonioxConfig]
        Types[Pydantic Types]
        Errors[Custom Exceptions]
        Utils[Utilities]
    end

    subgraph "Transport Layer"
        HTTP[httpx Client]
        WS[WebSocket Client]
        Pool[Connection Pool]
    end

    Client --> Files
    Client --> Trans
    Client --> Models
    Client --> Auth

    AsyncClient -.-> Files
    AsyncClient -.-> Trans

    RTClient --> WS

    Files --> HTTP
    Trans --> HTTP
    Models --> HTTP
    Auth --> HTTP

    HTTP --> Pool

    Client --> Config
    Client --> Types
    Client --> Errors
    Client --> Utils

    style Client fill:#3b82f6,stroke:#2563eb,color:#fff
    style AsyncClient fill:#3b82f6,stroke:#2563eb,color:#fff
    style RTClient fill:#3b82f6,stroke:#2563eb,color:#fff
    style HTTP fill:#10b981,stroke:#059669,color:#fff
    style WS fill:#10b981,stroke:#059669,color:#fff
                

REST API Flow

sequenceDiagram
    participant App as Your Application
    participant SDK as Soniox SDK
    participant API as Soniox API

    App->>SDK: client.files.upload(audio)
    SDK->>API: POST /files
    API-->>SDK: {file_id, status}
    SDK-->>App: file_id

    App->>SDK: client.transcriptions.create(file_id)
    SDK->>API: POST /transcriptions
    API-->>SDK: {id, status: "processing"}
    SDK-->>App: Transcription object

    loop Poll for completion
        App->>SDK: client.transcriptions.get(id)
        SDK->>API: GET /transcriptions/{id}
        API-->>SDK: {id, status, text?, ...}
        SDK-->>App: Transcription object
    end

    Note over App,API: Status changes to "completed"

    App->>SDK: result.text
    SDK-->>App: Transcribed text with metadata
                

Real-time WebSocket Flow

sequenceDiagram
    participant App as Your Application
    participant SDK as Soniox SDK
    participant WS as WebSocket Server

    App->>SDK: client = SonioxRealtimeClient(config)
    App->>SDK: with client.stream() as stream
    SDK->>WS: WebSocket Connect
    WS-->>SDK: Connection Established

    SDK->>WS: Send Config Message
    WS-->>SDK: Acknowledge

    loop Stream audio chunks
        App->>SDK: stream.send_audio(chunk)
        SDK->>WS: Binary Audio Data
        WS-->>SDK: RealtimeToken (interim)
        SDK-->>App: Yield token
    end

    App->>SDK: stream.finalize()
    SDK->>WS: Finalize Message
    WS-->>SDK: RealtimeToken (final)
    SDK-->>App: Final tokens

    App->>SDK: Exit context manager
    SDK->>WS: Close Connection
    WS-->>SDK: Connection Closed
                

Type System

classDiagram
    class Token {
        +text string
        +start_ms int
        +end_ms int
        +confidence float
        +is_final bool
        +speaker string
        +language string
    }

    class Transcription {
        +id string
        +status TranscriptionStatus
        +text string
        +tokens List
        +model string
        +duration_ms float
        +created_at datetime
    }

    class TranscriptionConfig {
        +model string
        +enable_speaker_diarization bool
        +enable_streaming bool
        +translation TranslationConfig
        +context ContextConfig
    }

    class TranslationConfig {
        <<interface>>
    }

    class OneWayTranslationConfig {
        +type string
        +target_language string
    }

    class TwoWayTranslationConfig {
        +type string
        +language_a string
        +language_b string
    }

    class ContextConfig {
        +general List
        +text string
        +terms List
    }

    TranslationConfig <|-- OneWayTranslationConfig
    TranslationConfig <|-- TwoWayTranslationConfig
    Transcription --> Token
    TranscriptionConfig --> TranslationConfig
    TranscriptionConfig --> ContextConfig
                

Error Handling Hierarchy

graph TB
    Base[SonioxError]

    API[SonioxAPIError]
    Auth[SonioxAuthenticationError]
    NotFound[SonioxNotFoundError]
    Rate[SonioxRateLimitError]
    Validation[SonioxValidationError]
    Timeout[SonioxTimeoutError]
    Network[SonioxNetworkError]
    Config[SonioxConfigError]

    Base --> API
    Base --> Config
    Base --> Network
    Base --> Timeout

    API --> Auth
    API --> NotFound
    API --> Rate
    API --> Validation

    style Base fill:#ef4444,stroke:#dc2626,color:#fff
    style API fill:#f59e0b,stroke:#d97706,color:#fff
    style Auth fill:#f59e0b,stroke:#d97706,color:#fff
    style NotFound fill:#f59e0b,stroke:#d97706,color:#fff
    style Rate fill:#f59e0b,stroke:#d97706,color:#fff
    style Validation fill:#f59e0b,stroke:#d97706,color:#fff
                

Capabilities

Transcription

  • ✓ Async file-based transcription
  • ✓ Real-time streaming transcription
  • ✓ Word-level timestamps
  • ✓ Confidence scores per token
  • ✓ Multiple audio formats (MP3, WAV, PCM)

Speaker Features

  • ✓ Speaker diarization (who spoke when)
  • ✓ Multi-speaker support
  • ✓ Speaker identification
  • ✓ Speaker change detection

Translation

  • ✓ One-way translation (source → target)
  • ✓ Two-way translation (bidirectional)
  • ✓ Real-time translation
  • ✓ 60+ language support
  • ✓ Language auto-detection

Customisation

  • ✓ Custom vocabulary
  • ✓ Domain-specific context
  • ✓ Custom terminology
  • ✓ Context hints for accuracy
  • ✓ Endpoint detection