Skip to main content

Realtime API Reference

GET

wss://eu2.rt.speechmatics.com/v2/

Protocol overview

A basic Realtime session will have the following message exchanges:

WARNING

Browser based transcription

When starting a Real-Time transcription session in the browser, temporary keys should be used to avoid exposing your long-lived API key.

To do so, you must provide the temporary key as a part of a query parameter. This is due to a browser limitation. For example:

 wss://eu2.rt.speechmatics.com/v2?jwt=<temporary-key>

Handshake Responses

Successful Response

  • 101 Switching Protocols - Switch to WebSocket protocol

Here is an example for a successful WebSocket handshake:

GET /v2/ HTTP/1.1
Host: eu2.rt.speechmatics.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: ujRTbIaQsXO/0uCbjjkSZQ==
Sec-WebSocket-Version: 13
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Authorization: Bearer wmz9fkLJM6U5NdyaG3HLHybGZj65PXp
User-Agent: Python/3.8 websockets/8.1

A successful response should look like:

HTTP/1.1 101 Switching Protocols
Server: nginx/1.17.8
Date: Wed, 06 Jan 2021 11:01:05 GMT
Connection: upgrade
Upgrade: WebSocket
Sec-WebSocket-Accept: 87kiC/LI5WgXG52nSylnfXdz260=

Malformed Request

A malformed handshake request will result in one of the following HTTP responses:

  • 400 Bad Request
  • 401 Unauthorized - when the API key is not valid
  • 405 Method Not Allowed - when the request method is not GET

Client Retry

Following a successful handshake and switch to the WebSocket protocol, the client could receive an immediate error message and WebSocket close handshake from the server. For the following errors only, we recommend adding a client retry interval of at least 5-10 seconds:

  • 4005 quota_exceeded
  • 4013 job_error
  • 1011 internal_error

Message Handling

Each message that the Server accepts is a stringified JSON object with the following fields:

  • message (String): The name of the message we are sending. Any other fields depend on the value of the message and are described below.

The messages sent by the Server to a Client are stringified JSON objects as well.

The only exception is a binary message sent from the Client to the Server containing a chunk of audio which will be referred to as AddAudio.

The following values of the message field are supported:

Sent messages

StartRecognition

Initiates a new recognition session.
messagerequired
Constant value: StartRecognition
audio_format objectrequired
oneOf
typerequired
Constant value: raw
encodingstringrequired

Possible values: [pcm_f32le, pcm_s16le, mulaw]

sample_rateintegerrequired
transcription_config objectrequired
languagestringrequired
domainstring

Request a specialized model based on 'language' but optimized for a particular field, e.g. "finance" or "medical".

output_localestring

Possible values: non-empty

additional_vocab object[]
  • Array [
  • oneOf
    string

    Possible values: non-empty

  • ]
  • diarizationstring

    Possible values: [none, speaker]

    max_delaynumber

    Possible values: >= 0

    max_delay_modestring

    Possible values: [flexible, fixed]

    speaker_diarization_config object
    max_speakersinteger

    Possible values: >= 2 and <= 100

    prefer_current_speakerboolean
    speaker_sensitivityfloat

    Possible values: >= 0 and <= 1

    audio_filtering_config object
    volume_thresholdfloat

    Possible values: >= 0 and <= 100

    transcript_filtering_config
    remove_disfluenciesboolean
    replacementsarray[]

    A list of replacement rules to apply to the transcript. Each rule consists of a pattern to match and a replacement string.

    enable_partialsboolean
    Default value: false
    enable_entitiesboolean
    Default value: true
    operating_pointstring

    Possible values: [standard, enhanced]

    punctuation_overrides object
    permitted_marksstring[]

    The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.

    Possible values: Value must match regular expression ^(.|all)$

    sensitivityfloat

    Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.

    Possible values: >= 0 and <= 1

    conversation_config object

    This mode will detect when a speaker has stopped talking. The end_of_utterance_silence_trigger is the time in seconds after which the server will assume that the speaker has finished speaking, and will emit an EndOfUtterance message. A value of 0 disables the feature.

    end_of_utterance_silence_triggerfloat

    Possible values: >= 0 and <= 2

    Default value: 0
    translation_config object
    target_languagesstring[]required
    enable_partialsboolean
    Default value: false
    audio_events_config object
    typesstring[]

    AddAudio

    A binary chunk of audio. The server confirms receipt by sending an AudioAdded message.
    stringbinary

    EndOfStream

    Declares that the client has no more audio to send.
    messagerequired
    Constant value: EndOfStream
    last_seq_nointegerrequired

    SetRecognitionConfig

    Allows the client to re-configure the recognition session.
    messagerequired
    Constant value: SetRecognitionConfig
    transcription_config objectrequired
    languagestringrequired
    domainstring

    Request a specialized model based on 'language' but optimized for a particular field, e.g. "finance" or "medical".

    output_localestring

    Possible values: non-empty

    additional_vocab object[]
  • Array [
  • oneOf
    string

    Possible values: non-empty

  • ]
  • diarizationstring

    Possible values: [none, speaker]

    max_delaynumber

    Possible values: >= 0

    max_delay_modestring

    Possible values: [flexible, fixed]

    speaker_diarization_config object
    max_speakersinteger

    Possible values: >= 2 and <= 100

    prefer_current_speakerboolean
    speaker_sensitivityfloat

    Possible values: >= 0 and <= 1

    audio_filtering_config object
    volume_thresholdfloat

    Possible values: >= 0 and <= 100

    transcript_filtering_config
    remove_disfluenciesboolean
    replacementsarray[]

    A list of replacement rules to apply to the transcript. Each rule consists of a pattern to match and a replacement string.

    enable_partialsboolean
    Default value: false
    enable_entitiesboolean
    Default value: true
    operating_pointstring

    Possible values: [standard, enhanced]

    punctuation_overrides object
    permitted_marksstring[]

    The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.

    Possible values: Value must match regular expression ^(.|all)$

    sensitivityfloat

    Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.

    Possible values: >= 0 and <= 1

    conversation_config object

    This mode will detect when a speaker has stopped talking. The end_of_utterance_silence_trigger is the time in seconds after which the server will assume that the speaker has finished speaking, and will emit an EndOfUtterance message. A value of 0 disables the feature.

    end_of_utterance_silence_triggerfloat

    Possible values: >= 0 and <= 2

    Default value: 0

    Received messages

    RecognitionStarted

    Server response to StartRecognition, acknowledging that a recognition session has started.
    messagerequired
    Constant value: RecognitionStarted
    orchestrator_versionstring
    idstring

    AudioAdded

    Server response to AddAudio, indicating that audio has been added successfully.
    messagerequired
    Constant value: AudioAdded
    seq_nointegerrequired

    AddPartialTranscript

    Contains a work-in-progress transcript of a part of the audio that the client has sent.
    messagerequired
    Constant value: AddPartialTranscript
    formatstring

    Speechmatics JSON output format version number.

    Example: 2.1
    metadata objectrequired
    start_timefloatrequired
    end_timefloatrequired
    transcriptstringrequired
    results object[]required
  • Array [
  • typestringrequired

    Possible values: [word, punctuation]

    start_timefloatrequired
    end_timefloatrequired
    channelstring
    attaches_tostring

    Possible values: [next, previous, none, both]

    is_eosboolean
    alternatives object[]
  • Array [
  • contentstringrequired
    confidencefloatrequired
    languagestring
    display
    directionstringrequired

    Possible values: [ltr, rtl]

    speakerstring
  • ]
  • scorefloat

    Possible values: >= 0 and <= 1

    volumefloat

    Possible values: >= 0 and <= 100

  • ]
  • AddTranscript

    Contains the final transcript of a part of the audio that the client has sent.
    messagerequired
    Constant value: AddTranscript
    formatstring

    Speechmatics JSON output format version number.

    Example: 2.1
    metadata objectrequired
    start_timefloatrequired
    end_timefloatrequired
    transcriptstringrequired
    results object[]required
  • Array [
  • typestringrequired

    Possible values: [word, punctuation]

    start_timefloatrequired
    end_timefloatrequired
    channelstring
    attaches_tostring

    Possible values: [next, previous, none, both]

    is_eosboolean
    alternatives object[]
  • Array [
  • contentstringrequired
    confidencefloatrequired
    languagestring
    display
    directionstringrequired

    Possible values: [ltr, rtl]

    speakerstring
  • ]
  • scorefloat

    Possible values: >= 0 and <= 1

    volumefloat

    Possible values: >= 0 and <= 100

  • ]
  • AddPartialTranslation

    Contains a work-in-progress translation of a part of the audio that the client has sent.
    messagerequired
    Constant value: AddPartialTranslation
    formatstring

    Speechmatics JSON output format version number.

    Example: 2.1
    languagestringrequired
    results object[]required
  • Array [
  • contentstringrequired
    start_timefloatrequired
    end_timefloatrequired
    speakerstring
  • ]
  • AddTranslation

    Contains the final translation of a part of the audio that the client has sent.
    messagerequired
    Constant value: AddTranslation
    formatstring

    Speechmatics JSON output format version number.

    Example: 2.1
    languagestringrequired
    results object[]required
  • Array [
  • contentstringrequired
    start_timefloatrequired
    end_timefloatrequired
    speakerstring
  • ]
  • EndOfTranscript

    Server response to EndOfStream, after the server has finished sending all AddTranscript messages.
    messagerequired
    Constant value: EndOfTranscript

    AudioEventStarted

    Start of an audio event detected.
    messagerequired
    Constant value: AudioEventStarted
    event objectrequired
    typestringrequired
    start_timefloatrequired
    confidencefloatrequired

    AudioEventEnded

    End of an audio event detected.
    messagerequired
    Constant value: AudioEventEnded
    event objectrequired
    typestringrequired
    end_timefloatrequired

    EndOfUtterance

    Indicates the end of an utterance, triggered by a configurable period of non-speech.
    messagerequired
    Constant value: EndOfUtterance
    metadata objectrequired
    start_timefloat
    end_timefloat

    Info

    Additional information sent from the server to the client.
    messagerequired
    Constant value: Info
    typestringrequired

    Possible values: [recognition_quality, model_redirect, deprecated, concurrent_session_usage]

    reasonstringrequired
    codeinteger
    seq_nointeger
    qualitystring
    usagenumber
    quotanumber
    last_updatedstring

    Warning

    Warning messages sent from the server to the client.
    messagerequired
    Constant value: Warning
    typestringrequired

    Possible values: [duration_limit_exceeded]

    reasonstringrequired
    codeinteger
    seq_nointeger
    duration_limitnumber

    Error

    Error messages sent from the server to the client.
    messagerequired
    Constant value: Error
    typestringrequired

    Possible values: [invalid_message, invalid_model, invalid_config, invalid_audio_type, not_authorised, insufficient_funds, not_allowed, job_error, data_error, buffer_error, protocol_error, timelimit_exceeded, quota_exceeded, unknown_error]

    reasonstringrequired
    codeinteger
    seq_nointeger