Realtime API Reference
GETwss://eu2.rt.speechmatics.com/v2/
Protocol overview
A basic Realtime session will have the following message exchanges:
Browser based transcription
When starting a Real-Time transcription session in the browser, temporary keys should be used to avoid exposing your long-lived API key.
To do so, you must provide the temporary key as a part of a query parameter. This is due to a browser limitation. For example:
wss://eu2.rt.speechmatics.com/v2?jwt=<temporary-key>
Handshake Responses
Successful Response
101 Switching Protocols
- Switch to WebSocket protocol
Here is an example for a successful WebSocket handshake:
GET /v2/ HTTP/1.1
Host: eu2.rt.speechmatics.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: ujRTbIaQsXO/0uCbjjkSZQ==
Sec-WebSocket-Version: 13
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Authorization: Bearer wmz9fkLJM6U5NdyaG3HLHybGZj65PXp
User-Agent: Python/3.8 websockets/8.1
A successful response should look like:
HTTP/1.1 101 Switching Protocols
Server: nginx/1.17.8
Date: Wed, 06 Jan 2021 11:01:05 GMT
Connection: upgrade
Upgrade: WebSocket
Sec-WebSocket-Accept: 87kiC/LI5WgXG52nSylnfXdz260=
Malformed Request
A malformed handshake request will result in one of the following HTTP responses:
400 Bad Request
401 Unauthorized
- when the API key is not valid405 Method Not Allowed
- when the request method is not GET
Client Retry
Following a successful handshake and switch to the WebSocket protocol, the client could receive an immediate error message and WebSocket close handshake from the server. For the following errors only, we recommend adding a client retry interval of at least 5-10 seconds:
4005 quota_exceeded
4013 job_error
1011 internal_error
Message Handling
Each message that the Server accepts is a stringified JSON object with the following fields:
message
(String): The name of the message we are sending. Any other fields depend on the value of themessage
and are described below.
The messages sent by the Server to a Client are stringified JSON objects as well.
The only exception is a binary message sent from the Client to the Server containing a chunk of audio which will be referred to as AddAudio
.
The following values of the message
field are supported:
Sent messages
StartRecognition
StartRecognition
audio_format objectrequired
- MOD1
- MOD2
raw
Possible values: [pcm_f32le
, pcm_s16le
, mulaw
]
file
transcription_config objectrequired
Request a specialized model based on 'language' but optimized for a particular field, e.g. "finance" or "medical".
Possible values: non-empty
additional_vocab object[]
- MOD1
- MOD2
Possible values: non-empty
Possible values: non-empty
Possible values: >= 1
Possible values: [none
, speaker
]
Possible values: >= 0
Possible values: [flexible
, fixed
]
speaker_diarization_config object
Possible values: >= 2
and <= 100
Possible values: >= 0
and <= 1
audio_filtering_config object
Possible values: >= 0
and <= 100
transcript_filtering_config
A list of replacement rules to apply to the transcript. Each rule consists of a pattern to match and a replacement string.
false
true
Possible values: [standard
, enhanced
]
punctuation_overrides object
The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.
Possible values: Value must match regular expression ^(.|all)$
Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.
Possible values: >= 0
and <= 1
conversation_config object
This mode will detect when a speaker has stopped talking. The end_of_utterance_silence_trigger is the time in seconds after which the server will assume that the speaker has finished speaking, and will emit an EndOfUtterance message. A value of 0 disables the feature.
Possible values: >= 0
and <= 2
0
translation_config object
false
audio_events_config object
AddAudio
EndOfStream
EndOfStream
SetRecognitionConfig
SetRecognitionConfig
transcription_config objectrequired
Request a specialized model based on 'language' but optimized for a particular field, e.g. "finance" or "medical".
Possible values: non-empty
additional_vocab object[]
- MOD1
- MOD2
Possible values: non-empty
Possible values: non-empty
Possible values: >= 1
Possible values: [none
, speaker
]
Possible values: >= 0
Possible values: [flexible
, fixed
]
speaker_diarization_config object
Possible values: >= 2
and <= 100
Possible values: >= 0
and <= 1
audio_filtering_config object
Possible values: >= 0
and <= 100
transcript_filtering_config
A list of replacement rules to apply to the transcript. Each rule consists of a pattern to match and a replacement string.
false
true
Possible values: [standard
, enhanced
]
punctuation_overrides object
The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.
Possible values: Value must match regular expression ^(.|all)$
Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.
Possible values: >= 0
and <= 1
conversation_config object
This mode will detect when a speaker has stopped talking. The end_of_utterance_silence_trigger is the time in seconds after which the server will assume that the speaker has finished speaking, and will emit an EndOfUtterance message. A value of 0 disables the feature.
Possible values: >= 0
and <= 2
0
Received messages
RecognitionStarted
RecognitionStarted
AudioAdded
AudioAdded
AddPartialTranscript
AddPartialTranscript
Speechmatics JSON output format version number.
2.1
metadata objectrequired
results object[]required
Possible values: [word
, punctuation
]
Possible values: [next
, previous
, none
, both
]
alternatives object[]
display
Possible values: [ltr
, rtl
]
Possible values: >= 0
and <= 1
Possible values: >= 0
and <= 100
AddTranscript
AddTranscript
Speechmatics JSON output format version number.
2.1
metadata objectrequired
results object[]required
Possible values: [word
, punctuation
]
Possible values: [next
, previous
, none
, both
]
alternatives object[]
display
Possible values: [ltr
, rtl
]
Possible values: >= 0
and <= 1
Possible values: >= 0
and <= 100
AddPartialTranslation
AddPartialTranslation
Speechmatics JSON output format version number.
2.1
results object[]required
AddTranslation
AddTranslation
Speechmatics JSON output format version number.
2.1
results object[]required
EndOfTranscript
EndOfTranscript
AudioEventStarted
AudioEventStarted
event objectrequired
AudioEventEnded
AudioEventEnded
event objectrequired
EndOfUtterance
EndOfUtterance
metadata objectrequired
Info
Info
Possible values: [recognition_quality
, model_redirect
, deprecated
, concurrent_session_usage
]
Warning
Warning
Possible values: [duration_limit_exceeded
]
Error
Error
Possible values: [invalid_message
, invalid_model
, invalid_config
, invalid_audio_type
, not_authorised
, insufficient_funds
, not_allowed
, job_error
, data_error
, buffer_error
, protocol_error
, timelimit_exceeded
, quota_exceeded
, unknown_error
]