Audio Endpoints
Fluency Assessment
Fluency API Request (POST)
Step 1curl https://api.kidsmart.ai/v1/audio/fluency \
-H "x-api-key:$KIDSMART_API_KEY" \
-F "file=@$AUDIO_FILE" \
-F "user_token=$USER_ID" \
-F "reference_text=@$REFERENCE_TEXT" \
-F "model_id=$MODEL_ID"
-H "Content-Type: multipart/form-data"
Name | Type | Description |
---|---|---|
x-api-key | header | For test purposes. The app key to use to authenticate with the Web Service. |
file | field | The audio file to be analyzed. Audio files can be in the format wav, mp3, m4a, webm, ogg, or mp4, and should be less than 15 minutes in duration. |
reference_text | field | The reference text against which the speech contained in the audio file should be analyzed. This should be a plain text file using UTF8 encoding. |
user_token | field | A unique ID that represents the speaker in the audio file. This should be a non-human readable alphanumeric identifier (such as a UUID) that has meaning to you but not Kid Smart AI. This token can be used to request deletion of a specific user's data in line with our data privacy commitments. |
model_id | field | Model ID (from Kid Smart AI). |
If successful, this will return a JSON response similar to the following:
{
"id":"abc123",
"soapbox_format":"https://api.kidsmart.ai/v1/audio/fluency/result/{result_id}/soapbox_format",
"kidsmart_format":"https://api.kidsmart.ai/v1/audio/fluency/result/{result_id}/kidsmart_format"
}
Extract the value of the status_uri field and use this to retrieve the result. The processing time depends on the length of the file, its complexity (e.g., audio quality) and connection speed. If the result is not yet available, you will receive a HTTP 404 status code. If you encounter a HTTP 404 you should wait a period of time before retrying.
curl https://api.kidsmart.ai/v1/audio/fluency/result/{result_id}/{format}
Fluency API Response Structure: 2 Formats Available
- Kid Smart AI Format
- SoapBox Labs Compatible Format
If the request is successful, the Fluency Web Service will return a JSON response containing the Fluency analysis. At the root of the results object are the following fields:
Field | Description |
---|---|
user_id | The user_id specified in the request. |
language_code | The language contained in the audio file being analyzed. |
result_id | A unique identifier for the request. |
time | The UTC time the request was processed at. |
results | The Fluency results object returned ([see details in results]). |
audio_duration | The duration of the audio file in seconds. |
assessStartTime | Time of first correct word uttered |
assessEndTime | Time of last correct word uttered |
WPM | Words per minute (correct words / (assessEndTime - assessStartTime)) |
accuracy_score | 100 * (Correct words / Total words) |
The following is an example of the JSON structure you can expect from Fluency. In the example, the reference text is “i love tigers too” while the child says “i like tigers” in the audio file.
Reference Text | Child Says |
---|---|
I love tigers. I have two cats. | “i like tigers. i have cats.” |
{
"audio_duration": 10,
"user_id": "930fed57-fd29-4d76-86e9-5004ffcb1369",
"language_code": "",
"result_id": "3dbda1c0-1f12-4014-a9c4-2f1a814baa0e",
"time": "2024-06-12T12:06:47.912Z",
"assessStartTime": "1.5",
"assessEndTime": "6.7",
"results": {
"num_differences": 2,
"substitution_count": 1,
"insertion_count": 0,
"correct_count": 5,
"deletion_count": 1,
"repetition_count": 0,
"reference_text": "",
"transcription": "",
"transcription_confidence": "",
"word_count": 7,
"WPM": "57",
"accuracy_score": "71",
"text_score": [
{
"reference_text": "love",
"reference_index": 1,
"transcription_words": [
"like"
],
"alignment_type": "SUBSTITUTION",
"transcription_indices": [
1
],
"transcription_details": {
"start": 2,
"end": 4,
"phone_breakdown": []
}
},
{
"reference_text": "two",
"reference_index": 5,
"transcription_words": [],
"alignment_type": "DELETION",
"transcription_indices": [],
"transcription_details": {
"start": 0,
"end": 0,
"phone_breakdown": []
}
}
]
}
}
The following are snippets from the full JSON response above with some additional information for each key.
Results object
Within the JSON response, the results node/object contains the analysis of the audio file including data such as the number of insertions, deletions, and substitutions found and a text_score node/object that contains a breakdown of each token analyzed in the request.
"results": {
"num_differences": 2,
"substitution_count": 1,
"insertion_count": 0,
"correct_count": 5,
"deletion_count": 1,
"repetition_count": 0,
"reference_text": "",
"transcription": "",
"transcription_confidence": "",
"word_count": 7,
"WPM": "57",
"accuracy_score": "71",
"text_score": []
}
Field | Description |
---|---|
num_differences | The number of differences between the reference and transcription text (total number of substitutions + deletions + insertions). |
substitution_count | The number of times a word has been substituted for another (i.e., the number of times the child said a different word than what was in the reference |
insertion_count | The number of times a word has been inserted (i.e., the number of times the child said a word that was not present in the reference |
correct_count | The number of times the child correctly said a word from the reference text. |
transcription_confidence | The confidence of the transcription text transcribed from the audio file. |
deletion_count | The number of times a word from the reference text was not said (i.e., the number of times the child omitted a word that's present in the reference text). Depending on the last_word_type selected, the deletion_count will vary. |
reference_text | The text that is expected to be read. |
text_score | A breakdown of the individual words. See text_score for more information |
transcription | The computed transcription from the audio file. |
word_count | The number of words in the reference_text. |
repetition_count | The number of times the child repeated a word. (For example, if the child correctly said “stripes” but immediately afterwards said “stripes” again, the second stripes would be flagged as an insertion and repetition.) |
If the request is successful, the Fluency Web Service will return a JSON response containing the Fluency analysis. At the root of the results object are the following fields:
Field | Description |
---|---|
user_id | The user_id specified in the request. |
language_code | The language contained in the audio file being analyzed. |
result_id | A unique identifier for the request. |
time | The UTC time the request was processed at. |
results | The Fluency results object returned (see details results section). |
audio_duration | The duration of the audio file in seconds. |
The following is an example of the JSON structure you can expect from Fluency. In the example, the reference text is “i love tigers too” while the child says “i like tigers” in the audio file.
Reference Text | Child Says |
---|---|
I love tigers. I have two cats. | “i like tigers. i have cats.” |
{
"audio_duration": 10,
"user_id": "930fed57-fd29-4d76-86e9-5004ffcb1369",
"language_code": "",
"result_id": "3dbda1c0-1f12-4014-a9c4-2f1a814baa0e",
"time": "2024-06-12T12:06:47.912Z",
"assessStartTime": "1.5",
"assessEndTime": "6.7",
"results": {
"num_differences": 2,
"substitution_count": 1,
"insertion_count": 0,
"correct_count": 5,
"deletion_count": 1,
"repetition_count": 0,
"reference_text": "",
"transcription": "",
"transcription_confidence": "",
"word_count": 7,
"WPM": "57",
"accuracy_score": "71",
"text_score": [
{
"reference_text": "love",
"reference_index": 1,
"transcription_words": [
"like"
],
"alignment_type": "SUBSTITUTION",
"transcription_indices": [
1
],
"transcription_details": {
"start": 2,
"end": 4,
"phone_breakdown": []
}
},
{
"reference_text": "two",
"reference_index": 5,
"transcription_words": [],
"alignment_type": "DELETION",
"transcription_indices": [],
"transcription_details": {
"start": 0,
"end": 0,
"phone_breakdown": []
}
}
]
}
}
The following are snippets from the full JSON response above with some additional information for each key.
Results object
Within the JSON response, the results node/object contains the analysis of the audio file including data such as the number of insertions, deletions, and substitutions found and a text_score node/object that contains a breakdown of each token analyzed in the request.
"results": {
"num_differences": 2,
"substitution_count": 1,
"insertion_count": 0,
"correct_count": 5,
"deletion_count": 1,
"repetition_count": 0,
"reference_text": "",
"transcription": "",
"transcription_confidence": "",
"word_count": 7,
"WPM": "57",
"accuracy_score": "71",
"text_score": []
}
Field | Description |
---|---|
num_differences | The number of differences between the reference and transcription text (total number of substitutions + deletions + insertions). |
substitution_count | The number of times a word has been substituted for another (i.e., the number of times the child said a different word than what was in the reference |
insertion_count | The number of times a word has been inserted (i.e., the number of times the child said a word that was not present in the reference |
correct_count | The number of times the child correctly said a word from the reference text. |
transcription_confidence | The confidence of the transcription text transcribed from the audio file. |
deletion_count | The number of times a word from the reference text was not said (i.e., the number of times the child omitted a word that's present in the reference text). Depending on the last_word_type selected, the deletion_count will vary. |
reference_text | The text that is expected to be read. |
text_score | A breakdown of the individual words. See text_score for more information. |
transcription | The computed transcription from the audio file. |
word_count | The number of words in the reference_text. |
repetition_count | The number of times the child repeated a word. (For example, if the child correctly said “stripes” but immediately afterwards said “stripes” again, the second stripes would be flagged as an insertion and repetition.) |
Pronunciation Assessment
Pronunciation API Request (POST)
Step 1curl https://api.kidsmart.ai/v1/audio/pronunciation \
-H "x-api-key:$KIDSMART_API_KEY" \
-F "file=@$AUDIO_FILE" \
-F "user_token=$USER_ID" \
-F "reference_text=@$REFERENCE_TEXT" \
-F "model_id=$MODEL_ID"
-H "Content-Type: multipart/form-data"
Name | Type | Description |
---|---|---|
x-api-key | header | For test purposes. The app key to use to authenticate with the Web Service. |
file | field | The audio file to be analyzed. Audio files should be in WAV format and should be 3-15 seconds in duration |
reference_text | field | The reference phonemes against which the speech contained in the audio file should be analyzed. This should be in the phonetic alphabet of the model specified by the model id. The default phonetic alphabet is Arpabet, but we can customize to your specific phonetic alphabet on demand. |
user_token | field | A unique ID that represents the speaker in the audio file. This should be a non-human readable alphanumeric identifier (such as a UUID) that has meaning to you but not Kid Smart AI. This token can be used to request deletion of a specific user's data in line with our data privacy commitments. |
model_id | field | Model ID (id is given to you by Kid Smart AI). |
If successful, this will return a JSON response similar to the following:
{
"id":"abc123",
"url":"https://api.kidsmart.ai/v1/audio/pronunciation/result/{id}/",
}
Extract the value of the status_uri field and use this to retrieve the result. The processing time depends on the length of the file, its complexity (e.g., audio quality) and connection speed. If the result is not yet available, you will receive a HTTP 404 status code. If you encounter a HTTP 404 you should wait a period of time before retrying.
curl https://api.kidsmart.ai/v1/audio/pronunciation/result/{result_id}/
Pronunciation Response Structure
If the request is successful, the Fluency Web Service will return a JSON response containing the Fluency analysis. At the root of the results object are the following fields:
Field | Description |
---|---|
user_id | The user_id specified in the request. |
assessment_id | The unique identifier for the request. |
analysis_details | The pronunciation results object returned ([see details in results]). |
audio_duration | The duration of the audio file in seconds. |
is_correct | T/F boolean whether the audio contained the reference phoneme(s) |
model_id | The id of the model |
reference_text | The phonemes the audio was tested against |
feedback | If there audio was marked incorrect, then feedback on why it was marked incorrect |
The following is an example of the JSON structure you can expect from Pronunciation. In the example, the reference text is “D EH M” and the child says "D EH M"
Reference Text (Nonsense word "dem") | Child Says |
---|---|
D EH M | D EH M |
{
"assessment_id": "fd66cf8a-7945-47dc-a146-1938155dc858",
"reference_text": "D EH M",
"is_correct": true,
"feedback": "feedback feature coming soon",
"audio_duration": 6,
"analysis_details": [
{
"phoneme": "D",
"timestamp": 1.57
},
{
"phoneme": "EH",
"timestamp": 3.03
},
{
"phoneme": "M",
"timestamp": 3.63
}
],
"model_id": "segmentation_medium",
"user_id": "930fed57-fd29-4d76-86e9-5004ffcb1369"
}
Word Recognition Assessment
Recognition API Request (POST)
curl https://api.kidsmart.ai/v1/audio/recognition \
-H "x-api-key:$KIDSMART_API_KEY" \
-F "file=@$AUDIO_FILE" \
-F "user_token=$USER_ID" \
-F "reference_text=@$REFERENCE_TEXT" \
-F "model_id=$MODEL_ID"
-H "Content-Type: multipart/form-data"
Name | Type | Description |
---|---|---|
x-api-key | header | Your API authentication key |
file | field | Audio file (WAV format, max 30 seconds duration) |
reference_text | field | Expected word or phrase for recognition |
user_token | field | Unique identifier for the speaker |
model_id | field | Model ID (from Kid Smart AI) |
webhook_url | field | (Optional) URL to receive results via webhook |
Recognition API Response Structure
The API returns a JSON response containing the recognition analysis:
{
"assessment_id": "b2e4df18-fdee-4b07-a687-3ef56abad050",
"reference_text": "plume",
"feedback": "feedback feature coming soon",
"audio_duration": 1.7066666666666668,
"model_id": "literably_word",
"user_id": "here",
"prediction": "plum",
"phoneme_details": [
{"phoneme": "P", "original": null, "timestamp": 1.18},
{"phoneme": "L", "original": null, "timestamp": 1.37},
{"phoneme": "AH", "original": null, "timestamp": 1.43},
{"phoneme": "M", "original": null, "timestamp": 1.63}
],
"correct": false,
"confidence": "High"
}
Field | Description |
---|---|
assessment_id | Unique identifier for this assessment |
reference_text | The expected word/phrase that was tested against |
prediction | The word/phrase that was recognized in the audio |
correct | Boolean indicating if the pronunciation was correct |
confidence | Confidence level of the recognition (High/Medium/Low) |
phoneme_details | If the child uttered the word or phrase incorrectly, the detailed breakdown of recognized phonemes and timing |
audio_duration | Length of the audio file in seconds |
feedback | Additional feedback about the recognition (if any) |
If correct == True (the child uttered the word or phrase correctly), the phoneme details are not returned.
New Feature: Webhooks
All audio endpoints now support webhooks for asynchronous result delivery. Add the optional webhook_url
parameter to receive results via POST callback instead of polling:
-F "webhook_url=https://your-domain.com/webhook-endpoint"
When a webhook URL is provided, the API response will include a webhook notification:
{
"id": "e09ecf55-36b5-4936-83f4-ff3439223ed4",
"webhook_notification": "Results will be sent to the provided webhook URL upon completion",
"url": "https://api.kidsmart.ai/v1/audio/recognition/result/e09ecf55-36b5-4936-83f4-ff3439223ed4/"
}
Webhook responses are typically delivered within 30 seconds of the initial request. See webhooks documentation for more details