Skip to main content

Audio Endpoints

Fluency Assessment

Fluency API Request (POST)

Step 1
curl https://api.kidsmart.ai/v1/audio/fluency \
-H "x-api-key:$KIDSMART_API_KEY" \
-F "file=@$AUDIO_FILE" \
-F "user_token=$USER_ID" \
-F "reference_text=@$REFERENCE_TEXT" \
-F "model_id=$MODEL_ID"
-H "Content-Type: multipart/form-data"
NameTypeDescription
x-api-keyheaderFor test purposes. The app key to use to authenticate with the Web Service.
filefieldThe audio file to be analyzed. Audio files can be in the format wav, mp3, m4a, webm, ogg, or mp4, and should be less than 15 minutes in duration.
reference_textfieldThe reference text against which the speech contained in the audio file should be analyzed. This should be a plain text file using UTF8 encoding.
user_tokenfieldA unique ID that represents the speaker in the audio file. This should be a non-human readable alphanumeric identifier (such as a UUID) that has meaning to you but not Kid Smart AI. This token can be used to request deletion of a specific user's data in line with our data privacy commitments.
model_idfieldModel ID (from Kid Smart AI).

If successful, this will return a JSON response similar to the following:

{
"id":"abc123",
"soapbox_format":"https://api.kidsmart.ai/v1/audio/fluency/result/{result_id}/soapbox_format",
"kidsmart_format":"https://api.kidsmart.ai/v1/audio/fluency/result/{result_id}/kidsmart_format"
}
Step 2: Retrieve result (Get)

Extract the value of the status_uri field and use this to retrieve the result. The processing time depends on the length of the file, its complexity (e.g., audio quality) and connection speed. If the result is not yet available, you will receive a HTTP 404 status code. If you encounter a HTTP 404 you should wait a period of time before retrying.

curl https://api.kidsmart.ai/v1/audio/fluency/result/{result_id}/{format}

Fluency API Response Structure: 2 Formats Available

If the request is successful, the Fluency Web Service will return a JSON response containing the Fluency analysis. At the root of the results object are the following fields:

FieldDescription
user_idThe user_id specified in the request.
language_codeThe language contained in the audio file being analyzed.
result_idA unique identifier for the request.
timeThe UTC time the request was processed at.
resultsThe Fluency results object returned ([see details in results]).
audio_durationThe duration of the audio file in seconds.
assessStartTimeTime of first correct word uttered
assessEndTimeTime of last correct word uttered
WPMWords per minute (correct words / (assessEndTime - assessStartTime))
accuracy_score100 * (Correct words / Total words)
API Response with Results

The following is an example of the JSON structure you can expect from Fluency. In the example, the reference text is “i love tigers too” while the child says “i like tigers” in the audio file.

Reference TextChild Says
I love tigers. I have two cats.“i like tigers. i have cats.”
{
"audio_duration": 10,
"user_id": "930fed57-fd29-4d76-86e9-5004ffcb1369",
"language_code": "",
"result_id": "3dbda1c0-1f12-4014-a9c4-2f1a814baa0e",
"time": "2024-06-12T12:06:47.912Z",
"assessStartTime": "1.5",
"assessEndTime": "6.7",
"results": {
"num_differences": 2,
"substitution_count": 1,
"insertion_count": 0,
"correct_count": 5,
"deletion_count": 1,
"repetition_count": 0,
"reference_text": "",
"transcription": "",
"transcription_confidence": "",
"word_count": 7,
"WPM": "57",
"accuracy_score": "71",
"text_score": [
{
"reference_text": "love",
"reference_index": 1,
"transcription_words": [
"like"
],
"alignment_type": "SUBSTITUTION",
"transcription_indices": [
1
],
"transcription_details": {
"start": 2,
"end": 4,
"phone_breakdown": []
}
},
{
"reference_text": "two",
"reference_index": 5,
"transcription_words": [],
"alignment_type": "DELETION",
"transcription_indices": [],
"transcription_details": {
"start": 0,
"end": 0,
"phone_breakdown": []
}
}
]
}
}
JSON Breakdown

The following are snippets from the full JSON response above with some additional information for each key.

Results object

Within the JSON response, the results node/object contains the analysis of the audio file including data such as the number of insertions, deletions, and substitutions found and a text_score node/object that contains a breakdown of each token analyzed in the request.

"results": {
"num_differences": 2,
"substitution_count": 1,
"insertion_count": 0,
"correct_count": 5,
"deletion_count": 1,
"repetition_count": 0,
"reference_text": "",
"transcription": "",
"transcription_confidence": "",
"word_count": 7,
"WPM": "57",
"accuracy_score": "71",
"text_score": []
}
FieldDescription
num_differencesThe number of differences between the reference and transcription text (total number of substitutions + deletions + insertions).
substitution_countThe number of times a word has been substituted for another (i.e., the number of times the child said a different word than what was in the reference
insertion_countThe number of times a word has been inserted (i.e., the number of times the child said a word that was not present in the reference
correct_countThe number of times the child correctly said a word from the reference text.
transcription_confidenceThe confidence of the transcription text transcribed from the audio file.
deletion_countThe number of times a word from the reference text was not said (i.e., the number of times the child omitted a word that's present in the reference text). Depending on the last_word_type selected, the deletion_count will vary.
reference_textThe text that is expected to be read.
text_scoreA breakdown of the individual words. See text_score for more information
transcriptionThe computed transcription from the audio file.
word_countThe number of words in the reference_text.
repetition_countThe number of times the child repeated a word. (For example, if the child correctly said “stripes” but immediately afterwards said “stripes” again, the second stripes would be flagged as an insertion and repetition.)

Pronunciation Assessment

Pronunciation API Request (POST)

Step 1
curl https://api.kidsmart.ai/v1/audio/pronunciation \
-H "x-api-key:$KIDSMART_API_KEY" \
-F "file=@$AUDIO_FILE" \
-F "user_token=$USER_ID" \
-F "reference_text=@$REFERENCE_TEXT" \
-F "model_id=$MODEL_ID"
-H "Content-Type: multipart/form-data"
NameTypeDescription
x-api-keyheaderFor test purposes. The app key to use to authenticate with the Web Service.
filefieldThe audio file to be analyzed. Audio files should be in WAV format and should be 3-15 seconds in duration
reference_textfieldThe reference phonemes against which the speech contained in the audio file should be analyzed. This should be in the phonetic alphabet of the model specified by the model id. The default phonetic alphabet is Arpabet, but we can customize to your specific phonetic alphabet on demand.
user_tokenfieldA unique ID that represents the speaker in the audio file. This should be a non-human readable alphanumeric identifier (such as a UUID) that has meaning to you but not Kid Smart AI. This token can be used to request deletion of a specific user's data in line with our data privacy commitments.
model_idfieldModel ID (id is given to you by Kid Smart AI).

If successful, this will return a JSON response similar to the following:

{
"id":"abc123",
"url":"https://api.kidsmart.ai/v1/audio/pronunciation/result/{id}/",
}
Step 2: Retrieve result (Get)

Extract the value of the status_uri field and use this to retrieve the result. The processing time depends on the length of the file, its complexity (e.g., audio quality) and connection speed. If the result is not yet available, you will receive a HTTP 404 status code. If you encounter a HTTP 404 you should wait a period of time before retrying.

curl https://api.kidsmart.ai/v1/audio/pronunciation/result/{result_id}/

Pronunciation Response Structure

If the request is successful, the Fluency Web Service will return a JSON response containing the Fluency analysis. At the root of the results object are the following fields:

FieldDescription
user_idThe user_id specified in the request.
assessment_idThe unique identifier for the request.
analysis_detailsThe pronunciation results object returned ([see details in results]).
audio_durationThe duration of the audio file in seconds.
is_correctT/F boolean whether the audio contained the reference phoneme(s)
model_idThe id of the model
reference_textThe phonemes the audio was tested against
feedbackIf there audio was marked incorrect, then feedback on why it was marked incorrect
API Response

The following is an example of the JSON structure you can expect from Pronunciation. In the example, the reference text is “D EH M” and the child says "D EH M"

Reference Text (Nonsense word "dem")Child Says
D EH MD EH M
{
"assessment_id": "fd66cf8a-7945-47dc-a146-1938155dc858",
"reference_text": "D EH M",
"is_correct": true,
"feedback": "feedback feature coming soon",
"audio_duration": 6,
"analysis_details": [
{
"phoneme": "D",
"timestamp": 1.57
},
{
"phoneme": "EH",
"timestamp": 3.03
},
{
"phoneme": "M",
"timestamp": 3.63
}
],
"model_id": "segmentation_medium",
"user_id": "930fed57-fd29-4d76-86e9-5004ffcb1369"
}

Word Recognition Assessment

Recognition API Request (POST)

curl https://api.kidsmart.ai/v1/audio/recognition \
-H "x-api-key:$KIDSMART_API_KEY" \
-F "file=@$AUDIO_FILE" \
-F "user_token=$USER_ID" \
-F "reference_text=@$REFERENCE_TEXT" \
-F "model_id=$MODEL_ID"
-H "Content-Type: multipart/form-data"
NameTypeDescription
x-api-keyheaderYour API authentication key
filefieldAudio file (WAV format, max 30 seconds duration)
reference_textfieldExpected word or phrase for recognition
user_tokenfieldUnique identifier for the speaker
model_idfieldModel ID (from Kid Smart AI)
webhook_urlfield(Optional) URL to receive results via webhook

Recognition API Response Structure

The API returns a JSON response containing the recognition analysis:

{
"assessment_id": "b2e4df18-fdee-4b07-a687-3ef56abad050",
"reference_text": "plume",
"feedback": "feedback feature coming soon",
"audio_duration": 1.7066666666666668,
"model_id": "literably_word",
"user_id": "here",
"prediction": "plum",
"phoneme_details": [
{"phoneme": "P", "original": null, "timestamp": 1.18},
{"phoneme": "L", "original": null, "timestamp": 1.37},
{"phoneme": "AH", "original": null, "timestamp": 1.43},
{"phoneme": "M", "original": null, "timestamp": 1.63}
],
"correct": false,
"confidence": "High"
}
FieldDescription
assessment_idUnique identifier for this assessment
reference_textThe expected word/phrase that was tested against
predictionThe word/phrase that was recognized in the audio
correctBoolean indicating if the pronunciation was correct
confidenceConfidence level of the recognition (High/Medium/Low)
phoneme_detailsIf the child uttered the word or phrase incorrectly, the detailed breakdown of recognized phonemes and timing
audio_durationLength of the audio file in seconds
feedbackAdditional feedback about the recognition (if any)

If correct == True (the child uttered the word or phrase correctly), the phoneme details are not returned.

New Feature: Webhooks

All audio endpoints now support webhooks for asynchronous result delivery. Add the optional webhook_url parameter to receive results via POST callback instead of polling:

-F "webhook_url=https://your-domain.com/webhook-endpoint"

When a webhook URL is provided, the API response will include a webhook notification:

{
"id": "e09ecf55-36b5-4936-83f4-ff3439223ed4",
"webhook_notification": "Results will be sent to the provided webhook URL upon completion",
"url": "https://api.kidsmart.ai/v1/audio/recognition/result/e09ecf55-36b5-4936-83f4-ff3439223ed4/"
}

Webhook responses are typically delivered within 30 seconds of the initial request. See webhooks documentation for more details