Skip to main content

Pronunciation Assessment API

Pronunciation Assessment

The Verbal fluency assessment product provides fast, accurate analysis of a fluency assessment of a child. The student is recorded reading a passage, then that recording and an ID for the text that was read are passed to Kid Smart AI API.

Inputs:

  • Audio file (wav only, specified)
  • Expected Phoneme or phonemes (this can be in Arpabet, or in your own phonenic alphabet)
    • Discuss with Kid Smart AI your phonemes, we can implement it.
  • Model ID
    • We can build custom model outputs for your use case, or you can select from one of our options based on confidence (high vs medium) and phoneme segmentation requirements (none vs segmented)
    • Please see playground examples to see the differences.

Outputs for Phoneme assessment

Output DescriptionDetails
Is Correct?Boolean indicating whether the phonemes were correctly pronounced
Analysis detailsTimestamps and the predicted phonemes. If there is uncertainty in the phoneme prediction, multiple phonemes are given
FeedbackFeedback any errors that were identified what was identified

The analysis occurs after the submission of the audio file, typically within 30 seconds of submission.

Pronunciation Assessment

Pronunciation API Request (POST)

Step 1
curl https://api.kidsmart.ai/v1/audio/pronunciation \
-H "x-api-key:$KIDSMART_API_KEY" \
-F "file=@$AUDIO_FILE" \
-F "user_token=$USER_ID" \
-F "reference_text=@$REFERENCE_TEXT" \
-F "model_id=$MODEL_ID"
-H "Content-Type: multipart/form-data"
NameTypeDescription
x-api-keyheaderFor test purposes. The app key to use to authenticate with the Web Service.
filefieldThe audio file to be analyzed. Audio files should be in WAV format and should be 3-15 seconds in duration
reference_textfieldThe reference phonemes against which the speech contained in the audio file should be analyzed. This should be in the phonetic alphabet of the model specified by the model id. The default phonetic alphabet is Arpabet, but we can customize to your specific phonetic alphabet on demand.
user_tokenfieldA unique ID that represents the speaker in the audio file. This should be a non-human readable alphanumeric identifier (such as a UUID) that has meaning to you but not Kid Smart AI. This token can be used to request deletion of a specific user's data in line with our data privacy commitments.
model_idfieldModel ID (id is given to you by Kid Smart AI).

If successful, this will return a JSON response similar to the following:

{
"id":"abc123",
"url":"https://api.kidsmart.ai/v1/audio/pronunciation/result/{id}/",
}
Step 2: Retrieve result (Get)

Extract the value of the status_uri field and use this to retrieve the result. The processing time depends on the length of the file, its complexity (e.g., audio quality) and connection speed. If the result is not yet available, you will receive a HTTP 404 status code. If you encounter a HTTP 404 you should wait a period of time before retrying.

curl https://api.kidsmart.ai/v1/audio/pronunciation/result/{result_id}/

Pronunciation Response Structure

If the request is successful, the Fluency Web Service will return a JSON response containing the Fluency analysis. At the root of the results object are the following fields:

FieldDescription
user_idThe user_id specified in the request.
assessment_idThe unique identifier for the request.
analysis_detailsThe pronunciation results object returned ([see details in results]).
audio_durationThe duration of the audio file in seconds.
is_correctT/F boolean whether the audio contained the reference phoneme(s)
model_idThe id of the model
reference_textThe phonemes the audio was tested against
feedbackIf there audio was marked incorrect, then feedback on why it was marked incorrect
API Response

The following is an example of the JSON structure you can expect from Pronunciation. In the example, the reference text is “D EH M” and the child says "D EH M"

Reference Text (Nonsense word "dem")Child Says
D EH MD EH M
{
"assessment_id": "fd66cf8a-7945-47dc-a146-1938155dc858",
"reference_text": "D EH M",
"is_correct": true,
"feedback": "feedback feature coming soon",
"audio_duration": 6,
"analysis_details": [
{
"phoneme": "D",
"timestamp": 1.57
},
{
"phoneme": "EH",
"timestamp": 3.03
},
{
"phoneme": "M",
"timestamp": 3.63
}
],
"model_id": "segmentation_medium",
"user_id": "930fed57-fd29-4d76-86e9-5004ffcb1369"
}