Audio Endpoints

Fluency Assessment

See Postman Workspace for working Fluency API examples

Fluency API Request (POST)

Step 1

curl https://api.kidsmart.ai/v1/audio/fluency \
    -H "x-api-key:$KIDSMART_API_KEY" \
    -F "file=@$AUDIO_FILE" \
    -F "user_token=$USER_ID" \
    -F "reference_text=@$REFERENCE_TEXT" \
    -F "model_id=$MODEL_ID" 
    -H "Content-Type: multipart/form-data"

Name	Type	Description
`x-api-key`	header	For test purposes. The app key to use to authenticate with the Web Service.
`file`	field	The audio file to be analyzed. Audio files can be in the format wav, mp3, m4a, webm, ogg, or mp4, and should be less than 15 minutes in duration.
`reference_text`	field	The reference text against which the speech contained in the audio file should be analyzed. This should be a plain text file using UTF8 encoding.
`user_token`	field	A unique ID that represents the speaker in the audio file. This should be a non-human readable alphanumeric identifier (such as a UUID) that has meaning to you but not Kid Smart AI. This token can be used to request deletion of a specific user's data in line with our data privacy commitments.
`model_id`	field	Model ID (from Kid Smart AI).

If successful, this will return a JSON response similar to the following:

{
  "id":"abc123",
  "soapbox_format":"https://api.kidsmart.ai/v1/audio/fluency/result/{result_id}/soapbox_format",
  "kidsmart_format":"https://api.kidsmart.ai/v1/audio/fluency/result/{result_id}/kidsmart_format"
}

Step 2: Retrieve result (Get)

Extract the value of the status_uri field and use this to retrieve the result. The processing time depends on the length of the file, its complexity (e.g., audio quality) and connection speed. If the result is not yet available, you will receive a HTTP 404 status code. If you encounter a HTTP 404 you should wait a period of time before retrying.

curl https://api.kidsmart.ai/v1/audio/fluency/result/{result_id}/{format}

Fluency API Response Structure: 2 Formats Available

Kid Smart AI Format
SoapBox Labs Compatible Format

If the request is successful, the Fluency Web Service will return a JSON response containing the Fluency analysis. At the root of the results object are the following fields:

Field	Description
`user_id`	The `user_id` specified in the request.
`language_code`	The language contained in the audio file being analyzed.
`result_id`	A unique identifier for the request.
`time`	The UTC time the request was processed at.
`results`	The Fluency results object returned ([see details in results]).
`audio_duration`	The duration of the audio file in seconds.
`assessStartTime`	Time of first correct word uttered
`assessEndTime`	Time of last correct word uttered
`WPM`	Words per minute (correct words / (assessEndTime - assessStartTime))
`accuracy_score`	100 * (Correct words / Total words)

API Response with Results

The following is an example of the JSON structure you can expect from Fluency. In the example, the reference text is “i love tigers too” while the child says “i like tigers” in the audio file.

Reference Text	Child Says
I love tigers. I have two cats.	“i like tigers. i have cats.”

{
  "audio_duration": 10,
  "user_id": "930fed57-fd29-4d76-86e9-5004ffcb1369",
  "language_code": "",
  "result_id": "3dbda1c0-1f12-4014-a9c4-2f1a814baa0e",
  "time": "2024-06-12T12:06:47.912Z",
  "assessStartTime": "1.5",
  "assessEndTime": "6.7",
  "results": {
    "num_differences": 2,
    "substitution_count": 1,
    "insertion_count": 0,
    "correct_count": 5,
    "deletion_count": 1,
    "repetition_count": 0,
    "reference_text": "",
    "transcription": "",
    "transcription_confidence": "",
    "word_count": 7,
    "WPM": "57",
    "accuracy_score": "71",
    "text_score": [
      {
        "reference_text": "love",
        "reference_index": 1,
        "transcription_words": [
          "like"
        ],
        "alignment_type": "SUBSTITUTION",
        "transcription_indices": [
          1
        ],
        "transcription_details": {
          "start": 2,
          "end": 4,
          "phone_breakdown": []
        }
      },
      {
        "reference_text": "two",
        "reference_index": 5,
        "transcription_words": [],
        "alignment_type": "DELETION",
        "transcription_indices": [],
        "transcription_details": {
          "start": 0,
          "end": 0,
          "phone_breakdown": []
        }
      }
    ]
  }
}

JSON Breakdown

The following are snippets from the full JSON response above with some additional information for each key.

Results object

Within the JSON response, the results node/object contains the analysis of the audio file including data such as the number of insertions, deletions, and substitutions found and a text_score node/object that contains a breakdown of each token analyzed in the request.

"results": {
    "num_differences": 2,
    "substitution_count": 1,
    "insertion_count": 0,
    "correct_count": 5,
    "deletion_count": 1,
    "repetition_count": 0,
    "reference_text": "",
    "transcription": "",
    "transcription_confidence": "",
    "word_count": 7,
    "WPM": "57",
    "accuracy_score": "71",
    "text_score": []
  }

Field	Description
`num_differences`	The number of differences between the reference and transcription text (total number of substitutions + deletions + insertions).
`substitution_count`	The number of times a word has been substituted for another (i.e., the number of times the child said a different word than what was in the reference
`insertion_count`	The number of times a word has been inserted (i.e., the number of times the child said a word that was not present in the reference
`correct_count`	The number of times the child correctly said a word from the reference text.
`transcription_confidence`	The confidence of the transcription text transcribed from the audio file.
`deletion_count`	The number of times a word from the reference text was not said (i.e., the number of times the child omitted a word that's present in the reference text). Depending on the last_word_type selected, the deletion_count will vary.
`reference_text`	The text that is expected to be read.
`text_score`	A breakdown of the individual words. See text_score for more information
`transcription`	The computed transcription from the audio file.
`word_count`	The number of words in the reference_text.
`repetition_count`	The number of times the child repeated a word. (For example, if the child correctly said “stripes” but immediately afterwards said “stripes” again, the second stripes would be flagged as an insertion and repetition.)

If the request is successful, the Fluency Web Service will return a JSON response containing the Fluency analysis. At the root of the results object are the following fields:

Field	Description
`user_id`	The `user_id` specified in the request.
`language_code`	The language contained in the audio file being analyzed.
`result_id`	A unique identifier for the request.
`time`	The UTC time the request was processed at.
`results`	The Fluency results object returned (see details results section).
`audio_duration`	The duration of the audio file in seconds.

API Response with Results

Reference Text	Child Says
I love tigers. I have two cats.	“i like tigers. i have cats.”

{
  "audio_duration": 10,
  "user_id": "930fed57-fd29-4d76-86e9-5004ffcb1369",
  "language_code": "",
  "result_id": "3dbda1c0-1f12-4014-a9c4-2f1a814baa0e",
  "time": "2024-06-12T12:06:47.912Z",
  "assessStartTime": "1.5",
  "assessEndTime": "6.7",
  "results": {
    "num_differences": 2,
    "substitution_count": 1,
    "insertion_count": 0,
    "correct_count": 5,
    "deletion_count": 1,
    "repetition_count": 0,
    "reference_text": "",
    "transcription": "",
    "transcription_confidence": "",
    "word_count": 7,
    "WPM": "57",
    "accuracy_score": "71",
    "text_score": [
      {
        "reference_text": "love",
        "reference_index": 1,
        "transcription_words": [
          "like"
        ],
        "alignment_type": "SUBSTITUTION",
        "transcription_indices": [
          1
        ],
        "transcription_details": {
          "start": 2,
          "end": 4,
          "phone_breakdown": []
        }
      },
      {
        "reference_text": "two",
        "reference_index": 5,
        "transcription_words": [],
        "alignment_type": "DELETION",
        "transcription_indices": [],
        "transcription_details": {
          "start": 0,
          "end": 0,
          "phone_breakdown": []
        }
      }
    ]
  }
}

JSON Breakdown

The following are snippets from the full JSON response above with some additional information for each key.

Results object

"results": {
    "num_differences": 2,
    "substitution_count": 1,
    "insertion_count": 0,
    "correct_count": 5,
    "deletion_count": 1,
    "repetition_count": 0,
    "reference_text": "",
    "transcription": "",
    "transcription_confidence": "",
    "word_count": 7,
    "WPM": "57",
    "accuracy_score": "71",
    "text_score": []
  }

Field	Description
`num_differences`	The number of differences between the reference and transcription text (total number of substitutions + deletions + insertions).
`substitution_count`	The number of times a word has been substituted for another (i.e., the number of times the child said a different word than what was in the reference
`insertion_count`	The number of times a word has been inserted (i.e., the number of times the child said a word that was not present in the reference
`correct_count`	The number of times the child correctly said a word from the reference text.
`transcription_confidence`	The confidence of the transcription text transcribed from the audio file.
`deletion_count`	The number of times a word from the reference text was not said (i.e., the number of times the child omitted a word that's present in the reference text). Depending on the last_word_type selected, the deletion_count will vary.
`reference_text`	The text that is expected to be read.
`text_score`	A breakdown of the individual words. See text_score for more information.
`transcription`	The computed transcription from the audio file.
`word_count`	The number of words in the reference_text.
`repetition_count`	The number of times the child repeated a word. (For example, if the child correctly said “stripes” but immediately afterwards said “stripes” again, the second stripes would be flagged as an insertion and repetition.)

Pronunciation Assessment

Pronunciation API Request (POST)

Step 1

curl https://api.kidsmart.ai/v1/audio/pronunciation \
    -H "x-api-key:$KIDSMART_API_KEY" \
    -F "file=@$AUDIO_FILE" \
    -F "user_token=$USER_ID" \
    -F "reference_text=@$REFERENCE_TEXT" \
    -F "model_id=$MODEL_ID" 
    -H "Content-Type: multipart/form-data"

Name	Type	Description
`x-api-key`	header	For test purposes. The app key to use to authenticate with the Web Service.
`file`	field	The audio file to be analyzed. Audio files should be in WAV format and should be 3-15 seconds in duration
`reference_text`	field	The reference phonemes against which the speech contained in the audio file should be analyzed. This should be in the phonetic alphabet of the model specified by the model id. The default phonetic alphabet is Arpabet, but we can customize to your specific phonetic alphabet on demand.
`user_token`	field	A unique ID that represents the speaker in the audio file. This should be a non-human readable alphanumeric identifier (such as a UUID) that has meaning to you but not Kid Smart AI. This token can be used to request deletion of a specific user's data in line with our data privacy commitments.
`model_id`	field	Model ID (id is given to you by Kid Smart AI).

If successful, this will return a JSON response similar to the following:

{
  "id":"abc123",
  "url":"https://api.kidsmart.ai/v1/audio/pronunciation/result/{id}/",
}

Step 2: Retrieve result (Get)

curl https://api.kidsmart.ai/v1/audio/pronunciation/result/{result_id}/

Pronunciation Response Structure

If the request is successful, the Fluency Web Service will return a JSON response containing the Fluency analysis. At the root of the results object are the following fields:

Field	Description
`user_id`	The `user_id` specified in the request.
`assessment_id`	The unique identifier for the request.
`analysis_details`	The pronunciation results object returned ([see details in results]).
`audio_duration`	The duration of the audio file in seconds.
`is_correct`	T/F boolean whether the audio contained the reference phoneme(s)
`model_id`	The id of the model
`reference_text`	The phonemes the audio was tested against
`feedback`	If there audio was marked incorrect, then feedback on why it was marked incorrect

API Response

The following is an example of the JSON structure you can expect from Pronunciation. In the example, the reference text is “D EH M” and the child says "D EH M"

Reference Text (Nonsense word "dem")	Child Says
D EH M	D EH M

{
"assessment_id": "fd66cf8a-7945-47dc-a146-1938155dc858",
"reference_text": "D EH M",
"is_correct": true,
"feedback": "feedback feature coming soon",
"audio_duration": 6,
"analysis_details": [
  {
    "phoneme": "D",
    "timestamp": 1.57
  },
  {
    "phoneme": "EH",
    "timestamp": 3.03
  },
  {
    "phoneme": "M",
    "timestamp": 3.63
  }
],
"model_id": "segmentation_medium",
"user_id": "930fed57-fd29-4d76-86e9-5004ffcb1369"
}

Word Recognition Assessment

Recognition API Request (POST)

curl https://api.kidsmart.ai/v1/audio/recognition \
    -H "x-api-key:$KIDSMART_API_KEY" \
    -F "file=@$AUDIO_FILE" \
    -F "user_token=$USER_ID" \
    -F "reference_text=@$REFERENCE_TEXT" \
    -F "model_id=$MODEL_ID" 
    -H "Content-Type: multipart/form-data"

Name	Type	Description
`x-api-key`	header	Your API authentication key
`file`	field	Audio file (WAV format, max 30 seconds duration)
`reference_text`	field	Expected word or phrase for recognition
`user_token`	field	Unique identifier for the speaker
`model_id`	field	Model ID (from Kid Smart AI)
`webhook_url`	field	(Optional) URL to receive results via webhook

Recognition API Response Structure

The API returns a JSON response containing the recognition analysis:

{
  "assessment_id": "b2e4df18-fdee-4b07-a687-3ef56abad050",
  "reference_text": "plume",
  "feedback": "feedback feature coming soon",
  "audio_duration": 1.7066666666666668,
  "model_id": "literably_word",
  "user_id": "here",
  "prediction": "plum",
  "phoneme_details": [
    {"phoneme": "P", "original": null, "timestamp": 1.18},
    {"phoneme": "L", "original": null, "timestamp": 1.37},
    {"phoneme": "AH", "original": null, "timestamp": 1.43},
    {"phoneme": "M", "original": null, "timestamp": 1.63}
  ],
  "correct": false,
  "confidence": "High"
}

Field	Description
`assessment_id`	Unique identifier for this assessment
`reference_text`	The expected word/phrase that was tested against
`prediction`	The word/phrase that was recognized in the audio
`correct`	Boolean indicating if the pronunciation was correct
`confidence`	Confidence level of the recognition (High/Medium/Low)
`phoneme_details`	If the child uttered the word or phrase incorrectly, the detailed breakdown of recognized phonemes and timing
`audio_duration`	Length of the audio file in seconds
`feedback`	Additional feedback about the recognition (if any)

If correct == True (the child uttered the word or phrase correctly), the phoneme details are not returned.

New Feature: Webhooks

All audio endpoints now support webhooks for asynchronous result delivery. Add the optional webhook_url parameter to receive results via POST callback instead of polling:

-F "webhook_url=https://your-domain.com/webhook-endpoint"

When a webhook URL is provided, the API response will include a webhook notification:

{
  "id": "e09ecf55-36b5-4936-83f4-ff3439223ed4",
  "webhook_notification": "Results will be sent to the provided webhook URL upon completion",
  "url": "https://api.kidsmart.ai/v1/audio/recognition/result/e09ecf55-36b5-4936-83f4-ff3439223ed4/"
}

Webhook responses are typically delivered within 30 seconds of the initial request. See webhooks documentation for more details

Audio Endpoints

Fluency Assessment​

Fluency API Request (POST)​

Fluency API Response Structure: 2 Formats Available​

Pronunciation Assessment​

Pronunciation API Request (POST)​

Pronunciation Response Structure​

Word Recognition Assessment​

Recognition API Request (POST)​

Recognition API Response Structure​

New Feature: Webhooks​

Fluency Assessment

Fluency API Request (POST)

Fluency API Response Structure: 2 Formats Available

Pronunciation Assessment

Pronunciation API Request (POST)

Pronunciation Response Structure

Word Recognition Assessment

Recognition API Request (POST)

Recognition API Response Structure

New Feature: Webhooks