Custom Speech Service Guide

Introduction

  • What is the Custom Speech Service?

    The Custom Speech Service enables you to create customized language models and acoustic models tailored to your application and your users. By uploading your specific speech and/or text data to the Custom Speech Service, you can create custom models that can be used in conjunction with Microsoft’s existing state-of-the-art speech models.

    For example, if you’re adding voice interaction to a mobile phone, tablet or PC app, you can create a custom language model that can be combined with Microsoft’s acoustic model to create a speech-to-text endpoint designed especially for your app. If your application is designed for use in a particular environment or by a particular user population, you can also create and deploy a custom acoustic model with this service.

  • How do speech recognition systems work?

    Speech recognition systems are composed of several components that work together. Two of the most important components are the acoustic model and the language model.

    The acoustic model is a classifier that labels short fragments of audio into one of a number of phonemes, or sound units, in a given language. For example, the word “speech” is comprised of four phonemes “s p iy ch”. These classifications are made on the order of 100 times per second.

    The language model is a probability distribution over sequences of words. The language model helps the system decide among sequences of words that sound similar, based on the likelihood of the word sequences themselves. For example, “recognize speech” and “wreck a nice beach” sound alike but the first hypothesis is far more likely to occur, and therefore will be assigned a higher score by the language model.

    Both the acoustic and language models are statistical models learned from training data. As a result, they perform best when the speech they encounter when used in applications is similar to the data observed during training. The acoustic and language models in the Microsoft Speech-To-Text engine have been trained on an enormous collection of speech and text and provide state-of-the-art performance for the most common usage scenarios, such as interacting with Cortana on your smart phone, tablet or PC, searching the web by voice or dictating text messages to a friend.

  • Why customize the speech-to-text engine?

    While the Microsoft Speech-To-Text engine is world-class, it is targeted toward the scenarios described above. However, if you expect voice queries to your application to contain particular vocabulary items, such as product names or jargon that rarely occur in typical speech, it is likely that you can obtain improved performance by customizing the language model.

    For example, if you were building an app to search MSDN by voice, it’s likely that terms like “object-oriented” or “namespace” or “dot net” will appear more frequently than in typical voice applications. Customizing the language model will enable the system to learn this.

    Similarly, customizing the acoustic model can enable the system to learn to do a better job recognizing speech in atypical environments. For example, if you have an app designed to be used by workers in a warehouse or factory, a customized acoustic model can more accurately recognize speech in the presence of the noises found in these environments.


Getting started

  • Samples

    You can find samples that show how to use the service on GitHub.

  • Adding a subscription key

    To get started using the Custom Speech Service, you first need to link your user account to an Azure subscription. Subscriptions to free and paid tiers are available. For information about the tiers, please visit the pricing page.

    Please follow the steps to get a subscription key from Azure portal:

    • Either go to Azure portal and add a new Cognitive Service API by searching for Cognitive Services

      try

      and then selecting Cognitive Services APIs

      try

    • or go directly to the Cognitive Services APIs blade.
    • Now, fill in the required fields:
      • For the Account name use something which works for you. This name you have to remember to find your Cognitive Service subscription within the all resources list.
      • For Subscription select one from your Azure subcriptions
      • As API type please select 'Custom Speech Service (Preview)'
      • Location is currently 'West US'
      • For Pricing tier select the one which works for you. F0 is the free tier with given quotas as explained on the pricing page!
      try

    • Now, you should either find a blade on your dashboard or a service with the provided Account name in your resources list. When you select it you can see an overview of your service. In the list of the left side ribbon you can find under Resource Management a menu item called Keys. Clicking this item brings you to the page showing your subscription keys. Please copy 'KEY 1'.

      This subscription key is required in the next steps.

      try

      One remark, please do not copy the 'Subscription ID' from the overview page. We need the subscription key in the next step!

      try

    • To enter your subscription key, click on your user account in the right side of the top ribbon and click on Subscriptions in the drop-down menu.
    • try

    • This brings you to a table of subscriptions, which will be empty the first time.
    • try

    • Click on Add new. Enter a name for the subscription and the subscription key. This can either be the 'KEY 1' (primary key) or the 'KEY 2' (secondary key) from your subscription.
    • try


Creating a custom acoustic model

  • To customize the acoustic model to a particular domain, a collection of speech data is required. This collection consists of a set of audio files of speech data, and a text file of transcriptions of each audio file. The audio data should be representative of the scenario in which you would like to use the recognizer.

    For example:

    • If you would like to better recognize speech in a noisy factory environment, the audio files should consist of people speaking in a noisy factory.
    • If you are interested in optimizing performance for a single speaker, e.g. you would like to transcribe all of FDR’s Fireside Chats, then the audio files should consist of many examples of that speaker only.

  • Preparing your data to customize the acoustic model

    An acoustic data set for customizing the acoustic model consists of two parts: (1) a set of audio files containing the speech data and (2) a file containing the transcriptions of all audio files.

  • Audio Data Recommendations

    • All audio files in the data set should be stored in the WAV (RIFF) audio format.
    • The audio must have a sampling rate of 8 kHz or 16 kHz and the sample values should be stored as uncompressed PCM 16-bit signed integers (shorts).
    • Only single channel (mono) audio files are supported.
    • The audio files must be between 100ms and 1 minute in length. Each audio file should ideally start and end with at least 100ms of silence, and somewhere between 500ms and 1 second is common.
    • If you have background noise in your data, it is recommended to also have some examples with longer segments of silence, e.g. a few seconds, in your data, before and/or after the speech content.
    • Each audio file should consist of a single utterance, e.g. a single sentence for dictation, a single query, or a single turn of a dialog system.
    • Each audio file to in the data set should have a unique filename and the extension “wav”.
    • The set of audio files should be placed in a single folder without subdirectories and the entire set of audio files should be packaged as a single ZIP file archive.

    Note that data imports via the web portal are currently limited to 2 GB, so this is the maximum size of an acoustic data set. This corresponds to approximately 17 hours of audio recorded at 16 kHz or 34 hours of audio recorded at 8 kHz. The main requirements for the audio data are summarized in the following table.

    File Format RIFF (WAV)
    Sampling Rate 8000 Hz or 16000 Hz
    Channels 1 (mono)
    Sample Format PCM, 16 bit integers
    File Duration 0.1 seconds < duration < 60 seconds
    Silence Collar > 0.1 seconds
    Archive Format Zip
    Maximum Archive Size 2 GB
  • Transcriptions

    The transcriptions for all WAV files should be contained in a single plain-text file. Each line of the transcription file should have the name of one of the audio files, followed by the corresponding transcription. The file name and transcription should be separated by a tab (\t).

    For example:

    speech01.wavspeech recognition is awesome

    speech02.wavthe quick brown fox jumped all over the place

    speech03.wavthe lazy dog was not amused

    The transcriptions will be text-normalized so they can be processed by the system. However, there are some very important normalizations that must be done by the user prior to uploading the data to the Custom Speech Service. Please consult the section on transcription guidelines for the appropriate language when preparing your transcriptions.

  • Importing the acoustic data set

    Once the audio files and transcriptions have been prepared, they are ready to be imported to the service web portal.

    To do so, first ensure you are signed into the system. Then click the “Menu” drop-down menu on the top ribbon and select “Acoustic Data”. If this is your first time uploading data to the Custom Speech Service, you will see an empty table called “Acoustic Data”. The current locale is reflected in the table title. If you would like to import acoustic data of a different language, click on “Change Locale”. Additional information on supported languages can be found in the section on Changing Locale.

    Click the “Import New” button, located directly below the table title and you will be taken to the page for uploading a new data set.


  • try


    Enter a Name and Description in the appropriate text boxes. These are useful for keeping track of various data sets you upload. Next, click “Choose File" for the “Transcription File” and “WAV files” and select your plaint-text transcription file and zip archive of WAV files, respectively. When this is complete, click “Import” to upload your data. Your data will then be uploaded. For larger data sets, this may take several minutes.

    When the upload is complete, you will return to the "Acoustic Data"" table and will see an entry that corresponds to your acoustic data set. Notice that it has been assigned a unique id (GUID). The data will also have a status that reflects its current state. Its status will be “Waiting” while it is being queued for processing, “Processing” while it is going through validation, and “Complete” when the data is ready for use.

    Data validation includes a series of checks on the audio files to verify the file format, length, and sampling rate, and on the transcription files to verify the file format and perform some text normalization.

    When the status is “Complete” you can click “View Report” to see the acoustic data verification report. The number of utterances that passed and failed verification will be shown, along with details about the failed utterances. In the example below, two WAV files failed verification because of improper audio format (in this data set, one had an incorrect sampling rate and one was the incorrect file format).


    try


    At some point, if you would like to change the Name or Description of the data set, you can click the “Edit” link and change these entries. Note that you cannot modify the audio files or transcriptions.


  • Creating a custom acoustic model

    Once the status of your acoustic data set is “Complete”, it can be used to create a custom acoustic model. To do so, click “Acoustic Models” in the “Menu” drop-down menu. You will see a table called "Your models” that lists all of your custom acoustic models. This table will be empty if this is your first use. The current locale is shown in the table title. Currently, acoustic models can be created for US English only.

    To create a new model, click “Create New” under the table title. As before, enter a name and description to help you identify this model. For example, the "Description"" field can be used to record which starting model and acoustic data set were used to create the model. Next, select a “Base Acoustic Model” from the drop-down menu. The base model is the model which is the starting point for your customization. There are two base acoustic models to choose from. The Microsoft Search and Dictation AM is appropriate for speech directed at an application, such as commands, search queries, or dictation. The Microsoft Conversational model is appropritate for recognizing speech spoken in a conversational style. This type of speech is typically directed at another person and occurs in call center or meetings.

    Next, select the acoustic data you wish to use to perform the customization using the drop-down menu.


  • try


    You can optionally choose to perform offline testing of your new model when the processing is complete. This will run a speech-to-text evaluation on a specified acoustic data set using the customized acoustic model and report the results. To perform this testing, select the “Offline Testing” check box. Then select a language model from the drop-down menu. If you have not created any custom language models, only the base language models will be in the drop-down list. Please see the description of the base language models in the guide and select the one that is most appropriate.

    Finally, select the acoustic data set you would like to use to evaluate the custom model. If you perform offline testing, it is important to select an acoustic data that is different from the one used for the model creation to get a realistic sense of the model’s performance. Also note that offline testing is limited to 1000 utterances. If the acoustic dataset for testing is larger than that, only the first 1000 utterances will be evaluated.

    When you are ready to start running the customization process, press “Create”.

    You will now see a new entry in the acoustic models table corresponding to this new model. The status of the process is reflected in the table. The status states are “Waiting”, “Processing” and “Complete”.


    try


    When the model is “Complete”, it can be used in a deployment. Clicking on the “Edit” link enables you to change the "Name" and "Description" of the model. If you have requested offline testing of your model, clicking “View Result” will show the results.

    Please note that the acoustic model creation process is computationally intensive and takes a significant amount of time. The computation time of model creation is approximately the same as the duration of the acoustic data set. For example, if your acoustic data set is composed of two hours of audio, then expect the model creation processing to take about two hours. If you choose to do offline testing, that adds additional processing time.

    To deploy your new acoustic model to a custom speech-to-text endpoint, see Creating a custom speech-to-text endpoint.



Creating a custom language model

  • The procedure for creating a custom language model is similar to creating an acoustic model except there is no audio data, only text. The text should consist of many examples of queries or utterances you expect users to say or have logged users saying (or typing) in your application.

  • Preparing the data for a custom language model

    In order to create a custom language model for your application, you need to provide a list of example utterances to the system, for example:

    • "He has had urticaria for the past week."
    • "The patient had a well-healed herniorrhaphy scar."

    The sentences do not need to be complete sentences or grammatically correct, and should accurately reflect the spoken input you expect the system to encounter in deployment. These examples should reflect both the style and content of the task the users will perform with your application.

    The language model data should be written in plain-text file using either the US-ASCII or UTF-8, depending of the locale. For en-US, both encodings are supported. For zh-CN, only UTF-8 is supported (BOM is optional). The text file should contain one example (sentence, utterance, or query) per line.

    If you wish some sentences to have a higher weight (importance), you can add it several times to your data. A good number of repetitions is between 10 - 100. If you normalize it to 100 you can weight sentence relative to this easily.

    The main requirements for the language data are summarized in the following table.

    Text Encoding

    en-US: US-ACSII or UTF-8

    zh-CN: UTF-8

    # of Utterances per line 1
    Maximum File Size 200 MB
    Remarks

    avoid repeating characters more often than 4 times, i.e. 'aaaaa'

    no special characters like '\t' or any other UTF-8 character above U+00A1 in Unicode characters table

    URIs will also be rejected since there is no unqiue way to pronounce a URI

    When the text is imported, it will be text-normalized so it can be processed by the system. However, there are some very important normalizations that must be done by the user prior to uploading the data. Please consult the section on Transcription Guidelines for the appropriate language when preparing your language data.

  • Importing the language data set

    When you are ready to import your language data set, click “Language Data” from the “Menu” drop-down menu. A table called “Language Data” that contains your language data sets is shown. If you have not yet uploaded any language data, the table will be empty. The current locale is reflected in the table title. If you would like to import language data of a different language, click on “Change Locale”. Additional information on supported languages can be found in the section on Changing Locale.

    To import a new data set, click “Import New” under the table title. Enter a Name and Description to help you identify the data set in the future. Next, use the “Choose File” button to locate the text file of language data. After that, click “Import” and the data set will be uploaded. Depending on the size of the data set, this may take several minutes.


  • try


    When the import is complete, you will return to the language data table and will see an entry that corresponds to your language data set. Notice that it has been assigned a unique id (GUID). The data will also have a status that reflects its current state. Its status will be “Waiting” while it is being queued for processing, “Processing” while it is going through validation, and “Complete” when the data is ready for use. Data validation performs a series of checks on the text in the file and some text normalization of the data.

    When the status is “Complete” you can click “View Report” to see the language data verification report. The number of utterances that passed and failed verification are shown, along with details about the failed utterances. In the example below, two examples failed verification because of improper characters (in this data set, the first had two emoticons and the second had several characters outside of the ASCII printable character set).


    try


    When the status of the language data set is “Complete”, it can be used to create a custom language model.


    try


  • Creating a custom language model

    Once your language data is ready, click “Language Models” from the “Menu” drop-down menu to start the process of custom language model creation. This page contains a table called “Language Models” with your current custom language models. If you have not yet created any custom language models, the table will be empty. The current locale is reflected in the table title. If you would like to create a language model for a different language, click on “Change Locale”. Additional information on supported languages can be found in the section on Changing Locale. To create a new model, click the “Create New” link below the table title.

    On the "Create Language Model" page, enter a "Name" and "Description" to help you keep track of pertinent information about this model, such as the data set used. Next, select the “Base Language Model” from the drop-down menu. This model will be the starting point for your customization. There are two base language models to choose from. The Microsoft Search and Dictation LM is appropriate for speech directed at an application, such as such as commands, search queries, or dictation. The Microsoft Conversational LM is appropriate for recognizing speech spoken in a conversational style. This type of speech is typically directed at another person and occurs in call centers or meetings.

    After you have specified the base language model, select the language data set you wish to use for the customization using the “Language Data” drop down menu


  • try


  • As with the acoustic model creation, you can optionally choose to perform offline testing of your new model when the processing is complete. Note that because this is an evaluation of the speech-to-text performance, offline testing requires an acoustic data set.

    To perform offline testing of your language model, select the check box next to “Offline Testing”. Then select an acoustic model from the drop-down menu. If you have not created any custom acoustic models, the Microsoft base acoustic models will be the only model in the menu. In case you have picked a conversational LM base model, you need to use a conversational AM here. In case you use a search and dictate LM model, you have to select a search and dictate AM model.

    Finally, select the acoustic data set you would like to use to perform the evaluation.

    When you are ready to start processing, press “Create”. This will return you to the table of language models. There will be a new entry in the table corresponding to this model. The status reflects the model’s state and will go through several states including “Waiting”, “Processing”, and “Complete”.

    When the model has reached the “Complete” state, it can be deployed to an endpoint. Clicking on “View Result” will show the results of offline testing, if performed.

    If you would like to change the "Name" or "Description"" of the model at some point, you can use the “Edit” link in the appropriate row of the language models table.


Creating a custom speech-to-text endpoint

  • When you have created custom acoustic models and/or language models, they can be deployed in a custom speech-to-text endpoint. To create a new custom endpoint, click “Deployments” from the “Menu” menu on the top of the page. This takes you to a table called “Deployments” of current custom endpoints. If you have not yet created any endpoints, the table will be empty. The current locale is reflected in the table title. If you would like to create a deployment for a different language, click on “Change Locale”. Additional information on supported languages can be found in the section on Changing Locale.

    To create a new endpoint, click the “Create New” link under the table title. On the "Create Deployment"" screen, enter a "Name" and "Description" of your custom deployment. From the "Acoustic Model" drop-down, select the desired acoustic model, and from the "Language Model" drop-down, select the desired language model. The choices for acoustic and language models always include the base Microsoft models. The selection of the base model limits the combinations. You cannot mix conversational base models with search and dictate base models.


    try


    When you have selected your acoustic and language models, click the “Create” button. This will return you to the table of deployments and you will see an entry in the table corresponding to your new endpoint. The endpoint’s status reflects its current state while it is being created. It can take up to 30 minutes to instantiate a new endpoint with your custom models. When the status of the deployment is “Complete”, the endpoint is ready for use.


    try


    You’ll notice that when the deployment is ready, the Name of the deployment is now a clickable link. Clicking that link shows you the URLs of your custom endpoint for use with either an HTTP request, or using the Microsoft Cognitive Services Speech Client Library which uses Web Sockets.


    try



Using a custom speech-to-text endpoint

  • Requests can be sent to a Custom Speech Service speech-to-text endpoint in a very similar manner as the default Microsoft Cognitive Services speech endpoint. Note that these endpoints are functionally identical to the default endpoints of the Speech API. Thus, the same functionality available via the client library or REST API for the Speech API is also the available for your custom endpoint.

    Please note that the endpoints created via this service can process different numbers of concurrent requests depending on the tier the subscription is associated to. If more recognitions than that are requested, they will return the error code 429 (Too many requests). For more information, please visit the pricing information. In the free tier, there is a monthly limit of requests. If you exceed this limit, the service returns the error code 403 (Forbidden).

    The service assumes that data is transmitted in real-time. If it is sent faster, the request is considered running until its audio duration in real-time has passed.

  • Sending requests via the client library

    To send requests to your custom endpoint using the speech client library, instantiate the recognition client using the Client Speech SDK from nuget (search for "speech recognition" and select Microsoft's speech recognition nuget for your platform). Some sample code can be found here. The Client Speeck SDK provides a factory class SpeechRecognitionServiceFactory which offers 4 methods:

    • CreateDataClient(...): A data recognition client
    • CreateDataClientWithIntent(...): A data recognition client with intent
    • CreateMicrophoneClient(...): A microphone recognition client
    • CreateMicrophoneClientWithIntent(...): A microphone recognition client with intent

    For detailed documentation please visit the Bing Speech API documentation since the Custom Speech Service endpoints support the same SDK.

    The data recognition client is appropriate for speech recognition from data, such as a file or other audio source. The microphone recognition client is appropriate for speech recognition from the microphone. The use of intent in either client can return structured intent results from the Language Understanding Intelligent Service (LUIS) if you have built a LUIS application for your scenario.

    All four types of clients can be instantiated in two ways. The first will utilize the standard Microsoft Cognitive Services Speech API, and the second allows you to specify a URL that corresponds to your custom endpoint created with the Custom Speech Service.

    For example, a DataRecognitionClient that will send requests to a custom endpoint can be created using the following method.

    public static DataRecognitionClient CreateDataClient(SpeeechRecognitionMode speechRecognitionMode, string language, string primaryOrSecondaryKey, string url);

    The your_subscriptionId and endpointURL refer to the Subscription Key and the Web Sockets URL shown on the Deployment Information page, respectively.

    The authenticationUri is used to receive a token from the authentication service. This Uri must be set separately as provided in the sample code below.

    Here is some sample code showing how to use the client SDK:

                                var dataClient = SpeechRecognitionServiceFactory.CreateDataClient(
                                    SpeechRecognitionMode.LongDictation,
                                    "en-us",
                                    "your_subscriptionId",
                                    "your_subscriptionId",
                                    "endpointURL");
                                // set the authorization Uri
                                dataClient.AuthenticationUri = "https://westus.api.cognitive.microsoft.com/sts/v1.0/issueToken";
                                

    Note that when using Create methods in the SDK you must provide the subscription id twice. This is because of overloading of the Create methods.

    Also, note that the Custom Speech Service uses two different URLs for short form and long form recognition, both listed on the "Deployments" page. Please use the correct endpoint URL for the specific form you want to use.

    More details on invoking the various recognition clients with your custom endpoint can be found on the MSDN page describing the SpeechRecognitionServiceFactory class. Note that the documentation on this page refers to Acoustic Model adaptation but it applies to all endpoints created via the Custom Speech Service.

  • Sending requests via HTTP

    Sending a request to your custom endpoint via HTTP post is similar to sending a request by HTTP to the Microsoft Cognitive Services Bing Speech API, except the URL needs to be modified to reflect the address of your custom deployment.

    Note that there are some restrictions on requests sent via HTTP for both the Microsoft Cognitive Services Speech endpoint and custom endpoints created with this service. The HTTP request cannot return partial results during the recognition process. Additionally, the duration of the requests is limited to 10 seconds for the audio content and 14 seconds overall.

    To create a POST request, the same process used for the Microsoft Cognitive Services Speech API must be followed. It consists of the following two steps:

    1. Obtain an access token using your subscription id. This is required to access the recognition endpoint and can be reused for ten minutes.


      curl -X POST --header "Ocp-Apim-Subscription-Key:<subscriptionId>" --data "" "https://westus.api.cognitive.microsoft.com/sts/v1.0/issueToken"


      where <subscriptionId> should be set to the Subscription Id you use for this deployment. The response is the plain token you need for the next request.

    2. Post audio to the endpoint using POST again:


      curl -X POST --data-binary @example.wav -H "Authorization: Bearer <token>" -H "Content-Type: application/octet-stream" "<https_endpoint>""


      where <token> is your access token you have received with the previous call.<https_endpoint> is the full address of your custom speech-to-text endpoint shown in the Deployment Information page.

    Please refer to documentation on the Microsoft Cognitive Services Bing Speech HTTP API for more information about HTTP post parameters and the response format.


Changing Locale

  • The Custom Speech Service currently supports customization of models in the following locales:

    Acoustic Models US English (en-US)
    Language Models US English (en-US), Chinese (zh-CN)

    Although Acoustic Model customization is only supported in US English, importing Chinese acoustic data is supported for the purposes of offline testing of customized Chinese Language Models.

    The appropriate locale must be selected before taking any action. The current locale is indicated in the table title on all data, model, and deployment pages. To change the locale, click the “Change Locale” button located under the table’s title. This will take you to a locale confirmation page. Click “OK” to return to the table.


Transcription guidelines (en-US)

  • To ensure the best use of your text data for acoustic and language model customization, the following transcription guidelines should be followed.

  • Text format for transcriptions and text data

    Text data uploaded to this service should in written in plain text using only the ASCII printable character set. Each line of the file should contain the text for a single utterance only.

    It is important to avoid the use of Unicode punctuation characters. This can happen inadvertently if preparing the data in a word processing program or scraping data from web pages. Replace these characters with appropriate ASCII substitutions. For example:

    Unicode to avoid ASCII substitution
    “Hello world” (open and close double quotes) "Hello world" (double quotes)
    John’s day (right single quotation mark) John's day (apostrophe)
  • Text normalization

    For optimal use in the acoustic or language model customization, the text data must be normalized, which means transformed into a standard, unambiguous form readable by the system. This section describes the text normalization performed by the Custom Speech Service when data is imported and the text normalization that the user must perform prior to data import.

    • Text normalization performed by the Custom Speech Service

      This service will perform the following text normalization on data imported as a language data set or transcriptions for an acoustic data set. This includes

      • Lower-casing all text
      • Removing all punctuation except word-internal apostrophes
      • Expansion of numbers to spoken form, including dollar amounts

      Here are some examples

      Original Text After Normalization
      Starbucks Coffee starbucks coffee
      “Holy cow!” said Batman. holy cow said batman
      “What?” said Batman’s sidekick, Robin. what said batman’s sidekick robin
      Go get -em! go get em
      I’m double-jointed i’m double jointed
      104 Main Street one oh four main street
      Tune to 102.7 tune to one oh two point seven
      Pi is about 3.14 pi is about three point one four
      It costs $3.14 it costs three fourteen
    • Text normalization required by users

      To ensure the best use of your data, the following normalization rules should be applied to your data prior to importing it.

      • Abbreviations should be written out in words to reflect spoken form
      • Non-standard numeric strings should be written out in words
      • Words with non-alphabetic characters or mixed alphanumeric characters should be transcribed as pronounced
      • Common acronyms can be left as a single entity without periods or spaces between the letters, but all other acronyms should be written out in separate letters, with each letter separated by a single space

      Here are some examples

      Original Text After Normalization
      14 NE 3rd Dr. fourteen northeast third drive
      Dr. Strangelove Doctor Strangelove
      James Bond 007 james bond double oh seven
      Ke$ha Kesha
      How long is the 2x4 How long is the two by four
      The meeting goes from 1-3pm The meeting goes from one to three pm
      my blood type is O+ My blood type is O positive
      water is H20 water is H 2 O
      play OU812 by Van Halen play O U 8 1 2 by Van Halen
    • Special considerations for transcribing acoustic data

      The above text normalization requirements apply to both transcriptions for acoustic data sets and text for language data sets. However, when transcribing recorded speech data, the following additional recommendations should be followed.

      Rule Before Normalization After Normalization
      Transcription should reflect what the user says, not the formal version of the word (if user spoke “wanna” transcribe “wanna” not “want to”) Wanna go to the mall? wanna go to the mall
      Ungrammatical utterances should be transcribed as is, not corrected Find a bonds with a ten year maturity date find a bonds with a ten year maturity date
      False starts and hesitations can be transcribed as words or word fragments. Repetitions should be transcribed as spoken. Do, uh, you want to, uh, go out sometime?
      I I just wanted to say
      do uh you want to uh go out sometime
      i i just wanted to say

      [For acoustic data sets only]

      Any short, discrete noise, either from the speaker (e.g. cough, sneeze, etc) or the environment (knock, door slam, beep, etc.) can be transcribed as “_noise_”

      I’m [COUGH] going outside [DOOR SLAM]

      i’m _noise_ going outside _noise_
    • Inverse text normalization

      The process of converting “raw” unformatted text back to formatted text, i.e. with capitalization and punctuation, is called inverse text normalization (ITN). ITN is performed on results returned by the Microsoft Cognitive Services Speech API. A custom endpoint deployed using the Custom Speech Service uses the same ITN as the Microsoft Cognitive Services Speech API. However, this service does not currently support custom ITN, so terms used in a custom language model will not be formatted in the recognition results unless they also existed in the base language model.


Transcription guidelines (zh-CN)

  • To ensure the best use of your text data for acoustic and language model customization, the following transcription guidelines should be followed.

  • Text format for transcriptions and text data

    Text data uploaded to the Custom Speech Service should in written in plain text using only the UTF-8 encoding (BOM is optional). Each line of the file should contain the text for a single utterance only.

    It is important to avoid the use of half-width punctuation characters. This can happen inadvertently if preparing the data in a word processing program or scraping data from web pages. Replace these characters with appropriate full-width substitutions. For example:

    Unicode to avoid ASCII substitution
    “你好” (open and close double quotes) "你好" (double quotes)
    需要什么帮助? (question mark) 需要什么帮助?
  • Text normalization

    For optimal use in the acoustic or language model customization, the text data must be normalized, which means transformed into a standard, unambiguous form readable by the system. This section describes the text normalization performed by this service when data is imported and the text normalization that the user must perform prior to data import.

    • Text normalization performed by the Custom Speech Service

      This speech service will perform the following text normalization on data imported as a language data set or transcriptions for an acoustic data set. This includes

      • Removing all punctuation
      • Expansion of numbers to spoken form
      • Convert full-width letters to half-width letters.
      • Upper-casing all English words

      Here are some examples:

      Original Text After Normalization
      3.1415 三 点 一 四 一 五
      ¥3.5 三 元 五 角
      w f y z W F Y Z
      1992年8月8日 一 九 九 二 年 八 月 八 日
      你吃饭了吗 ? 你 吃饭 了 吗
      下午5:00的航班 下午 五点 的 航班
      我今年21岁 我 今年 二十 一 岁
    • Text normalization required by users

      To ensure the best use of your data, the following normalization rules should be applied to your data prior to importing it.

      • Abbreviations should be written out in words to reflect spoken form
      • This service doesn’t cover all numeric quantities. It is more reliable to write numeric strings out in spoken form

      Here are some examples

      Original Text After Normalization
      我今年21 我今年二十一
      3号楼504 三号 楼 五 零 四
  • Special considerations for transcribing acoustic data

    The above text normalization requirements apply to both transcriptions for acoustic data sets and text for language data sets. However, when transcribing recorded speech data, the following additional recommendations should be followed.

    Rule Before Normalization After Normalization
    Transcription should reflect what the user says, not the formal version of the word (if user spoke “今儿” transcribe “今儿” not “今天”) 你今儿去哪玩了 ? 你 今儿 去哪 玩 了
    Ungrammatical utterances should be transcribed as is, not corrected 下雨了吗外面 下雨 了 吗 外面
    False starts and hesitations can be transcribed as words or word 嗯,那个,嗯,你有空吗 ? 嗯 那个 嗯 你 有空 吗

    [For acoustic data sets only]

    Any short, discrete noise, either from the speaker (e.g. cough, sneeze, etc) or the environment (knock, door slam, beep, etc.) can be transcribed as “_noise_”

    我 [COUGH] 感冒了

    [DOOR SLAM]

    我 _noise_ 感冒 了

    _noise_

  • Inverse text normalization

    The process of converting “raw” unformatted text back to formatted text, i.e. with capitalization and punctuation, is called inverse text normalization (ITN). ITN is performed on results returned by the Microsoft Cognitive Services Speech API. A custom endpoint deployed using the Custom Speech Service uses the same ITN as the Microsoft Cognitive Services Speech API. However, this service does not currently support custom ITN, so new terms introduced by a custom language model will not be formatted in the recognition results.