Google Summer of Coding Week 6&7
Hello all! 🙂
This is my 5th blog in this category (Google Summer of Code 2018). This is a short report on my work during Week 6 and 7. These two weeks were mainly about training for Mandarin language on PaddlePaddle..
Key Steps of training for Mandarin language
In this section, the key steps of training for Mandarin language are provided to help give a quick try, for most major modules, including data preparation, model training, case inference and model evaluation, with a few public dataset(Aishell). Reading this section will also help you to understand how to make it work with your own data.
Some of the scripts in ./examples
are configured with 8 GPUs. If you don’t have 8 GPUs available, please modify CUDA_VISIBLE_DEVICES
and --trainer_count
. If you don’t have any GPU available, please set --use_gpu
to False to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce --batch_size
to fit.
1. Go to directory
2. Prepare the data
run_data.sh
will download dataset, generate manifests, collect normalizer’s statistics and build vocabulary. Once the data preparation is done, you will find the data (only part of LibriSpeech) downloaded in ~/.cache/paddle/dataset/speech
and the corresponding manifest files generated in ./data/aishell as well as a mean stddev file and a vocabulary file. It has to be run for the very first time you run this dataset and is reusable for all further experiments.
After execution, you will see results like:
Skip downloading and unpacking. Data already exists in /mnt/rds/redhen/gallina/Singularity/DeepSpeech2/DeepSpeech/.cache/paddle/dataset/speech/Aishell.
Creating manifest data/aishell/manifest ...
----------- Configuration Arguments -----------
count_threshold: 0
manifest_paths: ['data/aishell/manifest.train', 'data/aishell/manifest.dev']
vocab_path: data/aishell/vocab.txt
------------------------------------------------
----------- Configuration Arguments -----------
manifest_path: data/aishell/manifest.train
num_samples: 2000
output_path: data/aishell/mean_std.npz
specgram_type: linear
------------------------------------------------
Aishell data preparation done.
3. Case inference with an existing model
run_infer_golden.sh
will show us some speech-to-text decoding results for several (default: 10) samples with the well trained model.
4. Evaluate an existing model
run_test_golden.sh
will evaluate the model with Word Error Rate (or Character Error Rate) measurement.
More detailed information will be provided in the following blogs. Wish you a happy journey with the DeepSpeech2 on PaddlePaddle ASR engine!
Code Walkthrough
Details in run_data.sh
Codes in run_data.sh
can mainly be divided into 3 parts:
- download data, generate manifests(aishell.py)
- build vocabulary(build_vocab.py)
- compute mean and stddev for normalizer(compute_mean_std.py)
We will walk through these 3 parts in this section.
1. aishell.py
DeepSpeech2 on PaddlePaddle accepts a textual manifest file as its data set interface. A manifest file summarizes a set of speech data, with each line containing some meta data (e.g. filepath, transcription, duration) of one audio clip, in JSON format, such as:
{"audio_filepath": "/home/work/.cache/paddle/Libri/134686/1089-134686-0001.flac", "duration": 3.275, "text": "stuff it into you his belly counselled him"}
{"audio_filepath": "/home/work/.cache/paddle/Libri/134686/1089-134686-0007.flac", "duration": 4.275, "text": "a cold lucid indifference reigned in his soul"}
To use your custom data, you only need to generate such manifest files to summarize the dataset. Given such summarized manifests, training, inference and all other modules can be aware of where to access the audio files, as well as their meta data including the transcription labels.
For how to generate such manifest files, aishell.py
will download data and generate manifest files for Aishell dataset.
2. build_vocab.py
A vocabulary of possible characters is required to convert the transcription into a list of token indices for training, and in decoding, to convert from a list of indices back to text again. Such a character-based vocabulary can be built with build_vocab.py
.
3. compute_mean_std.py
To perform z-score normalization (zero-mean, unit stddev) upon audio features, we have to estimate in advance the mean and standard deviation of the features, with some training samples. compute_mean_std.py
will compute the mean and standard deviation of power spectrum feature with 2000 random sampled audio clips listed in data/aishell/manifest.train and save the results to data/aishell/mean_std.npz for further usage.
Details in run_infer_golden.sh
Codes in run_infer_golden.sh
can mainly be divided into 4 parts:
- download language model(download_lm_ch.sh)
- download well-trained model(download_model.sh)
- infer(infer.py)
We will walk through these 3 parts in this section.
1. download_lm_ch.sh
In this example, we use a small language model for a quick test. In the future, we can use 70.4 GB Mandarin LM Large instead to improve the performance of our system.
2. download_model.sh
In this example, we use Aishell Model for a quick test. In the future, we can use BaiduCN1.2k Model instead to improve the performance of our system.
3.infer.py
An inference module caller infer.py
is provided to infer, decode and visualize speech-to-text results for several given audio clips. It might help to have an intuitive and qualitative evaluation of the ASR model’s performance.
Two types of CTC decoders are provided: CTC greedy decoder and CTC beam search decoder. The CTC greedy decoder is an implementation of the simple best-path decoding algorithm, selecting at each timestep the most likely token, thus being greedy and locally optimal. The CTC beam search decoder otherwise utilizes a heuristic breadth-first graph search for reaching a near global optimality; it also requires a pre-trained KenLM language model for better scoring and ranking. The decoder type can be set with argument –decoding_method.
Details in run_test_golden.sh
Codes in run_test_golden.sh
can mainly be divided into 4 parts:
- download language model(download_lm_ch.sh)
- download well-trained model(download_model.sh)
- evaluate model(test.py)
Since the first two parts are same as those in run_infer_golden.sh
, we will only walk through the last part test.py
in this section.
test.py
In test.py
, we are able to evaluate a model’s performance quantitatively. The error rate (default: word error rate; can be set with –error_rate_type) will be printed.
Conclusion
That’s all from 6th and 7th week. Thank you for reading. Next post will be about 8th and 9th week. Thank you. 😀