How to limit a dataset with Nkululeko

In some cases you don't want to use the whole dataset for training or test, but filter it in some way. There are several filter possibilities in nkuluoleko:

limit_samples: limit the number of samples, randomly selected
limit_samples_per_speaker: maximum number of samples per speaker (for leveling data where same speakers have a large number of samples)
min_duration_of_sample: limit the samples to a minimum length (in seconds)
max_duration_of_sample: limit the samples to a maximum length (in seconds)
filter: don't use all the data but only selected values from columns: [col, val].
You can specify several filters in one: e.g.
```
[DATA]
filter = [['sex', 'female'], ['style', 'reading']]
```
would use only the data where sex is female and style is reading

These can be specified per database:

[DATA]
databases = ['d1']
# force a specific feature to be present, e.g. gender labels ( when not all data has gender values)
d1.required = gender
# limit the absolute sample number
d1.limit_samples = 500
# limit the number of samples per speaker
d1.limit_samples_per_speaker = 20

Or for all samples, or the test and/or train splits

[DATA]
# only filter the training split: 
filter.sample_selection = train
# specify a minimum duration for train samples (in seconds)
min_duration_of_sample = 3.5
# use only samples where gender is female
filter = [['gender', 'female]]

speechsurfer

How to limit a dataset with Nkululeko

Leave a Reply Cancel reply

blog around speech technology