In some cases you don't want to use the whole dataset for training or test, but filter it in some way. There are several filter possibilities in nkuluoleko:
- limit_samples: limit the number of samples, randomly selected
- limit_samples_per_speaker: maximum number of samples per speaker (for leveling data where same speakers have a large number of samples)
- min_duration_of_sample: limit the samples to a minimum length (in seconds)
- max_duration_of_sample: limit the samples to a maximum length (in seconds)
- filter: don't use all the data but only selected values from columns: [col, val].
You can specify several filters in one: e.g.[DATA] filter = [['sex', 'female'], ['style', 'reading']]
would use only the data where sex is female and style is reading
These can be specified per database:
[DATA]
databases = ['d1']
# force a specific feature to be present, e.g. gender labels ( when not all data has gender values)
d1.required = gender
# limit the absolute sample number
d1.limit_samples = 500
# limit the number of samples per speaker
d1.limit_samples_per_speaker = 20
Or for all samples, or the test and/or train splits
[DATA]
# only filter the training split:
filter.sample_selection = train
# specify a minimum duration for train samples (in seconds)
min_duration_of_sample = 3.5
# use only samples where gender is female
filter = [['gender', 'female]]