In some cases you don't want to use the whole dataset for training or test, but filter it in some way. There are several filter possibilities in nkuluoleko:
- limit_samples: limit the number of samples, randomly selected
- limit_samples_per_speaker: maximum number of samples per speaker (for leveling data where same speakers have a large number of samples)
- min_duration_of_sample: limit the samples to a minimum length (in seconds)
- max_duration_of_sample: limit the samples to a maximum length (in seconds)
- filter: don't use all the data but only selected values from columns: [col, val].
You can specify several filters in one: e.g.
[DATA] filter = [['sex', 'female'], ['style', 'reading']]
would use only the data where sex is female and style is reading
These can be specified per database:
[DATA] databases = ['d1'] # force a specific feature to be present, e.g. gender labels ( when not all data has gender values) d1.required = gender # limit the number of samples per speaker d1.max_samples_per_speaker = 20
Or for all samples, or the test and/or train splits
[DATA] # only filter the training split: filter.sample_selection = train # specify a minimum duration for train samples (in seconds) min_duration_of_sample = 3.5 # use only samples where gender is female filter = [['gender', 'female]]