If you just want to see how your data distributes on the target with nkululeko, you can do a value_counts plot with the explore module
In your config, you would specify like this:
[EXPL]
# all samples, or only test or train split?
sample_selection = all
# activate the plot
value_counts = [['age'], ['gender'], ['duration'], ['duration', 'age']]
and then, run this with the explore module:
python -m nkululeko.explore --config myconfig.ini
The results, for a data set with target=depression, looks similar to this for all samples:
and this for the speakers (if there is a speaker annotation)
If you prefer a kernel density estimation over a histogram, you can select this with
[EXPL]
dist_type = kde
which would result for duration to:
Nkululeko distinguishes between categorical and continuous properties, this would be the output for gender
You can show the distribution of two sample properties at once, by using a scatter plot:
In addition, this module will automatically plot the distribution of samples per speaker, per gender (if annotated):
Load your data: Import your dataset into nkululeko. This can be done by reading your data from a file or connecting to a database.
no, database import connectors, e.g. mysql, are not implemented yet.