Haplo Prediction
predict haplogroups
Haplo Prediction Documentation

Predicts haplogroup using models trained with Y-STR data.

Licensed under Creative Commons BY-NC-SA 3.0.

Questions or comments? Contact Joseph Schlecht.


Use 'haplo-train' to train the models.

usage: haplo-train OPTIONS [data-fname | <stdin>]
 -h, --help                Prints program usage.
 -v, --version             Prints program version.
     --header-in           The input data contains a header (descriptive) line,
                           which should be discarded.
     --header-out          Write a header (descriptive) line to the first line
                           of the output results.
     --options=ARG         File containing program options. Any options
                           appearing on the command line following this option
                           take precendence over those in the options file.
     --seed=ARG            Random seed.
     --input-format=ARG    Input file format. Must be one of {txt, csv, xml}. If
                           the input is XML, it must conform to the XML DTD
                           haplo-input.dtd.
     --input-dtd=ARG       If the input format is XML, validate it with this
                           DTD.
     --output-format=ARG   Output file format. Must be one of {txt, csv, xml}.
     --labels=ARG          XML file containing the organization and listing of
                           possible haplo groups labels for the samples. Must
                           conform to the XML DTD haplo-labels.dtd.
     --labels-dtd=ARG      Validate the XML labels file with this DTD.
     --id-cols=ARG         Comma separated ordered list of columns to use for
                           sample identification. Prefixes the output of each
                           sample. Count begins with 1 at the first column of
                           the file. Set to zero to ignore the id column
     --label-col=ARG       Column containing the haplo group labels. Count
                           begins with 1 at the first column of the file. Set
                           to zero to ignore the label column.
     --1st-marker-col=ARG  Column containing the first marker. Use in
                           conjunction with num-markers to specify the markers
                           for reading. All other markers are assumed to follow
                           this one. Count begins with 1.
     --num-markers=ARG     Number of markers to read. Use in conjunction with
                           1st-marker-col to specify the markers for reading.
     --marker-cols=ARG     Comma separated ordered list of markers to use for
                           training. Use instead of 1st-marker-col and
                           num-markers. Count begins with 1 at the first column
                           of the CSV file.
     --model-dir=ARG       Directory to put trained models in.
     --data-out-dir=ARG    Directory to put the generated training data in for
                           each model. The name of the model is used, so if
                           this directory is set as the same as the model-dir,
                           the models could be overwritten. The default is not
                           to output the data.
     --nb-freq=ARG         Naive Bayes non-parametric frequency model tree
                           information.
     --nb-freq-dtd=ARG     Validate the naive Bayes non-parametric frequency
                           model tree information XML file with this DTD.
     --nb-gauss=ARG        Naive Bayes Gaussian model tree information.
     --nb-gauss-dtd=ARG    Validate the naive Bayes Gaussian model tree
                           information XML file with this DTD.
     --nb-gmm=ARG          Naive Bayes Gaussian mixture model tree information.
     --nb-gmm-dtd=ARG      Validate the naive Bayes Gaussian mixture model tree
                           information XML file with this DTD.
     --mv-gmm=ARG          Multivariate Gaussian mixture model tree information.
     --mv-gmm-dtd=ARG      Validate the multivariate Gaussian mixture model tree
                           information XML file with this DTD.
     --mv-mmm=ARG          Multivariate multinomial mixture model tree
                           information.
     --mv-mmm-dtd=ARG      Validate the multivariate multinomial mixture model
                           tree information XML file with this DTD.
     --svm=ARG             SVM model tree information.
     --svm-dtd=ARG         Validate the SVM model tree information XML file with
                           this DTD.
     --weka-j48=ARG        Weka J48 model tree information.
     --weka-part=ARG       Weka PART model tree information.
     --weka-jar=ARG        Weka java archive file. Required for using the Weka
                           algorithms.
     --weka-dtd=ARG        Validate the Weka model tree information XML files
                           with this DTD.
     --nearest=ARG         Nearest neighbor model information.
     --nearest-dtd=ARG     Validate the nearest neighbor model information XML
                           file with this DTD. 


Use 'haplo-predict' to predict haplogroup with the trained models.

usage: haplo-predict OPTIONS [data-fname | <stdin>]
 -h, --help                Prints program usage.
 -v, --version             Prints program version.
     --header-in           The input data contains a header (descriptive) line,
                           which should be discarded.
     --header-out          Write a header (descriptive) line to the first line
                           of the output results.
     --exclude-one         When performing the tandem prediction decision,
                           exclude at most one prediction from the set of
                           classification algorithms. There must be three or
                           more algorithms in play for this to take effect.
     --options=ARG         File containing program options. Any options
                           appearing on the command line following this option
                           take precendence over those in the options file.
     --seed=ARG            Random seed.
     --input-format=ARG    Input file format. Must be one of {txt, csv, xml}. If
                           the input is XML, it must conform to the XML DTD
                           haplo-input.dtd.
     --input-dtd=ARG       If the input format is XML, validate it with this
                           DTD.
     --output-format=ARG   Output file format. Must be one of {txt, csv, xml}.
     --labels=ARG          XML file containing the organization and listing of
                           possible haplo groups labels for the samples. Must
                           conform to the XML DTD haplo-labels.dtd.
     --labels-dtd=ARG      Validate the XML labels file with this DTD.
     --id-cols=ARG         Comma separated ordered list of columns to use for
                           sample identification. Prefixes the output of each
                           sample. Count begins with 1 at the first column of
                           the file. Set to zero to ignore the id column
     --label-col=ARG       Column containing the haplo group labels. Count
                           begins with 1 at the first column of the file. Set
                           to zero to ignore the label column.
     --1st-marker-col=ARG  Column containing the first marker. Use in
                           conjunction with num-markers to specify the markers
                           for reading. All other markers are assumed to follow
                           this one. Count begins with 1.
     --num-markers=ARG     Number of markers to read. Use in conjunction with
                           1st-marker-col to specify the markers for reading.
     --marker-cols=ARG     Comma separated ordered list of markers to use for
                           training. Use instead of 1st-marker-col and
                           num-markers. Count begins with 1 at the first column
                           of the CSV file.
     --output=ARG          File to output the predictions to. The default is
                           stdout.
     --model-dir=ARG       Directory containing the trained models.
     --nb-freq=ARG         Naive Bayes non-parametric frequency model tree
                           information.
     --nb-freq-dtd=ARG     Validate the naive Bayes non-parametric frequency
                           model tree information XML file with this DTD.
     --nb-gauss=ARG        Naive Bayes Gaussian model tree information.
     --nb-gauss-dtd=ARG    Validate the naive Bayes Gaussian model tree
                           information XML file with this DTD.
     --nb-gmm=ARG          Naive Bayes Gaussian mixture model tree information.
     --nb-gmm-dtd=ARG      Validate the naive Bayes Gaussian mixture model tree
                           information XML file with this DTD.
     --mv-gmm=ARG          Multivariate Gaussian mixture model tree information.
     --mv-gmm-dtd=ARG      Validate the multivariate Gaussian mixture model tree
                           information XML file with this DTD.
     --svm=ARG             SVM model tree information.
     --svm-dtd=ARG         Validate the SVM model tree information XML file with
                           this DTD.
     --weka-j48=ARG        Weka J48 model tree information.
     --weka-part=ARG       Weka PART model tree information.
     --weka-jar=ARG        Weka java archive file. Required for using the Weka
                           algorithms.
     --weka-dtd=ARG        Validate the Weka model tree information XML files
                           with this DTD.
     --nearest=ARG         Nearest neighbor model information.
     --nearest-dtd=ARG     Validate the nearest neighbor model information XML
                           file with this DTD.
     --nearest-max-d=ARG   Maximum distance allowed for a nearest neighbor
                           classification.