in a nutshell: fit trainingData (train a model), transform testData (predict with model)
- Transformer: DataFrame => DataFrame
- Estimator: DataFrame => Transformer
#Transformers
- Tokenizer: sentence => words
- RegexTokenizer: sentence => words - setPattern
- HashingTF: terms => feature vectors based on frequency - setNumFeatures
- StopWordsRemover: filter - setStopWords
- NGram: sequence of n strings
- Binarizer: number => 0/1 threshold - setThreshold
- PCA: reduce feature set statistical dimensionality reduction (selects least correlated) - setK
- PolynomialExpansion: feature set dimensionality expansion (~ Taylor Series) - setDegree
- DCT: time series => frequencies (via cosine wave)- setInverse
- StringIndexer: strings => frequency ordinals
- IndexToString: dual of StringIndexer
- OneHotEncoder: category feature => 1-hot bitset
- VectorIndexer: category automatically index categorical features in the featureset - setMaxCategories
- Normalizer: vector features to p-norm - setP
- StandardScaler: features to z-scores - setWithStd, setWithMean
- MinMaxScaler: scale feature to range [0, 1]
- Bucketizer: continuous to discrete - setSplits
- ElementwiseProduct: apply weights to vector features - setScalingVec
- SQLTransformer: SQL over featureset ! - setStatement
- VectorAssembler: combine multi-columns into a single vector column
- QuantileDiscretizer: continuous to discrete - setNumBuckets
- VectorSlicer: select subset of featureset - setIndices, setNames
- RFormula: specify labelled point dependent / independent variables - setFormula("y ~ x1 + x2"), setFeaturesCol, setLabelCol
- ChiSqSelector: select features with most predictive power - setNumTopFeatures, setFeaturesCol, setLabelCol
#Estimators:
- IDF: down-weights high frequency terms
- Word2Vec: document => token count - setVectorSize, setMinCount
- CountVectorizer: document => token count - setVocabSize, setMinDF
- LogisticRegression - setMaxIter, setRegParam, setElasticNetParam, setTol, setFitIntercept
- DecisionTreeClassifier
- RandomForestClassifier - setNumTrees
- GBTClassifier - setMaxIter
- MultilayerPerceptronClassifier - setLayers, setBlockSize, setSeed, setMaxIter
- OneVsRest - setClassifier
- DecisionTreeRegressor
- RandomForestRegressor
- GBTRegressor
- AFTSurvivalRegression - setQuantileProbabilities, setQuantilesCol
- KMeans - setK
- LDA - setK, setMaxIter
#Models:
- CountVectorizerModel
- LogisticRegressionModel - coefficients, intercept, setThreshold, summary
- DecisionTreeClassificationModel
- RandomForestClassificationModel
- GBTClassificationModel
- DecisionTreeRegressionModel
- RandomForestRegressionModel
- GBTRegressionModel
- LDAModel - logLikelihood, logPerplexity
#Evaluators:
- BinaryLogisticRegressionSummary - fMeasureByThreshold, areaUnderROC, roc
- BinaryClassificationEvaluator - default metric names: "areaUnderROC"
- MulticlassClassificationEvaluator - default metric name: "precision"
- MulticlassMetrics - confusionMatrix, falsePositiveRate
- RegressionEvaluator - default metric name: "rmse"