評估指標 - 基於 RDD 的 API
spark.mllib
附帶許多機器學習演算法,可用於從資料中學習並進行預測。將這些演算法應用於建立機器學習模型時,需要根據某些準則評估模型的效能,具體取決於應用程式及其需求。 spark.mllib
也提供一組指標,用於評估機器學習模型的效能。
特定的機器學習演算法屬於更廣泛類型的機器學習應用程式,例如分類、迴歸、群集等。這些類型中的每種類型都有完善的效能評估指標,本節將詳細說明 spark.mllib
中目前可用的指標。
分類模型評估
雖然有許多不同類型的分類演算法,但分類模型的評估都遵循類似的原則。在 監督式分類問題 中,每個資料點都存在一個真實輸出和一個模型產生的預測輸出。因此,每個資料點的結果可以分配到四種類別之一
- 真陽性 (TP) - 標籤為陽性,預測也為陽性
- 真陰性 (TN) - 標籤為陰性,預測也為陰性
- 假陽性 (FP) - 標籤為陰性,但預測為陽性
- 假陰性 (FN) - 標籤為陽性,但預測為陰性
這四個數字是大多數分類器評估指標的基礎。在考慮分類器評估時,一個基本觀點是純準確度(即預測是否正確)通常不是一個好的指標。原因在於資料集可能高度不平衡。例如,如果一個模型被設計為從一個資料集中預測詐欺,其中 95% 的資料點是非詐欺,而 5% 的資料點是詐欺,那麼一個天真的分類器預測非詐欺,無論輸入如何,將會準確 95%。因此,通常使用諸如 精確度和召回率 等指標,因為它們考慮了錯誤的類型。在大多數應用程式中,精確度和召回率之間存在一些期望的平衡,這可以透過將兩者組合成一個單一指標(稱為 F-測量)來捕捉。
二元分類
二元分類器 用於將給定資料集的元素分為兩個可能的群組之一(例如詐欺或非詐欺),並且是多類分類的一個特例。大多數二元分類指標可以推廣到多類分類指標。
閾值調整
了解許多分類模型實際上會針對每個類別輸出「分數」(通常是機率),其中較高的分數表示較高的可能性,這一點非常重要。在二元情況下,模型可能會針對每個類別輸出機率:$P(Y=1|X)$ 和 $P(Y=0|X)$。除了直接採用較高的機率之外,在某些情況下,可能需要調整模型,以便只有在機率非常高時才會預測類別(例如,只有在模型以 >90% 機率預測為詐騙時,才會封鎖信用卡交易)。因此,有一個預測閾值,用於根據模型輸出的機率來決定預測類別。
調整預測閾值會改變模型的準確度和召回率,這是模型最佳化中的重要部分。為了視覺化準確度、召回率和其他指標如何隨著閾值而改變,常見的做法是將競爭指標相互繪製,並以閾值參數化。P-R 曲線繪製不同閾值值的(準確度、召回率)點,而受試者工作特性或 ROC 曲線則繪製(召回率、假陽性率)點。
可用的指標
指標 | 定義 |
---|---|
準確度(陽性預測值) | $PPV=\frac{TP}{TP + FP}$ |
召回率(真正陽性率) | $TPR=\frac{TP}{P}=\frac{TP}{TP + FN}$ |
F 值 | $F(\beta) = \left(1 + \beta^2\right) \cdot \left(\frac{PPV \cdot TPR} {\beta^2 \cdot PPV + TPR}\right)$ |
受試者工作特性 (ROC) | $FPR(T)=\int^\infty_{T} P_0(T)\,dT \\ TPR(T)=\int^\infty_{T} P_1(T)\,dT$ |
ROC 曲線下的面積 | $AUROC=\int^1_{0} \frac{TP}{P} d\left(\frac{FP}{N}\right)$ |
準確度-召回率曲線下的面積 | $AUPRC=\int^1_{0} \frac{TP}{TP+FP} d\left(\frac{TP}{P}\right)$ |
範例
請參閱 BinaryClassificationMetrics
Python 文件 和 LogisticRegressionWithLBFGS
Python 文件,以取得更多 API 詳細資料。
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.evaluation import BinaryClassificationMetrics
from pyspark.mllib.util import MLUtils
# Several of the methods available in scala are currently missing from pyspark
# Load training data in LIBSVM format
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_binary_classification_data.txt")
# Split data into training (60%) and test (40%)
training, test = data.randomSplit([0.6, 0.4], seed=11)
training.cache()
# Run training algorithm to build the model
model = LogisticRegressionWithLBFGS.train(training)
# Compute raw scores on the test set
predictionAndLabels = test.map(lambda lp: (float(model.predict(lp.features)), lp.label))
# Instantiate metrics object
metrics = BinaryClassificationMetrics(predictionAndLabels)
# Area under precision-recall curve
print("Area under PR = %s" % metrics.areaUnderPR)
# Area under ROC curve
print("Area under ROC = %s" % metrics.areaUnderROC)
請參閱 LogisticRegressionWithLBFGS
Scala 文件 和 BinaryClassificationMetrics
Scala 文件,以取得 API 詳細資料。
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
// Load training data in LIBSVM format
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_binary_classification_data.txt")
// Split data into training (60%) and test (40%)
val Array(training, test) = data.randomSplit(Array(0.6, 0.4), seed = 11L)
training.cache()
// Run training algorithm to build the model
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(2)
.run(training)
// Clear the prediction threshold so the model will return probabilities
model.clearThreshold
// Compute raw scores on the test set
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label)
}
// Instantiate metrics object
val metrics = new BinaryClassificationMetrics(predictionAndLabels)
// Precision by threshold
val precision = metrics.precisionByThreshold
precision.collect.foreach { case (t, p) =>
println(s"Threshold: $t, Precision: $p")
}
// Recall by threshold
val recall = metrics.recallByThreshold
recall.collect.foreach { case (t, r) =>
println(s"Threshold: $t, Recall: $r")
}
// Precision-Recall Curve
val PRC = metrics.pr
// F-measure
val f1Score = metrics.fMeasureByThreshold
f1Score.collect.foreach { case (t, f) =>
println(s"Threshold: $t, F-score: $f, Beta = 1")
}
val beta = 0.5
val fScore = metrics.fMeasureByThreshold(beta)
fScore.collect.foreach { case (t, f) =>
println(s"Threshold: $t, F-score: $f, Beta = 0.5")
}
// AUPRC
val auPRC = metrics.areaUnderPR
println(s"Area under precision-recall curve = $auPRC")
// Compute thresholds used in ROC and PR curves
val thresholds = precision.map(_._1)
// ROC Curve
val roc = metrics.roc
// AUROC
val auROC = metrics.areaUnderROC
println(s"Area under ROC = $auROC")
有關 API 的詳細資訊,請參閱 LogisticRegressionModel
Java 文件 和 LogisticRegressionWithLBFGS
Java 文件。
import scala.Tuple2;
import org.apache.spark.api.java.*;
import org.apache.spark.mllib.classification.LogisticRegressionModel;
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS;
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;
String path = "data/mllib/sample_binary_classification_data.txt";
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc, path).toJavaRDD();
// Split initial RDD into two... [60% training data, 40% testing data].
JavaRDD<LabeledPoint>[] splits =
data.randomSplit(new double[]{0.6, 0.4}, 11L);
JavaRDD<LabeledPoint> training = splits[0].cache();
JavaRDD<LabeledPoint> test = splits[1];
// Run training algorithm to build the model.
LogisticRegressionModel model = new LogisticRegressionWithLBFGS()
.setNumClasses(2)
.run(training.rdd());
// Clear the prediction threshold so the model will return probabilities
model.clearThreshold();
// Compute raw scores on the test set.
JavaPairRDD<Object, Object> predictionAndLabels = test.mapToPair(p ->
new Tuple2<>(model.predict(p.features()), p.label()));
// Get evaluation metrics.
BinaryClassificationMetrics metrics =
new BinaryClassificationMetrics(predictionAndLabels.rdd());
// Precision by threshold
JavaRDD<Tuple2<Object, Object>> precision = metrics.precisionByThreshold().toJavaRDD();
System.out.println("Precision by threshold: " + precision.collect());
// Recall by threshold
JavaRDD<?> recall = metrics.recallByThreshold().toJavaRDD();
System.out.println("Recall by threshold: " + recall.collect());
// F Score by threshold
JavaRDD<?> f1Score = metrics.fMeasureByThreshold().toJavaRDD();
System.out.println("F1 Score by threshold: " + f1Score.collect());
JavaRDD<?> f2Score = metrics.fMeasureByThreshold(2.0).toJavaRDD();
System.out.println("F2 Score by threshold: " + f2Score.collect());
// Precision-recall curve
JavaRDD<?> prc = metrics.pr().toJavaRDD();
System.out.println("Precision-recall curve: " + prc.collect());
// Thresholds
JavaRDD<Double> thresholds = precision.map(t -> Double.parseDouble(t._1().toString()));
// ROC Curve
JavaRDD<?> roc = metrics.roc().toJavaRDD();
System.out.println("ROC curve: " + roc.collect());
// AUPRC
System.out.println("Area under precision-recall curve = " + metrics.areaUnderPR());
// AUROC
System.out.println("Area under ROC = " + metrics.areaUnderROC());
// Save and load model
model.save(sc, "target/tmp/LogisticRegressionModel");
LogisticRegressionModel.load(sc, "target/tmp/LogisticRegressionModel");
多類分類
多類別分類描述了一個分類問題,其中每個資料點有 $M \gt 2$ 個可能的標籤($M=2$ 的情況是二元分類問題)。例如,將手寫樣本分類為數字 0 到 9,有 10 個可能的類別。
對於多類別指標,正值和負值的觀念略有不同。預測和標籤仍然可以是正值或負值,但必須在特定類別的脈絡下考慮。每個標籤和預測都會採用多個類別之一的值,因此它們被視為對其特定類別為正值,而對所有其他類別為負值。因此,當預測和標籤相符時,就會出現真陽性,而當預測和標籤都不採用給定類別的值時,就會出現真陰性。根據此慣例,給定的資料樣本可以有多個真陰性。從正負標籤的先前定義中延伸出假陰性和假陽性是很簡單的。
基於標籤的指標
與只有兩個可能標籤的二元分類相反,多類別分類問題有許多可能的標籤,因此引入了基於標籤的指標概念。準確度衡量所有標籤的精確度 - 任何類別被正確預測的次數(真陽性)除以資料點的數量。按標籤劃分的精確度只考慮一個類別,並衡量特定標籤被正確預測的次數除以該標籤在輸出中出現的次數。
可用的指標
將類別或標籤集合定義為
\[L = \{\ell_0, \ell_1, \ldots, \ell_{M-1} \}\]真實輸出向量 $\mathbf{y}$ 包含 $N$ 個元素
\[\mathbf{y}_0, \mathbf{y}_1, \ldots, \mathbf{y}_{N-1} \in L\]多類別預測演算法會產生一個包含 $N$ 個元素的預測向量 $\hat{\mathbf{y}}$
\[\hat{\mathbf{y}}_0, \hat{\mathbf{y}}_1, \ldots, \hat{\mathbf{y}}_{N-1} \in L\]對於本節,修改後的 delta 函數 $\hat{\delta}(x)$ 將會很有用
\[\hat{\delta}(x) = \begin{cases}1 & \text{如果 $x = 0$}, \\ 0 & \text{否則}.\end{cases}\]指標 | 定義 |
---|---|
混淆矩陣 | $C_{ij} = \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_i) \cdot \hat{\delta}(\hat{\mathbf{y}}_k - \ell_j)\\ \\ \left( \begin{array}{ccc} \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_1) \cdot \hat{\delta}(\hat{\mathbf{y}}_k - \ell_1) & \ldots & \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_1) \cdot \hat{\delta}(\hat{\mathbf{y}}_k - \ell_N) \\ \vdots & \ddots & \vdots \\ \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_N) \cdot \hat{\delta}(\hat{\mathbf{y}}_k - \ell_1) & \ldots & \sum_{k=0}^{N-1} \hat{\delta}(\mathbf{y}_k-\ell_N) \cdot \hat{\delta}(\hat{\mathbf{y}}_k - \ell_N) \end{array} \right)$ |
準確度 | $ACC = \frac{TP}{TP + FP} = \frac{1}{N}\sum_{i=0}^{N-1} \hat{\delta}\left(\hat{\mathbf{y}}_i - \mathbf{y}_i\right)$ |
標籤準確度 | $PPV(\ell) = \frac{TP}{TP + FP} = \frac{\sum_{i=0}^{N-1} \hat{\delta}(\hat{\mathbf{y}}_i - \ell) \cdot \hat{\delta}(\mathbf{y}_i - \ell)} {\sum_{i=0}^{N-1} \hat{\delta}(\hat{\mathbf{y}}_i - \ell)}$ |
標籤召回率 | $TPR(\ell)=\frac{TP}{P} = \frac{\sum_{i=0}^{N-1} \hat{\delta}(\hat{\mathbf{y}}_i - \ell) \cdot \hat{\delta}(\mathbf{y}_i - \ell)} {\sum_{i=0}^{N-1} \hat{\delta}(\mathbf{y}_i - \ell)}$ |
標籤 F-測量 | $F(\beta, \ell) = \left(1 + \beta^2\right) \cdot \left(\frac{PPV(\ell) \cdot TPR(\ell)} {\beta^2 \cdot PPV(\ell) + TPR(\ell)}\right)$ |
加權準確度 | $PPV_{w}= \frac{1}{N} \sum\nolimits_{\ell \in L} PPV(\ell) \cdot \sum_{i=0}^{N-1} \hat{\delta}(\mathbf{y}_i-\ell)$ |
加權召回率 | $TPR_{w}= \frac{1}{N} \sum\nolimits_{\ell \in L} TPR(\ell) \cdot \sum_{i=0}^{N-1} \hat{\delta}(\mathbf{y}_i-\ell)$ |
加權 F-測量 | $F_{w}(\beta)= \frac{1}{N} \sum\nolimits_{\ell \in L} F(\beta, \ell) \cdot \sum_{i=0}^{N-1} \hat{\delta}(\mathbf{y}_i-\ell)$ |
範例
請參閱 MulticlassMetrics
Python 文件,以取得更多 API 詳細資料。
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from pyspark.mllib.util import MLUtils
from pyspark.mllib.evaluation import MulticlassMetrics
# Load training data in LIBSVM format
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_multiclass_classification_data.txt")
# Split data into training (60%) and test (40%)
training, test = data.randomSplit([0.6, 0.4], seed=11)
training.cache()
# Run training algorithm to build the model
model = LogisticRegressionWithLBFGS.train(training, numClasses=3)
# Compute raw scores on the test set
predictionAndLabels = test.map(lambda lp: (float(model.predict(lp.features)), lp.label))
# Instantiate metrics object
metrics = MulticlassMetrics(predictionAndLabels)
# Overall statistics
precision = metrics.precision(1.0)
recall = metrics.recall(1.0)
f1Score = metrics.fMeasure(1.0)
print("Summary Stats")
print("Precision = %s" % precision)
print("Recall = %s" % recall)
print("F1 Score = %s" % f1Score)
# Statistics by class
labels = data.map(lambda lp: lp.label).distinct().collect()
for label in sorted(labels):
print("Class %s precision = %s" % (label, metrics.precision(label)))
print("Class %s recall = %s" % (label, metrics.recall(label)))
print("Class %s F1 Measure = %s" % (label, metrics.fMeasure(label, beta=1.0)))
# Weighted stats
print("Weighted recall = %s" % metrics.weightedRecall)
print("Weighted precision = %s" % metrics.weightedPrecision)
print("Weighted F(1) Score = %s" % metrics.weightedFMeasure())
print("Weighted F(0.5) Score = %s" % metrics.weightedFMeasure(beta=0.5))
print("Weighted false positive rate = %s" % metrics.weightedFalsePositiveRate)
請參閱 MulticlassMetrics
Scala 文件,以取得 API 詳細資料。
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
// Load training data in LIBSVM format
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_multiclass_classification_data.txt")
// Split data into training (60%) and test (40%)
val Array(training, test) = data.randomSplit(Array(0.6, 0.4), seed = 11L)
training.cache()
// Run training algorithm to build the model
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(3)
.run(training)
// Compute raw scores on the test set
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label)
}
// Instantiate metrics object
val metrics = new MulticlassMetrics(predictionAndLabels)
// Confusion matrix
println("Confusion matrix:")
println(metrics.confusionMatrix)
// Overall Statistics
val accuracy = metrics.accuracy
println("Summary Statistics")
println(s"Accuracy = $accuracy")
// Precision by label
val labels = metrics.labels
labels.foreach { l =>
println(s"Precision($l) = " + metrics.precision(l))
}
// Recall by label
labels.foreach { l =>
println(s"Recall($l) = " + metrics.recall(l))
}
// False positive rate by label
labels.foreach { l =>
println(s"FPR($l) = " + metrics.falsePositiveRate(l))
}
// F-measure by label
labels.foreach { l =>
println(s"F1-Score($l) = " + metrics.fMeasure(l))
}
// Weighted stats
println(s"Weighted precision: ${metrics.weightedPrecision}")
println(s"Weighted recall: ${metrics.weightedRecall}")
println(s"Weighted F1 score: ${metrics.weightedFMeasure}")
println(s"Weighted false positive rate: ${metrics.weightedFalsePositiveRate}")
有關 API 的詳細資訊,請參閱 MulticlassMetrics
Java 文件。
import scala.Tuple2;
import org.apache.spark.api.java.*;
import org.apache.spark.mllib.classification.LogisticRegressionModel;
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS;
import org.apache.spark.mllib.evaluation.MulticlassMetrics;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;
import org.apache.spark.mllib.linalg.Matrix;
String path = "data/mllib/sample_multiclass_classification_data.txt";
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc, path).toJavaRDD();
// Split initial RDD into two... [60% training data, 40% testing data].
JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.6, 0.4}, 11L);
JavaRDD<LabeledPoint> training = splits[0].cache();
JavaRDD<LabeledPoint> test = splits[1];
// Run training algorithm to build the model.
LogisticRegressionModel model = new LogisticRegressionWithLBFGS()
.setNumClasses(3)
.run(training.rdd());
// Compute raw scores on the test set.
JavaPairRDD<Object, Object> predictionAndLabels = test.mapToPair(p ->
new Tuple2<>(model.predict(p.features()), p.label()));
// Get evaluation metrics.
MulticlassMetrics metrics = new MulticlassMetrics(predictionAndLabels.rdd());
// Confusion matrix
Matrix confusion = metrics.confusionMatrix();
System.out.println("Confusion matrix: \n" + confusion);
// Overall statistics
System.out.println("Accuracy = " + metrics.accuracy());
// Stats by labels
for (int i = 0; i < metrics.labels().length; i++) {
System.out.format("Class %f precision = %f\n", metrics.labels()[i],metrics.precision(
metrics.labels()[i]));
System.out.format("Class %f recall = %f\n", metrics.labels()[i], metrics.recall(
metrics.labels()[i]));
System.out.format("Class %f F1 score = %f\n", metrics.labels()[i], metrics.fMeasure(
metrics.labels()[i]));
}
//Weighted stats
System.out.format("Weighted precision = %f\n", metrics.weightedPrecision());
System.out.format("Weighted recall = %f\n", metrics.weightedRecall());
System.out.format("Weighted F1 score = %f\n", metrics.weightedFMeasure());
System.out.format("Weighted false positive rate = %f\n", metrics.weightedFalsePositiveRate());
// Save and load model
model.save(sc, "target/tmp/LogisticRegressionModel");
LogisticRegressionModel sameModel = LogisticRegressionModel.load(sc,
"target/tmp/LogisticRegressionModel");
多標籤分類
多標籤分類問題涉及將資料集中的每個樣本對應到一組類別標籤。在此類型的分類問題中,標籤並非互斥的。例如,在將一組新聞文章分類為主題時,單一文章可能同時屬於科學和政治類別。
由於標籤並非互斥的,因此預測和真實標籤現在是標籤集合的向量,而非標籤的向量。因此,多標籤指標將精確度、召回率等基本概念延伸到集合運算。例如,對於給定類別而言,當該類別存在於預測集合中,且存在於特定資料點的真實標籤集合中時,即會發生真陽性。
可用的指標
在此,我們定義一組 $D$,其中包含 $N$ 個文件
\[D = \left\{d_0, d_1, ..., d_{N-1}\right\}\]定義 $L_0, L_1, …, L_{N-1}$ 為一組標籤集合,且定義 $P_0, P_1, …, P_{N-1}$ 為一組預測集合,其中 $L_i$ 和 $P_i$ 分別為與文件 $d_i$ 相應的標籤集合和預測集合。
所有唯一標籤的集合由下式給出
\[L = \bigcup_{k=0}^{N-1} L_k\]在集合 $A$ 上的指示函數 $I_A(x)$ 的下列定義將是必要的
\[I_A(x) = \begin{cases}1 & \text{如果 $x \in A$}, \\ 0 & \text{否則}.\end{cases}\]指標 | 定義 |
---|---|
精確度 | $\frac{1}{N} \sum_{i=0}^{N-1} \frac{\left|P_i \cap L_i\right|}{\left|P_i\right|}$ |
召回率 | $\frac{1}{N} \sum_{i=0}^{N-1} \frac{\left|L_i \cap P_i\right|}{\left|L_i\right|}$ |
準確度 | $\frac{1}{N} \sum_{i=0}^{N - 1} \frac{\left|L_i \cap P_i \right|} {\left|L_i\right| + \left|P_i\right| - \left|L_i \cap P_i \right|}$ |
標籤準確度 | $PPV(\ell)=\frac{TP}{TP + FP}= \frac{\sum_{i=0}^{N-1} I_{P_i}(\ell) \cdot I_{L_i}(\ell)} {\sum_{i=0}^{N-1} I_{P_i}(\ell)}$ |
標籤召回率 | $TPR(\ell)=\frac{TP}{P}= \frac{\sum_{i=0}^{N-1} I_{P_i}(\ell) \cdot I_{L_i}(\ell)} {\sum_{i=0}^{N-1} I_{L_i}(\ell)}$ |
標籤的 F1 量測 | $F1(\ell) = 2 \cdot \left(\frac{PPV(\ell) \cdot TPR(\ell)} {PPV(\ell) + TPR(\ell)}\right)$ |
漢明損失 | $\frac{1}{N \cdot \left|L\right|} \sum_{i=0}^{N - 1} \left|L_i\right| + \left|P_i\right| - 2\left|L_i \cap P_i\right|$ |
子集準確度 | $\frac{1}{N} \sum_{i=0}^{N-1} I_{\{L_i\}}(P_i)$ |
F1 量測 | $\frac{1}{N} \sum_{i=0}^{N-1} 2 \frac{\left|P_i \cap L_i\right|}{\left|P_i\right| \cdot \left|L_i\right|}$ |
微觀精準度 | $\frac{TP}{TP + FP}=\frac{\sum_{i=0}^{N-1} \left|P_i \cap L_i\right|} {\sum_{i=0}^{N-1} \left|P_i \cap L_i\right| + \sum_{i=0}^{N-1} \left|P_i - L_i\right|}$ |
微觀召回率 | $\frac{TP}{TP + FN}=\frac{\sum_{i=0}^{N-1} \left|P_i \cap L_i\right|} {\sum_{i=0}^{N-1} \left|P_i \cap L_i\right| + \sum_{i=0}^{N-1} \left|L_i - P_i\right|}$ |
微觀 F1 量測 | $2 \cdot \frac{TP}{2 \cdot TP + FP + FN}=2 \cdot \frac{\sum_{i=0}^{N-1} \left|P_i \cap L_i\right|}{2 \cdot \sum_{i=0}^{N-1} \left|P_i \cap L_i\right| + \sum_{i=0}^{N-1} \left|L_i - P_i\right| + \sum_{i=0}^{N-1} \left|P_i - L_i\right|}$ |
範例
以下程式碼片段說明如何評估多標籤分類器的效能。範例使用下方所示的多標籤分類的假造預測和標籤資料。
文件預測
- 文件 0 - 預測 0, 1 - 類別 0, 2
- 文件 1 - 預測 0, 2 - 類別 0, 1
- 文件 2 - 預測無 - 類別 0
- 文件 3 - 預測 2 - 類別 2
- 文件 4 - 預測 2, 0 - 類別 2, 0
- 文件 5 - 預測 0, 1, 2 - 類別 0, 1
- 文件 6 - 預測 1 - 類別 1, 2
預測類別
- 類別 0 - 文件 0, 1, 4, 5 (共 4 個)
- 類別 1 - 文件 0, 5, 6 (共 3 個)
- 類別 2 - 文件 1, 3, 4, 5 (共 4 個)
真實類別
- 類別 0 - 文件 0, 1, 2, 4, 5 (共 5 個)
- 類別 1 - 文件 1, 5, 6 (共 3 個)
- 類別 2 - 文件 0, 3, 4, 6 (共 4 個)
請參閱 MultilabelMetrics
Python 文件,以取得更多 API 詳細資料。
from pyspark.mllib.evaluation import MultilabelMetrics
scoreAndLabels = sc.parallelize([
([0.0, 1.0], [0.0, 2.0]),
([0.0, 2.0], [0.0, 1.0]),
([], [0.0]),
([2.0], [2.0]),
([2.0, 0.0], [2.0, 0.0]),
([0.0, 1.0, 2.0], [0.0, 1.0]),
([1.0], [1.0, 2.0])])
# Instantiate metrics object
metrics = MultilabelMetrics(scoreAndLabels)
# Summary stats
print("Recall = %s" % metrics.recall())
print("Precision = %s" % metrics.precision())
print("F1 measure = %s" % metrics.f1Measure())
print("Accuracy = %s" % metrics.accuracy)
# Individual label stats
labels = scoreAndLabels.flatMap(lambda x: x[1]).distinct().collect()
for label in labels:
print("Class %s precision = %s" % (label, metrics.precision(label)))
print("Class %s recall = %s" % (label, metrics.recall(label)))
print("Class %s F1 Measure = %s" % (label, metrics.f1Measure(label)))
# Micro stats
print("Micro precision = %s" % metrics.microPrecision)
print("Micro recall = %s" % metrics.microRecall)
print("Micro F1 measure = %s" % metrics.microF1Measure)
# Hamming loss
print("Hamming loss = %s" % metrics.hammingLoss)
# Subset accuracy
print("Subset accuracy = %s" % metrics.subsetAccuracy)
請參閱 MultilabelMetrics
Scala 文件,以取得 API 詳細資料。
import org.apache.spark.mllib.evaluation.MultilabelMetrics
import org.apache.spark.rdd.RDD
val scoreAndLabels: RDD[(Array[Double], Array[Double])] = sc.parallelize(
Seq((Array(0.0, 1.0), Array(0.0, 2.0)),
(Array(0.0, 2.0), Array(0.0, 1.0)),
(Array.empty[Double], Array(0.0)),
(Array(2.0), Array(2.0)),
(Array(2.0, 0.0), Array(2.0, 0.0)),
(Array(0.0, 1.0, 2.0), Array(0.0, 1.0)),
(Array(1.0), Array(1.0, 2.0))), 2)
// Instantiate metrics object
val metrics = new MultilabelMetrics(scoreAndLabels)
// Summary stats
println(s"Recall = ${metrics.recall}")
println(s"Precision = ${metrics.precision}")
println(s"F1 measure = ${metrics.f1Measure}")
println(s"Accuracy = ${metrics.accuracy}")
// Individual label stats
metrics.labels.foreach(label =>
println(s"Class $label precision = ${metrics.precision(label)}"))
metrics.labels.foreach(label => println(s"Class $label recall = ${metrics.recall(label)}"))
metrics.labels.foreach(label => println(s"Class $label F1-score = ${metrics.f1Measure(label)}"))
// Micro stats
println(s"Micro recall = ${metrics.microRecall}")
println(s"Micro precision = ${metrics.microPrecision}")
println(s"Micro F1 measure = ${metrics.microF1Measure}")
// Hamming loss
println(s"Hamming loss = ${metrics.hammingLoss}")
// Subset accuracy
println(s"Subset accuracy = ${metrics.subsetAccuracy}")
請參閱 MultilabelMetrics
Java 文件,以取得 API 詳細資料。
import java.util.Arrays;
import java.util.List;
import scala.Tuple2;
import org.apache.spark.api.java.*;
import org.apache.spark.mllib.evaluation.MultilabelMetrics;
import org.apache.spark.SparkConf;
List<Tuple2<double[], double[]>> data = Arrays.asList(
new Tuple2<>(new double[]{0.0, 1.0}, new double[]{0.0, 2.0}),
new Tuple2<>(new double[]{0.0, 2.0}, new double[]{0.0, 1.0}),
new Tuple2<>(new double[]{}, new double[]{0.0}),
new Tuple2<>(new double[]{2.0}, new double[]{2.0}),
new Tuple2<>(new double[]{2.0, 0.0}, new double[]{2.0, 0.0}),
new Tuple2<>(new double[]{0.0, 1.0, 2.0}, new double[]{0.0, 1.0}),
new Tuple2<>(new double[]{1.0}, new double[]{1.0, 2.0})
);
JavaRDD<Tuple2<double[], double[]>> scoreAndLabels = sc.parallelize(data);
// Instantiate metrics object
MultilabelMetrics metrics = new MultilabelMetrics(scoreAndLabels.rdd());
// Summary stats
System.out.format("Recall = %f\n", metrics.recall());
System.out.format("Precision = %f\n", metrics.precision());
System.out.format("F1 measure = %f\n", metrics.f1Measure());
System.out.format("Accuracy = %f\n", metrics.accuracy());
// Stats by labels
for (int i = 0; i < metrics.labels().length - 1; i++) {
System.out.format("Class %1.1f precision = %f\n", metrics.labels()[i], metrics.precision(
metrics.labels()[i]));
System.out.format("Class %1.1f recall = %f\n", metrics.labels()[i], metrics.recall(
metrics.labels()[i]));
System.out.format("Class %1.1f F1 score = %f\n", metrics.labels()[i], metrics.f1Measure(
metrics.labels()[i]));
}
// Micro stats
System.out.format("Micro recall = %f\n", metrics.microRecall());
System.out.format("Micro precision = %f\n", metrics.microPrecision());
System.out.format("Micro F1 measure = %f\n", metrics.microF1Measure());
// Hamming loss
System.out.format("Hamming loss = %f\n", metrics.hammingLoss());
// Subset accuracy
System.out.format("Subset accuracy = %f\n", metrics.subsetAccuracy());
排名系統
排名演算法(通常被視為推薦系統)的角色是根據一些訓練資料,傳回一組相關的項目或文件給使用者。相關性的定義可能有所不同,且通常是特定於應用程式的。排名系統指標旨在量化這些排名或推薦在各種情境中的有效性。有些指標會將一組推薦文件與一組相關文件的真實情況進行比較,而其他指標可能會明確納入數值評分。
可用的指標
排名系統通常會處理一組 $M$ 個使用者
\[U = \left\{u_0, u_1, ..., u_{M-1}\right\}\]每個使用者 ($u_i$) 都有一組 $N_i$ 個真實情況相關文件
\[D_i = \left\{d_0, d_1, ..., d_{N_i-1}\right\}\]以及一份 $Q_i$ 個推薦文件清單,依相關性遞減排列
\[R_i = \left[r_0, r_1, ..., r_{Q_i-1}\right]\]排名系統的目標是為每個使用者產生最相關的文件集。集合的相关性和演算法的有效性可以使用下列指標來衡量。
有必要定義一個函數,提供一個推薦文件和一組真實情況相關文件,傳回推薦文件的相關性分數。
\[rel_D(r) = \begin{cases}1 & \text{如果 $r \in D$}, \\ 0 & \text{否則}.\end{cases}\]指標 | 定義 | 注意事項 |
---|---|---|
前 k 個精確度 | $p(k)=\frac{1}{M} \sum_{i=0}^{M-1} {\frac{1}{k} \sum_{j=0}^{\text{min}(Q_i, k) - 1} rel_{D_i}(R_i(j))}$ | 前 k 個精確度是衡量前 k 個推薦文件中有多少在真實相關文件集中,平均計算所有使用者。在此指標中,不考慮推薦的順序。 |
平均平均精確度 | $MAP=\frac{1}{M} \sum_{i=0}^{M-1} {\frac{1}{N_i} \sum_{j=0}^{Q_i-1} \frac{rel_{D_i}(R_i(j))}{j + 1}}$ | MAP是衡量推薦文件中有多少在真實相關文件集中,其中考慮推薦的順序(即對高度相關文件的懲罰較高)。 |
正規化折現累積增益 | $NDCG(k)=\frac{1}{M} \sum_{i=0}^{M-1} {\frac{1}{IDCG(D_i, k)}\sum_{j=0}^{n-1} \frac{rel_{D_i}(R_i(j))}{\text{log}(j+2)}} \\ \text{其中} \\ \hspace{5 mm} n = \text{min}\left(\text{max}\left(Q_i, N_i\right),k\right) \\ \hspace{5 mm} IDCG(D, k) = \sum_{j=0}^{\text{min}(\left|D\right|, k) - 1} \frac{1}{\text{log}(j+2)}$ | k 處的 NDCG 是衡量前 k 個推薦文件中有多少在所有使用者平均的真實相關文件組中的指標。與 k 處的準確度不同,此指標會考量推薦的順序(假設文件按相關性遞減順序排列)。 |
範例
以下程式碼片段說明如何載入範例資料集、在資料上訓練交替最小平方推薦模型,以及透過多個排名指標評估推薦器的效能。以下提供方法的簡要摘要。
MovieLens 評分採用 1-5 的評分比例
- 5:必看
- 4:會喜歡
- 3:還可以
- 2:很差
- 1:糟透了
因此,如果預測評分低於 3,我們不應推薦電影。我們使用下式將評分對應至信心分數
- 5 -> 2.5
- 4 -> 1.5
- 3 -> 0.5
- 2 -> -0.5
- 1 -> -1.5.
此對應表示未觀察到的項目通常介於「還可以」與「很差」之間。在此擴充的非正權重世界中,0 的語意為「與從未互動過相同」。
請參閱 RegressionMetrics
Python 文件 和 RankingMetrics
Python 文件 以取得更多 API 詳細資料。
from pyspark.mllib.recommendation import ALS, Rating
from pyspark.mllib.evaluation import RegressionMetrics
# Read in the ratings data
lines = sc.textFile("data/mllib/sample_movielens_data.txt")
def parseLine(line):
fields = line.split("::")
return Rating(int(fields[0]), int(fields[1]), float(fields[2]) - 2.5)
ratings = lines.map(lambda r: parseLine(r))
# Train a model on to predict user-product ratings
model = ALS.train(ratings, 10, 10, 0.01)
# Get predicted ratings on all existing user-product pairs
testData = ratings.map(lambda p: (p.user, p.product))
predictions = model.predictAll(testData).map(lambda r: ((r.user, r.product), r.rating))
ratingsTuple = ratings.map(lambda r: ((r.user, r.product), r.rating))
scoreAndLabels = predictions.join(ratingsTuple).map(lambda tup: tup[1])
# Instantiate regression metrics to compare predicted and actual ratings
metrics = RegressionMetrics(scoreAndLabels)
# Root mean squared error
print("RMSE = %s" % metrics.rootMeanSquaredError)
# R-squared
print("R-squared = %s" % metrics.r2)
請參閱 RegressionMetrics
Scala 文件 和 RankingMetrics
Scala 文件 以取得 API 詳細資料。
import org.apache.spark.mllib.evaluation.{RankingMetrics, RegressionMetrics}
import org.apache.spark.mllib.recommendation.{ALS, Rating}
// Read in the ratings data
val ratings = spark.read.textFile("data/mllib/sample_movielens_data.txt").rdd.map { line =>
val fields = line.split("::")
Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble - 2.5)
}.cache()
// Map ratings to 1 or 0, 1 indicating a movie that should be recommended
val binarizedRatings = ratings.map(r => Rating(r.user, r.product,
if (r.rating > 0) 1.0 else 0.0)).cache()
// Summarize ratings
val numRatings = ratings.count()
val numUsers = ratings.map(_.user).distinct().count()
val numMovies = ratings.map(_.product).distinct().count()
println(s"Got $numRatings ratings from $numUsers users on $numMovies movies.")
// Build the model
val numIterations = 10
val rank = 10
val lambda = 0.01
val model = ALS.train(ratings, rank, numIterations, lambda)
// Define a function to scale ratings from 0 to 1
def scaledRating(r: Rating): Rating = {
val scaledRating = math.max(math.min(r.rating, 1.0), 0.0)
Rating(r.user, r.product, scaledRating)
}
// Get sorted top ten predictions for each user and then scale from [0, 1]
val userRecommended = model.recommendProductsForUsers(10).map { case (user, recs) =>
(user, recs.map(scaledRating))
}
// Assume that any movie a user rated 3 or higher (which maps to a 1) is a relevant document
// Compare with top ten most relevant documents
val userMovies = binarizedRatings.groupBy(_.user)
val relevantDocuments = userMovies.join(userRecommended).map { case (user, (actual,
predictions)) =>
(predictions.map(_.product), actual.filter(_.rating > 0.0).map(_.product).toArray)
}
// Instantiate metrics object
val metrics = new RankingMetrics(relevantDocuments)
// Precision at K
Array(1, 3, 5).foreach { k =>
println(s"Precision at $k = ${metrics.precisionAt(k)}")
}
// Mean average precision
println(s"Mean average precision = ${metrics.meanAveragePrecision}")
// Mean average precision at k
println(s"Mean average precision at 2 = ${metrics.meanAveragePrecisionAt(2)}")
// Normalized discounted cumulative gain
Array(1, 3, 5).foreach { k =>
println(s"NDCG at $k = ${metrics.ndcgAt(k)}")
}
// Recall at K
Array(1, 3, 5).foreach { k =>
println(s"Recall at $k = ${metrics.recallAt(k)}")
}
// Get predictions for each data point
val allPredictions = model.predict(ratings.map(r => (r.user, r.product))).map(r => ((r.user,
r.product), r.rating))
val allRatings = ratings.map(r => ((r.user, r.product), r.rating))
val predictionsAndLabels = allPredictions.join(allRatings).map { case ((user, product),
(predicted, actual)) =>
(predicted, actual)
}
// Get the RMSE using regression metrics
val regressionMetrics = new RegressionMetrics(predictionsAndLabels)
println(s"RMSE = ${regressionMetrics.rootMeanSquaredError}")
// R-squared
println(s"R-squared = ${regressionMetrics.r2}")
請參閱 RegressionMetrics
Java 文件 和 RankingMetrics
Java 文件 以取得 API 詳細資料。
import java.util.*;
import scala.Tuple2;
import org.apache.spark.api.java.*;
import org.apache.spark.mllib.evaluation.RegressionMetrics;
import org.apache.spark.mllib.evaluation.RankingMetrics;
import org.apache.spark.mllib.recommendation.ALS;
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel;
import org.apache.spark.mllib.recommendation.Rating;
String path = "data/mllib/sample_movielens_data.txt";
JavaRDD<String> data = sc.textFile(path);
JavaRDD<Rating> ratings = data.map(line -> {
String[] parts = line.split("::");
return new Rating(Integer.parseInt(parts[0]), Integer.parseInt(parts[1]), Double
.parseDouble(parts[2]) - 2.5);
});
ratings.cache();
// Train an ALS model
MatrixFactorizationModel model = ALS.train(JavaRDD.toRDD(ratings), 10, 10, 0.01);
// Get top 10 recommendations for every user and scale ratings from 0 to 1
JavaRDD<Tuple2<Object, Rating[]>> userRecs = model.recommendProductsForUsers(10).toJavaRDD();
JavaRDD<Tuple2<Object, Rating[]>> userRecsScaled = userRecs.map(t -> {
Rating[] scaledRatings = new Rating[t._2().length];
for (int i = 0; i < scaledRatings.length; i++) {
double newRating = Math.max(Math.min(t._2()[i].rating(), 1.0), 0.0);
scaledRatings[i] = new Rating(t._2()[i].user(), t._2()[i].product(), newRating);
}
return new Tuple2<>(t._1(), scaledRatings);
});
JavaPairRDD<Object, Rating[]> userRecommended = JavaPairRDD.fromJavaRDD(userRecsScaled);
// Map ratings to 1 or 0, 1 indicating a movie that should be recommended
JavaRDD<Rating> binarizedRatings = ratings.map(r -> {
double binaryRating;
if (r.rating() > 0.0) {
binaryRating = 1.0;
} else {
binaryRating = 0.0;
}
return new Rating(r.user(), r.product(), binaryRating);
});
// Group ratings by common user
JavaPairRDD<Object, Iterable<Rating>> userMovies = binarizedRatings.groupBy(Rating::user);
// Get true relevant documents from all user ratings
JavaPairRDD<Object, List<Integer>> userMoviesList = userMovies.mapValues(docs -> {
List<Integer> products = new ArrayList<>();
for (Rating r : docs) {
if (r.rating() > 0.0) {
products.add(r.product());
}
}
return products;
});
// Extract the product id from each recommendation
JavaPairRDD<Object, List<Integer>> userRecommendedList = userRecommended.mapValues(docs -> {
List<Integer> products = new ArrayList<>();
for (Rating r : docs) {
products.add(r.product());
}
return products;
});
JavaRDD<Tuple2<List<Integer>, List<Integer>>> relevantDocs = userMoviesList.join(
userRecommendedList).values();
// Instantiate the metrics object
RankingMetrics<Integer> metrics = RankingMetrics.of(relevantDocs);
// Precision, NDCG and Recall at k
Integer[] kVector = {1, 3, 5};
for (Integer k : kVector) {
System.out.format("Precision at %d = %f\n", k, metrics.precisionAt(k));
System.out.format("NDCG at %d = %f\n", k, metrics.ndcgAt(k));
System.out.format("Recall at %d = %f\n", k, metrics.recallAt(k));
}
// Mean average precision
System.out.format("Mean average precision = %f\n", metrics.meanAveragePrecision());
//Mean average precision at k
System.out.format("Mean average precision at 2 = %f\n", metrics.meanAveragePrecisionAt(2));
// Evaluate the model using numerical ratings and regression metrics
JavaRDD<Tuple2<Object, Object>> userProducts =
ratings.map(r -> new Tuple2<>(r.user(), r.product()));
JavaPairRDD<Tuple2<Integer, Integer>, Object> predictions = JavaPairRDD.fromJavaRDD(
model.predict(JavaRDD.toRDD(userProducts)).toJavaRDD().map(r ->
new Tuple2<>(new Tuple2<>(r.user(), r.product()), r.rating())));
JavaRDD<Tuple2<Object, Object>> ratesAndPreds =
JavaPairRDD.fromJavaRDD(ratings.map(r ->
new Tuple2<Tuple2<Integer, Integer>, Object>(
new Tuple2<>(r.user(), r.product()),
r.rating())
)).join(predictions).values();
// Create regression metrics object
RegressionMetrics regressionMetrics = new RegressionMetrics(ratesAndPreds.rdd());
// Root mean squared error
System.out.format("RMSE = %f\n", regressionMetrics.rootMeanSquaredError());
// R-squared
System.out.format("R-squared = %f\n", regressionMetrics.r2());
迴歸模型評估
回歸分析用於根據多個自變數預測連續輸出變數。
可用的指標
指標 | 定義 |
---|---|
均方誤差 (MSE) | $MSE = \frac{\sum_{i=0}^{N-1} (\mathbf{y}_i - \hat{\mathbf{y}}_i)^2}{N}$ |
均方根誤差 (RMSE) | $RMSE = \sqrt{\frac{\sum_{i=0}^{N-1} (\mathbf{y}_i - \hat{\mathbf{y}}_i)^2}{N}}$ |
平均絕對誤差 (MAE) | $MAE=\frac{1}{N}\sum_{i=0}^{N-1} \left|\mathbf{y}_i - \hat{\mathbf{y}}_i\right|$ |
決定係數 $(R^2)$ | $R^2=1 - \frac{MSE}{\text{VAR}(\mathbf{y}) \cdot (N-1)}=1-\frac{\sum_{i=0}^{N-1} (\mathbf{y}_i - \hat{\mathbf{y}}_i)^2}{\sum_{i=0}^{N-1}(\mathbf{y}_i-\bar{\mathbf{y}})^2}$ |
解釋變異數 | $1 - \frac{\text{VAR}(\mathbf{y} - \mathbf{\hat{y}})}{\text{VAR}(\mathbf{y})}$ |