合奏 - 基於 RDD 的 API

梯度提升樹與隨機森林
隨機森林
- 基本演算法
  - 訓練
  - 預測
- 使用秘訣
- 範例
  - 分類
  - 回歸
梯度提升樹 (GBT)

一個合奏方法是一種學習演算法，會建立一個由一組其他基本模型組成的模型。spark.mllib 支援兩種主要的合奏演算法：GradientBoostedTrees 和 RandomForest。這兩種演算法都使用決策樹作為其基本模型。

梯度提升樹與隨機森林

梯度提升樹 (GBT) 和隨機森林都是學習樹合奏的演算法，但訓練過程不同。有幾個實際的權衡

GBT 一次訓練一棵樹，因此訓練時間可能比隨機森林長。隨機森林可以並行訓練多棵樹。
- 另一方面，使用較小的（較淺的）樹與 GBT 相較於隨機森林通常是合理的，而訓練較小的樹需要較短的時間。
隨機森林較不容易過度擬合。在隨機森林中訓練更多棵樹會降低過度擬合的可能性，但使用 GBT 訓練更多棵樹會增加過度擬合的可能性。（在統計語言中，隨機森林透過使用更多棵樹來減少變異，而 GBT 透過使用更多棵樹來減少偏差。）
隨機森林較容易調整，因為效能會隨著樹木數量單調遞增（而如果樹木數量過多，梯度提升樹的效能可能會開始下降）。

簡而言之，兩種演算法都可能有效，而選擇應根據特定資料集而定。

隨機森林

隨機森林是決策樹的合奏。隨機森林是分類和回歸最成功的機器學習模型之一。它們結合許多決策樹，以降低過度擬合的風險。與決策樹一樣，隨機森林處理分類特徵，延伸到多類別分類設定，不需要特徵縮放，並且能夠擷取非線性和特徵交互作用。

spark.mllib 支援隨機森林進行二元和多類別分類以及回歸，使用連續和分類特徵。spark.mllib 使用現有的決策樹實作來實作隨機森林。請參閱決策樹指南，以取得有關樹木的更多資訊。

基本演算法

隨機森林分別訓練一組決策樹，因此訓練可以在平行處理中完成。該演算法將隨機性注入訓練過程中，以便每個決策樹都有些不同。結合每個樹的預測會降低預測的變異，進而改善測試資料的效能。

訓練

注入訓練過程的隨機性包括

在每次反覆運算中對原始資料集進行子抽樣，以取得不同的訓練集（又稱為引導取樣）。
在每個樹節點考慮不同的隨機特徵子集進行分割。

除了這些隨機化之外，決策樹訓練是以與個別決策樹相同的方式進行。

預測

為了對新實例進行預測，隨機森林必須彙總其決策樹集的預測。此彙總對於分類和回歸的執行方式不同。

分類：多數決。每個樹的預測被計為一類的投票。預測標籤為獲得最多票數的類別。

回歸：平均。每個樹預測一個實值。預測標籤為樹預測的平均值。

使用秘訣

我們透過討論各種參數，來包含使用隨機森林的一些準則。我們省略了一些決策樹參數，因為這些參數在決策樹指南中有涵蓋。

我們提到的前兩個參數是最重要的，調整它們通常可以改善效能

numTrees：森林中的樹木數量。
- 增加樹木數量會降低預測的變異，進而改善模型的測試時間準確度。
- 訓練時間會隨著樹木數量大致呈線性增加。
maxDepth：森林中每棵樹的最大深度。
- 增加深度會讓模型更具表達性和效能。然而，深度樹的訓練時間較長，也更容易過度擬合。
- 一般而言，使用隨機森林訓練深度樹比使用單一決策樹時更可接受。單一樹比隨機森林更容易過度擬合（因為隨機森林中多棵樹的平均值會減少變異）。

接下來的兩個參數通常不需要調整。然而，可以調整它們以加速訓練。

subsamplingRate：此參數指定用於訓練森林中每棵樹的資料集大小，為原始資料集大小的一小部分。建議使用預設值 (1.0)，但降低此分數可以加速訓練。
featureSubsetStrategy：在每個樹節點中作為分割候選項使用的特徵數量。此數字指定為特徵總數的一小部分或函數。降低此數字將加速訓練，但如果太低，有時會影響效能。

範例

分類

以下範例說明如何載入 LIBSVM 資料檔，將其剖析為 LabeledPoint 的 RDD，然後使用隨機森林執行分類。計算測試誤差以衡量演算法準確度。

請參閱 RandomForest Python 文件和 RandomForest Python 文件以取得更多 API 詳細資料。

from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
#  Note: Use larger numTrees in practice.
#  Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=3, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=4, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(
    lambda lp: lp[0] != lp[1]).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification forest model:')
print(model.toDebugString())

# Save and load model
model.save(sc, "target/tmp/myRandomForestClassificationModel")
sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestClassificationModel")

在 Spark 回應程式中，請在「examples/src/main/python/mllib/random_forest_classification_example.py」中尋找完整的範例程式碼。

請參閱 RandomForest Scala 文件和 RandomForestModel Scala 文件以取得 API 詳細資料。

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a RandomForest model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 3 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 4
val maxBins = 32

val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
  numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

// Evaluate model on test instances and compute test error
val labelAndPreds = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println(s"Test Error = $testErr")
println(s"Learned classification forest model:\n ${model.toDebugString}")

// Save and load model
model.save(sc, "target/tmp/myRandomForestClassificationModel")
val sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestClassificationModel")

在 Spark 回應程式中，請在「examples/src/main/scala/org/apache/spark/examples/mllib/RandomForestClassificationExample.scala」中尋找完整的範例程式碼。

請參閱 RandomForest Java 文件和 RandomForestModel Java 文件以取得 API 詳細資料。

import java.util.HashMap;
import java.util.Map;

import scala.Tuple2;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.tree.RandomForest;
import org.apache.spark.mllib.tree.model.RandomForestModel;
import org.apache.spark.mllib.util.MLUtils;

SparkConf sparkConf = new SparkConf().setAppName("JavaRandomForestClassificationExample");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
// Load and parse the data file.
String datapath = "data/mllib/sample_libsvm_data.txt";
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), datapath).toJavaRDD();
// Split the data into training and test sets (30% held out for testing)
JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3});
JavaRDD<LabeledPoint> trainingData = splits[0];
JavaRDD<LabeledPoint> testData = splits[1];

// Train a RandomForest model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
int numClasses = 2;
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<>();
int numTrees = 3; // Use more in practice.
String featureSubsetStrategy = "auto"; // Let the algorithm choose.
String impurity = "gini";
int maxDepth = 5;
int maxBins = 32;
int seed = 12345;

RandomForestModel model = RandomForest.trainClassifier(trainingData, numClasses,
  categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins,
  seed);

// Evaluate model on test instances and compute test error
JavaPairRDD<Double, Double> predictionAndLabel =
  testData.mapToPair(p -> new Tuple2<>(model.predict(p.features()), p.label()));
double testErr =
  predictionAndLabel.filter(pl -> !pl._1().equals(pl._2())).count() / (double) testData.count();
System.out.println("Test Error: " + testErr);
System.out.println("Learned classification forest model:\n" + model.toDebugString());

// Save and load model
model.save(jsc.sc(), "target/tmp/myRandomForestClassificationModel");
RandomForestModel sameModel = RandomForestModel.load(jsc.sc(),
  "target/tmp/myRandomForestClassificationModel");

在 Spark 儲存庫中，於「examples/src/main/java/org/apache/spark/examples/mllib/JavaRandomForestClassificationExample.java」中找到完整的範例程式碼。

回歸

以下範例說明如何載入 LIBSVM 資料檔，並將其剖析為 LabeledPoint 的 RDD，然後使用隨機森林執行迴歸。最後計算均方誤差 (MSE) 以評估擬合度。

請參閱 RandomForest Python 文件和 RandomForest Python 文件以取得更多 API 詳細資料。

from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
#  Note: Use larger numTrees in practice.
#  Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
                                    numTrees=3, featureSubsetStrategy="auto",
                                    impurity='variance', maxDepth=4, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testMSE = labelsAndPredictions.map(lambda lp: (lp[0] - lp[1]) * (lp[0] - lp[1])).sum() /\
    float(testData.count())
print('Test Mean Squared Error = ' + str(testMSE))
print('Learned regression forest model:')
print(model.toDebugString())

# Save and load model
model.save(sc, "target/tmp/myRandomForestRegressionModel")
sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestRegressionModel")

在 Spark 儲存庫中，於「examples/src/main/python/mllib/random_forest_regression_example.py」中找到完整的範例程式碼。

請參閱 RandomForest Scala 文件和 RandomForestModel Scala 文件以取得 API 詳細資料。

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a RandomForest model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 3 // Use more in practice.
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "variance"
val maxDepth = 4
val maxBins = 32

val model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo,
  numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

// Evaluate model on test instances and compute test error
val labelsAndPredictions = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
println(s"Test Mean Squared Error = $testMSE")
println(s"Learned regression forest model:\n ${model.toDebugString}")

// Save and load model
model.save(sc, "target/tmp/myRandomForestRegressionModel")
val sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestRegressionModel")

在 Spark 儲存庫中，於「examples/src/main/scala/org/apache/spark/examples/mllib/RandomForestRegressionExample.scala」中找到完整的範例程式碼。

請參閱 RandomForest Java 文件和 RandomForestModel Java 文件以取得 API 詳細資料。

import java.util.HashMap;
import java.util.Map;

import scala.Tuple2;

import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.tree.RandomForest;
import org.apache.spark.mllib.tree.model.RandomForestModel;
import org.apache.spark.mllib.util.MLUtils;
import org.apache.spark.SparkConf;

SparkConf sparkConf = new SparkConf().setAppName("JavaRandomForestRegressionExample");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
// Load and parse the data file.
String datapath = "data/mllib/sample_libsvm_data.txt";
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), datapath).toJavaRDD();
// Split the data into training and test sets (30% held out for testing)
JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3});
JavaRDD<LabeledPoint> trainingData = splits[0];
JavaRDD<LabeledPoint> testData = splits[1];

// Set parameters.
// Empty categoricalFeaturesInfo indicates all features are continuous.
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<>();
int numTrees = 3; // Use more in practice.
String featureSubsetStrategy = "auto"; // Let the algorithm choose.
String impurity = "variance";
int maxDepth = 4;
int maxBins = 32;
int seed = 12345;
// Train a RandomForest model.
RandomForestModel model = RandomForest.trainRegressor(trainingData,
  categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins, seed);

// Evaluate model on test instances and compute test error
JavaPairRDD<Double, Double> predictionAndLabel =
  testData.mapToPair(p -> new Tuple2<>(model.predict(p.features()), p.label()));
double testMSE = predictionAndLabel.mapToDouble(pl -> {
  double diff = pl._1() - pl._2();
  return diff * diff;
}).mean();
System.out.println("Test Mean Squared Error: " + testMSE);
System.out.println("Learned regression forest model:\n" + model.toDebugString());

// Save and load model
model.save(jsc.sc(), "target/tmp/myRandomForestRegressionModel");
RandomForestModel sameModel = RandomForestModel.load(jsc.sc(),
  "target/tmp/myRandomForestRegressionModel");

在 Spark 儲存庫中，於「examples/src/main/java/org/apache/spark/examples/mllib/JavaRandomForestRegressionExample.java」中找到完整的範例程式碼。

梯度提升樹 (GBT)

梯度提升樹 (GBT) 是決策樹的合奏。GBT 迭代訓練決策樹以最小化損失函數。與決策樹一樣，GBT 處理分類特徵、延伸至多類別分類設定、不需要特徵縮放，而且能夠捕捉非線性和特徵交互作用。

spark.mllib 支援 GBT 進行二元分類和迴歸，同時使用連續和分類特徵。 spark.mllib 使用現有的決策樹實作來實作 GBT。請參閱決策樹指南以取得更多關於樹的資訊。

注意：GBT 尚未支援多類別分類。對於多類別問題，請使用決策樹或隨機森林。

基本演算法

梯度提升會迭代訓練一系列決策樹。在每個迭代中，演算法會使用目前的合奏來預測每個訓練實例的標籤，然後將預測與真實標籤進行比較。資料集會重新標籤，以更強調預測不良的訓練實例。因此，在下一個迭代中，決策樹將有助於修正先前的錯誤。

重新標籤實例的特定機制是由損失函數 (如下所述) 定義的。在每個迭代中，GBT 會進一步減少訓練資料上的損失函數。

損失

下表列出目前由 spark.mllib 中的 GBT 所支援的損失。請注意，每個損失都適用於分類或迴歸中的一個，而不是兩者。

符號：$N$ = 執行個體數。$y_i$ = 執行個體 $i$ 的標籤。$x_i$ = 執行個體 $i$ 的特徵。$F(x_i)$ = 模型預測執行個體 $i$ 的標籤。

損失	任務	公式	說明
對數損失	分類	$2 \sum_{i=1}^{N} \log(1+\exp(-2 y_i F(x_i)))$	兩倍二項負對數似然。
平方誤差	回歸	$\sum_{i=1}^{N} (y_i - F(x_i))^2$	也稱為 L2 損失。迴歸任務的預設損失。
絕對誤差	回歸	$\sum_{i=1}^{N} \|y_i - F(x_i)\|$	也稱為 L1 損失。比平方誤差更能抵抗異常值。

使用秘訣

我們透過討論各種參數，加入一些使用 GBT 的指南。我們省略一些決策樹參數，因為這些參數已在決策樹指南中涵蓋。

loss：請參閱上方的部分，以取得有關損失及其對任務（分類與迴歸）適用性的資訊。不同的損失可能會產生顯著不同的結果，具體取決於資料集。
numIterations：這會設定集合中的樹木數量。每次反覆運算會產生一棵樹。增加此數字會讓模型更具表現力，進而改善訓練資料的準確度。但是，如果這個數字太大，測試時間的準確度可能會受到影響。
learningRate：此參數不應需要調整。如果演算法行為看似不穩定，降低此值可能會改善穩定性。
algo：演算法或任務（分類與迴歸）會使用樹 [策略] 參數設定。

訓練期間驗證

梯度提升可能會在使用更多樹木進行訓練時過度擬合。為了防止過度擬合，在訓練時進行驗證很有用。已提供 runWithValidation 方法來使用此選項。它會將一對 RDD 作為引數，第一個是訓練資料集，第二個是驗證資料集。

當驗證誤差的改善幅度小於某個容忍度（由 BoostingStrategy 中的 validationTol 參數提供）時，訓練會停止。實際上，驗證誤差會先下降，然後再上升。在某些情況下，驗證誤差不會單調變化，建議使用者設定一個夠大的負容忍度，並使用 evaluateEachIteration（提供每次反覆運算的誤差或損失）來檢查驗證曲線，以調整反覆運算的次數。

範例

分類

以下範例說明如何載入 LIBSVM 資料檔案，將其剖析為 LabeledPoint 的 RDD，然後使用梯度提升樹執行分類，損失為對數損失。會計算測試誤差來衡量演算法的準確度。

請參閱 GradientBoostedTrees Python 文件和 GradientBoostedTreesModel Python 文件，以取得更多 API 詳細資料。

from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel
from pyspark.mllib.util import MLUtils

# Load and parse the data file.
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a GradientBoostedTrees model.
#  Notes: (a) Empty categoricalFeaturesInfo indicates all features are continuous.
#         (b) Use more iterations in practice.
model = GradientBoostedTrees.trainClassifier(trainingData,
                                             categoricalFeaturesInfo={}, numIterations=3)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(
    lambda lp: lp[0] != lp[1]).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification GBT model:')
print(model.toDebugString())

# Save and load model
model.save(sc, "target/tmp/myGradientBoostingClassificationModel")
sameModel = GradientBoostedTreesModel.load(sc,
                                           "target/tmp/myGradientBoostingClassificationModel")

在 Spark 儲存庫的「examples/src/main/python/mllib/gradient_boosting_classification_example.py」中，找到完整的範例程式碼。

請參閱 GradientBoostedTrees Scala 文件和 GradientBoostedTreesModel Scala 文件，以取得 API 詳細資料。

import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a GradientBoostedTrees model.
// The defaultParams for Classification use LogLoss by default.
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 3 // Note: Use more iterations in practice.
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 5
// Empty categoricalFeaturesInfo indicates all features are continuous.
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(trainingData, boostingStrategy)

// Evaluate model on test instances and compute test error
val labelAndPreds = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println(s"Test Error = $testErr")
println(s"Learned classification GBT model:\n ${model.toDebugString}")

// Save and load model
model.save(sc, "target/tmp/myGradientBoostingClassificationModel")
val sameModel = GradientBoostedTreesModel.load(sc,
  "target/tmp/myGradientBoostingClassificationModel")

在 Spark 儲存庫的「examples/src/main/scala/org/apache/spark/examples/mllib/GradientBoostingClassificationExample.scala」中，找到完整的範例程式碼。

請參閱 GradientBoostedTrees Java 文件和 GradientBoostedTreesModel Java 文件，以取得 API 詳細資料。

import java.util.HashMap;
import java.util.Map;

import scala.Tuple2;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.tree.GradientBoostedTrees;
import org.apache.spark.mllib.tree.configuration.BoostingStrategy;
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel;
import org.apache.spark.mllib.util.MLUtils;

SparkConf sparkConf = new SparkConf()
  .setAppName("JavaGradientBoostedTreesClassificationExample");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);

// Load and parse the data file.
String datapath = "data/mllib/sample_libsvm_data.txt";
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), datapath).toJavaRDD();
// Split the data into training and test sets (30% held out for testing)
JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3});
JavaRDD<LabeledPoint> trainingData = splits[0];
JavaRDD<LabeledPoint> testData = splits[1];

// Train a GradientBoostedTrees model.
// The defaultParams for Classification use LogLoss by default.
BoostingStrategy boostingStrategy = BoostingStrategy.defaultParams("Classification");
boostingStrategy.setNumIterations(3); // Note: Use more iterations in practice.
boostingStrategy.getTreeStrategy().setNumClasses(2);
boostingStrategy.getTreeStrategy().setMaxDepth(5);
// Empty categoricalFeaturesInfo indicates all features are continuous.
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<>();
boostingStrategy.treeStrategy().setCategoricalFeaturesInfo(categoricalFeaturesInfo);

GradientBoostedTreesModel model = GradientBoostedTrees.train(trainingData, boostingStrategy);

// Evaluate model on test instances and compute test error
JavaPairRDD<Double, Double> predictionAndLabel =
  testData.mapToPair(p -> new Tuple2<>(model.predict(p.features()), p.label()));
double testErr =
  predictionAndLabel.filter(pl -> !pl._1().equals(pl._2())).count() / (double) testData.count();
System.out.println("Test Error: " + testErr);
System.out.println("Learned classification GBT model:\n" + model.toDebugString());

// Save and load model
model.save(jsc.sc(), "target/tmp/myGradientBoostingClassificationModel");
GradientBoostedTreesModel sameModel = GradientBoostedTreesModel.load(jsc.sc(),
  "target/tmp/myGradientBoostingClassificationModel");

在 Spark 儲存庫的「examples/src/main/java/org/apache/spark/examples/mllib/JavaGradientBoostingClassificationExample.java」中，找到完整的範例程式碼。

回歸

以下範例說明如何載入 LIBSVM 資料檔案，將其剖析為 LabeledPoint 的 RDD，然後使用梯度提升樹執行迴歸，損失為平方誤差。最後會計算均方誤差 (MSE) 來評估擬合優度。

請參閱 GradientBoostedTrees Python 文件和 GradientBoostedTreesModel Python 文件，以取得更多 API 詳細資料。

from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel
from pyspark.mllib.util import MLUtils

# Load and parse the data file.
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a GradientBoostedTrees model.
#  Notes: (a) Empty categoricalFeaturesInfo indicates all features are continuous.
#         (b) Use more iterations in practice.
model = GradientBoostedTrees.trainRegressor(trainingData,
                                            categoricalFeaturesInfo={}, numIterations=3)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testMSE = labelsAndPredictions.map(lambda lp: (lp[0] - lp[1]) * (lp[0] - lp[1])).sum() /\
    float(testData.count())
print('Test Mean Squared Error = ' + str(testMSE))
print('Learned regression GBT model:')
print(model.toDebugString())

# Save and load model
model.save(sc, "target/tmp/myGradientBoostingRegressionModel")
sameModel = GradientBoostedTreesModel.load(sc, "target/tmp/myGradientBoostingRegressionModel")

在 Spark 儲存庫的「examples/src/main/python/mllib/gradient_boosting_regression_example.py」中，找到完整的範例程式碼。

請參閱 GradientBoostedTrees Scala 文件和 GradientBoostedTreesModel Scala 文件，以取得 API 詳細資料。

import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a GradientBoostedTrees model.
// The defaultParams for Regression use SquaredError by default.
val boostingStrategy = BoostingStrategy.defaultParams("Regression")
boostingStrategy.numIterations = 3 // Note: Use more iterations in practice.
boostingStrategy.treeStrategy.maxDepth = 5
// Empty categoricalFeaturesInfo indicates all features are continuous.
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(trainingData, boostingStrategy)

// Evaluate model on test instances and compute test error
val labelsAndPredictions = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
println(s"Test Mean Squared Error = $testMSE")
println(s"Learned regression GBT model:\n ${model.toDebugString}")

// Save and load model
model.save(sc, "target/tmp/myGradientBoostingRegressionModel")
val sameModel = GradientBoostedTreesModel.load(sc,
  "target/tmp/myGradientBoostingRegressionModel")

在 Spark 儲存庫的「examples/src/main/scala/org/apache/spark/examples/mllib/GradientBoostingRegressionExample.scala」中，找到完整的範例程式碼。

請參閱 GradientBoostedTrees Java 文件和 GradientBoostedTreesModel Java 文件，以取得 API 詳細資料。

import java.util.HashMap;
import java.util.Map;

import scala.Tuple2;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.tree.GradientBoostedTrees;
import org.apache.spark.mllib.tree.configuration.BoostingStrategy;
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel;
import org.apache.spark.mllib.util.MLUtils;

SparkConf sparkConf = new SparkConf()
  .setAppName("JavaGradientBoostedTreesRegressionExample");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
// Load and parse the data file.
String datapath = "data/mllib/sample_libsvm_data.txt";
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), datapath).toJavaRDD();
// Split the data into training and test sets (30% held out for testing)
JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3});
JavaRDD<LabeledPoint> trainingData = splits[0];
JavaRDD<LabeledPoint> testData = splits[1];

// Train a GradientBoostedTrees model.
// The defaultParams for Regression use SquaredError by default.
BoostingStrategy boostingStrategy = BoostingStrategy.defaultParams("Regression");
boostingStrategy.setNumIterations(3); // Note: Use more iterations in practice.
boostingStrategy.getTreeStrategy().setMaxDepth(5);
// Empty categoricalFeaturesInfo indicates all features are continuous.
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<>();
boostingStrategy.treeStrategy().setCategoricalFeaturesInfo(categoricalFeaturesInfo);

GradientBoostedTreesModel model = GradientBoostedTrees.train(trainingData, boostingStrategy);

// Evaluate model on test instances and compute test error
JavaPairRDD<Double, Double> predictionAndLabel =
  testData.mapToPair(p -> new Tuple2<>(model.predict(p.features()), p.label()));
double testMSE = predictionAndLabel.mapToDouble(pl -> {
  double diff = pl._1() - pl._2();
  return diff * diff;
}).mean();
System.out.println("Test Mean Squared Error: " + testMSE);
System.out.println("Learned regression GBT model:\n" + model.toDebugString());

// Save and load model
model.save(jsc.sc(), "target/tmp/myGradientBoostingRegressionModel");
GradientBoostedTreesModel sameModel = GradientBoostedTreesModel.load(jsc.sc(),
  "target/tmp/myGradientBoostingRegressionModel");

在 Spark 儲存庫中「examples/src/main/java/org/apache/spark/examples/mllib/JavaGradientBoostingRegressionExample.java」中找到完整的範例程式碼。

MLlib：主要指南

MLlib：基於 RDD 的 API 指南

合奏 - 基於 RDD 的 API

梯度提升樹與隨機森林

隨機森林

基本演算法

訓練

預測

使用秘訣

範例

分類

回歸

梯度提升樹 (GBT)

基本演算法

損失

使用秘訣

訓練期間驗證

範例

分類

回歸