GROUP BY 子句
說明
GROUP BY
子句用於根據一組指定的群組表達式對列進行分組,並根據一個或多個指定的聚合函數對列群組計算聚合。Spark 也支援進階聚合,透過 GROUPING SETS
、CUBE
、ROLLUP
子句針對同一個輸入記錄集進行多重聚合。群組表達式與進階聚合可以在 GROUP BY
子句中混合使用,並在 GROUPING SETS
子句中巢狀。在 Mixed/Nested Grouping Analytics
區段中查看更多詳細資料。當 FILTER 子句附加到聚合函數時,只有符合條件的列會傳遞給該函數。
語法
GROUP BY group_expression [ , group_expression [ , ... ] ] [ WITH { ROLLUP | CUBE } ]
GROUP BY { group_expression | { ROLLUP | CUBE | GROUPING SETS } (grouping_set [ , ...]) } [ , ... ]
聚合函數定義為
aggregate_name ( [ DISTINCT ] expression [ , ... ] ) [ FILTER ( WHERE boolean_expression ) ]
參數
-
group_expression
指定列根據其分組在一起的準則。列的分組是根據群組表達式的結果值執行的。群組表達式可以是欄位名稱,例如
GROUP BY a
,欄位位置,例如GROUP BY 0
,或表達式,例如GROUP BY a + b
。 -
grouping_set
群組集是由括號中的零個或多個逗號分隔的表達式指定的。當群組集只有一個元素時,可以省略括號。例如,
GROUPING SETS ((a), (b))
與GROUPING SETS (a, b)
相同。語法:
{ ( [ expression [ , ... ] ] ) | expression }
-
GROUPING SETS
為 GROUPING SETS 之後指定的每個群組集分組列。例如,
GROUP BY GROUPING SETS ((warehouse), (product))
在語意上等於GROUP BY warehouse
和GROUP BY product
結果的聯集。此子句是UNION ALL
的簡寫,其中UNION ALL
算子的每一段執行GROUPING SETS
子句中指定的每個群組集的聚合。類似地,GROUP BY GROUPING SETS ((warehouse, product), (product), ())
在語意上等於GROUP BY warehouse, product
、GROUP BY product
和全域聚合的結果聯集。注意:為了相容於 Hive,Spark 允許
GROUP BY ... GROUPING SETS (...)
。GROUP BY 表達式通常會被忽略,但如果它包含比 GROUPING SETS 表達式更多的表達式,額外的表達式將會包含在分組表達式中,而值永遠為 null。例如,SELECT a, b, c FROM ... GROUP BY a, b, c GROUPING SETS (a, b)
,欄位c
的輸出永遠為 null。 -
ROLLUP
在單一陳述式中指定多個層級的聚合。此子句用於根據多個分組集計算聚合。
ROLLUP
是GROUPING SETS
的簡寫。例如,GROUP BY warehouse, product WITH ROLLUP
或GROUP BY ROLLUP(warehouse, product)
等同於GROUP BY GROUPING SETS((warehouse, product), (warehouse), ())
。GROUP BY ROLLUP(warehouse, product, (warehouse, location))
等同於GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse), ())
。ROLLUP
規格的 N 個元素會產生 N+1 個GROUPING SETS
。 -
CUBE
CUBE
子句用於根據GROUP BY
子句中指定的分組欄位組合執行聚合。CUBE
是GROUPING SETS
的簡寫。例如,GROUP BY warehouse, product WITH CUBE
或GROUP BY CUBE(warehouse, product)
等同於GROUP BY GROUPING SETS((warehouse, product), (warehouse), (product), ())
。GROUP BY CUBE(warehouse, product, (warehouse, location))
等同於GROUP BY GROUPING SETS((warehouse, product, location), (warehouse, product), (warehouse, location), (product, warehouse, location), (warehouse), (product), (warehouse, product), ())
。CUBE
規格的 N 個元素會產生 2^N 個GROUPING SETS
。 -
混合/巢狀分組分析
GROUP BY 子句可以包含多個
group_expression
和多個CUBE|ROLLUP|GROUPING SETS
。GROUPING SETS
也可以有巢狀的CUBE|ROLLUP|GROUPING SETS
子句,例如GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location))
、GROUPING SETS(warehouse, GROUPING SETS(location, GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location))))
。CUBE|ROLLUP
只是GROUPING SETS
的語法糖,請參閱上述部分,了解如何將CUBE|ROLLUP
轉換為GROUPING SETS
。group_expression
在此上下文中可以視為單一組的GROUPING SETS
。對於GROUP BY
子句中的多個GROUPING SETS
,我們會透過對原始GROUPING SETS
進行交叉運算來產生單一的GROUPING SETS
。對於GROUPING SETS
子句中的巢狀GROUPING SETS
,我們只需取得其分組集並將其剝離。例如,GROUP BY warehouse, GROUPING SETS((product), ()), GROUPING SETS((location, size), (location), (size), ())
和GROUP BY warehouse, ROLLUP(product), CUBE(location, size)
等同於GROUP BY GROUPING SETS( (warehouse, product, location, size), (warehouse, product, location), (warehouse, product, size), (warehouse, product), (warehouse, location, size), (warehouse, location), (warehouse, size), (warehouse))
。GROUP BY GROUPING SETS(GROUPING SETS(warehouse), GROUPING SETS((warehouse, product)))
等同於GROUP BY GROUPING SETS((warehouse), (warehouse, product))
。 -
aggregate_name
指定聚合函數名稱 (MIN、MAX、COUNT、SUM、AVG 等)。
-
DISTINCT
在將重複資料傳遞至聚合函數之前,將其從輸入列中移除。
-
FILTER
篩選輸入列,其中
WHERE
子句中的boolean_expression
評估為 true 的列會傳遞至聚合函數;其他列會被捨棄。
範例
CREATE TABLE dealer (id INT, city STRING, car_model STRING, quantity INT);
INSERT INTO dealer VALUES
(100, 'Fremont', 'Honda Civic', 10),
(100, 'Fremont', 'Honda Accord', 15),
(100, 'Fremont', 'Honda CRV', 7),
(200, 'Dublin', 'Honda Civic', 20),
(200, 'Dublin', 'Honda Accord', 10),
(200, 'Dublin', 'Honda CRV', 3),
(300, 'San Jose', 'Honda Civic', 5),
(300, 'San Jose', 'Honda Accord', 8);
-- Sum of quantity per dealership. Group by `id`.
SELECT id, sum(quantity) FROM dealer GROUP BY id ORDER BY id;
+---+-------------+
| id|sum(quantity)|
+---+-------------+
|100| 32|
|200| 33|
|300| 13|
+---+-------------+
-- Use column position in GROUP by clause.
SELECT id, sum(quantity) FROM dealer GROUP BY 1 ORDER BY 1;
+---+-------------+
| id|sum(quantity)|
+---+-------------+
|100| 32|
|200| 33|
|300| 13|
+---+-------------+
-- Multiple aggregations.
-- 1. Sum of quantity per dealership.
-- 2. Max quantity per dealership.
SELECT id, sum(quantity) AS sum, max(quantity) AS max FROM dealer GROUP BY id ORDER BY id;
+---+---+---+
| id|sum|max|
+---+---+---+
|100| 32| 15|
|200| 33| 20|
|300| 13| 8|
+---+---+---+
-- Count the number of distinct dealer cities per car_model.
SELECT car_model, count(DISTINCT city) AS count FROM dealer GROUP BY car_model;
+------------+-----+
| car_model|count|
+------------+-----+
| Honda Civic| 3|
| Honda CRV| 2|
|Honda Accord| 3|
+------------+-----+
-- Sum of only 'Honda Civic' and 'Honda CRV' quantities per dealership.
SELECT id, sum(quantity) FILTER (
WHERE car_model IN ('Honda Civic', 'Honda CRV')
) AS `sum(quantity)` FROM dealer
GROUP BY id ORDER BY id;
+---+-------------+
| id|sum(quantity)|
+---+-------------+
|100| 17|
|200| 23|
|300| 5|
+---+-------------+
-- Aggregations using multiple sets of grouping columns in a single statement.
-- Following performs aggregations based on four sets of grouping columns.
-- 1. city, car_model
-- 2. city
-- 3. car_model
-- 4. Empty grouping set. Returns quantities for all city and car models.
SELECT city, car_model, sum(quantity) AS sum FROM dealer
GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
ORDER BY city;
+---------+------------+---+
| city| car_model|sum|
+---------+------------+---+
| null| null| 78|
| null| HondaAccord| 33|
| null| HondaCRV| 10|
| null| HondaCivic| 35|
| Dublin| null| 33|
| Dublin| HondaAccord| 10|
| Dublin| HondaCRV| 3|
| Dublin| HondaCivic| 20|
| Fremont| null| 32|
| Fremont| HondaAccord| 15|
| Fremont| HondaCRV| 7|
| Fremont| HondaCivic| 10|
| San Jose| null| 13|
| San Jose| HondaAccord| 8|
| San Jose| HondaCivic| 5|
+---------+------------+---+
-- Group by processing with `ROLLUP` clause.
-- Equivalent GROUP BY GROUPING SETS ((city, car_model), (city), ())
SELECT city, car_model, sum(quantity) AS sum FROM dealer
GROUP BY city, car_model WITH ROLLUP
ORDER BY city, car_model;
+---------+------------+---+
| city| car_model|sum|
+---------+------------+---+
| null| null| 78|
| Dublin| null| 33|
| Dublin| HondaAccord| 10|
| Dublin| HondaCRV| 3|
| Dublin| HondaCivic| 20|
| Fremont| null| 32|
| Fremont| HondaAccord| 15|
| Fremont| HondaCRV| 7|
| Fremont| HondaCivic| 10|
| San Jose| null| 13|
| San Jose| HondaAccord| 8|
| San Jose| HondaCivic| 5|
+---------+------------+---+
-- Group by processing with `CUBE` clause.
-- Equivalent GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
SELECT city, car_model, sum(quantity) AS sum FROM dealer
GROUP BY city, car_model WITH CUBE
ORDER BY city, car_model;
+---------+------------+---+
| city| car_model|sum|
+---------+------------+---+
| null| null| 78|
| null| HondaAccord| 33|
| null| HondaCRV| 10|
| null| HondaCivic| 35|
| Dublin| null| 33|
| Dublin| HondaAccord| 10|
| Dublin| HondaCRV| 3|
| Dublin| HondaCivic| 20|
| Fremont| null| 32|
| Fremont| HondaAccord| 15|
| Fremont| HondaCRV| 7|
| Fremont| HondaCivic| 10|
| San Jose| null| 13|
| San Jose| HondaAccord| 8|
| San Jose| HondaCivic| 5|
+---------+------------+---+
--Prepare data for ignore nulls example
CREATE TABLE person (id INT, name STRING, age INT);
INSERT INTO person VALUES
(100, 'Mary', NULL),
(200, 'John', 30),
(300, 'Mike', 80),
(400, 'Dan', 50);
--Select the first row in column age
SELECT FIRST(age) FROM person;
+--------------------+
| first(age, false) |
+--------------------+
| NULL |
+--------------------+
--Get the first row in column `age` ignore nulls,last row in column `id` and sum of column `id`.
SELECT FIRST(age IGNORE NULLS), LAST(id), SUM(id) FROM person;
+-------------------+------------------+----------+
| first(age, true) | last(id, false) | sum(id) |
+-------------------+------------------+----------+
| 30 | 400 | 1000 |
+-------------------+------------------+----------+