NULL 語意

說明

表格由一組列組成,而每一列都包含一組欄位。欄位與資料類型相關聯,並表示實體的特定屬性(例如,age 是稱為 person 的實體的欄位)。有時候,特定於某一列的欄位值在該列產生時並不知道。在 SQL 中,此類值表示為 NULL。本節詳細說明 NULL 值在各種運算子、運算式和其他 SQL 建構中的處理語意。

  1. 比較運算子中的 Null 處理
  2. 邏輯運算子中的 Null 處理
  3. 運算式中的 Null 處理
    1. 不允許 Null 的運算式中的 Null 處理
    2. 可處理 Null 值運算元的運算式中的 Null 處理
    3. 內建聚合運算式中的 Null 處理
  4. WHERE、HAVING 和 JOIN 條件中的 Null 處理
  5. GROUP BY 和 DISTINCT 中的 Null 處理
  6. ORDER BY 中的 Null 處理
  7. UNION、INTERSECT、EXCEPT 中的 Null 處理
  8. EXISTS 和 NOT EXISTS 子查詢中的 Null 處理
  9. IN 和 NOT IN 子查詢中的 Null 處理

下列說明名為 person 的表格的結構配置和資料。資料在 age 欄位中包含 NULL 值,而此表格將用於下列各節中的各種範例。
表格:person

Id Name Age
100 Joe 30
200 Marry NULL
300 Mike 18
400 Fred 50
500 Albert NULL
600 Michelle 30
700 Dan 50

比較運算子

Apache Spark 支援標準的比較運算子,例如「>」、「>=」、「=」、「<」和「<=」。當運算元之一或兩個運算元都是未知或 NULL 時,這些運算子的結果為未知或 NULL。為了比較 NULL 值是否相等,Spark 提供了一個空值安全相等運算子(「<=>」),當一個運算元為 NULL 時傳回 False,當兩個運算元都是 NULL 時傳回 True。下表說明當一個或兩個運算元為 NULL` 時,比較運算子的行為

左運算元 右運算元 > >= = < <= <=>
NULL 任何值 NULL NULL NULL NULL NULL False
任何值 NULL NULL NULL NULL NULL NULL False
NULL NULL NULL NULL NULL NULL NULL True

範例

-- Normal comparison operators return `NULL` when one of the operand is `NULL`.
SELECT 5 > null AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

-- Normal comparison operators return `NULL` when both the operands are `NULL`.
SELECT null = null AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

-- Null-safe equal operator return `False` when one of the operand is `NULL`
SELECT 5 <=> null AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|            false|
+-----------------+

-- Null-safe equal operator return `True` when one of the operand is `NULL`
SELECT NULL <=> NULL;
+-----------------+
|expression_output|
+-----------------+
|             true|
+-----------------+

邏輯運算子

Spark 支援標準的邏輯運算子,例如 ANDORNOT。這些運算子以 布林 敘述作為引數,並傳回 布林 值。

下表說明當一個或兩個運算元為 NULL 時,邏輯運算子的行為。

左運算元 右運算元 OR AND
True NULL True NULL
False NULL NULL False
NULL True True NULL
NULL False NULL False
NULL NULL NULL NULL
運算元 NOT
NULL NULL

範例

-- Normal comparison operators return `NULL` when one of the operands is `NULL`.
SELECT (true OR null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             true|
+-----------------+

-- Normal comparison operators return `NULL` when both the operands are `NULL`.
SELECT (null OR false) AS expression_output
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

-- Null-safe equal operator returns `False` when one of the operands is `NULL`
SELECT NOT(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

敘述

在 Spark 中,比較運算子和邏輯運算子被視為敘述。除了這兩種敘述之外,Spark 還支援其他形式的敘述,例如函數敘述、轉型敘述等。Spark 中的敘述可廣泛分類為

不允許 Null 的敘述

當敘述的一個或多個引數為 NULL 時,不允許 Null 的敘述會傳回 NULL,而且大多數的敘述都屬於此類別。

範例
SELECT concat('John', null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

SELECT positive(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

SELECT to_date(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

可處理 Null 值運算元的表達式

此類表達式旨在處理 NULL 值。表達式的結果取決於表達式本身。舉例來說,函數表達式 isnull 會在 null 輸入時傳回 true,而在非 null 輸入時傳回 false,而函數 coalesce 會在其運算元清單中傳回第一個非 NULL 值。不過,當 coalesce 的所有運算元都是 NULL 時,它會傳回 NULL。以下是此類別中不完整的表達式清單。

範例
SELECT isnull(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             true|
+-----------------+

-- Returns the first occurrence of non `NULL` value.
SELECT coalesce(null, null, 3, null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|                3|
+-----------------+

-- Returns `NULL` as all its operands are `NULL`. 
SELECT coalesce(null, null, null, null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

SELECT isnan(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|            false|
+-----------------+

內建聚合表達式

聚合函數會透過處理一組輸入列來計算單一結果。以下是 NULL 值由聚合函數處理的方式規則。

範例

-- `count(*)` does not skip `NULL` values.
SELECT count(*) FROM person;
+--------+
|count(1)|
+--------+
|       7|
+--------+

-- `NULL` values in column `age` are skipped from processing.
SELECT count(age) FROM person;
+----------+
|count(age)|
+----------+
|         5|
+----------+

-- `count(*)` on an empty input set returns 0. This is unlike the other
-- aggregate functions, such as `max`, which return `NULL`.
SELECT count(*) FROM person where 1 = 0;
+--------+
|count(1)|
+--------+
|       0|
+--------+

-- `NULL` values are excluded from computation of maximum value.
SELECT max(age) FROM person;
+--------+
|max(age)|
+--------+
|      50|
+--------+

-- `max` returns `NULL` on an empty input set.
SELECT max(age) FROM person where 1 = 0;
+--------+
|max(age)|
+--------+
|    null|
+--------+

WHERE、HAVING 和 JOIN 子句中的條件表達式

WHEREHAVING 運算子會根據使用者指定的條件來篩選列。 JOIN 運算子用於根據聯結條件來結合兩個表格中的列。對於這三個運算子來說,條件表達式都是布林表達式,且可以傳回 True、False 或 Unknown (NULL)。如果條件的結果為 True,則表示它們「符合條件」。

範例

-- Persons whose age is unknown (`NULL`) are filtered out from the result set.
SELECT * FROM person WHERE age > 0;
+--------+---+
|    name|age|
+--------+---+
|Michelle| 30|
|    Fred| 50|
|    Mike| 18|
|     Dan| 50|
|     Joe| 30|
+--------+---+

-- `IS NULL` expression is used in disjunction to select the persons
-- with unknown (`NULL`) records.
SELECT * FROM person WHERE age > 0 OR age IS NULL;
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|Michelle|  30|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
|   Marry|null|
|     Joe|  30|
+--------+----+

-- Person with unknown(`NULL`) ages are skipped from processing.
SELECT age, count(*) FROM person GROUP BY age HAVING max(age) > 18;
+---+--------+
|age|count(1)|
+---+--------+
| 50|       2|
| 30|       2|
+---+--------+

-- A self join case with a join condition `p1.age = p2.age AND p1.name = p2.name`.
-- The persons with unknown age (`NULL`) are filtered out by the join operator.
SELECT * FROM person p1, person p2
    WHERE p1.age = p2.age
    AND p1.name = p2.name;
+--------+---+--------+---+
|    name|age|    name|age|
+--------+---+--------+---+
|Michelle| 30|Michelle| 30|
|    Fred| 50|    Fred| 50|
|    Mike| 18|    Mike| 18|
|     Dan| 50|     Dan| 50|
|     Joe| 30|     Joe| 30|
+--------+---+--------+---+

-- The age column from both legs of join are compared using null-safe equal which
-- is why the persons with unknown age (`NULL`) are qualified by the join.
SELECT * FROM person p1, person p2
    WHERE p1.age <=> p2.age
    AND p1.name = p2.name;
+--------+----+--------+----+
|    name| age|    name| age|
+--------+----+--------+----+
|  Albert|null|  Albert|null|
|Michelle|  30|Michelle|  30|
|    Fred|  50|    Fred|  50|
|    Mike|  18|    Mike|  18|
|     Dan|  50|     Dan|  50|
|   Marry|null|   Marry|null|
|     Joe|  30|     Joe|  30|
+--------+----+--------+----+

聚合運算子 (GROUP BY、DISTINCT)

如前一節 比較運算子 中所討論,兩個 NULL 值不相等。然而,為了分組和區別處理,兩個或更多具有 NULL 資料 的值會被分組到同一個區段中。此行為符合 SQL 標準和其他企業資料庫管理系統。

範例

-- `NULL` values are put in one bucket in `GROUP BY` processing.
SELECT age, count(*) FROM person GROUP BY age;
+----+--------+
| age|count(1)|
+----+--------+
|null|       2|
|  50|       2|
|  30|       2|
|  18|       1|
+----+--------+

-- All `NULL` ages are considered one distinct value in `DISTINCT` processing.
SELECT DISTINCT age FROM person;
+----+
| age|
+----+
|null|
|  50|
|  30|
|  18|
+----+

排序運算子 (ORDER BY 子句)

Spark SQL 支援在 ORDER BY 子句中指定 NULL 排序。Spark 透過將所有 NULL 值放在最前面或最後面來處理 ORDER BY 子句,具體取決於 NULL 排序規格。預設情況下,所有 NULL 值都會放在最前面。

範例

-- `NULL` values are shown at first and other values
-- are sorted in ascending way.
SELECT age, name FROM person ORDER BY age;
+----+--------+
| age|    name|
+----+--------+
|null|   Marry|
|null|  Albert|
|  18|    Mike|
|  30|Michelle|
|  30|     Joe|
|  50|    Fred|
|  50|     Dan|
+----+--------+

-- Column values other than `NULL` are sorted in ascending
-- way and `NULL` values are shown at the last.
SELECT age, name FROM person ORDER BY age NULLS LAST;
+----+--------+
| age|    name|
+----+--------+
|  18|    Mike|
|  30|Michelle|
|  30|     Joe|
|  50|     Dan|
|  50|    Fred|
|null|   Marry|
|null|  Albert|
+----+--------+

-- Columns other than `NULL` values are sorted in descending
-- and `NULL` values are shown at the last.
SELECT age, name FROM person ORDER BY age DESC NULLS LAST;
+----+--------+
| age|    name|
+----+--------+
|  50|    Fred|
|  50|     Dan|
|  30|Michelle|
|  30|     Joe|
|  18|    Mike|
|null|   Marry|
|null|  Albert|
+----+--------+

設定運算子 (UNION、INTERSECT、EXCEPT)

在設定運算的背景下,NULL 值會以空值安全的方式進行相等性比較。這表示在比較列時,兩個 NULL 值會被視為相等,這與一般的 EqualTo(=) 運算子不同。

範例

CREATE VIEW unknown_age SELECT * FROM person WHERE age IS NULL;

-- Only common rows between two legs of `INTERSECT` are in the 
-- result set. The comparison between columns of the row are done
-- in a null-safe manner.
SELECT name, age FROM person
    INTERSECT
    SELECT name, age from unknown_age;
+------+----+
|  name| age|
+------+----+
|Albert|null|
| Marry|null|
+------+----+

-- `NULL` values from two legs of the `EXCEPT` are not in output. 
-- This basically shows that the comparison happens in a null-safe manner.
SELECT age, name FROM person
    EXCEPT
    SELECT age FROM unknown_age;
+---+--------+
|age|    name|
+---+--------+
| 30|     Joe|
| 50|    Fred|
| 30|Michelle|
| 18|    Mike|
| 50|     Dan|
+---+--------+

-- Performs `UNION` operation between two sets of data. 
-- The comparison between columns of the row ae done in
-- null-safe manner.
SELECT name, age FROM person
    UNION 
    SELECT name, age FROM unknown_age;
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|     Joe|  30|
|Michelle|  30|
|   Marry|null|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
+--------+----+

EXISTS/NOT EXISTS 子查詢

在 Spark 中,WHERE 子句內允許使用 EXISTS 和 NOT EXISTS 函數。這些是布林函數,會傳回 TRUEFALSE。換句話說,EXISTS 是成員條件,當它所引用的子查詢傳回一或多列時,會傳回 TRUE。類似地,NOT EXISTS 是非成員條件,當子查詢沒有傳回任何列或傳回零列時,會傳回 TRUE。

這兩個函數不受子查詢結果中是否存在 NULL 的影響。它們通常較快,因為它們可以在沒有特別考量空值感知的情況下轉換為半聯結/反半聯結。

範例

-- Even if subquery produces rows with `NULL` values, the `EXISTS` expression
-- evaluates to `TRUE` as the subquery produces 1 row.
SELECT * FROM person WHERE EXISTS (SELECT null);
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|Michelle|  30|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
|   Marry|null|
|     Joe|  30|
+--------+----+

-- `NOT EXISTS` expression returns `FALSE`. It returns `TRUE` only when
-- subquery produces no rows. In this case, it returns 1 row.
SELECT * FROM person WHERE NOT EXISTS (SELECT null);
+----+---+
|name|age|
+----+---+
+----+---+

-- `NOT EXISTS` expression returns `TRUE`.
SELECT * FROM person WHERE NOT EXISTS (SELECT 1 WHERE 1 = 0);
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|Michelle|  30|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
|   Marry|null|
|     Joe|  30|
+--------+----+

IN/NOT IN 子查詢

在 Spark 中,查詢的 WHERE 子句內允許使用 INNOT IN 函數。與 EXISTS 函數不同,IN 函數可以傳回 TRUEFALSEUNKNOWN (NULL) 值。從概念上來說,IN 函數在語意上等同於一組相等條件,這些條件以析取運算子 (OR) 分隔。例如,c1 IN (1, 2, 3) 在語意上等同於 (C1 = 1 OR c1 = 2 OR c1 = 3)

就處理 NULL 值而言,語意可以從比較運算子 (=) 和邏輯運算子 (OR) 中的 NULL 值處理中推論出來。總之,以下是計算 IN 函數結果的規則。

當清單包含 NULL 時,NOT IN 總是會傳回 UNKNOWN,與輸入值無關。這是因為如果值不在包含 NULL 的清單中,則 IN 會傳回 UNKNOWN,而 NOT UNKNOWN 又是 UNKNOWN。

範例

-- The subquery has only `NULL` value in its result set. Therefore,
-- the result of `IN` predicate is UNKNOWN.
SELECT * FROM person WHERE age IN (SELECT null);
+----+---+
|name|age|
+----+---+
+----+---+

-- The subquery has `NULL` value in the result set as well as a valid 
-- value `50`. Rows with age = 50 are returned. 
SELECT * FROM person
    WHERE age IN (SELECT age FROM VALUES (50), (null) sub(age));
+----+---+
|name|age|
+----+---+
|Fred| 50|
| Dan| 50|
+----+---+

-- Since subquery has `NULL` value in the result set, the `NOT IN`
-- predicate would return UNKNOWN. Hence, no rows are
-- qualified for this query.
SELECT * FROM person
    WHERE age NOT IN (SELECT age FROM VALUES (50), (null) sub(age));
+----+---+
|name|age|
+----+---+
+----+---+