JSON 檔案

Spark SQL 可以自動推論 JSON 資料集的架構，並將其載入為 DataFrame。此轉換可以使用 SparkSession.read.json 在 JSON 檔案上執行。

請注意，提供為JSON 檔案的檔案並非典型的 JSON 檔案。每一行都必須包含一個獨立、自含的有效 JSON 物件。如需更多資訊，請參閱 JSON Lines 文字格式，也稱為換行區隔的 JSON。

對於一般的多行 JSON 檔案，將 multiLine 參數設定為 True。

# spark is from the previous example.
sc = spark.sparkContext

# A JSON dataset is pointed to by path.
# The path can be either a single text file or a directory storing text files
path = "examples/src/main/resources/people.json"
peopleDF = spark.read.json(path)

# The inferred schema can be visualized using the printSchema() method
peopleDF.printSchema()
# root
#  |-- age: long (nullable = true)
#  |-- name: string (nullable = true)

# Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

# SQL statements can be run by using the sql methods provided by spark
teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
teenagerNamesDF.show()
# +------+
# |  name|
# +------+
# |Justin|
# +------+

# Alternatively, a DataFrame can be created for a JSON dataset represented by
# an RDD[String] storing one JSON object per string
jsonStrings = ['{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = sc.parallelize(jsonStrings)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()
# +---------------+----+
# |        address|name|
# +---------------+----+
# |[Columbus,Ohio]| Yin|
# +---------------+----+

在 Spark 儲存庫中的「examples/src/main/python/sql/datasource.py」中尋找完整的範例程式碼。

Spark SQL 可以自動推論 JSON 資料集的架構，並將其載入為 Dataset[Row]。此轉換可以使用 SparkSession.read.json() 在 Dataset[String] 或 JSON 檔案上執行。

對於一般的多行 JSON 檔案，將 multiLine 選項設定為 true。

// Primitive types (Int, String, etc) and Product types (case classes) encoders are
// supported by importing this when creating a Dataset.
import spark.implicits._

// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files
val path = "examples/src/main/resources/people.json"
val peopleDF = spark.read.json(path)

// The inferred schema can be visualized using the printSchema() method
peopleDF.printSchema()
// root
//  |-- age: long (nullable = true)
//  |-- name: string (nullable = true)

// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

// SQL statements can be run by using the sql methods provided by spark
val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
teenagerNamesDF.show()
// +------+
// |  name|
// +------+
// |Justin|
// +------+

// Alternatively, a DataFrame can be created for a JSON dataset represented by
// a Dataset[String] storing one JSON object per string
val otherPeopleDataset = spark.createDataset(
  """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val otherPeople = spark.read.json(otherPeopleDataset)
otherPeople.show()
// +---------------+----+
// |        address|name|
// +---------------+----+
// |[Columbus,Ohio]| Yin|
// +---------------+----+

在 Spark 儲存庫中的「examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala」中尋找完整的範例程式碼。

Spark SQL 可以自動推論 JSON 資料集的架構，並將其載入為 Dataset<Row>。此轉換可以使用 SparkSession.read().json() 在 Dataset<String> 或 JSON 檔案上執行。

對於一般的多行 JSON 檔案，將 multiLine 選項設定為 true。

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files
Dataset<Row> people = spark.read().json("examples/src/main/resources/people.json");

// The inferred schema can be visualized using the printSchema() method
people.printSchema();
// root
//  |-- age: long (nullable = true)
//  |-- name: string (nullable = true)

// Creates a temporary view using the DataFrame
people.createOrReplaceTempView("people");

// SQL statements can be run by using the sql methods provided by spark
Dataset<Row> namesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19");
namesDF.show();
// +------+
// |  name|
// +------+
// |Justin|
// +------+

// Alternatively, a DataFrame can be created for a JSON dataset represented by
// a Dataset<String> storing one JSON object per string.
List<String> jsonData = Arrays.asList(
        "{\"name\":\"Yin\",\"address\":{\"city\":\"Columbus\",\"state\":\"Ohio\"}}");
Dataset<String> anotherPeopleDataset = spark.createDataset(jsonData, Encoders.STRING());
Dataset<Row> anotherPeople = spark.read().json(anotherPeopleDataset);
anotherPeople.show();
// +---------------+----+
// |        address|name|
// +---------------+----+
// |[Columbus,Ohio]| Yin|
// +---------------+----+

在 Spark 儲存庫中的「examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java」中尋找完整的範例程式碼。

Spark SQL 可以自動推論 JSON 資料集的架構，並將其載入為 DataFrame。使用 read.json() 函數，從 JSON 檔案目錄載入資料，其中檔案的每一行都是一個 JSON 物件。

對於一般的多行 JSON 檔案，將命名參數 multiLine 設定為 TRUE。

# A JSON dataset is pointed to by path.
# The path can be either a single text file or a directory storing text files.
path <- "examples/src/main/resources/people.json"
# Create a DataFrame from the file(s) pointed to by path
people <- read.json(path)

# The inferred schema can be visualized using the printSchema() method.
printSchema(people)
## root
##  |-- age: long (nullable = true)
##  |-- name: string (nullable = true)

# Register this DataFrame as a table.
createOrReplaceTempView(people, "people")

# SQL statements can be run by using the sql methods.
teenagers <- sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
head(teenagers)
##     name
## 1 Justin

在 Spark 儲存庫中的「examples/src/main/r/RSparkSQLExample.R」中尋找完整的範例程式碼。

CREATE TEMPORARY VIEW jsonTable
USING org.apache.spark.sql.json
OPTIONS (
  path "examples/src/main/resources/people.json"
)

SELECT * FROM jsonTable

資料來源選項

JSON 的資料來源選項可透過下列方式設定

DataFrameReader 的 .option/.options 方法
- DataFrameReader
- DataFrameWriter
- DataStreamReader
- DataStreamWriter
以下內建函式
- from_json
- to_json
- schema_of_json
OPTIONS 子句在 CREATE TABLE USING DATA_SOURCE

屬性名稱	預設值	意義	範圍
`timeZone`	(`spark.sql.session.timeZone` 組態的值)	設定字串，用來指出時區 ID，以用於格式化 JSON 資料來源或分割區值中的時間戳記。支援下列 `timeZone` 格式基於區域的時區 ID：應採用「區域/城市」格式，例如「America/Los_Angeles」。時區偏移量：應採用「(+\|-)HH:mm」格式，例如「-08:00」或「+01:00」。此外，也支援「UTC」和「Z」作為「+00:00」的別名。不建議使用其他簡稱，例如「CST」，因為它們可能產生歧義。	讀取/寫入
`primitivesAsString`	`false`	將所有基本值推論為字串類型。	讀取
`prefersDecimal`	`false`	將所有浮點值推論為十進位數類型。如果這些值不符合十進位數，則將它們推論為雙精度值。	讀取
`allowComments`	`false`	忽略 JSON 記錄中的 Java/C++ 風格註解。	讀取
`allowUnquotedFieldNames`	`false`	允許未加引號的 JSON 欄位名稱。	讀取
`allowSingleQuotes`	`true`	除了雙引號之外，也允許使用單引號。	讀取
`allowNumericLeadingZeros`	`false`	允許數字中有前導零（例如 00012）。	讀取
`allowBackslashEscapingAnyCharacter`	`false`	允許使用反斜線引號機制引號所有字元。	讀取
`mode`	`PERMISSIVE`	允許在解析期間處理損毀記錄的模式。 `PERMISSIVE`：當遇到損毀的記錄時，會將格式錯誤的字串放入由 `columnNameOfCorruptRecord` 所設定的欄位，並將格式錯誤的欄位設為 `null`。若要保留損毀的記錄，使用者可以在使用者定義的架構中設定一個名為 `columnNameOfCorruptRecord` 的字串類型欄位。如果架構沒有這個欄位，它會在解析時捨棄損毀的記錄。在推斷架構時，它會在輸出架構中隱含地加入一個 `columnNameOfCorruptRecord` 欄位。 `DROPMALFORMED`：忽略整個損毀的記錄。此模式在 JSON 內建函數中不受支援。 `FAILFAST`：在遇到損毀的記錄時擲回例外狀況。	讀取
`columnNameOfCorruptRecord`	（`spark.sql.columnNameOfCorruptRecord` 設定的值）	允許重新命名由 `PERMISSIVE` 模式建立的，包含格式錯誤字串的新欄位。這會覆寫 spark.sql.columnNameOfCorruptRecord。	讀取
`dateFormat`	`yyyy-MM-dd`	設定表示日期格式的字串。自訂日期格式遵循日期時間模式中的格式。這適用於日期類型。	讀取/寫入
`timestampFormat`	`yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`	設定表示時間戳記格式的字串。自訂日期格式遵循日期時間模式中的格式。這適用於時間戳記類型。	讀取/寫入
`timestampNTZFormat`	yyyy-MM-dd'T'HH:mm:ss[.SSS]	設定表示沒有時區的時間戳記格式的字串。自訂日期格式遵循日期時間模式中的格式。這適用於沒有時區的時間戳記類型，請注意，在寫入或讀取此資料類型時，不支援時區偏移和時區元件。	讀取/寫入
`enableDateTimeParsingFallback`	如果時間解析器政策有舊版設定，或沒有提供自訂日期或時間戳記模式，則啟用。	允許回歸到解析日期和時間戳記的後向相容（Spark 1.x 和 2.0）行為，如果值與設定的模式不符。	讀取
`multiLine`	`false`	解析每一個記錄，它可能跨多行，每個檔案。JSON 內建函數會忽略這個選項。	讀取
`allowUnquotedControlChars`	`false`	允許 JSON 字串包含未加引號的控制字元（值小於 32 的 ASCII 字元，包括 tab 和換行字元）或不包含。	讀取
`編碼`	當 `multiLine` 設為 `true`（用於讀取）時自動偵測，`UTF-8`（用於寫入）	用於讀取時，允許強制設定 JSON 檔案的標準基本或延伸編碼之一。例如 UTF-16BE、UTF-32LE。用於寫入時，指定已儲存 json 檔案的編碼（字元集）。JSON 內建函式會略過此選項。	讀取/寫入
`換行符號`	`\r`、`\r\n`、`\n`（用於讀取），`\n`（用於寫入）	定義用於剖析的換行符號。JSON 內建函式會略過此選項。	讀取/寫入
`取樣比率`	`1.0`	定義用於推論架構的輸入 JSON 物件的比例。	讀取
`如果全部為 Null 則捨棄欄位`	`false`	在推論架構時是否略過所有 null 值或空陣列的欄位。	讀取
`地區`	`en-US`	設定地區為 IETF BCP 47 格式的語言標籤。例如，在剖析日期和時間戳記時會使用 `地區`。	讀取
`允許非數字數字`	`true`	允許 JSON 剖析器將一組「非數字」（NaN）代幣視為合法的浮點數值。 `+INF`：表示正無限大，以及 `+Infinity` 和 `Infinity` 的別名。 `-INF`：表示負無限大，別名為 `-Infinity`。 `NaN`：表示其他非數字，例如除以零的結果。	讀取
`壓縮`	（無）	儲存到檔案時要使用的壓縮編解碼器。這可以是已知的其中一個不區分大小寫的簡寫名稱（無、bzip2、gzip、lz4、snappy 和 deflate）。JSON 內建函式會略過此選項。	寫入
`略過 Null 欄位`	（`spark.sql.jsonGenerator.ignoreNullFields` 組態的值）	在產生 JSON 物件時是否略過 null 欄位。	寫入

其他一般選項可在一般檔案來源選項中找到。

Spark SQL 指南

JSON 檔案

資料來源選項