文字檔

Spark SQL 提供 spark.read().text("file_name") 來將文字檔案或文字檔案目錄讀取到 Spark DataFrame 中，並提供 dataframe.write().text("path") 來寫入文字檔案。在讀取文字檔案時，預設情況下，每一行會變成每一列，且該列有一個字串「值」欄位。行分隔符號可以變更，如以下範例所示。 option() 函數可用於自訂讀取或寫入的行為，例如控制行分隔符號、壓縮等行為。

# spark is from the previous example
sc = spark.sparkContext

# A text dataset is pointed to by path.
# The path can be either a single text file or a directory of text files
path = "examples/src/main/resources/people.txt"

df1 = spark.read.text(path)
df1.show()
# +-----------+
# |      value|
# +-----------+
# |Michael, 29|
# |   Andy, 30|
# | Justin, 19|
# +-----------+

# You can use 'lineSep' option to define the line separator.
# The line separator handles all `\r`, `\r\n` and `\n` by default.
df2 = spark.read.text(path, lineSep=",")
df2.show()
# +-----------+
# |      value|
# +-----------+
# |    Michael|
# |   29\nAndy|
# | 30\nJustin|
# |       19\n|
# +-----------+

# You can also use 'wholetext' option to read each input file as a single row.
df3 = spark.read.text(path, wholetext=True)
df3.show()
# +--------------------+
# |               value|
# +--------------------+
# |Michael, 29\nAndy...|
# +--------------------+

# "output" is a folder which contains multiple text files and a _SUCCESS file.
df1.write.csv("output")

# You can specify the compression format using the 'compression' option.
df1.write.text("output_compressed", compression="gzip")

在 Spark 儲存庫的「examples/src/main/python/sql/datasource.py」中尋找完整的範例程式碼。

// A text dataset is pointed to by path.
// The path can be either a single text file or a directory of text files
val path = "examples/src/main/resources/people.txt"

val df1 = spark.read.text(path)
df1.show()
// +-----------+
// |      value|
// +-----------+
// |Michael, 29|
// |   Andy, 30|
// | Justin, 19|
// +-----------+

// You can use 'lineSep' option to define the line separator.
// The line separator handles all `\r`, `\r\n` and `\n` by default.
val df2 = spark.read.option("lineSep", ",").text(path)
df2.show()
// +-----------+
// |      value|
// +-----------+
// |    Michael|
// |   29\nAndy|
// | 30\nJustin|
// |       19\n|
// +-----------+

// You can also use 'wholetext' option to read each input file as a single row.
val df3 = spark.read.option("wholetext", true).text(path)
df3.show()
//  +--------------------+
//  |               value|
//  +--------------------+
//  |Michael, 29\nAndy...|
//  +--------------------+

// "output" is a folder which contains multiple text files and a _SUCCESS file.
df1.write.text("output")

// You can specify the compression format using the 'compression' option.
df1.write.option("compression", "gzip").text("output_compressed")

在 Spark 儲存庫的「examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala」中尋找完整的範例程式碼。

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

// A text dataset is pointed to by path.
// The path can be either a single text file or a directory of text files
String path = "examples/src/main/resources/people.txt";

Dataset<Row> df1 = spark.read().text(path);
df1.show();
// +-----------+
// |      value|
// +-----------+
// |Michael, 29|
// |   Andy, 30|
// | Justin, 19|
// +-----------+

// You can use 'lineSep' option to define the line separator.
// The line separator handles all `\r`, `\r\n` and `\n` by default.
Dataset<Row> df2 = spark.read().option("lineSep", ",").text(path);
df2.show();
// +-----------+
// |      value|
// +-----------+
// |    Michael|
// |   29\nAndy|
// | 30\nJustin|
// |       19\n|
// +-----------+

// You can also use 'wholetext' option to read each input file as a single row.
Dataset<Row> df3 = spark.read().option("wholetext", "true").text(path);
df3.show();
//  +--------------------+
//  |               value|
//  +--------------------+
//  |Michael, 29\nAndy...|
//  +--------------------+

// "output" is a folder which contains multiple text files and a _SUCCESS file.
df1.write().text("output");

// You can specify the compression format using the 'compression' option.
df1.write().option("compression", "gzip").text("output_compressed");

在 Spark 儲存庫的「examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java」中尋找完整的範例程式碼。

資料來源選項

文字的資料來源選項可透過下列方式設定：

.option/.options 方法
- DataFrameReader
- DataFrameWriter
- DataStreamReader
- DataStreamWriter
在 CREATE TABLE USING DATA_SOURCE 中的 OPTIONS 子句

屬性名稱	預設值	意義	範圍
`wholetext`	`false`	如果為 true，則將輸入路徑中的每個檔案讀取為單一列。	read
`lineSep`	`\r`、`\r\n`、`\n` (用於讀取)，`\n` (用於寫入)	定義用於讀取或寫入的行分隔符號。	read/write
`compression`	(無)	儲存到檔案時要使用的壓縮編解碼器。這可以是已知的其中一個不分大小寫的簡短名稱 (none、bzip2、gzip、lz4、snappy 和 deflate)。	write

其他一般選項可以在一般檔案來源選項中找到。

Spark SQL 指南

文字檔

資料來源選項