建立資料來源表格

描述

CREATE TABLE 陳述式使用資料來源定義新表格。

語法

CREATE TABLE [ IF NOT EXISTS ] table_identifier
    [ ( col_name1 col_type1 [ COMMENT col_comment1 ], ... ) ]
    USING data_source
    [ OPTIONS ( key1=val1, key2=val2, ... ) ]
    [ PARTITIONED BY ( col_name1, col_name2, ... ) ]
    [ CLUSTERED BY ( col_name3, col_name4, ... ) 
        [ SORTED BY ( col_name [ ASC | DESC ], ... ) ] 
        INTO num_buckets BUCKETS ]
    [ LOCATION path ]
    [ COMMENT table_comment ]
    [ TBLPROPERTIES ( key1=val1, key2=val2, ... ) ]
    [ AS select_statement ]

請注意,USING 子句和 AS SELECT 子句之間的子句可以按照任何順序出現。例如,您可以在 TBLPROPERTIES 之後撰寫 COMMENT table_comment。

參數

資料來源互動

資料來源表格就像指向基礎資料來源的指標。例如,您可以使用 JDBC 資料來源在 Spark 中建立一個指向 MySQL 中「bar」表格的「foo」表格。當您讀取/寫入「foo」表格時,您實際上是讀取/寫入「bar」表格。

一般而言,CREATE TABLE 會建立一個「指標」,而您需要確定它指向現有項目。檔案來源(例如 parquet、json)例外。如果您未指定 LOCATION,Spark 會為您建立預設的資料表位置。

對於 CREATE TABLE AS SELECT with LOCATION,如果給定的位置存在非空目錄,Spark 會擲回分析例外。如果 spark.sql.legacy.allowNonEmptyLocationInCTAS 設為 true,Spark 會使用輸入查詢的資料覆寫基礎資料來源,以確保建立的資料表包含與輸入查詢完全相同的資料。

範例


--Use data source
CREATE TABLE student (id INT, name STRING, age INT) USING CSV;

--Use data from another table
CREATE TABLE student_copy USING CSV
    AS SELECT * FROM student;
  
--Omit the USING clause, which uses the default data source (parquet by default)
CREATE TABLE student (id INT, name STRING, age INT);

--Use parquet data source with parquet storage options
--The columns 'id' and 'name' enable the bloom filter during writing parquet file,
--column 'age' does not enable
CREATE TABLE student_parquet(id INT, name STRING, age INT) USING PARQUET
    OPTIONS (
      'parquet.bloom.filter.enabled'='true',
      'parquet.bloom.filter.enabled#age'='false'
    );

--Specify table comment and properties
CREATE TABLE student (id INT, name STRING, age INT) USING CSV
    COMMENT 'this is a comment'
    TBLPROPERTIES ('foo'='bar');

--Specify table comment and properties with different clauses order
CREATE TABLE student (id INT, name STRING, age INT) USING CSV
    TBLPROPERTIES ('foo'='bar')
    COMMENT 'this is a comment';

--Create partitioned and bucketed table
CREATE TABLE student (id INT, name STRING, age INT)
    USING CSV
    PARTITIONED BY (age)
    CLUSTERED BY (Id) INTO 4 buckets;

--Create partitioned and bucketed table through CTAS
CREATE TABLE student_partition_bucket
    USING parquet
    PARTITIONED BY (age)
    CLUSTERED BY (id) INTO 4 buckets
    AS SELECT * FROM student;

--Create bucketed table through CTAS and CTE
CREATE TABLE student_bucket
    USING parquet
    CLUSTERED BY (id) INTO 4 buckets (
    WITH tmpTable AS (
        SELECT * FROM student WHERE id > 100
    )
    SELECT * FROM tmpTable
);