Tomasrohr.org

Merge in Hive and Impala

Suppose you want to load data into an already populated Hive table and you want to apply data as update of existing records and insert of new records. Neither Hive nor Impala query language includes MERGE command (or UPSERT known from Teradata). Also it is a good practice to ovewrite a whole table partition at once. Here is the command which performs what you would achieve in RDBMS using MERGE or UPSERT command.

Assumptions:

  1. MY_SOURCE_TABLE has the same primary key as MY_TARGET_TABLE
  2. MY_SOURCE_TABLE may be a subquery. In this case resulting data must not violate target table's primary key.
-- on Hive
set hive.exec.dynamic.partition.mode=true
set hive.exec.dynamic.partition.mode=nonstrict;
-- on Hive


INSERT OVERWRITE TABLE MY_TARGET_TABLE PARTITION (PARTITION_COLUMN)
SELECT 
    COALESCE(SRC.COLUMN_1, TGT.COLUMN_1)  AS COLUMN_1,
    COALESCE(SRC.COLUMN_2, TGT.COLUMN_2)  AS COLUMN_2,
    COALESCE(SRC.COLUMN_3, TGT.COLUMN_3)  AS COLUMN_3,
    COALESCE(SRC.PARTITION_COLUMN, TGT.PARTITION_COLUMN) AS PARTITION_COLUMN
FROM MY_SOURCE_TABLE SRC
FULL OUTER JOIN  MY_TARGET_TABLE TGT 
  ON  TGT.KEY_COLUMN_1 = SRC.KEY_COLUMN_1         
  AND TGT.KEY_COLUMN_2 = SRC.KEY_COLUMN_2 
WHERE coalesce(SRC.COLUMN_3, TGT.COLUMN_3) = 'something'  /* filter condition */
;

On Hive you must set properties:
set hive.exec.dynamic.partition.mode=true
set hive.exec.dynamic.partition.mode=nonstrict