Tomasrohr.org

Incremental insert in Hive and Impala

Task: To insert data set into a table, but if a record already exists in the target table, it must not be inserted. And you must overwrite whole partitions (which is a good practice on Hive). The source data set may be a table or a query result.
Solution: You identify those records which do not exist in the target table in terms of primary key. That makes the data increment. Now, you must union that with existing data in target partitions and then ovewrite those partitions.

Assumptions:

  1. MY_SOURCE_TABLE has the same primary key as MY_TARGET_TABLE
  2. MY_SOURCE_TABLE may be a subquery. In this case resulting data must not violate target table's primary key.
INSERT OVERWRITE TABLE MY_TARGET_TABLE PARTITION (PARTITION_COLUMN)
SELECT
  SRC.KEY_COLUMN_1,
  SRC.KEY_COLUMN_2,
  SRC.COLUMN_3,
  SRC.COLUMN_4,
  PARTITION_COLUMN
FROM
  MY_SOURCE_TABLE SRC
  LEFT OUTER JOIN MY_TARGET_TABLE TGT
    ON TGT.KEY_COLUMN_1 = SRC.KEY_COLUMN_1
    AND TGT.KEY_COLUMN_2 = SRC.KEY_COLUMN_2
WHERE
  TGT.KEY_COLUMN_1 IS NULL  /* only records not existing in the target */
UNION ALL
SELECT
  TGT.KEY_COLUMN_1,
  TGT.KEY_COLUMN_2,
  TGT.COLUMN_3,
  TGT.COLUMN_4,
  PARTITION_COLUMN
FROM
  MY_TARGET_TABLE T
  JOIN (   
    SELECT DISTINCT
      PARTITION_COLUMN
    FROM
      MY_SOURCE_TABLE PI
  ) P
    ON T.PARTITION_COLUMN = P.PARTITION_COLUMN
;

On Hive you must set properties:
set hive.exec.dynamic.partition.mode=true
set hive.exec.dynamic.partition.mode=nonstrict

Why is there that join in subquery afer union all? Because we want to union with only those partitions which we are going to ovewrite.
And of cource there can be more than one partition columns.