====== Incremental insert in Hive and Impala ====== Task: To insert data set into a table, but if a record already exists in the target table, it must not be inserted. And you must overwrite whole partitions (which is a good practice on Hive). The source data set may be a table or a query result. \\ Solution: You identify those records which do not exist in the target table in terms of primary key. That makes the data increment. Now, you must union that with existing data in target partitions and then ovewrite those partitions. Assumptions: - MY_SOURCE_TABLE has the same primary key as MY_TARGET_TABLE - MY_SOURCE_TABLE may be a subquery. In this case resulting data must not violate target table's primary key. INSERT OVERWRITE TABLE MY_TARGET_TABLE PARTITION (PARTITION_COLUMN) SELECT SRC.KEY_COLUMN_1, SRC.KEY_COLUMN_2, SRC.COLUMN_3, SRC.COLUMN_4, PARTITION_COLUMN FROM MY_SOURCE_TABLE SRC LEFT OUTER JOIN MY_TARGET_TABLE TGT ON TGT.KEY_COLUMN_1 = SRC.KEY_COLUMN_1 AND TGT.KEY_COLUMN_2 = SRC.KEY_COLUMN_2 WHERE TGT.KEY_COLUMN_1 IS NULL /* only records not existing in the target */ UNION ALL SELECT TGT.KEY_COLUMN_1, TGT.KEY_COLUMN_2, TGT.COLUMN_3, TGT.COLUMN_4, PARTITION_COLUMN FROM MY_TARGET_TABLE T JOIN ( SELECT DISTINCT PARTITION_COLUMN FROM MY_SOURCE_TABLE PI ) P ON T.PARTITION_COLUMN = P.PARTITION_COLUMN ; On Hive you must set properties:\\ ''set hive.exec.dynamic.partition.mode=true''\\ ''set hive.exec.dynamic.partition.mode=nonstrict''\\ Why is there that join in subquery afer ''union all''? Because we want to union with only those partitions which we are going to ovewrite. \\ And of cource there can be more than one partition columns.