Check duplicate record in Hive

Today, I will discuss about ” How to automate the process where in you can check entire row duplicate record in hive”. As I have mentioned in all Automation blogs, I will share the pseudo code.
STEP1: In hive , use “desc table_name” , this command will give you column names along with datatype and data length. Store the output of this command in a file , say HIVE_TABLE_DDL.txt

STEP2 : Read the file HIVE_TABLE_DDL.txt using “cat” command.

cat HIVE_TABLE_DDL.txt | awk '{print $1}' ORS=',' | sed 's/,$//'

* awk'{print $1}’ -> This will only print column names
* ORS=’,’-> Transponse the data into a row with comma separated.
* sed ‘s/,$//’ -> This will remove the last comma in the above transposed data.

Store the output of above mentioned code into a variable say COL_NAMES.

STEP3: Now , run the below query which has group by clause with all column names with having clause

select count(1) from table_name group by $COL_NAMES having count(1)>1

If count is greater than 1 , that means entire record is duplicate .

STEP4: Repeat above steps for all tables using config file and read that file one by one.

NOTE: You can use same steps for any database, just you need to change the STEP1.
SQL server -> Use information_schema.columns table
IBM DB2 -> Use syscat.columns table
Oracle -> Use USER_TAB_COLUMNS


Related posts