In Hive, We know that it works on file reading mechanism where hive reads data present in files present in hadoop file system. Here , pre-requisite is you should have basic knowledge of Hive.
STEP-1 Copy the Hadoop files of a particular Partition for that particular Hive object to your local server using get command. For example
hdfs dfs –get /hadoop-server-details/path-of-the-file/partition-column-name=value /home/user1/
Here Assumption is file format of files that are mapped to hive object is not normal text file (say avro file) . So, it is recommended to copy the files first to your local server(Linux).
STEP-2 Once the file is in Linux Server, go to /home/user1/ path using cd command. Now, you want to replace one value to another value , for example , you want to replace Chicago to Los Angeles. So, by using sed command, you can achieve this.
sed -i ‘s/Chicago/Los Angeles/g’ filename.avro
Now, you have new file with different City. That is how you have new data set ready.
STEP-3 Now, you have two different data set and hence , there are two scenarios . 1.To keep the existing partition for that particular hive object and copy new data-set to a different partition for that hive object. 2. To remove the existing data from the hadoop file system and copy the new dataset to the same existing partition.
Scenario 1: So, first we need to create a different partition for the same object using mkdir command (in hdfs).
hdfs dfs –mkdir /Hadoop-path/file-name/date=2019-12-31
Once new partition is created, now you have to copy the new dataset to new Hadoop partition for that object using below command.
hdfs dfs –put /home/user1/filename.avro /Hadoop-path/file-name/date=2019-12-31
Now, we need to alter hive object as well to create that partition using below command.
alter table schema_name.hiveobject1 add partition (date=’2019-12-31′); Next step is to run msck repair command for that Object.
msck repair table hiveobject1;
Scenario 2: Here, you need to remove the data from existing partition first using below command.
hdfs dfs –rm -r /Hadoop-path/file-name/date=2019-12-30/* Then copy the new data-set to the existing partition using the below command .
hdfs dfs –put /home/user1/filename.avro /Hadoop-path/file-name/date=2019-12-30/
Run the msck repair command to sync the data.
msck repair table hiveobject1;