Today i will discuss about how to use Data cleaner plugin in Pentaho 7.0.0.
First go to Tools->Market Place-> Search with word “Datacleaner “. See the below SS for the same.
As you can see , it is coming as installed because i have already installed on my local Machine.
Once you install this plugin, go to <PATH>/data-integration/plugins/, you will see folder “kettle6-profiling-datacleaner”.
Now, you need to download Datacleaner from the below URL.
I have downloaded the latest version DataCleaner 5.1.5.
Once downloading is completed, unzip the folder.So , it will have Datacleaner directory inside “DataCleaner-windows” folder.
Double click on spoon.bat to open the Pentaho DI workspace. Go to Tools->DataCleaner Configuration->Click on Browse-> Here give path of datacleaner folder which is
Click on Ok. As soon as you click on Ok, configuration file gets generated inside folder <Data integration Path>\plugins\kettle6-profiling-datacleaner and file name is datacleaner-configuration.txt. So if you open this file, the path which we mentioned before for Datacleaner will reflect here.
If you are using Pentaho Data Integration version 7.0 or 7.1, there is one additional step to complete this configuration.
Close the Spoon now and make sure Datacleaner is not running.
Open up windows Explorer and navigate to the \data-integration\lib folder.
Copy the file: commons-vfs2-2.1-20150824.jar and paste the copy into the
Rename the original commons-vfs2-2.0.jar which is present in Datacleaner folder to commons-vfs2-2.0.OLD so it will not be loaded when DataCleaner launches.
At this point everything should be set up and configured. Steps to create a data profiling job are given on the next page.
Now, you open Spoon again. Open any transformation, right click on any component like csv file input and go to Profile with DataCleaner->Generate Sample job from Metadata. See the SS for the same.
Then run the Transformation ,As soon as you run the transformation,Datacleaner popup will come, click on continue . See the below image for the same.
Click on Execute button which is on right top. See the SS for the same.
Here , I have considered a file with PHONE NUMBER and NAME. post execution on execute , below data will come.
Here , you can see total count ->4, NULL value ->1 ,unique->1 ,non-unique->2
Therefore ,you can do data cleaning in a very effective way.
You can even click on Number Analyser and String Analyser to see like Min value,Max Value, NULL values. See the SS for the same.