execute the job using data cleaner in Pentaho

In my last post, I have explained how to create different data sources in data cleaner. Today, i will use the same data source which is csv input file and design a job in the data cleaner tool.  Upon completion of job, we will use data cleaner component in Pentaho and execute the same job using Pentaho.

First, once you open data cleaner tool using Pentaho which i have mentioned in my previous post, click on New-> Build New Job. See the below SS for the same.

Then , select the datastore which is EMP_DETAILS.csv file. The below Screen will appear on the data cleaner tool.

Here, you can see all the field names of the file are present in the left pane and datastore is reflecting on the work space Area.Now, we will design job using this datastore.

Go to Analyze  section, drag the completeness Analyzer component to work-space Area. Right click on EMP_DETAILS.csv,click on LINK, then drag the line from EMP_DETAILS.csv to completeness Analyzer. In the same way, drag the string Analyzer componenent, connect the hop between EMP_DETAILS.csv and String Analyzer. See the SS for the same.

Click on Execute Button which is at the top right . once you this job got executed , below image will appear on the tool.

Here, green arrows are hyperlinked. If you click on EMP_LAST_NAME(Blank count row), It will route you to the rows where EMP_LAST_NAME is blank. See the SS for the same.

Now, save this job to <PATH>\data-integration\plugins\kettle6-profiling-datacleaner\jobs and give name emp_Details . At back end file will get saved as emp_Details.analysis.xml

Now, we will design job in Pentaho. Here, we will use “Execute DataCleaner Job“. See the below image for the same.

Run the job , see the result.html file generated at location  <PATH>\data-integration\plugins\kettle6-profiling-datacleaner\result.html.  This file will also have hyperlink.

Once you click on Green Arrows(hyperlink), it will show the EMP_LAST_NAME having null values, provided you have click on green arrow against EMP_LAST_NAME and null counts.

Related posts