Lead Image © Artsem Martysiuk, 123RF.com

Lead Image © Artsem Martysiuk, 123RF.com

An introduction to a special tool for transforming data formats

Data Lego

Article from ADMIN 38/2017
The Jaspersoft ETL tool creates complex constructs from simple components by scrubbing, filtering, converting, and importing data into a database.

The process of preparing data is called "extract, transform, and load" (ETL), and specialty tools like Jaspersoft ETL help you carry out these tasks. In this article, I introduce the community version 6.0.1 of Jaspersoft ETL, which is available as a free download [1].

Data Gold

Panning for gold is tedious. Nuggets of precious metal don't simply sit around on a riverbed. Instead, the prospector has to sift through multiple pans of sand and stones, often retrieving just a few flakes for all of the trouble expended. Data is the equivalent of gold in today's world. Given this reality, you must tediously filter huge volumes of data to extract tiny particles of information that have real value.

When looking for gold, a prospector first has to get a fix on the location of a deposit and then get access to it. The same can be said for data: The process of locating a promising source involves procedures like aligning fonts, converting values and data formats, and importing results into databases. These preparatory steps need to be performed before you can effectively pan for gold in the form of data.

Converting Files

One of the easiest exercises for learning the ETL tool involves converting an input file to a different format. A simple example of a text-based book list should suffice for presenting the first practical steps (Listing 1). Columns separated by semicolons have a particular width. The lines are not in a particular order.

Listing 1

Book List (Excerpt)

Year;Number;Author                        ;Title                                             ;Publisher                ;Pp  ;Price
2015;06;    Stepanov, Rose                ;From Mathematics to Generic Programming           ;Edison Wesley            ;0320;28
2015;09;    Whithurst, Jim                ;The Open Organization                             ;Harvard Business Review  ;0227;16
2015;03;    Kofler, Kühnast, Scherbeck    ;Raspberry Pi: Das umfassende Handbuch             ;Galileo                  ;1064;40
2016;04;    Heinle, Stefan                ;Heimautomatisierung MIT KNX, DALI, 1-Wire und Co. ;Rheinwerk                ;1260;50
2016;09;    Blasband, Darius              ;The rise and fall of software recipes             ;Reality Bites Publishing ;0360;25
2016;11;    Luckhaus, Stefan              ;Produktivitäts- und Leistungsmessung              ;Pass Consulting          ;0125;20
2016;01;    Troche, Christoph             ;FRITZ!Box                                         ;Mitp                     ;0240;20
2015;02;    Will, Torsten T.              ;Einführung in C++                                 ;Galileo                  ;0520;25
2015;10;    Keller, Müller                ;Pentaho und Jedox                                 ;Hanser                   ;0435;60
2015;03;    Ristic, Ivan                  ;Bulletproof SSL and TLS                           ;Feisty Duck              ;0507;46

To convert to a different format, you must first set up a "job," which is accomplished by starting the TIBCO Jaspersoft ETL tool and selecting the Create Job option from the context menu for Job Designs in the left pane. After entering a name for the job in the dialog box, a four-pane window opens (Figure 1):

Figure 1: The jobs are assembled in the center grid.
  • Top left is a file structure for the jobs and their components.
  • The center section has an area with a gridded background in which to assemble jobs from components.
  • Top right is a menu of components.
  • The bottom window has several tabs used to configure the components and start the jobs.

Now it is time to acquire the data to be processed. In this example, the data source is the text file in Listing 1. You need to use a component of type tFileInputDelimited, which opens the comma-separated values (CSV) file, reads it in line by line, separates the fields according to a defined separator and schema, and passes on the fields to the next component for further processing.

Drag the tFileInputDelimited input component from the menu on the right to the center of the gridded assembly area. Double-clicking on the component opens a new dialog box in which you should define the properties of the component (Figure 2). The definitions include the path to the input file and the separators. With a single click on the Edit schema button, you will come to a window in which you can set up the data type for each field of the CSV file.

Figure 2: The first component reads the input file and then sends the data chunks on their way.

From the right-hand column, select the component to be used to present the result and drag it onto the assembly area. This example uses a component of type tFileOutputExcel. After double-clicking on this component, you can define the name and path of the output file as above before connecting the components from left to right by clicking on the handle of one icon and pulling a line to the next icon (Figure 3). Now the only thing left to do is start the job, and voilà, an Excel spreadsheet containing the comma-separated data appears as specified.

Figure 3: The completed converter generates an Excel file from a CSV file.

Sorting Files

The next level of complexity for a job like this is to insert between the input and output components a component that will process the input data before it is output. For example, sorting would be a very simple method to insert here.

In the example at hand, try sorting the unordered list of books according to year and issue number. To do this, you should place the tSortRow component between the input and output modules. The component is found in the right-hand menu under the Processing option.

Afterward, you should connect the components as before (Figure 4). This sets up the path for the data from the input CSV file to the sorter and from there to the output file. Now you can double-click to configure the sorting criteria in the columns of the schema.

Figure 4: Placing a sorting component between input and output components produces an ordered list.

Once the sorting criteria have been defined, you can start the job. After a few seconds, you will see a short notification in the assembly area about the number of processed lines and the elapsed processing time.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Workflow-based data analysis with KNIME
    They say data is "the new oil," but all that data you collect is only valuable if it leads to new insights. An open source analysis tool called KNIME lets you analyze data through graphical workflows – without the need for programming or complex spreadsheet manipulation.
  • Mailman 3.0 – a long time coming
    Mailman 3.0 is a new major version, released 15 years after version 2.0. We put the new version through its paces and explain the installation procedure and new features.
  • Apache Storm
    We take you through the installation of a Storm cluster and discuss how to create your own topologies.
  • Legally compliant blockchain archiving
    Unite multiple data warehouses and data silos set up for ERP, tax compliancy, and EAI/ESB processes with blockchain-archived data.
  • Discover ransomware with PowerShell
    Simple backup strategies cannot protect files encrypted by ransomware, because they can be affected as well. A PowerShell script can ensure that your files are okay before sending them to backup.
comments powered by Disqus
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs

Support Our Work

ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.