Lead Image © Danila Krylov, 123RF.com

Lead Image © Danila Krylov, 123RF.com

HTML to database with a Perl script

Clean Start

Article from ADMIN 60/2020
A Perl script strips HTML markup, creating text files, and makes each file an entry in a database.

At work, I was tasked to come up with a solution to a problem presented by a web developer who needed some 1,200 static HTML files stripped of the HTML markup, leaving behind only text. Because 1,200 files introduce too much complexity and are generally cumbersome to keep on a web server, a database-driven approach was necessary. Only a few Perl scripts would be needed to serve the pages. Although 1,200 files is perhaps an attractive approach to a low-tech solution, its simple bulk is unruly. Also, editing the text as database entries would be much simpler, employing only a few more fairly simple-to-implement Perl scripts.

In the solution described in this article, a Perl script strips the HTML markup with a simple regular expression and creates the text files that are put into a database.

The Solution

Initially, an HTML file employing three iframes is created (Listing 1) [1]. Each window is given a name with the name attribute (line 7). A simple target attribute within an anchor element sends the data to the specified iframe. The first iframe, designated menu, is a list of reference categories. The links from the menu iframe are targeted to the iframe below it, list (line 11), which contains the commands for the categories representing the language that was identified in the menu iframe.

Listing 1

HTML iframes

01 <table border="0" cellpadding="0" cellspacing="0" align="center">
02 <tr>
03   <td align="left" valign="top">
04   <table border="0" cellpadding="0" cellspacing="0" align="center">
05   <tr>
06     <td align="left" valign="top">
07     <iframe width="150" height="249" src="menu.html" name="menu"></iframe>
08     </td>
09   </tr><tr>
10     <td align="left" valign="top">
11     <iframe width="150" height="249" src="list.html" name="list"></iframe>
12     </td>
13   </tr>
14   </table>
15   <td align="left" valign="top">
16   <iframe width="400" height="500" src="data.html" name="data"> </iframe>
17   </td>
18 </tr>
19 </table>

Once the list of commands is loaded, the target attribute is again used in a list of links to display the language reference for one command. Designated data (line 16), this HTML file is the end of the decision tree – an intuitively obvious interface.

Originally, the data was displayed in a static HTML file. The Perl scripts employed once the flat files are inserted into the database are simple lists, as well, but with the capability to introduce features like hit counts, revision notes, and other cool and interesting features.

The decision was made that each line of each file should be a table row, with the content of each file being a table. The name of the file is also the table name, which is added to a lookup table for easy access later.

The beauty and flexibility of Perl is demonstrated very well in this small construction. The entire process took about a minute to perform on my little server and resulted in an easily referenced database, owing to the simplicity of the database model.

By nesting three foreach loops (Listing 2, lines 51, 55, 65), you can perform the step-by-step operation with a minimum of fuss, cycling first through @dirlist for the directory names, and then through @filelist for each file, which is then opened and read. The file data is split on the newlines and placed into the @split_text array. For each $filelist[$filelist_count], you add a table and populate it if the if statement evaluates to true.

Listing 2

The Master Database

001 #!/usr/bin/perl
003 use CGI;
004 use DBI;
005 use CGI::Carp (fatalsToBrowser);
007 $tv = new CGI;
009 ### make the directory list
010 @dirlist = (CssBackgroundProperties,
011   CssBorderProperties,
012   CssClassificationProperties,
013   CssFontProperties,
014   CssClassificationProperties,
015   CssFontProperties,
016   CssMarginProperties,
017   CssPaddingProperties,
018   CssTextProperties,
019   DomCollectionsAndArrays,
020   DomEventHandlers,
021   DomMethods,
022   DomObjects,
023   DomProperties,
024   HtmlAttributes,
025   HtmlEventHandlers,
026   HtmlTags,
027   JavaScriptControlStatements,
028   JavaScriptGlobalFunctions,
029   JavaScriptObjectMethods,
030   JavaScriptObjectProperties,
031   JavaScriptObjects,
032   JavaScriptOperators,
033   JavaScriptStatements);
035 ### connect to the database
036 $dbh = DBI->connect('DBI:mysql:host=mysql.domain.com;database=databasename', 'username', 'password', {'RaiseError' => 1}) or die "Cannot Connect to Database";
038 ### create the master table
039 $query = qq{CREATE TABLE master (
041     tablename VARCHAR (200) NOT NULL,
042     category VARCHAR (200)
043     )};
044 $sth = $dbh->prepare($query);
045 $sth->execute();
047 $count = "0";
048 $dirlist_count = "0";
049 $filelist_count = "0";
050 $type = "text";
051 foreach (@dirlist) {
052   while ($thisfile = <$dirlist[$dirlist_count]/*.txt>) {
053     push @filelist, $thisfile;
054   }
055   foreach (@filelist) {
056     open (FILEHANDLE, "$dirlist[$dirlist_count]/$filelist[$filelist_count]");
057     $text = "";
058     $newtext = "";
059     while (read (FILEHANDLE, $newtext, 1)) {
060       $text .= $newtext;
061     }
062     close FILEHANDLE;
063     @split_text = split "\n", $text;
064     $split_text_count = "0";
065     foreach (@split_text) {
066       $split_text[$split_text_count] =~ s/\n+//g;
067       $this_split = $split_text[$split_text_count];
068       $this_split =~ s/[ ]+//g;
069       if ($this_split != "") {
070         $query = qq{CREATE TABLE $filelist[$filelist_count] (
071           type VARCHAR (200) NOT NULL,
072           text BLOB
073           )};
074         $sth = $dbh->prepare($query);
075         $sth->execute();
077         $query = qq{INSERT INTO $filelist[$filelist_count] (
078           type,
079           text
080           ) VALUES (
081           '$type',
082           '$split_text[$split_text_count]'
083           )};
084         $sth = $dbh->prepare($query);
085         $sth->execute();
087         $query = qq{INSERT INTO master (
088           tablename,
089           category
090           ) VALUES (
091           '$filelist[$filelist_count]',
092           '$dirlist[$dirlist[$count]'
093           )};
094         $sth = $dbh->prepare($query);
095         $sth->execute();
097         ++$split_text_count;
098         ++$count;
099       }
100     ++$filelist_count;
101     }
102   ++$dirlist_count;
103   }
105 ### print da shtuff
106 print qq{Content-type: text/html\n\n};
107 print qq{I'm Done!<br>};
108 print qq{I have created $count tables};
109 exit;

The @dirlist array holds the names of the directories in which the files are placed, and each file within the directory is glob'd for the filenames, which are placed in @filelist (lines 52, 53). A foreach loop then opens and reads each file, byte-by-byte, the content of which is placed in the aptly name scalar $text.

Next, the code splits $text on the newlines and adds each new line of text into @split_text, removes the newline characters (line 66), and proceeds to make a copy of each item within $this_split to later test for whether the array item has no data (lines 63-68). Because each original text file has several blank lines, so will the @split_text array have empty items (file rows) that you won't want to see as part of the database.

Every space is removed from the copied array item. If the copied item is empty, the script will not do anything but continue on to the next item in @split_text through the use of the aforementioned foreach loop.

If $this_split != "" evaluates to true, the program creates a table using the filename as the table name and populates it with rows, each row of the text file being a row in the table. Each table name is then placed in the master lookup table.


Replacing the many HTML files that were used to display the menu items with one Perl script is a great solution. The only change you would have to make to the iframe is to the src attributes, which will now be .cgi files, of course designating a Perl script.

As touched on earlier, you can add features like hit counts, notes, and credits for those that work on the online reference just created. A very wide range of options like colors, fonts, and general overall look can be tailored to each individual user through the use of maybe one or two new tables or a more involved table column alteration.

The approach of looping through a multitiered decision tree made this project fun and was completed in one morning. The entire reference, now in a database, is served in an intuitively obvious way which makes it a joy to work with for users. Since this project was completed, many new features have been employed.

One of the greatest new capabilities that has become possible and very popular recently is a simple and easy-to-use language translator engine. Real-time language translators can translate the original English language entries on the fly into pretty much any language spoken today. These language translators can be accessed from a browser or a service, such as an application employed from a Perl script.

Happy coding!

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus