Create JSON from HTML in Perl

Assumption: Your HTML file is not pretty-encoded i.e. all content is contained in a single line.

Code:

#!/usr/bin/perl

use strict;
use warnings;
use utf8;
use JSON;
use autodie;

open my $fh1, '<files.html' or die "Cannot open files.html: $!\n";

my $html = <$fh1>;

$html =~ s/<html><head><title>JSON Viewer<\/title><\/head><body><table border="1"><thead><tr><th width="300">Keys<\/th><th width="500">Values<\/th><\/tr><\/thead><tbody>//g;
$html =~ s/<\/tbody><\/table><\/body><\/html>//g;

my @rows = split("<\/tr>", $html);

my %data;

foreach my $row (@rows)
{
    my @cols = split("<\/td>", $row);
    my $key;
   
    foreach my $col (@cols)
    {
        if (not $col =~ /<br\/>/)
        {
            $col =~ s/<tr><td>//g;
            $key = $col;
        }
        else
        {
            my @values = split("<br\/>", $col);
          
            foreach my $value (@values)
            {
                $value =~ s/<td>//g;
                push @{ $data{$key} }, $value;
            }
        }
    }
}

close($fh1) or die "Cannot close files.html: $!\n";

my $json = encode_json \%data;
open my $fh2,">files.json" or die "open failed <output: files.json>: $!\n";
print $fh2 $json or die "print failed <output: files.json>: $!\n";
close $fh2 or die "close failed <output: files.json>: $!\n";

This Perl script reads an HTML file and produces JSON from its content, assuming the HTML sits on a single line, using the JSON module to serialize the extracted data.

Converting markup into structured JSON is useful when feeding scraped or exported content into other tools. For messier, multi line HTML, a dedicated parser such as HTML::TreeBuilder gives more reliable results.

Comments

Popular posts from this blog

[Solved] Error: No such keg: /usr/local/Cellar/gcc

[How To] Unfollow Non-followers on Instagram