BLOODHOUNDBLOG.COM

There’s always something to howl about

Speaking in tongues: Parsing structured data on the fly

This is not ProjectBloodhound material, at least not first semester stuff. But if you find yourself running into highly structured data — such as the reports from a spreadsheet or a database application — you have the ability to easily manipulate that data in PHP.

This is a simple example, but you don’t have to limit yourself to doing simple things. Imagine a data structure like this:

Name[tab]Phone Number
Cathleen Collins[tab]602-369-9275
Greg Swann[tab]602-740-7531

In the file the code shown here as “[tab]” would be an actual tab character, and this kind of data goes by the arcane name of: A tab-delimited file.

Most programming languages were written by exacting people with abstract and elegant reasons for everything they did. PHP was written by overbooked programmers who needed to pound out new web pages as quickly as possible.

In consequence, PHP is optimized for dealing with highly structured data. Here is a short program that will take a tab-delimited phone number file as input and output reformatted phone numbers into the HTML stream. In other words, this code could produce a dynamically-updated phone list in what what might otherwise be a static web page:

<?PHP
auto_detect_line_endings;

$fi = fopen("PhoneNums.txt","r");
$line = fgets ($fi, 4096); // throw away fieldDef line

echo ("<b>Phone Numbers</b><br>");

while (!feof($fi))
    {
    $line = fgets ($fi, 4096);

    list ($Name, $Phone_Number) = explode ("\t", $line);

    if ($Name)
        {
        echo ("$Phone_Number <i>($Name)</i><br>");
        }
    }

fclose ($fi);
?>

There is one line that makes all the difference for this kind of work:

    list ($Name, $Phone_Number) = explode ("\t", $line);

The stuff between the parenthesis are our known field names, and we’re using them as variable names for clarity’s sake. The explode function will create an array of separate fields from the text stored in the $line variable, splitting the fields on the tab character. The list function then inherits the array just created by explode and assigns each field to the appropriate field name variables. We only have two fields in this case, but I have a variation on these ideas that parses an MLS database that contains 213 fields per line of text.

Once we have the fields assigned to the right variables, it’s duck soup to represent the data in whatever format we wish. Alternatively, we could write a new file out to disk. The routine that parses the MLS database writes XML files to disk using a few dozen of the available fields — and throwing the rest away.

In fact, from here it’s very easy to write XML files, such as those used by Realty.bots. Say so if you want to see a demonstration.

But there is a lot more that you can do with software like this. It’s common, when you get data that is almost what you want, to try to edit it in word processors or text editors. A parsing tool like this enables you to take complete control over the data, echoing it back as perfectly-formatted HTML or writing a formatted file out to disk.

Related posts:
  • Podcast: Wrapping your mind around dynamic web pages
  • Speaking in tongues: Presentable PHP in WordPress
  • Project Bloodhound: How to make Google your weblog’s best friend

  • 15 comments

    15 Comments so far

    1. Cheryl Johnson June 20th, 2008 3:41 am

      Greg,

      Sorry, some of us (me included) still need the version with training wheels. :-)

      So… I copy and paste the phone number data into a text editor and name the file PhoneNums.txt. OK, so far.

      Then I copy and paste the PHP code into … what? Into an HTML document? A blank document? And name it RunPhones.php?

      Then upload both files to a PHP-enabled server, and when I go to mysite.com/RunPhones.php, the “exploded” display will appear?

    2. Teri Lussier June 20th, 2008 5:16 am

      Cheryl-

      *That’s* training wheels?!?! Oh dear. I need the trike version. I’m off to Da Blog Mother archives.

    3. Cheryl Johnson June 20th, 2008 6:14 am

      Another question: Suppose I pasted the PHP code into a WP sidebar.php file… Would I need to change the PHP fopen line to the absolute file path-the complete URL of where PhoneNums.txt is located?

      (Obviously I haven’t yet got it working either way, or else I wouldn’t be asking :-) )

    4. Cheryl Johnson June 20th, 2008 6:20 am

      BINGO! http://www.nelanews.info/
      There, now ~I~ can write the training wheels version. Though I ‘pose, I could change PhoneNums.txt to my phone numbers,

    5. Cheryl Johnson June 20th, 2008 6:21 am

      Except its displaying the brackets and the word “tab”

    6. Greg Swann June 20th, 2008 7:24 am

      > Except its displaying the brackets and the word “tab”

      You’re almost there. You have to change [tab] to a real tab character, which I can’t show in a weblog post.

      Watch this:

      Date[tab] Time[tab] Time Zone[tab] Name[tab] Type[tab] Status[tab] Gross[tab] Fee[tab] Net[tab] From Email Address[tab] To Email Address[tab] Transaction ID[tab] Counterparty Status[tab] Address Status[tab] Item Title[tab] Item ID[tab] Shipping and Handling Amount[tab] Insurance Amount[tab] Sales Tax[tab] Option 1 Name[tab] Option 1 Value[tab] Option 2 Name[tab] Option 2 Value[tab] Auction Site[tab] Buyer ID[tab] Item URL[tab] Closing Date[tab] Escrow Id[tab] Invoice Id[tab] Reference Txn ID[tab] Invoice Number[tab] Custom Number[tab] Receipt ID[tab] Balance[tab] Address Line 1[tab] Address Line 2/District[tab] Town/City[tab] State/Province/Region/County/Territory/Prefecture/Republic[tab] Zip/Postal Code[tab] Country[tab] Contact Phone Number[tab]

      That’s the field definition line if you download your activity report from PayPal. To name these as fields in PHP, we would need to lose the spaces, then convert the tabs into comma-space — like this:

      list ($Date, $Time, $Time_Zone, $Name, $Type,
      $Status, $Gross, $Fee, $Net, $From_Email_Address,
      $To_Email_Address, $Transaction_ID, $Counterparty_Status,
      $Address_Status, $Item_Title, $Item_ID,
      $Shipping_and_Handling_Amount, $Insurance_Amount,
      $Sales_Tax, $Option_1_Name, $Option_1_Value,
      $Option_2_Name, $Option_2_Value, $Auction_Site,
      $Buyer_ID, $Item_URL, $Closing_Date, $Escrow_Id,
      $Invoice_Id, $Reference_Txn_ID, $Invoice_Number,
      $Custom_Number, $Receipt_ID, $Address_Line_1,
      $Address_Line_2_District, $Town_City, $State_Province,
      $Zip_Postal_Code, $Country, $Contact_Phone_Number) =
      explode ("\t", $line);

      From there, you have the ability to parse that report any way you want it.

      This

      list ($Name, $Artist, $Composer, $Album, $Grouping,
      $Genre, $tSize, $tTime, $Disc_Number, $Disc_Count,
      $Track_Number, $Track_Count, $tYear, $Date_Modified,
      $Date_Added, $Bit_Rate, $Sample_Rate, $Volume_Adjustment,
      $Kind, $Equalizer, $Comments, $Play_Count, $Last_Played,
      $Skip_Count, $Last_Skipped, $My_Rating, $Location) =
      explode ("\t", $line);

      will parse your iTunes library.

      Not all data is well-structured — Microsoft products, as you might expect, introduce dysfunctional crap into everything — but a lot of the data you’re going to run into on the web will come to you just this way — tab- or comma-delimited with a field definition line. A parser like the one shown in this post is an easy way to manipulate that data to any other purpose: Formatted display on the web, edited or reorganized in another file or rendered as XML for another piece of software to devour.

    7. Greg Swann June 20th, 2008 7:43 am

      > Then I copy and paste the PHP code into … what? Into an HTML document? A blank document? And name it RunPhones.php?

      Yes, or you could paste it in as a part of a standing PHP page. Inlookers: If you intend for PHP to “see” and process your code, the web page has to be named MyPageName.php. It’s okay if the page contains nothing but HTML, but the PHP parser will not act on any PHP within the page if it is named MyPageName.htm or MyPageName.html. It’s good practice to name all your new pages MyNewPage.php. That way PHP will be available to you now — even if you don’t need it — or later — when you might.

      Cheryl, here’s another way of thinking of this: You’re building “PhoneNums.txt” so that you or anyone — or a piece of software — can update the phone numbers without messing with the code. What if you were to isolate your call to the code, so that it can be changed without your having to change each file that references it? Instead of pasting in the code, you could save it to a separate file, then include it wherever you want it:

      <?PHP include ("http://MyServer.com/RunPhones.php"); ?>

      When you edit “RunPhones.php”, the changes will be reflected instantly everywhere you have “included” it. And, yes, in this circumstance, I would use an absolute path, as I’m showing here.

    8. Cheryl Johnson June 20th, 2008 7:43 am

      Oh. Duh on the tab thing.

      Here’s the part I don’t know nuthin’ about yet: If I want that PHP code to run somewhere other than just a WordPress sidebar … if I want it to do things locally just on my own computer …. I’m going to need to install a web server app on my machine, right?

    9. Greg Swann June 20th, 2008 8:09 am

      > If I want that PHP code to run somewhere other than just a WordPress sidebar …

      You can run PHP on any Apache web server, if the file is named FileName.php. A PHP file can contain any valid HTML, plus any valid PHP. The real purpose of PHP is to produce valid HTML at runtime. In other words, if you View Source in a PHP page (lots of them on my sites), you will never see anything except valid HTML, even though some, most or all of it will have been rendered at runtime by software. Revisit my discussion of a contributor’s blogroll as an example.

      > if I want it to do things locally just on my own computer …. I’m going to need to install a web server app on my machine, right?

      That’s right. I don’t know how to do this in the Windows world. It’s baked in the cake on any OS X Macintosh — you have to download PHP, but every Mac is an Apache web server out of the box. Even so, I almost never use localhost, not even for testing. I either edit locally and FTP to a server (we have one we use for testing) or just edit directly on the server. For this, you need an FTP client that integrates with a text editor that will in its turn open and save files directly from a file server. In a year or two, the integration between onsite and offsite storage will be complete and you will FTP in and out of your file servers just as if they were hard disks mounted on your desktop.

      It seems a little weird to go to a file server for everything, but I have three reasons for working this way. 1. I’ve lived most of my adult life using desktop compilers, and, while they are a lot more robust, they’re a big pain in the ass to work with, where quick-and-dirty PHP might be pretty damned dirty, but it’s pretty damned quick. 2. Almost everything I do now is bound for the web anyway, so there’s no reason to solve problems locally, just to solve them again on the web server. 3. Anything I do in PHP I can share with anyone else on Apache servers, without worrying about the pestilential virus known as Microsoft Windows.

      Cameron keeps telling me that I need to learn Ruby on Rails, and probably I do. PHP was written by recovering C programmers, so it slides right into my mind with no effort, which means I can punch things out without having to puzzle them out. For web-based programming, it seems like an optimal solution for me, right now: I can do anything I want — including something as complex as engenu, which is written entirely in PHP — and I can pound out bread-and-butter stuff with alacrity. I think PHP is worth knowing, and I don’t think there has ever been a better time for ordinary (non-geek) people to learn to write software.

    10. CJ, Broker in NELA, CA June 20th, 2008 10:18 am

      Aside: Years ago a C programmer told me that C was a garden of delight, then C++ came along and crapped on the flowers….

    11. CJ, Broker in NELA, CA June 20th, 2008 11:19 am

      Re MyPage.php …. OK … Got that working http://www.nelanews.info/testrun.php
      I haven’t tried yet, but I’m supposing I could specify a stylesheet in the header of that page, and the page would then echo that design?

    12. CJ, Broker in NELA, CA June 20th, 2008 11:31 am

      OK. Got it running as a single page. http://www.nelanews.info/testrun.php

      (No, I haven’t fixed the tab thing yet.) I haven’t tried yet, but I’m supposing if I specify a stylesheet in the header, the page will then echo that style, just like a regular html page?

    13. Greg Swann June 20th, 2008 11:45 am

      Check. By the time the http handler sees it, it’s all HTML. The contributor’s blogroll in the sidebar is inheriting the sidebar’s CSS. If I were to call that same routine from a post, it would look like a list in a post instead.

    14. Cheryl Johnson June 20th, 2008 7:02 pm

      Oh. My. Goodness. I’m slowly getting there:
      http://www.nelanews.info/SuperBowlHistory.php
      (no stylesheet yet – just working on concept)

      What if I wanted the data to populate a table?

    15. Greg Swann June 20th, 2008 7:25 pm

      You’ve got it. The CSS part is easy. HTML is HTML.

      > What if I wanted the data to populate a table?

      Easily done. Remember that the HTML surrounding your variables is just HTML. You would format a table the same way you would do it manually, but you only have to do the job once.

      For something like this, I would uses a function:

      function TableRow ($Win, $WinScore, $Los, $LosScore, $Year)
          {
          echo ("<tr>");
          echo ("<td><strong>$Win</strong></td>");
          echo ("<td><strong>$WinScore</strong></td>");
          echo ("<td>$Los</td>");
          echo ("<td>$LosScore</td>");
          echo ("<td><em>($Year)</em></td>");
          echo ("</tr>\r");
          }
      

      This is more from the K&R world. When you call explode or list, you’re calling a function, it’s just built into PHP. We can create our own functions for isolating and simplifying repetitive jobs.

      Now instead of doing your echo‘s in your main loop, you would call

      TableRow ($Win, $WinScore, $Los, $LosScore, $Year);

      instead. Pushing an ugly job like this off to the function makes the code easier to maintain and more self-documenting.