<< Previous Page  Next Page >>      More Spider Script Examples

Movie Listings, Anyone?

Here's our next example. MTW Cinema and Cafe is a make-believe movie theatre. We're going to use it to demonstrate how easy it is to collect data from a website. In this example we are going to introduce the BEGIN INFORMATION section and the @n variables. Go grab a cup of coffee.

MTW Cinema and Cafe

Visit the MTW Cinema and Cafe website and view the HTML source. We are going to retrieve the list of shows playing at the cinema. To do this, we have to find out where in the HTML source the movie listings appear. This is almost always a recurring "pattern" of HTML code. As we saw in the previous Yahoo! Quotes example, the market quotes were encapsulated within a repeating pattern of <tr>'s. See if you can identify the pattern in this example. The pattern is:
        <TR bgColor=#......>
          <TD></TD>
          <TD><B><FONT color=#befefe>Movie Name</FONT></B> Movie Length<BR>
            Show Times</TD>

We used bold black lettering to distinguish values that are different for each occurence of the pattern. The first pattern occurs right after
      <TABLE class=normal cellSpacing=0 width="100%" align=center border=0>
        <TBODY>
and the last pattern is immediately followed by
</TR></TBODY></TABLE><BR></FONT></TR></TBODY></TABLE></TD></TR><TR><TD colspan="2">

So far, our spider script looks like this:
BEGIN HEADER
source:http://shuetech.com/minetheweb/demo/docs/mocksites/cinema/index.htm

BEGIN INFORMATION

BEGIN ACTION
startafter:
      <TABLE class=normal cellSpacing=0 width="100%" align=center border=0>
        <TBODY>
|
endat:
</TR></TBODY></TABLE><BR></FONT></TR></TBODY></TABLE></TD></TR><TR><TD colspan="2">|

BEGIN DO
Adding in the pattern: and definition: fields will result in:
BEGIN HEADER
source:http://shuetech.com/minetheweb/demo/docs/mocksites/cinema/index.htm

BEGIN INFORMATION

BEGIN ACTION
startafter:
      <TABLE class=normal cellSpacing=0 width="100%" align=center border=0>
        <TBODY>
|
endat:
</TR></TBODY></TABLE><BR></FONT></TR></TBODY></TABLE></TD></TR><TR><TD colspan="2">|

pattern:
$1        <TR bgColor=\#$2>
          <TD></TD>
          <TD><B><FONT color=\#befefe>$3</FONT></B> $4<BR>
            $5</TD>
|
definition:
$1:
$2:
$3:TITLE:TEXT
$4:MOVIELENGTH:TEXT
$5:SHOWTIMES:TEXT:StripHTMLTags():StripTrailingWhitespace()

BEGIN DO
A little bit of explanation is necessary in the above listing. Why do the $1 and $2 lines not have field names? Well, if you specify a variable without a field name, Mine The Web will not store it in the database. This is useful when there is a piece of information in a pattern that isn't always the same but you're not interested in keeping. For example, the title of the movie isn't the same each time the pattern repeats, but you want to store that information in the database. Therefore, you give the $3 variable a field name (TITLE). If however you decided that you did not want the movie names to be stored in the database, then you would not give the $3 variable any name.

There is another use for unnamed variables. Look at each pattern in the MTW Cinema HTML source, and you may notice that there isn't always the same number of empty lines between each pattern. Putting an unnamed variable $1 at the start of each pattern will "eat up" those empty lines. Otherwise, Mine The Web may not know where to start each pattern.

After the name and data type field, you can optionally include up to 3 modifier functions to process the contents of that variable. What are modifier functions? Modifier functions can be used to process a variable's data before storing it in the database. Looking again at the MTW Cinema HTML source, you'll see that the showtimes of each movie contains plenty of <i> tags. We don't want these tags to be stored in our database, we just want the raw showtimes text, so we use the StripHTMLTags() modifier function to strip off all HTML tags before storing it in the database. The StripTrailingWhitespace() tag strips off any trailing whitespace.

Here's a list of all the modifier functions that are available in Mine The Web's spider scripting language:

BEGIN INFORMATION

What if for every database record that the spider creates, we also want to store the name of the cinema and its location? This information is not present within the pattern. No problem, we can use the BEGIN INFORMATION section to achieve this. Let's extend our spider script to include some spider script in the BEGIN INFORMATION section:

BEGIN HEADER
source:http://shuetech.com/minetheweb/demo/docs/mocksites/cinema/index.htm

BEGIN INFORMATION
CINEMA:TEXT:MTW Cinema and Cafe
LOCATION:TEXT:Boston

BEGIN ACTION
startafter:
      <TABLE class=normal cellSpacing=0 width="100%" align=center border=0>
        <TBODY>
|
endat:
</TR></TBODY></TABLE><BR></FONT></TR></TBODY></TABLE></TD></TR><TR><TD colspan="2">|

pattern:
$1        <TR bgColor=\#$2>
          <TD></TD>
          <TD><B><FONT color=\#befefe>$3</FONT></B> $4<BR>
            $5</TD>
|
definition:
$1:
$2:
$3:TITLE:TEXT
$4:MOVIELENGTH:TEXT
$5:SHOWTIMES:TEXT:StripHTMLTags():StripTrailingWhitespace()

BEGIN DO
This tells Mine The Web to create 2 additional fields for every record stored, called CINEMA and LOCATION, which will both be stored as text. For each occurence of the pattern defined in the BEGIN ACTION section, Mine The Web also stores the values 'MTW Cinema and Cafe' and 'Boston' in the fields CINEMA and LOCATION.

Now, what if we want to store the movie listing's expiry date in each record? This information is shown at the top of the MTW Cinema page right above the movie listings. This piece of information is the same for each repetition of the pattern, just like the BEGIN INFORMATION example above. However, the difference is that we do not know the value for this piece of information in advance. It can change from week to week. In the previous example, we knew in advance what the name of the cinema is and where it is located (there's only one MTW Cinema in the world). We don't have the same luxury in this case. So what do we do? Simple, use defpatterns.

Think of defpatterns as miniature BEGIN ACTION sections, or sub BEGIN ACTION sections. defpatterns are defined within the @ and ^@ characters. Take a look at this script:
BEGIN HEADER
source:http://shuetech.com/minetheweb/demo/docs/mocksites/cinema/index.htm

BEGIN INFORMATION
CINEMA:TEXT:MTW Cinema and Cafe
LOCATION:TEXT:Boston
SHOWINGUNTIL:TEXT:@1

BEGIN ACTION
startafter:
      <TABLE class=normal cellSpacing=0 width="100%" align=center border=0>
        <TBODY>
|
endat:
</TR></TBODY></TABLE><BR></FONT></TR></TBODY></TABLE></TD></TR><TR><TD colspan="2">|

pattern:
$1        <TR bgColor=\#$2>
          <TD></TD>
          <TD><B><FONT color=\#befefe>$3</FONT></B> $4<BR>
            $5</TD>
|
definition:
$1:
$2:
$3:TITLE:TEXT
$4:MOVIELENGTH:TEXT
$5:SHOWTIMES:TEXT:StripHTMLTags():StripTrailingWhitespace()
@1:
defstartafter:
Valid Until : <B>|
defendat:
</B><BR>|
^@

BEGIN DO

The SHOWINGUNTIL line in the BEGIN INFORMATION section tells Mine The Web that the value for that field is determined by the @1 defpattern. The @1 defpattern itself is defined in the BEGIN ACTION section between the characters @1 and ^@1.
defstartafter:
Valid Until : <B>|
defendat:
</B><BR>|
Can you figure out what's going on there? Very simple. All the above 4 lines say is that the value for the SHOWINGUNTIL (@1) field is between the "Valid Until : <B>" and "</B><BR>" text in the HTML source, which is where you'll find the listing's expiry date.

That's all there is to it. You now have a very good understanding of Mine The Web's spider scripting language. The next pages of this tutorial will cover the rest of Mine The Web's spider scripting language.

<< Previous Page  Next Page >>