Next Page >>      More Spider Script Examples

How To Write Spiders for Mine The Web

This document is an easy to follow tutorial that will teach you how to create spiders for Mine The Web. Spiders are what Mine The Web uses to retrieve content for you. Mine The Web has a very flexible spider scripting language that allows you to retrieve content from websites regardless of where the contents are placed within a webpage. By the end of this tutorial, you will realize that content you once thought was difficult or even impossible to structure and retrieve from a website can now be easily done using Mine The Web.

Your Very First Mine The Web Spider

The key to Mine The Web's flexibility and ease of use is its spiders, and the key to creating spiders for Mine The Web is in knowing the spider scripting language. You will see from this first example that creating a spider that does a whole lot is actually extremely easy and straightforward using the spider scripting language.

For our first spider, go ahead and point your browser to the Mock Yahoo Finance page which shows archived data borrowed from Yahoo Finance. Take a look at the Market Summary column. Yup, you guessed it. In this example, we are going to retrieve the Dow, Nasdaq and S&P 500 quotes and store them in our own database. We are going to achieve this without using any XML at all, which is why Mine The Web is so powerful. It does not require data sources to be in XML format. Without further ado, here's the spider script that we'll need to retrieve those market quotes.
BEGIN HEADER
source:http://shuetech.com/minetheweb/demo/docs/mocksites/quotes/index.htm

BEGIN INFORMATION

BEGIN ACTION
startafter:
bgcolor="white" border="0" cellpadding="0" cellspacing="0" width="100%"><tbody>|
endat:
              <!-- Others -->|
pattern:
              <tr align="right">
                <td class="yfnc_mktsumtxt" colspan="1" align="left" nowrap="nowrap"><a href="http://finance.yahoo.com/q?s=%5E$1">$2</a></td>
                <td class="yfnc_mktsumtxt" nowrap="nowrap">$3</td>
                <td class="yfnc_mktsumtxt" nowrap="nowrap">$4</td>
                <td class="yfnc_mktsumtxt" nowrap="nowrap">$5</td>
              </tr>
|
definition:
$1:SYMBOL:TEXT
$2:INDEX:TEXT
$3:CLOSING:TEXT
$4:CHANGE:TEXT:StripHTMLTags()
$5:CHANGEPCT:TEXT:StripHTMLTags()

BEGIN DO

Looks too simple to be true? Indeed it is. But before we explain what's going on in the script, we'll show you how to input this spider into Mine The Web so that Mine The Web can run it. For this, go to our demo page if you aren't already there. Go to the Spiders page and click on Add Spider. Give your spider (actually, it's OUR spider since we came up with this example :) a name in the Spider Name field, and copy and paste the above script into the Spider Script field. For the Destination Table field, enter the name of the database table that you would like Mine The Web to store the retrieved data into. Make sure the table name is a valid database table name. Then click the Submit button. You will be returned to the main Spiders page.

Now click on the Execute link next to the spider you just created, and voila! You will see a page with the market quotes from Yahoo!. Want to store that data in a database table for future use? Just click on the "Store in table ..." button. Remember that you do not have to create a new spider every time you want to get market updates from Yahoo!. Just execute the same spider again tomorrow to get tomorrow's quotes. There's even a way to automate the process to perform the updates automatically! More on that later.

Holy Craps, That Was Easy!

Now let's get to the down and dirty stuff, and the most exciting part of Mine The Web: the spider scripting language.

As you may have noticed from the script example above, a spider is made up of 4 sections: BEGIN HEADER, BEGIN INFORMATION, BEGIN ACTION, and BEGIN DO. Let's start with the first section, BEGIN HEADER.

All you really need to tell Mine The Web in the BEGIN HEADER section is the URL of the page or site that you want to retrieve content/data from. In the above example, we told Mine The Web that we want to retrieve content/data from http://shuetech.com/minetheweb/demo/docs/mocksites/quotes/index.htm, which is the Mock Yahoo Finance page.

We'll leave the BEGIN INFORMATION section empty for the moment.

The BEGIN ACTION section is where all of the fun stuff happens. To understand this section, we'll need you to view the HTML source of the Mock Yahoo Finance page. You usually do this by clicking on View->Source from your browser. Go ahead and do it. The BEGIN ACTION section consists of several instructions, such as startafter:, endat:, pattern:, and definition:. The startafter: and endat: instructions basically tell Mine The Web where the data you are interested in is in the HTML code of the source page (the source page in this example is the Mock Yahoo Finance page). For example, if your source page looks like this:
<HTML>
<BODY>
Data line 1
Data line 2
Data line 3
</BODY>
</HTML>

and you want to retrieve the data lines between the opening and closing tags, then your startafter: and endat: instructions should look like this:

startafter:
<BODY>|
endat:
</BODY>|

The '|' character is used to terminate an instruction.

Now let's go back to the Yahoo Finance example. Look at the source of the http://shuetech.com/minetheweb/demo/docs/mocksites/quotes/index.htm page. With a little bit of sleuthing, you'll discover that the data we are interested in starts right after
bgcolor="white" border="0" cellpadding="0" cellspacing="0" width="100%"><tbody>
and ends right before
              <!-- Others -->
So to tell Mine The Web to look only at the HTML contents between those 2 strings and ignore everything else, we say
startafter:
bgcolor="white" border="0" cellpadding="0" cellspacing="0" width="100%"><tbody>|
endat:
              <!-- Others -->|

Make sense? No rocket science involved here. Mine The Web now knows that it should only be concerned with the contents between those 2 strings, which is
              <tr align="right">
                <td class="yfnc_mktsumtxt" colspan="1" align="left" nowrap="nowrap"><a href="http://finance.yahoo.com/q?s=%5EDJI">Dow</a></td>
                <td class="yfnc_mktsumtxt" nowrap="nowrap">10,402.77</td>
                <td class="yfnc_mktsumtxt" nowrap="nowrap"><span class="pos">+172.82</span></td>

                <td class="yfnc_mktsumtxt" nowrap="nowrap"><span class="pos">(+1.69%)</span></td>
              </tr>
              <tr align="right">
                <td class="yfnc_mktsumtxt" colspan="1" align="left" nowrap="nowrap"><a href="http://finance.yahoo.com/q?s=%5EIXIC">Nasdaq</a></td>
                <td class="yfnc_mktsumtxt" nowrap="nowrap">2,089.88</td>
                <td class="yfnc_mktsumtxt" nowrap="nowrap"><span class="pos">+26.07</span></td>
                <td class="yfnc_mktsumtxt" nowrap="nowrap"><span class="pos">(+1.26%)</span></td>

              </tr>
              <tr align="right">
                <td class="yfnc_mktsumtxt" colspan="1" align="left" nowrap="nowrap"><a href="http://finance.yahoo.com/q?s=%5EGSPC">S&amp;P 500</a></td>
                <td class="yfnc_mktsumtxt" nowrap="nowrap">1,198.41</td>
                <td class="yfnc_mktsumtxt" nowrap="nowrap"><span class="pos">+19.51</span></td>
                <td class="yfnc_mktsumtxt" nowrap="nowrap"><span class="pos">(+1.65%)</span></td>

              </tr>
Notice that each <tr> holds one market quote. We want to extract the useful information within each <tr>...</tr> and discard the contents that we don't want such as HTML tags. The scripting language has a very flexible way of telling Mine The Web what it is that you want to keep and what you don't want to keep.
pattern:
              <tr align="right">
                <td class="yfnc_mktsumtxt" colspan="1" align="left" nowrap="nowrap"><a href="http://finance.yahoo.com/q?s=%5E$1">$2</a></td>
                <td class="yfnc_mktsumtxt" nowrap="nowrap">$3</td>
                <td class="yfnc_mktsumtxt" nowrap="nowrap">$4</td>
                <td class="yfnc_mktsumtxt" nowrap="nowrap">$5</td>
              </tr>
|

The above tells Mine The Web to store the content occuring between the %5E and ", and within the <a> and <td> tags into the variables $1, $2, $3, $4 and $5 respectively. The definition: instruction is used to associate the variables $1 to $5 with useful names that will be used as field names when the data is stored in the database.
definition:
$1:SYMBOL:TEXT
$2:INDEX:TEXT
$3:CLOSING:FLOAT
$4:CHANGE:FLOAT:StripHTMLTags()
$5:CHANGEPCT:FLOAT:StripHTMLTags()

The definition instruction above is telling Mine The Web to store variable $1 in a table field named INDEX as a string data type, variable $2 in INDEX, variable $3 in CLOSING as a fractional number, and so on. The variable names themselves must be one character long, so $7 and $a for example are valid variable names, but $10 and $FOO are not.

StripHTMLTags() is a modifier function that is executed for the CHANGE and CHANGEPCT fields to remove any HTML tags that may be in those variables before they are stored in the database. More on that later.

You can store data as text or fractional numbers as shown above using the TEXT and FLOAT keywords, or as round numbers using the keyword INT. Blobs are also supported (using the keyword BLOB). TEXT fields are limited to 255 characters. If you want to store anything more than 255 characters, blobs can come in handy. There is also a DATE data type for storing date fields.

Lastly, the BEGIN DO line in the example simply tells Mine The Web to execute the script.

That's it. You've just created your very first spider script to retrieve specific content from within a chaos of HTML code. Play around with this example for a while and we'll move on to another example.

Next Page >>