ImageHost Grabber Host File Editor Manual Written By: Matthew McMullen Revised: January 09, 2010
CONTENTS Introduction ................................................................................................................................................................................. 1 Regular Expressions ..................................................................................................................................................................... 2 Special Characters in Regular Expressions ............................................................................................................................... 4 Applying the Concepts ............................................................................................................................................................. 8 Further Instruction in Regular Expressions .............................................................................................................................. 9 The Interface .............................................................................................................................................................................. 10 URL Pattern ............................................................................................................................................................................ 11 Search Pattern ....................................................................................................................................................................... 14 The “ID” Directive .................................................................................................................................................................. 14 The “REPLACE” Directive ....................................................................................................................................................... 15 Using the Regular Expressions ............................................................................................................................................... 17 Using a Function to Handle the Search .................................................................................................................................. 18
i
INTRODUCTION The purpose of this manual is to hopefully familiarize you with enough understanding of regular expressions and my host file editor to get you on the road to adding your own hosts. The following section is a brief introduction to regular expressions. You will not be a master simply by reading through that section, hence the reason I provide external links for further instruction. However, what the section will do is help you to understand what regular expressions are and why they are used. The next section explains the interface to the host file editor. It will cover, in depth, what each field is and how to format the field.
1
REGULAR EXPRESSIONS “A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids” (Regular-Expressions.info). When dealing with web pages, regular expressions come in particular use when trying to find an element that follows a pattern, but that element may have a few changing parts to it. These changes may occur with each page load, or with each successive instance of that element. For example, consider the following set of URLs:
http://img16.imagevenue.com/img.php?image=99340_DSC01455_122_790lo.JPG http://img177.imagevenue.com/img.php?image=99348_DSC01424_122_441lo.JPG http://img232.imagevenue.com/img.php?image=99352_DSC01433_122_70lo.JPG http://img109.imagevenue.com/img.php?image=99356_DSC01434_122_656lo.JPG http://img20.imagevenue.com/img.php?image=99365_DSC01426_122_645lo.JPG
The element in this case is the URL, where different parts of the URL change with each succession. Another way to think of it is the URL changes for each picture that the links point to. The dynamic (changing) parts are highlighted, whereas the static (unchanging) parts are not highlighted. If you were going to write a script to do something with these URLs (like visit each URL and download the target image), you would have two options:
1.
Write your own code to handle the searching; or
2.
Use regular expressions
Using regular expressions is the better choice because it offers a very powerful method for pattern searching. Writing your own code would have to be very sophisticated to match the abilities of regular expressions.
Consider now the same example from above, but this time we’ll start with a simpler task. Let’s say you want to find only the URLs above and nothing else. First, pretend that the URLs are located in a web page where the HTML for the web page may look something like:
2
Logically, you want to limit your search to fit a specific pattern, including both static and dynamic parts. This may look something like:
http://img*.imagevenue.com/img.php?image=*
where the * represents the dynamic part. For the actual search, what you would want to tell the search engine is: “find everything that starts with http://img followed by a dynamic part followed by .imagevenue.com/image.php?image= followed by a dynamic part.” This would ensure that you didn’t get any of the other HTML code or any of the URLs to the thumbnails.
Now, regular expressions work on this kind of thinking, but not using the same notation that I used above. Regular expressions allow you to specify what type or types of character the dynamic part is, and how long the dynamic part is. It is for this reason that a simple * will not suffice as the dynamic representation.
3
SPECIAL CHARACTERS IN REGULAR EXPRESSIONS As mentioned earlier, regular expressions allow you to specify the type or types of character the dynamic part is, and how long the dynamic part is. In the world of regular expressions, the dynamic parts are known as special characters and will be referred to as such. A complete list of the special characters used in regular expressions is shown below in Table 1.
Character
Meaning
\
Either of the following:
For characters that are usually treated literally, indicates that the next character is special and not to be interpreted literally. For example, /b/ matches the character 'b'. By placing a backslash in front of b, that is by using /\b/, the character becomes special to mean match a word boundary. For characters that are usually treated specially, indicates that the next character is not special and should be interpreted literally. For example, * is a special character that means 0 or more occurrences of the preceding item should be matched; for example, /a*/ means match 0 or more a's. To match * literally, precede it with a backslash; for example, /a\*/ matches 'a*'.
^
Matches beginning of input. If the multiline flag is set to true, also matches immediately after a line break character. For example, /^A/ does not match the 'A' in "an A", but does match the first 'A' in "An A".
$
Matches end of input. If the multiline flag is set to true, also matches immediately before a line break character. For example, /t$/ does not match the 't' in "eater", but does match it in "eat".
*
Matches the preceding character 0 or more times. For example, /bo*/ matches 'boooo' in "A ghost booooed" and 'b' in "A bird warbled", but nothing in "A goat grunted".
+
Matches the preceding character 1 or more times. Equivalent to {1,}. For example, /a+/ matches the 'a' in "candy" and all the a's in "caaaaaaandy".
?
Matches the preceding character 0 or 1 time. For example, /e?le?/ matches the 'el' in "angel" and the 'le' in "angle." If used immediately after any of the quantifiers *, +, ?, or {}, makes the quantifier non-greedy (matching the 4
minimum number of times), as opposed to the default, which is greedy (matching the maximum number of times). For example, using /\d+/ non-globle match "123abc" return "123", if using /\d+?/, only "1" will be matched. Also used in lookahead assertions, described under x(?=y) and x(?!y) in this table. .
(The decimal point) matches any single character except the newline character. For example, /.n/ matches 'an' and 'on' in "nay, an apple is on the tree", but not 'nay'.
(x)
Matches 'x' and remembers the match. These are called capturing parentheses. For example, /(foo)/ matches and remembers 'foo' in "foo bar." The matched substring can be recalled from the resulting array's elements [1], ..., [n].
(?:x)
Matches 'x' but does not remember the match. These are called non-capturing parentheses. The matched substring can not be recalled from the resulting array's elements [1], ..., [n].
x(?=y)
Matches 'x' only if 'x' is followed by 'y'. For example, /Jack(?=Sprat)/ matches 'Jack' only if it is followed by 'Sprat'. /Jack(?=Sprat|Frost)/ matches 'Jack' only if it is followed by 'Sprat' or 'Frost'. However, neither 'Sprat' nor 'Frost' is part of the match results.
x(?!y)
Matches 'x' only if 'x' is not followed by 'y'. For example, /\d+(?!\.)/ matches a number only if it is not followed by a decimal point. The regular expression /\d+(?!\.)/.exec("3.141") matches '141' but not '3.141'.
x|y
Matches either 'x' or 'y'. For example, /green|red/ matches 'green' in "green apple" and 'red' in "red apple."
{n}
Where n is a positive integer. Matches exactly n occurrences of the preceding character. For example, /a{2}/ doesn't match the 'a' in "candy," but it matches all of the a's in "caandy," and the first two a's in "caaandy."
{n,}
Where n is a positive integer. Matches at least n occurrences of the preceding character. For example, /a{2,}/ doesn't match the 'a' in "candy", but matches all of the a's in "caandy" and in "caaaaaaandy."
{n,m}
Where n and m are positive integers. Matches at least n and at most m occurrences of the preceding character. For example, /a{1,3}/ matches nothing in "cndy", the 'a' in "candy," the first two a's in "caandy," and the first three a's in "caaaaaaandy" Notice that when matching "caaaaaaandy", the match is "aaa", even though the original string had more a's in it.
[xyz]
A character set. Matches any one of the enclosed characters. You can specify a range of characters by using a hyphen. For example, [abcd] is the same as [a-d]. They match the 'b' in "brisket" and the 'c' in "ache".
5
[^xyz]
A negated or complemented character set. That is, it matches anything that is not enclosed in the brackets. You can specify a range of characters by using a hyphen. For example, [^abc] is the same as [^a-c]. They initially match 'r' in "brisket" and 'h' in "chop."
[\b]
Matches a backspace. (Not to be confused with \b.)
\b
Matches a word boundary, such as a space or a newline character. (Not to be confused with [\b].) For example, /\bn\w/ matches the 'no' in "noonday";/\wy\b/ matches the 'ly' in "possibly yesterday."
\B
Matches a non-word boundary. For example, /\w\Bn/ matches 'on' in "noonday", and /y\B\w/ matches 'ye' in "possibly yesterday."
\cX
Where X is a control character. Matches a control character in a string. For example, /\cM/ matches controlM in a string.
\d
Matches a digit character. Equivalent to [0-9]. For example, /\d/ or /[0-9]/ matches '2' in "B2 is the suite number."
\D
Matches any non-digit character. Equivalent to [^0-9]. For example, /\D/ or /[^0-9]/ matches 'B' in "B2 is the suite number."
\f
Matches a form-feed.
\n
Matches a linefeed.
\r
Matches a carriage return.
\s
Matches a single white space character, including space, tab, form feed, line feed. Equivalent to [ \f\n\r\t\v\u00A0\u2028\u2029]. For example, /\s\w*/ matches ' bar' in "foo bar."
\S
Matches a single character other than white space. Equivalent to [^ \f\n\r\t\v\u00A0\u2028\u2029]. For example, /\S\w*/ matches 'foo' in "foo bar."
6
\t
Matches a tab.
\v
Matches a vertical tab.
\w
Matches any alphanumeric character including the underscore. Equivalent to [A-Za-z0-9_]. For example, /\w/ matches 'a' in "apple," '5' in "$5.28," and '3' in "3D."
\W
Matches any non-word character. Equivalent to [^A-Za-z0-9_]. For example, /\W/ or /[^A-Za-z0-9_]/ matches '%' in "50%."
\n
Where n is a positive integer. A back reference to the last substring matching the n parenthetical in the regular expression (counting left parentheses). For example, /apple(,)\sorange\1/ matches 'apple, orange,' in "apple, orange, cherry, peach."
\0
Matches a NUL character. Do not follow this with another digit.
\xhh
Matches the character with the code hh (two hexadecimal digits)
\uhhhh
Matches the character with the code hhhh (four hexadecimal digits).
Table 1 – Special Characters in Regular Expressions (adapted from the mozilla developer center)
Although it may be confusing what all these special characters mean at this point, just realize that regular expressions give you the flexibility to define precisely how the search pattern should look. Further explanation of how to use these special characters will come later.
One very important thing to note at this point is that some of the special characters are commonly used characters, especially in URLs. For instance, the period is used to separate hosts from domains in the first part of the URL. In regular expressions, the period represents a single character to be matched in the search pattern. In the previous example, you would be interested in matching the literal character instead of the regular expression interpretation. In order to do this, you have to escape the character.
Escaping a character simply means that you are telling the search engine that you intend on searching for the literal character instead of having the search engine interpreting the special meaning. To escape a character, you precede it with a backslash (\). So to follow suit with the previous example where you are searching for the literal character, you would put “\.” in place of the actual period.
7
APPLYING THE CONCEPTS Referring back to the original example, the task now is to replace the simplified search pattern,
http://img*.imagevenue.com/img.php?image=*
with the regular expression equivalent. The first thing to note is that there are characters that will need to be escaped: the forward slash (/), the period (.), and the question mark (?). Secondly, you need to determine how you are going to define the special characters. What you will need to know is what kind of characters they are and how long they should be.
For the first part, it appears that the types of characters are numbers. Also, it’s not known exactly how many numbers there may be, but you can assume that there will be at least one. So how do we handle this? Well, the plus sign (+) is exactly what this is used for. The plus sign repeats the preceding character (or special character) until it finds the character after the plus sign. If you provide a special character before the plus sign, then you can tell it to repeat any type of character, like a number, until it runs into the succeeding character, like the period.
For the last part, it appears that the type and length of the character varies. To handle this, we’ll use a period followed by a plus sign. Recall that the period matches any single character except a newline character. So what this amounts to is that it will match any set of characters of any length. Putting it all together, the following is the regular expression equivalent.
http:\/\/img[0-9]+\.imagevenue\.com\/img\.php\?image=.+
What the regular expression is telling the search engine is:
“Find all matches for the search pattern of http: followed by two literal forward slashes followed by img followed by any number 0 through 9 of any length until you find a literal period followed by imagevenue followed by a literal period followed by com followed by a literal forward slash followed by img followed by a literal period followed by php followed by a literal question mark followed by image= followed a set of characters of any length”
8
FURTHER INSTRUCTION IN REGULAR EXPRESSIONS Thus far, the goal was to familiarize you with regular expressions. However, I probably did not provide you with enough information and instruction to begin using regular expressions on your own. Therefore, I will now provide you with two very useful links to further your instruction. The first one is the very good tutorial I used when I started writing ImageHost Grabber.
http://www.regular-expressions.info/
It gives a very in-depth walk-through of regular expressions. The knowledge you obtain will assist you in programming and non-programming environments. To begin, look for the menu located at the top-left part of the page. Click on the link titled “Tutorial.” You won’t have to go through the entire tutorial to begin using regular expressions. There is even a link to a nice text editor that uses regular expressions to assist you in the learning process.
The second link is something you should look at after you finish the tutorial. It provides a complete reference to the regular expressions used in mozilla.
https://developer.mozilla.org/en/Core_JavaScript_1.5_Guide/Regular_Expressions
It assumes that you already have a basic understanding of javascript and regular expressions. However, you don’t need to understand javascript to utilize the reference.
9
THE INTERFACE Perhaps the easiest way to get started is to look at the existing entries and see how they work. Figure 1 points out the three important parts of the interface.
Host Label
URL Pattern
Search Pattern
Figure 1 – Host File Editor Main Window
The “Host Label” is a unique identifying label for the entry. Something needs to be there, but it doesn’t matter what. It is a good idea to use an appropriate label for the entry. The “URL Pattern” is the pattern that the links to the image host follows. Typically these are the links that are provided in a forum. It is typically the link that is associated with the thumbnail, although it doesn’t have to be associated with a thumbnail to still work. The “Search Pattern” is the search pattern used to find the actual image. There are additional methods provided besides regular expressions to obtain the image source. These methods will be discussed later.
10
URL PATTERN The URL Pattern describes the pattern for a particular host. The pattern will be in the form of a regular expression. Simply stated, it is the pattern that the image host uses for the links to the pages where the full size images are located. You can find this in several ways. First, you could look at the source for the current page you are looking at that contains the thumbnails. Then scroll down to find the actual link that the thumbnail points to. The following example illustrates this method.
Figure 2, shown below, shows what the web page might look like:
Figure 2 – Example Web Page
11
The following is the HTML source for the web page shown in Figure 2:




The actual link is highlighted in the above HTML source. All links start with the “a” tag and the actual URL is encased within the “href” attribute. If you’re working with a large page with a lot of content, this is probably the worst way to find the URL pattern.
The second method is to look at the link properties. In firefox, you can look at the link properties by right-clicking on the thumbnail (or link) and go to “Properties.” You will then be presented with the properties window as shown below in Figure 3. You will probably have to resize it to see the full link. The part you are interested in is the address, which is where the link takes you to.
12
Figure 3 – Element Properties Window
The third method is to obtain the link from the address bar after visiting the link, as demonstrated in Figure 4. Be cautious of this method because the original link located on the original page (i.e. the forum) may re-direct you to another page. In this case, the link you obtain from the address bar may not match the original link and thus, the URL pattern you construct may not find anything on the original page. To get around this, you can use the first two methods described above.
Figure 4 – Address Bar Example
13
Once the link has been obtained, it is just a matter of constructing the regular expression to work with the URL pattern. It is a good idea to take a look at the existing entries to see how they work. Play around with it a little and see what you can come up with.
SEARCH PATTERN This part is the trickiest of the process. To begin, you need to understand a little bit of how the program works to know what your limitations are. First, the program visits each link (found using the URL pattern) and saves (to local memory) the page data for each link. Then for each link, the program will perform one of four actions based on the type of the search pattern. The types of search patterns are outlined below. An important note must be made before continuing with the discussion. The first three types of search patterns have to be encompassed by double quotes. If you leave out the quotes, then this will signify that you are trying to specify a function. The quotes signify that ImageHost Grabber should not execute the search pattern, but instead interpret the search pattern.
1.
The search pattern is an “ID” directive
2.
The search pattern is a “REPLACE” directive
3.
The search pattern is a regular expression
4.
The search pattern is a JavaScript function
Each search pattern is associated with its own method of handling the page data for each link. For the “ID” directive, the program searches within the page data for an image tag with an “id” attribute. The value associated with the “id” attribute is specified in the search pattern. For the “REPLACE” directive, the program replaces a part of the original link with a replacement part. The part to be replaced and the replacement part are specified in the search pattern.
For the regular expression, the program parses the page data to find every image tag. Then every image tag is parsed to find a match for the given regular expression (given in the search pattern). Note that it searches the entire tag, not just the values of the attributes only. And finally, for the JavaScript function, the program executes the function (given in the search pattern). This function should return some stuff to the program, and based on this stuff, the program will perform one of four actions. The next few sections will explain in detail how to specify and use the four types of search patterns.
THE “ID” DIRECTIVE The format of using the “ID” directive is as follows:
14
“ID: some_id”
The program is very specific on how the directive is specified. The first part must be capitalized, followed by a colon and a single space. Then the id is specified. So at this point, perhaps you are wondering what this whole id business is. First, a little background knowledge is needed.
When programming in javascript (or any other language), it is possible to manipulate a web page from the program side of things. To do that, however, a reference is needed as to where the element of interest is. The simplest method is to give an element a unique id. It is given in the form of an attribute called the id attribute. It is the quickest method to gain reference to an element that you are trying to manipulate.
In the image host side of things, typically this looks like the following:
onClick="showOnclick()" onLoad="scaleImg()" SRC="aAfkjfp01fo1i27778/loc790/99340_DSC01455_122_790lo.JPG" >
This snippet was taken directly from the example that has been ongoing through this manual. Notice, though, the part that says id=”thepic”. In the case of adding a new host, this is the easiest way to gain a direct reference to the image source. So what I would put into the “Search Pattern” box is the following:
“ID: thepic”
THE “REPLACE” DIRECTIVE The format of using the “REPLACE” directive is as follows:
“REPLACE: ‘reg exp’, ‘replacement’”
15
Once again, the program is very picky on how it receives the directive. It must start with capitalized REPLACE followed by a colon followed by a space followed by a regular expression encompassed in single quotes followed by a comma followed by a space followed by the replacement text (not a regular expression) encompassed in single quotes.
So why do I have this option in my program? The option is there because some sites will host the actual image in a location easily determined by removing or adding parts to the original URL. Take for example the host “10pix.com.” It does exactly what I just explained.
Here is the URL to an image that is hosted on “10pix.com”:
http://www.10pix.com/show.php/183068_MILF.KAMA.SUTRA.picture.001.jpg.html
The actual image source is:
http://www.10pix.com/out.php/i183068_MILF.KAMA.SUTRA.picture.001.jpg
Now notice how they just change things around a little bit? So why bother going through the process of constructing a regular expression when we can just replace a few things here and there. Well, here’s what we would do. First, replace show.php with out.php. Then remove the .html at the end. Then add an i before the filename. No problem. Here’s what that looks like:
"REPLACE: 'show\.php\/(.+)\.html' , 'out.php/i$1'"
The parenthesis tells the regular expression to store the value of the text into a variable, and the $1 is the variable that the stuff was stored in. So how this is read:
“Replace show.php/some_stuff.html with out.php/isome_stuff”
To get a more in-depth understanding of using the variables, read up on the regular expression reference at the mozilla developer center. 16
USING THE REGULAR EXPRESSIONS For most image hosts, no id is provided and there is no simple way to replace some stuff to get to the image source. In this case, it is left to you to determine a good enough search pattern to find the image tag that contains the image source. The best way to explain this is by example.
Consider the following host “dumparump.com” where an image was recently uploaded. The URL to the page is:
http://www.dumparump.com/view.php?id=06zebzT
The corresponding image tag for the full-size image that we are looking for is:
There are plenty of unique search criteria available to construct a good regular expression. Take a look at the source. That alone would be good enough. The “alt” attribute provides even more unique text to construct a proper regular expression. I will demonstrate regular expressions using both as a unique search pattern.
“http:\/\/image\.dumparump\.com\/[0-9]+\/.+”
or
“Hosted by dumpArump\.com”
Which ever you decide to pick doesn’t matter. The point is the program needs something to match the right image tag. I personally picked the second one because it was the easiest at the time. Note that it is also possible to include the attribute names in the regular expression. For instance, you could construct the regular expression to look something like:
17
“src=\”http:\/\/image\.dumparump\.com\/[0-9]+\/.+\””
Typically this is unnecessary, but you may find a case where including the attribute name proves to be useful. One important thing to note is that the quotes had to be escaped. This is not because the quote is a special character in regular expressions. It is because JavaScript would interpret the two quotes inside the starting quote and ending quote. Quotes in JavaScript denote a string value, and if you want a quote to be part of the string value, then you must escape the quote.
USING A FUNCTION TO HANDLE THE SEARCH Although this section may be beyond the level of understanding for many readers, it is presented here for those who are already knowledgeable. This section will provide an explanation of how the function should work, and what the function should return. This section will also cover the four actions that ImageHost Grabber (IHG) will take based on the data returned. What this section is not is a JavaScript tutorial.
The basic idea of the function is to tell IHG what the URL is that requires action, and what that action is. Two string objects are passed to the function where the first object is the page data and the second object is the page URL. It is your responsibility as the programmer to use these objects to find a URL and determine an appropriate action. How you implement the process of determining these two criteria is solely an exercise of style (or skill). However, there are some basic requirements that the function must implement in order for it to work.
The first and most important requirement is that the function return an object with at least (but not limited to) two string object members: imgUrl and status. The members must be defined using these names because IHG will use these member names to evaluate the values. Other object members can be added to the returned object if, for example, the host file is to be used on another program. With that said, allow me to touch on one important topic before continuing any further.
One thing left out so far is the fact that the host file has been designed to be used on any program, not just IHG. This is important because some functions included in the host file may perform actions specific to IHG. In order to maintain compatibility with other programs, the function uses a conditional test on the global variable ihg_Globals.appName. For IHG, the value of ihg_Globals.appName is “ImageHost Grabber”. This allows each program that uses the host file to specify actions unique to the program without causing incompatibility issues. This is helpful if a particular host poses a condition requiring specific action to be taken, such in the case of counter measures taken against leeching. Writing program specific code should be avoided at all costs, as the goal of the host file is to be as generic as possible. If you have to utilize a function, library, variable, or whatever that is only defined in your program, then put it in a conditional statement where the condition is the global variable ihg_Globals.appName.
18
The second requirement that the function should have is in the expected values of the member object status. The value of this member defines what course of action should be taken on the value of the imgUrl member. The values for status are case sensitive and the accepted values are: OK, ABORT, RETRY, and REQUEUE. If you are using my host file in your program, then it is up to you what these actions will do in your program. However, in IHG, these values are fairly self-explanatory, with maybe the exception of REQUEUE.
Assigning the value of OK to status signifies that the value of imgUrl is ready to be downloaded. The status value of ABORT signifies that the particular download should not be performed. The status value of RETRY signifies that the page data should be re-downloaded and the new page data be sent back to the function for re-evaluation. Note that in IHG, it is not required to assign a value to imgUrl for a status value of ABORT or RETRY.
For the status value of REQUEUE, ImageHost Grabber will use the value in imgUrl to determine a new host definition and re-queue the download process using the new host definition. This is particularly important when dealing with encapsulating or redirecting image hosts such as “usercash” and “clb1”. In using the REQUEUE feature of IHG, the host function for “usercash” finds the target URL, or the encapsulated URL, and assigns that target URL to imgUrl. IHG will then examine imgUrl to determine which image host was encapsulated and assign the corresponding host definition. Finally, the new page data for imgUrl is downloaded and examined using the search pattern (from the host definition).
Thus far, the explanation of the functions used in IHG has been very abstract. To get a good idea of how a function should be constructed, take a look at the host file and look at the existing functions. A few examples taken from the host file will be presented to aid in understanding. One last thing should be noted before continuing. Functions do not have to be explicitly defined. That is, they can be anonymous functions. The reason is because IHG will assign the anonymous function to a member of an object where the object is an instantiation of a class.
In the first example shown below, several of the main concepts are utilized. The function is an anonymous function that takes two arguments, pageData and pageUrl. These arguments correspond to the page data and the page URL, respectively. This function also makes use of IHG specific stuff in the conditional statement on ihg_Globals.appName. It changes the global variable maxThreads to a value of 1. The object retVal is created and the two required members are added in the if/else statement. A third member, fileName, is added to the object. This third member is not required but it does assist in retaining the original file name if it is provided.
19
function(pageData, pageUrl) { if (ihg_Globals.appName == "ImageHost Grabber") maxThreads = 1; var theId = "picture"; var searchPat = new RegExp("<.+id(?:\\s+)?=(?:\\s+)?(\"|')?" + theId + "\\1.+?>"); var tmpMatch = pageData.match(searchPat); if (tmpMatch) { tmpMatch = tmpMatch[0]; var the_url = tmpMatch.match(/src(?:\s+)?=(?:\s+)?("|')?(.+?)\1/); if (the_url) the_url = the_url[2]; }
var retVal = new Object(); if(!the_url) { retVal.imgUrl retVal.status } else { retVal.imgUrl retVal.status }
= null; = "ABORT";
= the_url; = "OK";
try { retVal.fileName = pageData.match(/»
\w*<\/a> » (.+\.\w+)\s\(views: \w*\)/)[1]; } catch(e) { retVal.fileName = Math.random().toString().substring(2) + ".jpg"; } return retVal; }
In the next example, the host is actually a redirect service. The function finds the target URL, or the URL that is being redirected to, and stores it in the variable newPageUrl. The value of newPageUrl is then assigned to the member imgUrl and the status of “REQUEUE” is assigned to the member status. ImageHost Grabber will then take the target URL and determine what host definition should be applied. Finally, if a defined host is found, IHG will re-queue the download with the new host definition for the target URL.
20
function(pageData, pageUrl) { var retVal = new Object(); var theId = "redirectframe"; var someMatch = new RegExp("<iframe.+?id=('|\")" + theId + "\\1.+?>"); var frameElem = pageData.match(someMatch); if (frameElem) { var srcMatch = frameElem.toString().match(/src=("|')(.+?)\1/); if (srcMatch) var newPageUrl = srcMatch[2]; } if (!newPageUrl) { retVal.status = "ABORT"; retVal.imgUrl = null; } else { retVal.status = "REQUEUE"; retVal.imgUrl = newPageUrl; } return retVal; }
This concludes the rather brief description of writing host functions in IHG. If you have questions regarding the implementation of the host file for your program, or how to get a function to work with IHG, feel free to email me with your questions.
21