Tagline

The Studio of Eric Valosin

Tuesday, February 24, 2015

Resources for Building a Link to a Random Website

Consider this a "for-dummies" resource base written by a former dummy who wants to save some other dummies a headache or two.

Background:


Several of my past projects had used a QR code that linked to a random website. I had taken the easy route at the time, linking to a random website generator someone else had already created:
Some Examples of Existing Generators:
But when the link I used went dead, I needed to start figuring out how to build and host this myself. Hopefully this can prove helpful to anyone who might be trying to do something similar, starting from as little a knowledge base as I did.

I knew I essentially needed to compile an array of URLs and then randomly pick one to redirect the user's browser to. However, I didn't want to have to compile and meticulously update an exhaustive list of all viable urls on the internet by hand, so I decided to try to tap into a reliable pre-existing database.

That's when I discovered the DMOZ Open Directory Project. It is an extensive and longstanding project backed buy Yahoo and a number of other reputable organizations, and attempts to catalog the entire internet, and it happens to be the resource that many of these generators use anyway.
Some Other Useful Attempts to Catalog the Internet:  

 The Solution: 


I endeavored to design a web crawler using PHP which would pore over the DMOZ directory and randomly extract a web address to follow.

THE CRAWLER

For the code for a good, super simple, web crawler that scrapes a site for its links I suggest you look here, as I did. The example script here became the basis for my crawler. It has 2 basic parts (follow the link above for the visual):

  • The first part (first 2 lines of code at the link above) scrapes all the data from the given page and strips it of all html tags except those tags pertaining to links. This file can be printed out using the commented out "debug" lines of code to get an idea of just what's going on.
  • The second part (the "preg_match_all..." line) uses what's called a "regular expression," a code to look through that stripped version of the page and find everything that matches a given perimeter - in this case, everything that is a link, and stores them as entries in an array
REGEX - VERY IMPORTANT NOTE: regular expressions (or "regex") are stupidly opaque and very difficult to learn (think The Imitation Game), but they're SUPER useful.  Here are the resources I used in order to build the guts of my crawler:

Applying this to DMOZ

My crawler uses this as its basis, but because the DMOZ website has layers of sub-categorizations before you find any external links, it has to iterate through the scraping process several times. My crawler looks at the home page of DMOZ and uses a Regex to create an array of all the directory's category links. It then chooses one element from that array at random and follows it to Level 2. At this second level, it repeats the process, scraping and searching for all links, iterating deeper and deeper until it terminates at an external URL. I customized the regex to weed out irrelevant links, trying to get either to an external url, or to the next level of depth within the categorization of the DMOZ directory.

If the link it randomly selects from the created array is internal, it follows that link and repeats the process again. If the link is external, it stops there and redirects to that link. I'm still tweaking the regex to get this as clean as possible, but it is essentially functional.

The Redirect is a simple PHP header redirect to this final external URL, as you'll see in the code.

Below is my CODE...


which you can feel free to use and experiment with. I simply ask that you give any credit where credit is due, and pass on the help to others (random website generators are cool. Plagiarism and selfishness aren't).

I don't claim this to be flawless or even pretty (in fact I'm pretty sure I could find a way to condense the second and third level iterations into one block of code, and I need to tweak the regex to stop favoring musicMoz, wikipedia, and other external directories that don't technically terminate the search for a URL). This was actually my first time using PHP at all, so a better programmer could surely find a more elegant and robust solution, but nonetheless, it's a starting point. Enjoy!

1:  <?php  
2:    
3:  // Crawls DMOZ directory for all relevant links, picks one at random to follow, and repeats that process until terminating at an external link, which it then redirects to.  
4:  // The original code and regular expression off of which this was adapted was as follows:   
5:  //     $original_file = file_get_contents("http://www.domain.com");  
6:  //     $stripped_file = strip_tags($original_file, "<a>");  
7:  //     preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches);  
8:  // If DMOZ ever crashes, simply change the destination on the first line of code and adjust the regex for something appropriate for the new link.  
9:    
10:  //---- HOMEPAGE LAYER ---------------       
11:    
12:   $original_file = file_get_contents("http://www.dmoz.org");   
13:       // gets the contents of the dmoz website homepage  
14:   $stripped_file = strip_tags($original_file, "<a>");   
15:       // strips file of all tags except <a>  
16:   preg_match_all("~<a(?:[^>]*)href=\"(\/(?!World)[A-Z]\w+)\/\">[^\"]*\"(?:[^>]*)>(?:[^<]*)<\/a>~s", $stripped_file, $homepageLinks);   
17:       // searches content for all directory related links except for "/World" in [0], and captures the relevant parts of url in [1]  
18:         
19:   $randomFromHomepage = $homepageLinks['1'][array_rand($homepageLinks['1'],1)];  
20:       // selects a random element from the $homepageLinks array  
21:    
22:            //-- DEBUG --  
23:            // header("Content-type: text/plain");   
24:                 // organizes $homepageLinks array into something readable  
25:            // print_r($homepageLinks);   
26:                 // returns the organized $matches array  
27:            // echo "link followed = " . $randomFromHomepage . "\n";  
28:                 // returns randomly generated link  
29:    
30:  //----- LEVEL 2 -------------------  
31:    
32:   $level2 = file_get_contents("http://www.dmoz.org" . $randomFromHomepage);   
33:   $stripped_level2 = strip_tags($level2, "<a>");   
34:   preg_match_all("~<a(?:[^>]*)href=\"((http|\/(?!World)[A-Z]\w+)[^\"]*)[^\w*\?q](?<!dmoz|dmoz\.org\/)\"(?:[^>]*)>(?:[^<]*)<\/a>~s", $stripped_level2, $level2links);   
35:    
36:   $newRandomLink = $level2links['1'][array_rand($level2links['1'],1)];  
37:     
38:            //-- DEBUG --  
39:            // header("Content-type: text/plain");  
40:            // print_r($level2links);  
41:            // echo "link followed = " . $randomFromHomepage . "\n";  
42:            // echo "next link = " . $newRandomLink . "\n";  
43:    
44:  //----- DEEPER ITERATIONS ---------- TRY TO GET RID OF MOZILLA, MUSICMOZ, EN.WIKIPEDIA  
45:    
46:     
47:   $internal = preg_match("~\/[A-Z]\.*~", $newRandomLink);  
48:            // print $internal;  
49:    
50:   while ($internal === 1) {  
51:    $nextFile = file_get_contents("http://www.dmoz.org" . $newRandomLink);  
52:    $stripped_nextFile = strip_tags($nextFile, "<a>");  
53:    preg_match_all("~<a(?:[^>]*)href=\"((http|\/(?!World)[A-Z]\w+)[^\"]*)[^\w*\?q](?<!dmoz|dmoz\.org\/)\"(?:[^>]*)>(?:[^<]*)<\/a>~s", $stripped_nextFile, $linksArray);    
54:               
55:            //-- DEBUG --  
56:            // header("Content-type: text/plain");  
57:            // print $nextFile;   
58:            // print_r($linksArray);  
59:            // print "link followed = " . $newRandomLink . "\n";  
60:    
61:    $newRandomLink = $linksArray['1'][array_rand($linksArray['1'],1)];  
62:    $internal = preg_match("~\/[A-Z]\.*~", $newRandomLink);  
63:    
64:      
65:    if (preg_match("~https?.*~", $newRandomLink)) {  
66:     break;  
67:    }  
68:   }  
69:               
70:  //----- REDIRECT -------------  
71:    
72:   header("Location: $newRandomLink");  
73:            // echo "newRandomLink = " . $newRandomLink; //Shows ultimate redirect destination  
74:    
75:  ?>   
76:    









No comments:

Post a Comment