Art: (b)Logos: February 2015

Tuesday, February 24, 2015

(Web) Crawling Out From Between a QR and a Hard Place

So you want to make a QR code that points to a randomized website, eh?

So did I, and when I was in grad school for my MFA I hacked together a number of somewhat unstable solutions. I recently overhauled my approach to this challenge, when bits of my artwork started falling apart around me and I found I needed to take matters into my own hands. Disclaimer: I'm not a programmer by trade AT ALL, but just a quick learner who gets obsessed easily and can share what knowledge I've accumulated. Here's a bit of a narrative tutorial that can help you learn from my trial and error.

----

Now then,

If you're looking to make a random QR code there are 3 things to consider:

the destination url - how to get a single url to redirect to another randomly selected url (the hard part)
the short link to that url - in order to make the destination url short enough to conveniently fit within a QR code's character limits
The QR code itself - translating that url into the pattern of black and white squares. (the easy part, sort of)

There are a few ways to do this, depending how hacked and how proprietary you want the solution to be.

THE EASY WAY... (aka learn from my mistakes)

...is simply to find someone else's pre-existing website that has a link to a random website, and copy their link address into a QR code generator, and there you go. When I first created Meditation 1.1 that's what I did. It works excellently in the short term but it ended up giving me a lot of issues in the long run because I didn't have any control over the back end of things.

Problem 1: Last summer, during an exhibition, it came to my attention that my piece was leading people to error 404 messages. As it turned out, the QR code generator I had used (sparqcode.com) was bought out by another company and then migrated and dissolved. The code itself was still scannable but because the QR code generator automatically shortened my destination url to its own proprietary short link (to get it to fit within the dimensions of their QR code), when the company went down it took all its short links with it. Even though my destination url (the randomizing link) was still functioning, the QR code no longer pointed people to it.

Solution: I ended up having to choose a different QR code generator (qrstuff.com) and remake all my QR codes (a huge headache when those codes are hand drawn into artwork and published in print...)

Problem 2: Once I reconnected my codes to the destination url,, the person who ran that website whose randomizing link I used (the now defunct randomwebsite.com) decided to shut down his website, so that link went dead (in the middle of another exhibition of mine, I might add!) My QR code still functioned, and the short link encoded into it still worked, but this time when viewers landed on the destination url that was supposed to redirect them to a random website, they'd get another a error 404 message.

Solution: found a new randomizing destination website link (like uroulette.com for example), and start all over - new QR, new short link, etc.

Problem 3: This wouldn't be so bad if you could simply adjust the destination url, keeping the short link the same (that way the link encoded into the QR code doesn't change and you don't have to redraw the code, it just points somewhere different.) However most free QR code generators online will automatically shorten the link you enter to make it fit into the QR code, but they don't let you adjust the destination url that the short link points to. When my randomizing destination url went dead, I had no way of changing it, thus I had to entirely remake the QR code every time.

Solution: I switched from relying on a QR code generator's proprietary link shortener, to using the goo.gl link shortener, which lets you adjust the destination url each short link points to. That way if randomwebsite.com goes down, I can switch the link to uroulette.com without changing the QR code. I also figured Google's going to stick around longer than, say, bit.ly or tinyurl.

THE HARD WAY (aka, the real way to do it)

That's ok if you don't mind some instability, but with all these things falling apart, I really wanted to take all of this into my own hands to get rid of the proprietary uncertainty. So I decided to code my own randomizing url, link shortener, and QR code generator from scratch and host them on my own website, so nobody can mess me up but me. This is the solution I now use:

QR Code Generator: Coding this is the easy one, because you don't need to do it! In my research I've learned that a QR code is a QR code is a QR code, and it doesn't matter who makes it. Once its made it will always be readable even if the generator goes out of business (like a poem will always be readable even if the author dies.). Besides, to actually translate the url into the black and white pixels yourself involves a ridiculous amount of advanced algebra and calculus. Seriously. I tried, and it's not worth it. (if you're curious, you can learn about it here). So just use someone else's QR code generator, but don't let them shorten your link (do that part yourself)

Link Shortener: This one's not too bad either. There are open source programs you can download, like the php/mySQL based yoURLs. This works great for most people and puts all the control in your hands, letting you adjust anything you want, host it on your own website, and even make it public for other people to use (creating your own bit.ly etc) if you want. For some reason it didn't work for me though (some unidentifiable configuration discrepancy, I don't know).

My workaround: The maximum character limit for a QR code varies by the dimensions of the code and error correction level, but for what I wanted It just so happened that the domain I already had was short enough that I could use it, with an extension of up to 3 other characters and still have it fit within a 25 x 25 QR code grid. (i.e. http://mydomain.com/123 does not need to be shortened further).

So I created a subdomain on my website that would be used specifically just for this hacked link shortener, and filled it with pages titled with no more than 3 letters, each holding a php header redirect script that redirects to whatever given destination url I want. (I also had to add an .htaccess file on my site that gets rid of the ".php" in the url to make sure it didn't get too long -"http://mydomain.com/123" instead of "http://mydomain.com/123.php")

If I want to change where the short link points, all I have to do is open the file and adjust the redirect code. I designated one of these "short links" on my site to where I had stored the following web scraper that does the heavy lifting...

Random Website: This was the hard part, really. One possibility is to create a program that essentially compiles random letters and numbers until it stumbles onto a valid url (the monkey typing shakespeare approach), but that hardly seemed practical and exceeded my knowledge of coding. So, unless you want to hand compile an exhaustive list of all possible extant websites and enter them into your array of possible sites to redirect to randomly, you have to rely on preexisting lists. There's really no such thing as an exhaustive list of valid urls, really, but I've found a few useful, legit websites that are trying their darnedest to index the internet. the DMOZ project has been around for a decade or so and seems pretty stable, and it catalogs thousands and thousands of sites and organizes them by category. In fact, I discovered that uroulette.com actually uses their database to point viewers to the random site. There's also the Internet Archive (which is like a time capsule of the internet) and The Alphabetical Web Directory (which is a little easier to search but seemed less extensive).

So I decided to go straight to the source and build a php web crawler that scrapes all the data off the DMOZ home page and looks for links. It picks one of them at random and uses a regular expression to decide whether that link is external (to a site outside of the DMOZ directory) or internal (another subcategory within the DMOZ directory). If it's external, it redirects the browser to that site (ta-daa!!). If it's internal, it follows that link to the next page and scrapes it all over again looking for the next randomly selected link, repeating the process further down the rabbit hole until it eventually lands on some random external link and redirects there (ta-daaaa!!!). I still need to tweak the code a bit, but it works pretty well for the most part.

FOR DETAILS ON HOW I BUILT THIS WEB CRAWLER, CLICK HERE

And there you have it! If any part of any of this breaks on me now - if DMOZ shuts down, or if my short link goes dead, etc - I can adjust any part of this process inside the code on my website without effecting any of the QR codes pointing to them!

SO:

If you want less of a headache, go with the easy route (good luck finding one though). But if you want something more stable or are a more competent programmer than me, I advise the hard way.

GOOD LUCK!

Resources for Building a Link to a Random Website

Consider this a "for-dummies" resource base written by a former dummy who wants to save some other dummies a headache or two.

Background:

Several of my past projects had used a QR code that linked to a random website. I had taken the easy route at the time, linking to a random website generator someone else had already created:

Some Examples of Existing Generators:

URouLette.com

RandomWebsiteDotCom

Mangle

List of Random Websites

But when the link I used went dead, I needed to start figuring out how to build and host this myself. Hopefully this can prove helpful to anyone who might be trying to do something similar, starting from as little a knowledge base as I did.

I knew I essentially needed to compile an array of URLs and then randomly pick one to redirect the user's browser to. However, I didn't want to have to compile and meticulously update an exhaustive list of all viable urls on the internet by hand, so I decided to try to tap into a reliable pre-existing database.

That's when I discovered the DMOZ Open Directory Project. It is an extensive and longstanding project backed buy Yahoo and a number of other reputable organizations, and attempts to catalog the entire internet, and it happens to be the resource that many of these generators use anyway.

Some Other Useful Attempts to Catalog the Internet:

DMOZ

The Internet Archive

Yowl

The Solution:

I endeavored to design a web crawler using PHP which would pore over the DMOZ directory and randomly extract a web address to follow.

THE CRAWLER

For the code for a good, super simple, web crawler that scrapes a site for its links I suggest you look here, as I did. The example script here became the basis for my crawler. It has 2 basic parts (follow the link above for the visual):

The first part (first 2 lines of code at the link above) scrapes all the data from the given page and strips it of all html tags except those tags pertaining to links. This file can be printed out using the commented out "debug" lines of code to get an idea of just what's going on.
The second part (the "preg_match_all..." line) uses what's called a "regular expression," a code to look through that stripped version of the page and find everything that matches a given perimeter - in this case, everything that is a link, and stores them as entries in an array

REGEX - VERY IMPORTANT NOTE: regular expressions (or "regex") are stupidly opaque and very difficult to learn (think The Imitation Game), but they're SUPER useful. Here are the resources I used in order to build the guts of my crawler:

Here's a dynamite resource for learning Regex, with tutorials and tests
and then once you've got the hang of the concept, here's a great place to test out and experiment with regular expressions. It only works with preg_match [only find the first match and then stop] which is not so useful for your crawler but it color codes the syntax which is REALLY helpful in demystifying why something works or doesn't work
and here's a second equally useful test sandbox. this one does preg_match_all [finds every match instead of just the first one], which is what you'll want in your final product.

Applying this to DMOZ

My crawler uses this as its basis, but because the DMOZ website has layers of sub-categorizations before you find any external links, it has to iterate through the scraping process several times. My crawler looks at the home page of DMOZ and uses a Regex to create an array of all the directory's category links. It then chooses one element from that array at random and follows it to Level 2. At this second level, it repeats the process, scraping and searching for all links, iterating deeper and deeper until it terminates at an external URL. I customized the regex to weed out irrelevant links, trying to get either to an external url, or to the next level of depth within the categorization of the DMOZ directory.

If the link it randomly selects from the created array is internal, it follows that link and repeats the process again. If the link is external, it stops there and redirects to that link. I'm still tweaking the regex to get this as clean as possible, but it is essentially functional.

The Redirect is a simple PHP header redirect to this final external URL, as you'll see in the code.

Below is my CODE...

which you can feel free to use and experiment with. I simply ask that you give any credit where credit is due, and pass on the help to others (random website generators are cool. Plagiarism and selfishness aren't).

I don't claim this to be flawless or even pretty (in fact I'm pretty sure I could find a way to condense the second and third level iterations into one block of code, and I need to tweak the regex to stop favoring musicMoz, wikipedia, and other external directories that don't technically terminate the search for a URL). This was actually my first time using PHP at all, so a better programmer could surely find a more elegant and robust solution, but nonetheless, it's a starting point. Enjoy!

1:  <?php  
2:    
3:  // Crawls DMOZ directory for all relevant links, picks one at random to follow, and repeats that process until terminating at an external link, which it then redirects to.  
4:  // The original code and regular expression off of which this was adapted was as follows:   
5:  //     $original_file = file_get_contents("http://www.domain.com");  
6:  //     $stripped_file = strip_tags($original_file, "<a>");  
7:  //     preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches);  
8:  // If DMOZ ever crashes, simply change the destination on the first line of code and adjust the regex for something appropriate for the new link.  
9:    
10:  //---- HOMEPAGE LAYER ---------------       
11:    
12:   $original_file = file_get_contents("http://www.dmoz.org");   
13:       // gets the contents of the dmoz website homepage  
14:   $stripped_file = strip_tags($original_file, "<a>");   
15:       // strips file of all tags except <a>  
16:   preg_match_all("~<a(?:[^>]*)href=\"(\/(?!World)[A-Z]\w+)\/\">[^\"]*\"(?:[^>]*)>(?:[^<]*)<\/a>~s", $stripped_file, $homepageLinks);   
17:       // searches content for all directory related links except for "/World" in [0], and captures the relevant parts of url in [1]  
18:         
19:   $randomFromHomepage = $homepageLinks['1'][array_rand($homepageLinks['1'],1)];  
20:       // selects a random element from the $homepageLinks array  
21:    
22:            //-- DEBUG --  
23:            // header("Content-type: text/plain");   
24:                 // organizes $homepageLinks array into something readable  
25:            // print_r($homepageLinks);   
26:                 // returns the organized $matches array  
27:            // echo "link followed = " . $randomFromHomepage . "\n";  
28:                 // returns randomly generated link  
29:    
30:  //----- LEVEL 2 -------------------  
31:    
32:   $level2 = file_get_contents("http://www.dmoz.org" . $randomFromHomepage);   
33:   $stripped_level2 = strip_tags($level2, "<a>");   
34:   preg_match_all("~<a(?:[^>]*)href=\"((http|\/(?!World)[A-Z]\w+)[^\"]*)[^\w*\?q](?<!dmoz|dmoz\.org\/)\"(?:[^>]*)>(?:[^<]*)<\/a>~s", $stripped_level2, $level2links);   
35:    
36:   $newRandomLink = $level2links['1'][array_rand($level2links['1'],1)];  
37:     
38:            //-- DEBUG --  
39:            // header("Content-type: text/plain");  
40:            // print_r($level2links);  
41:            // echo "link followed = " . $randomFromHomepage . "\n";  
42:            // echo "next link = " . $newRandomLink . "\n";  
43:    
44:  //----- DEEPER ITERATIONS ---------- TRY TO GET RID OF MOZILLA, MUSICMOZ, EN.WIKIPEDIA  
45:    
46:     
47:   $internal = preg_match("~\/[A-Z]\.*~", $newRandomLink);  
48:            // print $internal;  
49:    
50:   while ($internal === 1) {  
51:    $nextFile = file_get_contents("http://www.dmoz.org" . $newRandomLink);  
52:    $stripped_nextFile = strip_tags($nextFile, "<a>");  
53:    preg_match_all("~<a(?:[^>]*)href=\"((http|\/(?!World)[A-Z]\w+)[^\"]*)[^\w*\?q](?<!dmoz|dmoz\.org\/)\"(?:[^>]*)>(?:[^<]*)<\/a>~s", $stripped_nextFile, $linksArray);    
54:               
55:            //-- DEBUG --  
56:            // header("Content-type: text/plain");  
57:            // print $nextFile;   
58:            // print_r($linksArray);  
59:            // print "link followed = " . $newRandomLink . "\n";  
60:    
61:    $newRandomLink = $linksArray['1'][array_rand($linksArray['1'],1)];  
62:    $internal = preg_match("~\/[A-Z]\.*~", $newRandomLink);  
63:    
64:      
65:    if (preg_match("~https?.*~", $newRandomLink)) {  
66:     break;  
67:    }  
68:   }  
69:               
70:  //----- REDIRECT -------------  
71:    
72:   header("Location: $newRandomLink");  
73:            // echo "newRandomLink = " . $newRandomLink; //Shows ultimate redirect destination  
74:    
75:  ?>   
76:

Monday, February 2, 2015

Opening into Infinity

Here are a few shots from the opening of Getting to Infinity at Seton Hall University's Walsh Gallery. I am proud to have been in very strong company. Any one of the artists could have stolen the show, except that the show had already been stolen by Jessica Angel's stellar immersive installation (literally, an immersive, diagrammatic installation of the visible cosmos, oriented to be precisely aligned with real space at 6 pm on the day of the opening.)

Jessica Angel's installation

Gianluca Bianchino, The Singularity

Julie Oldham, In the Beginning there was Nothing (still)

Travis LeRoy Southworth, The Continuous Work Drawings, line drawings made with mouse tracking software on his work computer as he edits photos and images.

Foreground: man posing with Chad Stayrook's Ejection Seat #3, Background: Man standing near my Dissolvation

Which brings me to my own piece of this show, Dissolvation. For the backstory and an explanation, you can follow the link below to the first post about the project.

I want to thank everyone who came out. I had a great time talking with you all and sharing in the excellent work!

<< PREVIOUS POST FIRST POST ON THIS PROJECT

Subscribe to Eric Valosin's mailing list

* indicates required

First Name *

Last Name *

Email Address *

Company

State (U.S.)

Country

Website

Art: (b)Logos

Tagline