URL shortening – How do they do it

URL shortening services had experienced a big push in the age of social networks rising, such as Twitter, where big urls collide with the 140 characters user input restriction.

So there was this need of magically transform URLs like this one http://raduboncea.ro/2009/04/01/twitter-autofollow-and-dm-using-python-twitter-imaplib-and-gmail/ into something like this http://tinyurl.com/cralfk.

Many would be tempted to think there is some kind of compression algorithm that would shorten a big URL and make it fit into a unique combination of 4,5 or 6 characters and then decompressed upon user request. Well it doesn”t work like this, mathematics prohibit it.

So how does it actually work?
A URL shortening service takes the big URL and saves it into a database along with an auto incremented numerical value, an id or sequence, so we would have the first big URL saved with ID=1, the second with ID=2 and so on. When someone would ask for http://tinyurl.com/1/ we would get from the database the big URL identified by ID=1 and redirect the user to that URL.

But there is a problem: having numerical ids does not shorten urls too much. For the URL number 1 million we would have 7 numerical characters. So the next step is to change base numerotation from 10 to a bigger one, lets say 36, so we include all a-z small latin letters or, why note, 62 which includes also the capital chars A-Z.
In 36 base we can save as much as 1.679.615 URLs into 4 characters. On base 62 we have 14.776.335 URLs fit into 4 chars, almost 10 times as much as base32 and 14.776 times base10.

Some security issues.
Many shortening services stopped to the algorithm above without digging into a very important security issue which is related to how redirects work. It is critical, before redirecting users to the URL they requested to check what’s behind that URL, to check if the URL is not yet another redirect which may result into a recursive loop and may disrupt the service.

A malicious user may save a URL to a page which will later modify to redirect back to the service URL and thus attempting to generate an infinite loop which could make the interpreter to hang.

To avoid this kind of DoS vulnerability, we must not allow redirects to our own links and to limit redirects to outside to a smaller finite number of depths analysis.

Beneath a small model-view example in django, using also urllib to check for redirects and the base10-base62 conversion algorithm similar to the base36 which is included in django.









5 Responses to “URL shortening – How do they do it”
name Posted on June 18, 2009 at 7:57 pm

Your Site Is Great,

Radu Boncea Weblog » Microsoft Fail Posted on January 16, 2010 at 9:32 am

[…] via techcrunchUn intro despre ce este si cum functioneaza un serviciu de scurtare URL. […]

Ali Posted on May 4, 2010 at 11:06 am

Good post. One thing more that I would like to point out that is that site related links such as “admin” etc. should also be checked for before assigning any short-URL to any link.

Hindi Sms Posted on June 4, 2010 at 5:54 am

I can guess the hard work it must have been needed to research for this post.All what i can say is just keep Publishing such post we all love it.And just to bring something to your notice,I have seen several blog providng your blog as source for this information.

Nelis Elorm Duhadzi Posted on August 27, 2010 at 9:35 pm

This is a URL shortening function i wrote in PHP. I also have Javascript and MySql versions. I have been tweeting about this code for sometime but am now building a website on it to help people who might need it.
Follow me on Twitter (@myNelis) or add me on Facebook and get the updates.
// this first function shortens any URL into a 6 character string
// @param int $num: the autoincrement ID from the database table
function int2key ($num) {
$str = ”;
$map = array(
‘TtzK9WLgqkhlHrXM4VyNs63YdCnJEpDwv1SAjReibGxu852QfcaoB7UPIFZm’,  // x 1
‘krLZzDXfem7IagnPc4H6V9EWBYpiuNqjJMTGtyswKRACxvFoU18dSb32Qhl5′,  // x 60
‘RgK2DspoSFhJYy4cBztre8ivEXPC6jnG9NluLfbUa3Zk7xMVAm1TW5dHIqwQ’,  // x 3600
‘Aa5lWyL9jP8hMTcKFNpSY73ZGkgei2vHsoXRBV1mJx4EfCqtwDnz6bIrUQud’,  // x 216000
‘KpAGgEjRmzMDHbUydrqC6lIXu1ksfviWFZcVS3oN7e9xJ5hT24Q8aLPYwnBt’,  // x 12960000
‘QU4gnd9Ima8ACK2rb5SyGTZepJtkVvfYXc1hiFMu7LzPxNDRjW3q6BwoHEsl’   // x 777600000
);

$key_len = strlen($map[0]);   

foreach ($map as $index => $key) {           
$offset = ($num%$key_len);           
$str .= substr($key, $offset, 1);           
$num = floor($num/$key_len);
}

return $str;
}

// this part converts the 6 character code back into a its original ID number
function key2int ($str) {       
$num = 0;
$map = array(
‘TtzK9WLgqkhlHrXM4VyNs63YdCnJEpDwv1SAjReibGxu852QfcaoB7UPIFZm’,  // x 1
‘krLZzDXfem7IagnPc4H6V9EWBYpiuNqjJMTGtyswKRACxvFoU18dSb32Qhl5′,  // x 60
‘RgK2DspoSFhJYy4cBztre8ivEXPC6jnG9NluLfbUa3Zk7xMVAm1TW5dHIqwQ’,  // x 3600
‘Aa5lWyL9jP8hMTcKFNpSY73ZGkgei2vHsoXRBV1mJx4EfCqtwDnz6bIrUQud’,  // x 216000
‘KpAGgEjRmzMDHbUydrqC6lIXu1ksfviWFZcVS3oN7e9xJ5hT24Q8aLPYwnBt’,  // x 12960000
‘QU4gnd9Ima8ACK2rb5SyGTZepJtkVvfYXc1hiFMu7LzPxNDRjW3q6BwoHEsl’   // x 777600000
);

$key_len = strlen($map[0]);   
if (sizeof($map) > strlen($str)) {
return -1;   
}   

foreach ($map as $index => $key) {
$char = substr($str, $index, 1);
$pos = strpos($key, $char);   

$mult = pow($key_len, $index);
$num += $pos*$mult;
}

return $num;
}

Post a Comment