今天在获取店铺列表的时候发现部分不能获取店铺ID,遂查看了一下,发现是因为链接是淘宝客格式的加密链接,经过多次自动跳转,导致无法得到店铺原始url。下面对跳转进行分析,看能不能通过PHP程序对淘宝客链接进行破解。
首先拿到的是经过加密处理的初始URL
1 |
URL-1 : http://s.click.taobao.com/t?e=m%3D2%26s%3DmhUBaPBOjKgcQipKwQzePDAVflQIoZepLKpWJ%2Bin0XJRAdhuF14FMWbTPFtu%2Beb25x%2BIUlGKNpVThCboufnbP9Hv2pYZlnKo2aXUa9ZSw4NICmluU1pBFBM7j7IDy3as |
根据这篇文章,淘宝客链接会经历多次302跳转和url解密操作。下面尝试通过PHP来对url进行解码。
第一步获取第一次跳转之后的url。因第一次跳转是302跳转,根据这篇文章,采用第一种方法,获得如下url
1 |
URL-2 : http://s.click.taobao.com/t_js?tu=http%3A%2F%2Fs.click.taobao.com%2Ft%3Fe%3Dm%253D2%2526s%253DmhUBaPBOjKgcQipKwQzePDAVflQIoZepLKpWJ%252Bin0XJRAdhuF14FMWbTPFtu%252Beb25x%252BIUlGKNpVThCboufnbP9Hv2pYZlnKo2aXUa9ZSw4NICmluU1pBFBM7j7IDy3as__%26ref%3D%26et%3DlFmOxUWdrHGW4uwYzOcDJzhoOyNq6XyZ |
接下来是第二次跳转,这次比较复杂,是js跳转,需要对url进行解码,通过这篇文章里的unescape函数,来模拟JS解码过程,然后获得如下url
1 |
URL-3 : http://s.click.taobao.com/t_js?tu=http://s.click.taobao.com/t?e=m%3D2%26s%3DmhUBaPBOjKgcQipKwQzePDAVflQIoZepLKpWJ%2Bin0XJRAdhuF14FMWbTPFtu%2Beb25x%2BIUlGKNpVThCboufnbP9Hv2pYZlnKo2aXUa9ZSw4NICmluU1pBFBM7j7IDy3as__&ref=&et=lFmOxUWdrHGW4uwYzOcDJzhoOyNq6XyZ |
根据对链接的分析,这里的tu查询参数后的内容就是我们需要的跳转地址
1 |
URL-4 : http://s.click.taobao.com/t?e=m%3D2%26s%3DmhUBaPBOjKgcQipKwQzePDAVflQIoZepLKpWJ%2Bin0XJRAdhuF14FMWbTPFtu%2Beb25x%2BIUlGKNpVThCboufnbP9Hv2pYZlnKo2aXUa9ZSw4NICmluU1pBFBM7j7IDy3as__&ref=&et=lFmOxUWdrHGW4uwYzOcDJzhoOyNq6XyZ |
但是直接在浏览器中访问是被拒绝的,因为在header头部中缺少referer参数,因此这里通过CURL的伪造头部信息来模拟真实跳转。在referer参数中设置值为URL-2,一同提交后,获得如下URL
1 |
URL-5 : http://store.taobao.com/shop/view_shop.htm?user_number_id=11927771&ali_trackid=2:mm_34573471_3460241_15078369:1394245150_6k2_446582491 |
上面的这个URL中已经包含了店铺的ID和阿里妈妈的ID,是能够正常在浏览器中访问的,但是这里还需要最后一步的跳转,我们用第一步里使用的普通302跳转方法,得到如下URL
1 |
URL-6 : http://afeyesonme.taobao.com/shop/view_shop.htm?user_number_id=11927771&ali_trackid=2:mm_34573471_3460241_15078369:1394245150_6k2_446582491_ |
最终的URL就得到了,一共进过6次跳转,现在我们就可以对URL进行修改和使用了。
PHP实现如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
<head></head> <body> <form name="url" method="get" action=""> <label for="url"></label> <input type="text" name="url" id="url" size="150"> <input type="submit" value="Submit"> </form> <?php /** * Created by PhpStorm. * User: Jason * Date: 14-3-7 * Time: 下午9:22 */ if(!empty($_GET['url'])) $url = $_GET["url"]; else $url = "http://s.click.taobao.com/t?e=m%3D2%26s%3DmhUBaPBOjKgcQipKwQzePDAVflQIoZepLKpWJ%2Bin0XJRAdhuF14FMWbTPFtu%2Beb25x%2BIUlGKNpVThCboufnbP9Hv2pYZlnKo2aXUa9ZSw4NICmluU1pBFBM7j7IDy3as "; echo 'URL-1 : '.$url."<br>"; //模拟普通的302跳转 function get_redirect_url($url){ $header = get_headers($url, 1); if (strpos($header[0], '301') !== false || strpos($header[0], '302') !== false) { if(is_array($header['Location'])) { return $header['Location'][count($header['Location'])-1]; }else{ return $header['Location']; } }else { return $url; } } $url_2 = get_redirect_url($url); echo 'URL-2 : '.$url_2."<br>"; //模拟JavaScript解码过程 function unescape($str) { $str = rawurldecode ( $str ); preg_match_all ( "/%u.{4}|&#x.{4};|&#\d+;|.+/U", $str, $r ); $ar = $r [0]; foreach ( $ar as $k => $v ) { if (substr ( $v, 0, 2 ) == "%u") $ar [$k] = iconv ( "UCS-2", "GBK", pack ( "H4", substr ( $v, - 4 ) ) ); elseif (substr ( $v, 0, 3 ) == "&#x") $ar [$k] = iconv ( "UCS-2", "GBK", pack ( "H4", substr ( $v, 3, - 1 ) ) ); elseif (substr ( $v, 0, 2 ) == "&#") { $ar [$k] = iconv ( "UCS-2", "GBK", pack ( "n", substr ( $v, 2, - 1 ) ) ); } } return join ( "", $ar ); } $url_3 = unescape($url_2); echo 'URL-3 : '.$url_3."<br>"; //parse_str($url_3, $arr); $url_4 = substr($url_3, 34); echo 'URL-4 : '.$url_4."<br>"; //模拟需要refer传递的url跳转 function curl_get_redirects($url,$refer){ $curl = curl_init($url); curl_setopt($curl, CURLOPT_FAILONERROR, true); curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true); curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false); curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false); curl_setopt($curl, CURLOPT_REFERER, $refer); curl_setopt($curl, CURLOPT_HEADER, true); curl_setopt($curl, CURLOPT_NOBODY, true); curl_setopt($curl, CURLOPT_TIMEOUT, 30); $result = curl_exec($curl); curl_close($curl); if (preg_match("!Location: (.*)!", $result, $matches)) { echo ": redirects to $matches[1]\n"."<br>"; return $matches[1]; } else { echo ": no redirection\n"."<br>"; return false; } } $refer = $url_2; $url_5 = curl_get_redirects($url_4,$refer); echo 'URL-5 : '.$url_5."<br>"; $url_6 = get_redirect_url($url_5); echo 'URL-6 : '.$url_6."<br>"; //对URL进行处理的函数 /* 结果各分组如下 $1 = http: $2 = http $3 = //www.nowamagic.net $4 = www.nowamagic.net $5 = /pub/ietf/uri/ $6 = <undefined> $7 = <undefined> $8 = #Gonn $9 = Gonn */ function url_format($url){ $search = '~^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?~i'; $url = trim($url); preg_match_all($search, $url ,$rr); //var_dump($rr); return $rr; } $url_7 = url_format($url_6); echo 'URL-7 : '.$url_7[4][0]."<br>"; $url_8 = $url_7[4][0]; $url_8 = 'http://'.$url_8.'/search.htm?search=y&orderType=newOn_desc'; echo 'URL-8 : '.$url_8."<br>"; |
One comment