RESEARCHUT -- Minds With Innovations
RESEARCHUT
Minds With Innovations

RESEARCHUT - minds with innovations

Python Crawler

Script to download data based of specific pattern from apache directory listing

Sunday 16 April 2006 at 4:43 pm. Used tags: , ,

This IMO is one of the dirtiest way for me up till now to accomplish a requirement. But it does the job I want. :D

 DISCLAIMER: I'm a learner. There must be better, smarter and easier way to accomplish the same task.

rrs@learner:~/My_Documents/My Books $ cat /home/rrs/devel/eclipse/PythonFun/web_pattern_fetcher.py

#!/usr/bin/env python

"""
This tiny little script does the job of crawling into Apache generated directory listings
and download scanning a specific pattern.
I'm using it to download anything that apache shows as TXT or IMG.
I'm sure others will be able to extend it more.
"""

import urllib, urllib2, string

url = "http://www.wuppy.net.ru/Fun/"
req = urllib2.Request(url)
handle = urllib2.urlopen(req)

x = 1
data = ''

while x:
    data = ''
    line = handle.readline()
    if "[TXT]" in line or "[IMG]" in line:
        word_list = line.split(' ')
        word = word_list[4:5]
        req_word = str(word)
        # Break and take out the relevant data uri
        begin_num = req_word.find(">")
        end_num = req_word.find("</A" )
        req_word = list(req_word)
        while begin_num < end_num - 1:
            final_word = string.lstrip( string.rstrip(str(req_word[begin_num+1:begin_num+2]), "']"), "['")
            data += final_word
            begin_num += 1
            #data.append(req_word[begin_num+1:begin_num+2])
        real_url = url + data
        urllib.urlretrieve(real_url, data)
    if line == '':
        x = 0