黑客工具_Python多线程爬虫抓取扫描器-阿里云开发者社区

黑客工具_Python多线程爬虫抓取扫描器

2017-11-01 1324

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

代码如下:

 
        # -*- coding:utf-8 -*-
       
        __author__
        =
        "iplaypython.com" 
       
        import 
        os 
       
        import 
        urllib2 
       
        import 
        threading 
       
        import 
        Queue 
       
        import 
        time 
       
        import 
        random 
       
        q 
        = 
        Queue.Queue() 
        # Queue产生一个队列，有3种类型队列 默认用 FIFO队列 
       
        threading_num 
        = 
        5 
        # 开启5个线程 
       
        # 扫描本地IP或域名
       
        domain_name 
        = 
        "http://127.0.0.1" 
       
        # 百度蜘蛛UA
       
        Baidu_spider 
        = 
        "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 
       
        # 不需要的文件过滤列表
       
        exclude_list 
        = 
        [
        '.jpg'
        , 
        '.gif'
        , 
        '.css'
        , 
        '.png'
        , 
        '.js'
        , 
        '.scss'
        ]  
       
        proxy_list 
        = 
        [ 
        # 代理服务器，可能已经失效，换为自己的 
       
        {
        'http'
        : 
        '117.28.254.130:8080'
        }, 
       
        {
        'http'
        : 
        '118.144.177.254:3128'
        }, 
       
        {
        'http'
        : 
        '113.118.211.152:9797'
        }, 
       
        ]
       
        # 打开字典文件，开始过滤不需要的文件路径
       
        with 
        open
        (
        "/home/leo/app_txt/wordpress.txt" 
        , 
        "r"
        ) as lines: 
       
        for 
        line 
        in 
        lines: 
       
        line 
        = 
        line.rstrip() 
       
        if 
        os.path.splitext(line)[
        1
        ] 
        not 
        in 
        exclude_list: 
       
        q.put(line) 
        #将line传入到队列 q 中 
       
        # 扫描器方法
       
        def 
        crawler(): 
       
        while 
        not 
        q.empty(): 
        # 循环 
       
        path 
        = 
        q.get() 将line从队列 q 中取出来 
       
        url 
        = 
        "%s%s" 
        % 
        (domain_name, path) 
        # 组合url地址，用于下一步提交 
       
        random_proxy 
        = 
        random.choice(proxy_list) 
        # 随机使用一个代理服务器 
       
        proxy_support 
        = 
        urllib2.ProxyHandler(random_proxy) 
       
        opener 
        = 
        urllib2.build_opener(proxy_support) 
       
        urllib2.install_opener(opener) 
       
        headers 
        = 
        {}  
       
        headers[
        'User-Agent'
        ] 
        = 
        Baidu_spider 
        # 蜘蛛的头部信息 
       
        # 玩蛇网 www.iplaypython.com 
       
        request 
        = 
        urllib2.Request(url, headers
        =
        headers)  
       
        try
        : 
       
        response 
        = 
        urllib2.urlopen(request) 
       
        content 
        = 
        response.read() 
       
        if 
        len
        (content): 
        # 内容不为空的情况下返回状态码、路径 
       
        print 
        "Status [%s]  - path: %s" 
        % 
        (response.code, path) 
       
        response.close() 
       
        time.sleep(
        1
        ) 
        # 休息一会儿，防止速度过快连接数过大被封掉IP 
       
        except 
        urllib2.HTTPError as e: 
       
        # print e.code, path 
       
        pass 
        # 异常处理，先暂时pass掉 
       
        if 
        __name__ 
        =
        = 
        '__main__'
        : 
       
        # 创建多线程并指明函数的入口为crawler，以后还可以传参进去 
       
        for 
        i 
        in 
        range
        (threading_num):  
       
        t 
        = 
        threading.Thread(target
        =
        crawler) 
       
        t.start()

# 上面代码，我们一共导入了6个模块都是接下来需要使用的功能模块，

# os作用是对我们不需要扫描的后缀名文件进行筛选，

# urllib2负责抓取，而threading就是我们的Python多线程模块，

# 这次还需要用到Queue这个保证线程安全的队列模块，

# 其它两个比较简单，一个是随机模块random，另一个时间模块time

转载地址:http://www.iplaypython.com/crawler/multithreading-crawler-scanner.html

本文转自027ryan 51CTO博客，原文链接：http://blog.51cto.com/ucode/1871010，如需转载请自行联系原作者

黑客工具_Python多线程爬虫抓取扫描器

热门文章

最新文章

相关课程

相关电子书

相关实验场景