爬取一个网站数据时我们首先要对一个网站的规模和结构进行了解。网站自身的robots.txt和Sitemap文件对我们了解一个网站非常有帮助。
1.估算网站大小:
可以使用搜索引擎估算一个网站的大小,在谷歌或百度上使用site关键词可以获取信息。例如使用site:www.oschina.net结果如下。 在域名后面添加url路径会对页面进行过滤,只显示网站某些部分页面。当然,这种方法不是非常精确,只是给出一个参考。
2.识别网站使用的技术
我们可以使用builtwith模块识别目标网站使用了什么技术。 安装模块:
pip install builtwith
然后加载模块,下载该url进行分析:
>>> import builtwith>>> builtwith.parse('http://www.oschina.net/'){u'javascript-frameworks': [u'jQuery', u'Vue.js'], u'web-servers': [u'Tengine']}>>> builtwith.parse('http://www.douban.com'){u'javascript-frameworks': [u'jQuery'], u'tag-managers': [u'Google Tag Manager'], u'analytics': [u'Piwik']}
3.查询域名所有者
查询郁闷所有者可以使用多种方法,可以直接使用一些网站提供的服务,也可以使用Python的封装库直接查询:
pip install python-whois
>>> import whois>>> print whois.whois('oschina.com'){ "updated_date": [ "2017-01-06 00:00:00", "2017-01-06 06:00:23" ], "status": [ "clientDeleteProhibited https://icann.org/epp#clientDeleteProhibited", "clientTransferProhibited https://icann.org/epp#clientTransferProhibited", "clientDeleteProhibited https://www.icann.org/epp#clientDeleteProhibited", "clientTransferProhibited https://www.icann.org/epp#clientTransferProhibited" ], "name": "zheng jin wei", "dnssec": "unsigned", "city": "bei jing", "expiration_date": [ "2018-07-25 00:00:00", "2018-07-25 13:05:23" ], "zipcode": "100000", "domain_name": [ "OSCHINA.COM", "oschina.com" ], "country": "CN", "whois_server": "whois.ename.com", "state": "bei jing", "registrar": "eName Technology Co.,Ltd.", "referral_url": "http://www.ename.net", "address": "hai dian qu zhong guan cun nan da jie", "name_servers": [ "DNS1.IIDNS.COM", "DNS2.IIDNS.COM", "DNS3.IIDNS.COM", "DNS4.IIDNS.COM", "DNS5.IIDNS.COM", "DNS6.IIDNS.COM", "dns1.iidns.com", "dns2.iidns.com", "dns3.iidns.com", "dns4.iidns.com", "dns5.iidns.com", "dns6.iidns.com" ], "org": "zheng jin wei", "creation_date": [ "2007-07-25 00:00:00", "2007-07-25 13:05:23" ], "emails": [ "abuse@ename.com", "d_7665@163.com" ]}