Analyze 5 Anti-Crawler Skills
Nowadays, data collection and data analysis play a key role in enterprise development and policy formulation, so companies use crawler technology to collect data. The purpose of crawlers is to obtain data on a large scale and for a long time. However, one IP is always used to crawl websites, and large-scale centralized access to the server may be rejected after a long time, and crawlers may crawl data for a long time. A verification code is required. Even if multiple accounts are crawled in turn, there will still be cases where the verification code is required to be entered. The following Tomatoip engineers will give you an analysis of 5 anti-crawler techniques to teach you how to solve and avoid these problems.
Tip 1: Set download waiting time/download frequency
Large-scale centralized access has a greater impact on the server, and crawlers can increase the server load in a short time. What needs to be noted here is: set the range control of the download waiting time. If the waiting time is too long, it cannot meet the requirements of short-term large-scale crawling. If the waiting time is too short, access is likely to be denied.
1. In the previous “Get HTML from URL” method, socket timeout and connect timeout were set for httpGet configuration. In fact, the length of time here is not absolute, mainly depends on the control of the crawler by the target website.
2. In addition, in the scrapy crawler framework, the proprietary parameter can be set to download the waiting time download_delay, this parameter can be set in setting.py, or in the spider.
Technique 2: Modify User-Agent
The most common is to disguise the browser and modify the User-Agent.
User-Agent refers to a string containing browser information, operating system information, etc., also known as a special network protocol. The server uses it to determine whether the current access object is a browser, a mail client or a web crawler. The user-agent can be viewed in request.headers, how to analyze the data packet, view its User-Agent and other information, which was mentioned in the previous article.
The specific method can change the value of User-Agent to the way of the browser, and even set up a User-Agent pool (list, array, dictionary can be), store multiple “browser”, randomly fetch it every time it is crawled A User-Agent to set the request, so that the User-Agent will always change to prevent being walled.
Trick 3: Setting cookies
Technique 4: Distributed crawling
There are also many Githubrepo for distributed crawling. The principle is to maintain a distributed queue that all cluster machines can effectively share.
There is another purpose of using distributed crawling: large-scale crawling, a single machine has a heavy load, and the speed is very slow. Multiple machines can set up a master to manage multiple slaves to crawl at the same time.
Technique 5: Modify IP
Actually, what Facebook recognizes is the IP, not the account. In other words, when a lot of data needs to be captured continuously, simulated login is meaningless. As long as it is the same IP, no matter how to change the account, it is useless. The main thing is to change the IP.
One of the web server’s strategies to respond to crawlers is to directly block the IP or the entire IP segment to prohibit access. When the IP is banned, switch to another IP to continue access. Method: proxy IP, local IP database (using IP pool).
The above are the top 5 anti-crawler techniques, I hope to help you crawl a lot of data.
What’s a Proxy Server? A proxy server acts as...
Types of proxy servers There are a variety of...
Proxy server is an intermediary server between...
There are many benefits of using a proxy serve...
Deployed at your network edge, content deliver...