Analyze 5 Anti-Crawler Skills

  Nowadays, data collection and data analysis play a key role in enterprise development and policy formulation, so companies use crawler technology to collect data. The purpose of crawlers is to obtain data on a large scale and for a long time. However, one IP is always used to crawl websites, and large-scale centralized access to the server may be rejected after a long time, and crawlers may crawl data for a long time. A verification code is required. Even if multiple accounts are crawled in turn, there will still be cases where the verification code is required to be entered. The following Tomatoip engineers will give you an analysis of 5 anti-crawler techniques to teach you how to solve and avoid these problems.

Analyze 5 Anti-Crawler Skills

  Anti-Crawler Skills

  Tip 1: Set download waiting time/download frequency

  Large-scale centralized access has a greater impact on the server, and crawlers can increase the server load in a short time. What needs to be noted here is: set the range control of the download waiting time. If the waiting time is too long, it cannot meet the requirements of short-term large-scale crawling. If the waiting time is too short, access is likely to be denied.

  1. In the previous “Get HTML from URL” method, socket timeout and connect timeout were set for httpGet configuration. In fact, the length of time here is not absolute, mainly depends on the control of the crawler by the target website.

  2. In addition, in the scrapy crawler framework, the proprietary parameter can be set to download the waiting time download_delay, this parameter can be set in setting.py, or in the spider.

  Technique 2: Modify User-Agent

  The most common is to disguise the browser and modify the User-Agent.

  User-Agent refers to a string containing browser information, operating system information, etc., also known as a special network protocol. The server uses it to determine whether the current access object is a browser, a mail client or a web crawler. The user-agent can be viewed in request.headers, how to analyze the data packet, view its User-Agent and other information, which was mentioned in the previous article.

  The specific method can change the value of User-Agent to the way of the browser, and even set up a User-Agent pool (list, array, dictionary can be), store multiple “browser”, randomly fetch it every time it is crawled A User-Agent to set the request, so that the User-Agent will always change to prevent being walled.

  Trick 3: Setting cookies

  Cookies are actually some encrypted data stored in the user terminal. Some websites use cookies to identify users. If a visit is always requested frequently, it is likely to be noticed by the website and be suspected of being a crawler. The website can find the visitor through the cookie and refuse his visit. You can customize the cookie policy (prevent cookie rejection problem: refuse to write cookies) or prohibit cookies.

  1. Customize the cookie policy (prevent cookierejected problems, refuse to write cookies), the setting methods are actually the same, because the HttpClient-4.3.1 component version is different from the previous version, and the writing method is also different.

  2. Prohibit cookies By banning cookies, the client actively prevents the server from writing. Banning cookies can prevent websites that may use cookies to identify crawlers from banning us. COOKIES_ENABLES= FALSE can be set in the scrapy crawler, that is, the cookies middleware is not enabled, and cookies are not sent to the web server.

  Technique 4: Distributed crawling

  There are also many Githubrepo for distributed crawling. The principle is to maintain a distributed queue that all cluster machines can effectively share.

  There is another purpose of using distributed crawling: large-scale crawling, a single machine has a heavy load, and the speed is very slow. Multiple machines can set up a master to manage multiple slaves to crawl at the same time.

  Technique 5: Modify IP

  Actually, what Facebook recognizes is the IP, not the account. In other words, when a lot of data needs to be captured continuously, simulated login is meaningless. As long as it is the same IP, no matter how to change the account, it is useless. The main thing is to change the IP.

  One of the web server’s strategies to respond to crawlers is to directly block the IP or the entire IP segment to prohibit access. When the IP is banned, switch to another IP to continue access. Method: proxy IP, local IP database (using IP pool).

  The above are the top 5 anti-crawler techniques, I hope to help you crawl a lot of data.