Skip to main content

Pyramid of efficient scraping

· 8 min read
Gregory Komissarov
Engineering Enthusiast

We will go step-by-step trough different approaches to retrieving data from the public internet starting from the options that require more work from you and will require less cash investment and ending with the opposite.

Intro

In the real world to get enough data with quality and quantity you need to make business decisions you will need to mix different approaches to make it cost-efficient. So I recommend using the lower layer as much as possible and only going to the higher when it’s needed and works for you from a margin perspective.

Talking about pyramids, triangles and signs, if you are an Illuminati humanoid reptiles - there could be your commercial ;)

Layer 0 - Free data

This might come as a surprise, but some resources intentionally do not protect their data from scrapping and even provide API to make it easier. I can name a few reasons for such behavior:

  • They gained value from it. For e-commerce, you might share data regarding the SKUs(their goods), and somebody might be interested and mention the company or convert it to the client.

  • It might be too constant to handle all these HTTP requests that aren’t as efficient as data API cause will fetch more than needed in an unpredictable schedule. This might lead to the outage. Also, the company will need to spend time with engineers to integrate 3rd party solutions like recaptcha, challenges, etc which will lead to the lower conversion of visitors to the client. Making in-house high-quality solutions will cost no less than previously.

So, in such case just go and get the provided data using your favorite programming language, or no-code automation.

Layer 1 - Dedicated server w/o proxy

Target resource(let’s name the Website we want to scrape this way) does not provide free access to the data and counting amount of requests per minute from the same IP addr. In this case, you need to get a big enough amount of unique ip_addr. We assume that at this stage the quality of the IP does not matter. If the quantity of ip_addr that’s needed is insignificant(from tens to hundreds) and the target resource zeros their counter daily - you can build a solution yourself:

  • Rent the generic VPS(Linux Virtual Machine, it would be enough to have a mini PC), rent ip_addr from the hosting provider or, if a provider can do static routing for you, rent ip_addr on an exchange like IPXO.
  • Install your favorite free and open-source proxy server like Squid or 3proxy
  • You can assign a few hundreds of ip_addr to the single network interface in the Unix system and that will work fine. Than you can make squid using ip_addr that was used to accept connection from the proxy client as outgoing ip_addr

Here we go, if you will use these ip_add often - it should be cost efishent cause with a propper hosting choice you should have flat or extremely affordable prices for traffic. Be aware that popular IaaS/PaaS cloud providers like AWS, GCP and Azure, have very high prices for the traffic, so you better take a look at something from a dedicated server epoch like hetzner and servers.com.

Layer 2 - DC proxy

The target resource requires more ip_addr than you are ready to support yourself and/or you have a specific GEO requirement. Some resources might be available only in particular countries or provide specific language and currency only for users from defined countries. Yes, they define your country based on you public internet ip_addr of your client. Then you can rent a DC Proxy. This is the cheapest proxy type. Usually, vendors offer a price per amount of unique ip_addr per month with replacement as an option. If you need to download media content - images, audio, video, DC Proxy is a good option cause will give a smaller price per gigabyte, 24/7 availability, and stable bandwidth(download speed).

Layer 3 - ISP and Resi proxies

The target resource checks the client's ip_addr characteristics and adjusts the allowed amount/rate of requests and/or showing capture. You need to improve the quality of ip_addr. There are services, let's name them IP DBs that keep records of all known addresses and provide you with some subjective characteristics of them(TODO: write an article about IP DBs). You can’t change the characteristics of ip_addr that you got from a proxy provider, but you can get ip_addr that are more effective(have higer Success Rate) than DC:

  • ISP Proxy - Vendors say that they physically place their servers in home ISP networks, so the traffic physically goes from the same networks as real people use. Of course, there are some techniques to make DC ip_addr look like an ISP, what’s cheaper and easier for vendors, but some IP DBs can catch it.
  • Static Residental Proxy - Vendors say these are real devices that belong to people like TVs, TV sets, washing machines, and refrigerators. You get a fixed amount of static IPs and rotation is optional with an extra fee. Again, it might be just ISP.
  • Rotating Residental Proxy - It’s real phones, laptops, PCs, etc belong to individuals that have installed some apps. It can be anything from a VPN client to a computer game. Usually, this fact is mentioned in terms of the use of the app, but nobody reads them carefully.
  • Mobile proxy - Pretty much the same as rotating residential, but ip_addr belongs to the cell provider networks like T-Mobile, AT&T, etc. This means that the device is connected to the internet via a 3G/4G/5G/etc mobile data connection. As a result, the connection is unstable(bandwidth unstable and pure), ip_addr session length is short.

For the target resource isn't impossible to ban, especially for a significant time, residential ip_addr cause they are shared among people, and the real person/device behind it is changing over time. That’s why target resources have to use smart/agile limits and captures but not bans.

Talking about Proxy Providers, here are the popular(only mentioning that I have been using and sure that quality would be good) soax, oxylabs, smartproxy, brightdata, rampageproxies

Layer 4 - Resi + Scriptable WEB browser

The target resource uses client fingerprints and requires JavaScript rendering. Scrappers have used ip_addrs for a pretty long time, so the target resources adapted to this and nowadays use not only data about the client(IP address) but metadata as well(TODO write about fingerprints):

  • TCP fingerprints - allows to identify of TCP client(proxy nodes) OS.
  • SSL fingerprint - allows identification of app/library.
  • Web browser fingerprints - headers, JS browser API, fonts, video card, etc
  • Timezone

So you have to use the Scriptable WEB browser(controlled by an app/script you wrote), hide that it’s managed programmatically(TODO; Write about Scriptable WEB browsers and API), and make your fingerprints aligned and reliable. This is the most time and effort-consuming case, you do all the engineering starting with scrapper runtime, proxy choose and management, antibot protection bypass, and data organization and storage.

Also it might be too costly cause WEB brower is slower(startup, shutdown, DOM animation, etc) and consumes more resources(CPU, MEM, etc). If it's too cosntly or too slow for you, I see two options:

  • To combine Scriptable WEB browser and basic HTTP client: Fetch first page with WEB browser to get auth token and do all futher requests with basic HTTP client.
  • You can reverce JS and reimplement its logic using yout favorite programming language and HTTP client from it.

Layer 5 - Scraping API

You found that it’s too costly or too tough to bypass the target antipot protection. But you still need the content.

You can purchase a generic Scrapping API, it might have the name Ublocker or Unblocker depending on the vendor. The client passes the target URL to it, and might adjust some options in the form of HTTP params, the vendor does the rest: fingerprints, JS rendering, proper ip_addr pool, etc, and returns the HTML document to you. Product examples: WEB Unblocker from Bright, WEB Unblocker from Oxy, ScrappingBee API, ScrappingOps API, etc It more costly than previous layers, but you don’t care about scrappers, proxies, and bypassing. You just need to organize and store the data properly.

Layer 6 - Data API

The generic Unblocker/API isn’t good enough because of the speed or SR(Success Rate). Also, you might want not the pages themself, but extracted structured data(it might require fetching the data with different approaches and joining it). There are target-focused products for such cases(sometimes they are called Data APIs):

Layer 7 - Data itself or delegation

The data gathering might not be the core business of your company and you want to delegate it. Then you find companies that sell the already scrapped and organized data, find companies that do custom development/scrapping, or find freelancers.