Scriptable and headless WEB browsers
I want to clarify the meaning and value of Scriptable, Headless, and Antidetect/Stelth WEB browsers for data management and retrieval tasks.
… No, you must believe me. It was a horseman, a dead one. Headless!
I have respect and deep parity for the developers of the solutions discussed below. It's very cool that such solutions exist, that they are open source, and that we have the opportunity to choose. Any criticism is my subjective experience.
TL DR
If you are just getting started with Selenium, Puppeteer, and Playwright and you don't have deep knowledge and a large code base with one of them - get Playwright for its simplicity, speed, and support for multiple browsers and languages.
History
The WEB browsers were dominant for a long time as the main tool to interact with the WEB, nowadays mobile apps compete with them, but WEB browsers are still popular.
In one moment WEB sites turned into complicated applications with AJAX, user-generated content, and SaaS federated apps. To do end-to-end tests humanity needed to conduct tests in an automated way faster and cheaper than human beings can do. Selenium framework appears and allows to management of popular WEB browsers programmatic way with Java and Python.
- 2004 according to the Selenium repo. It is hard to accept how old it is and the fact that the codebase was using CVS, then SVN, and then Git. Google Chrome did not exist in those days and Selenium initially supported only Firefox.
- There is no way to control the browser directly from outside, there is no interface or browser API for that(excluding universal Desktop automation and testing tools). Selenium was the first solution.
- Selenium was launching the middleman java library, which was launching the browser, loading extensions into it, and listening HTTP interface to accept commands from the client app(tester).
- Over time the protocol was standardized - JSONWire.
- Support for other engines was added Chromedriver or Operadriver
- There were feature gaps btw browser engines.
- The WEB Driver standard and built-in WEB browsers support it appears. From this point, you can operate a popular WEB browser directly via a single standard HTTP API w/o additional libraries and adaptors. In fact, selenium was still used on the client side
- Aug 2010 first mention in Chromium git history
- 2017 Web Driver protocol was added to Firefox ver. 52
- 2018 Selenium4 was introduced(not stable) with Webdriver support
- 2019 First draft version of the standard was published
- CDP(Chrome DevTools Protocol) appeared as a specification and interface in the Chromium engine, this time it’s WebSocket protocol.
- May 2017 Puppeteer opensource lib was created - maintained by the Chrome DevTools team
- Nov 2019 Playwright opensource lib was created - maintained by Microsoft
Summarising, the mentioned solution allows to control WEB browser from another application and makes it follow a predefined scenario(algorithm/script) to test WEB services or the browser itself or scrape public data. Such WEB browsers are controlled by applications but no human has a name scriptable WEB browser.
What is headless?
The way to launch a WEB browser without GUI(graphical user interface). A browser can load pages, render a DOM tree, interpret JS, and show nothing on display. It has some advantages:
- Works faster cause some steps are skipped like CSS rendering.
- It consumes fewer resources like CPU and MEM.
- You don’t need OS capabilities for GUI apps(Xvfb was used for Selenium tests in headful mode on Linux Servers to emulate display). As a result, you can run tests on typical Linux boxes and Docker using only command line and network interfaces.
What is a stealth/antidetect browser?
This this the patched version of a popular open-source WEB browse engine with support for changeable and shared among multiple user profiles. These give your next benefits and usage patterns:
- You can manage multiple accounts from a single WEB browser in services that don’t like it.
- You can share an account among a group of people and use it simultaneously. WEB service will treat you as one person.
- When you are using a popular generic WEB browser, sites can easily identify you using a composite fingerprint even if you disallow cookies, are using incognito, etc. Antidetect browser helps to avoid it. Warning: anti detect/stealth WEB browsers can’t guarantee that your identity will not be leaked, they just helping you and decreasing chances.
Runtime
If to solve a business problem you need to run dozens or hundreds of browsers you need a solution for that. Keeping them running constantly can be redundant and expensive, and from my experience there can be memory leaks and unpredictable behavior of the browser after a long time of use.
WebDriver solutions
I don’t want to spent your time mentioning all the limitations and bugs if Selenium Grid/Hub cause that was the only way at the beginning to split workload among browser but not the best one. After k8s gained popularity, open source(calisto, selenoid) and poprietary(moon) solutions appeared for running browsers in it and balancing the load.
CDP
I don’t know Service Level Balancers and k8s operators for the CDP protocol, if you have such - please let me know. My nobrainers is to write simple k8s operator which will contro WEb browsers liefcyrcle and load balance HTTP/Webscoket protocol.
SaaS
If you don’t want to care about runtime or it works for you from cost efficiency perspective, try SaaS which provides you managed WEB browsers like https://www.browserstack.com/
Comparison matrix
Selenium
- Protocol: WebDriver
- Web browsers: Chromium based(Chrome, Opera, Edge), Firefox, Safari, Internet Explorer.
- Official languages support: Java first. Java, Python, C#, JS/TS, Ruby, Perl, PHP.
- Runtime cooking help: 3rd party webdriver_manager to install browser binaries.
- Dev experience: Sync API, no network waiters. To introspect network 3rd party MITM needed. Good integration with Test NG.
Puppeteer
- Protocol: CDP
- Web browsers: Chromium based(Chrome, Opera, Edge) + Firefox and Webkit - patched.
- Official languages support: JS/TS.
- Runtime cooking help: Build-in install method to get compatible binary.
- Dev experience: Async API, build-in network introspection.
Playwright
- Protocol: CDP
- Web browsers: Chrome, Webkit, Firefox
- Official languages support: JS first. Java, Python, C#, JS/TS.
- Runtime cooking help: Build-in install method to get compatible binary and dependencies
- Dev experience: Sync and async API, build-in network introspection, good integration with pytest
Demo
To demonstrate the difference I’ll show simple and durty(not production ready) code for Selenium and Playwright solving next real task: We want to scrape site which uses Javascript and cryptography on the client side to create access tokens. We don’t want to reverse this JS code, so we will use WEB browser to get enough tokens to reuse them in a basic HTTP client to get the data concurrently.
Selenium and sync python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import typing as t
import time
import asyncio
import datetime
import threading
from multiprocessing import JoinableQueue, Array, Value, Process, Pool # noqa
from seleniumwire import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import TimeoutException, WebDriverException
from seleniumwire.thirdparty.mitmproxy.exceptions import TcpDisconnect
from iterators import TimeoutIterator
from furl import furl
import pprint
pp = pprint.PrettyPrinter(indent=4).pprint
class ScrappingTask:
pass
class APITokensMiner:
"""Simplified version of the user type that is responsible for the secure tokens gathering.
In real code there would be more boilerplate with selenium manager, chrome cmd options and exceptions handling
"""
def __init__(self, wd):
self.wd = wd
def get_api_token(self):
try:
self.wd.get('https://target.we.want')
except (WebDriverException, TcpDisconnect, TimeoutException) as e:
print("get_token_and_fprints / WD exception: " + str(e))
return False
# Here would be sleep which will block the Python process mo matar what kind it would be:
# * time.sleep
# * driver.manage().timeouts().implicitlyWait
time.sleep(1)
# We are able to access HTTP requests and responce because of the seleniumwire - MITM proxy.
# The negative consequinces - it's much slower that DevTools and break our TLS session.
for request in self.wd.requests:
if request.response and request.url.startswith('https://pattern.we.need'):
return ScrappingTask(
furl(request.url).remove('xyz'),
)
return None
def token_consumer_async(tasks_it, items_to_scrape, lcp_constr, term_flag):
""" This function will be called in the main thread. So we can't block here
for a long time, that's why I'm running "proc_butch_of_items" in a pthreads.
"""
buffer_wait_threshold = 35 # in seconds
buffer_size_threshold = 600 # in amount of coroutains == HTTP calls
req_per_token = 29
lookup_service = lcp_constr[0](*lcp_constr[1], **lcp_constr[2])
it = TimeoutIterator(tasks_it, timeout=5)
ts_it_wait_start = datetime.datetime.utcnow()
tasks = []
subproc = []
threads = []
for scrape_task in it:
t_wait_sec = (datetime.datetime.utcnow() - ts_it_wait_start).seconds
if scrape_task != it.get_sentinel(): # value insted of to
for _ in range(req_per_token):
try:
item = items_to_scrape.pop()
except IndexError:
term_flag.value = 1 # noqa
for p in subproc:
p.join()
for tr in threads:
tr.join()
return
tasks.append((item, scrape_task))
if t_wait_sec > buffer_wait_threshold or len(tasks) >= buffer_size_threshold:
thread = threading.Thread(target=proc_butch_of_items, args=(
tasks.copy(),
lookup_service
))
threads.append(thread)
thread.start()
tasks = []
ts_it_wait_start = datetime.datetime.utcnow()
def proc_butch_of_items(scrape_tasks, lookup_service):
"""Thanks to asyncio.gather we will be blocked till concurently all the
tassk will finish.
"""
async_tasks = []
new_loop = asyncio.new_event_loop()
asyncio.set_event_loop(new_loop)
for scrape_task in scrape_tasks:
async_tasks.append(lookup_service.lookup_item(scrape_task))
rv = new_loop.run_until_complete(asyncio.gather(*async_tasks))
for result in rv:
print(result)
return
def tokens_producer(lookup_service_constr, term_flag):
lookup_service = lookup_service_constr[0](*lookup_service_constr[1], **lookup_service_constr[2])
for scrape_task in lookup_service.produce_tasks():
if term_flag.value:
return
yield scrape_task
def producer_wrap(queue, lookup_service_constr, term_flag):
"""Wrapper fn to use queue for IPC for multiple POSIX process.
"""
for fp in tokens_producer(lookup_service_constr, term_flag):
queue.put(fp)
def consumer_wrap_async(queue, items_to_scrape: t.List[str], lookup_service_constr, term_flag):
"""Wrapper fn to use queue for IPC for multiple POSIX process.
"""
def queue_iter():
while True:
rv = queue.get()
yield rv
token_consumer_async(queue_iter(), items_to_scrape, lookup_service_constr, term_flag)
def job_multiproc(items_to_scrape: t.List[str],
lookup_service_constr, term_flag=None, browserproc=8):
"""Here we are mixing parllel/multiroc execution of producsers - which
mining secure tokens with Selenium and WEB browsers.
We accumulating tokens with the queue and run scrappers awating group of
concurrent HTTP reqs.
* We need term_flag to gracefuly shutdown shutfown all the process.
* I pass lookup_service_constr to initiate it in child process. AFAIK you
can't pass initiated user types in python with multiprocessing.
I hope there is better and smart way to do this ).
"""
threds_num = browserproc
producers = []
term_flag = term_flag or Value('i', 0)
queue = JoinableQueue()
for _ in range(threds_num):
p = Process(target=producer_wrap, args=(
queue,
lookup_service_constr,
term_flag))
producers.append(p)
p.start()
consumer_wrap_async(
queue,
items_to_scrape,
lookup_service_constr,
term_flag)
for p in producers:
p.join()
return
def main():
# This is simplified code. In real proj you need to build Chrome cmd opts to make it work headless + proxy + other tweaks.
# And I contantly have issues with ChromeDriverManager
items_to_scrape = []
driver = webdriver.Chrome(ChromeDriverManager().install())
lookup_service_constr = [APITokensMiner, [], {"wd": driver}]
job_multiproc(items_to_scrape, lookup_service_constr, term_flag=None)
if __name__ == '__main__':
main()
Playwright async python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import typing as t
import uuid
import asyncio
from playwright.async_api import async_playwright, Playwright, TimeoutError as PlaywrightTimeoutError
from furl import furl
import pprint
pp = pprint.PrettyPrinter(indent=4).pprint
PROXY = {
"server": "http://proxy.soax.com:5000",
"username": "package-123-sessionid-",
"password": "iddqd",
}
class ScrappingTask:
pass
class ScrappingResult:
pass
def get_items_to_scrape():
""" Item data obj
"""
pass
def cmpl_proxy_url(task):
""" Return furl URL
"""
pass
async def scrape_target_item(item, url_tmpl: str, proxy_url: str, headers: t.Dict[str, int]) -> ScrappingResult:
""" Let's assume we got url_tmpl, proxy_url and headers from the WEB
browser. Here we will do the heavy lifting and fetch needed data repeating
ip_addr and headers from the WEB browser but with higer concurency.
return result
"""
pass
async def get_api_token(playwright: Playwright, playwright_proxy) -> ScrappingTask:
scrp_task = None
def handle_req(req):
if req.url.startswith('https://pattern.we.need'):
scrp_task.update(
furl(req.url).remove('xyz'),
{k: req.headers[k] for k in req.headers.keys() - {'connection', 'proxy-authorization'}},
playwright_proxy,
True
)
wb = await playwright.firefox.launch(proxy=playwright_proxy, headless=True)
# To use another WEB browser engine you need to chnage only this string
# Stelth tweaks might be applied here if needed.
context = await wb.new_context()
page = await context.new_page()
page.on("request", handle_req)
try:
await page.goto("https://target.we.want", wait_until="networkidle")
except PlaywrightTimeoutError:
if not scrp_task:
print('PlaywrightTimeoutError w/o gathered API token')
except Exception as e:
print('Playwright failed to load target page:', str(e))
await wb.close()
return scrp_task
async def tokens_gen_wrap(proxy_tmpl, queue):
async with async_playwright() as playwright:
while True:
proxy_playwright = proxy_tmpl.copy()
proxy_playwright['username'] += uuid.uuid4().hex # get unique session id
t = await get_api_token(playwright, proxy_playwright)
if t.is_collected:
await queue.put(t)
async def proc_items(items, scrp_t_queue, chunk_size=29):
for i in range(0, len(items), chunk_size):
chunk = items[i:i+chunk_size]
t = await scrp_t_queue.get()
proxy_url = cmpl_proxy_url(t)
tasks = [scrape_target_item(t, t['url'], proxy_url, t['headres']) for i in chunk]
rv = await asyncio.gather(*tasks)
for result in rv:
print(f"Here the result data:{result}")
async def launch_job(items_to_scrape: t.List[str], pool_size=2):
scrp_t_queue = asyncio.Queue()
tokens_gen_workers = [asyncio.create_task(tokens_gen_wrap(PROXY, scrp_t_queue)) for _ in range(pool_size)]
token_consume_worker = asyncio.create_task(proc_items(items_to_scrape, scrp_t_queue))
await asyncio.gather(*tokens_gen_workers, token_consume_worker)
async def main():
items_to_scrape = get_items_to_scrape()
await launch_job(items_to_scrape)
asyncio.run(main())