We will find out why HTTP proxies were named as forward, why the CONNECT method is not always used, and how to get unique DEBUG info from proxy
Plaint HTTP exmplonation
In the article, we have already discussed how the TLS handshake goes after the HTTP CONNECT request. But let me show how you can talk to an HTTP server via Proxy without additional CONNECT requests.
$ curl -v http://ifconfig.co/json -x "http://deadbeef:;;;;@proxy.soax.com:9000"
* Host proxy.soax.com:9000 was resolved.
* IPv6: (none)
* IPv4: 23.109.148.44, 23.109.157.92, 23.109.90.116
* Trying 23.109.148.44:9000...
* Connected to proxy.soax.com (23.109.148.44) port 9000
* using HTTP/1.x
* Proxy auth using Basic with user 'deadbeef'
> GET http://ifconfig.co/json HTTP/1.1
> Host: ifconfig.co
> Proxy-Authorization: Basic bm90LXRoaXMtdGltZTp3aWZpO3VzOzs7
> User-Agent: curl/8.10.1
> Accept: */*
> Proxy-Connection: Keep-Alive
>
* Request completely sent off
< HTTP/1.1 200 OK
... The rest is target server reply
Pay attention to the 3rd line from the bottom - * Request completely sent off
. There is no information regarding the CONNECT call and a reply. This is correct behavior according to the specification RFC 7231, 4.3.6. CONNECT. We do not encrypt our traffic at rest using TLS, that's why the proxy server can just send the HTTP request as it is and forward the reply back. Precisely this way, it was working decades ago. Proxy servers were able to cancel requests based on the content and to reply from the cache for the acceleration without any tweaks or consent from the client side.
Nowadays, we are using TLS, and traffic is encrypted. It forces the proxy client to send a CONNECT request first to the proxy server to create a TUNNEL, but not just send an HTTP request to the target. The pretty thing about this is that we can use it for our needs.
To complete the picture, it should be said that there are NGFW(next-generation firewalls) and CASB(Cloud access security brokers), which decrypt employees' traffic to protect the business. Some proxy providers decrypt your traffic as well(it requires you to install their root certificate); they need it to adjust your requests to increase the antibot solution bypassing Success Rate.
Cutom HTTP CONNECT request and responce headers
Interaction with the Rotating Residential Proxy services is a bit tricky from the client's perspective. You never know what proxy node you got after a successful connection. Of course, it should meet the GEO filters you provided, but what is the exact IP address and ASN you have? Also, if the target resource is using GEO DNS - you don't know the exact IP address and GEO point of the server that handles your request. You can request one of the IP checkers to find Proxy Noode's IP address, but it will be the new connection and the new Node. Sticky sessions will help you in such cases, but this is not the ultimate solution cause Node might go offline, and the proxy service will find a new one for you(Some proxy providers have sticky IP features to prevent this, but let's cover this another time). So it's beneficial to know what IP address I've got right after connection, to skip the Node if needed, and to know the resolved target IP address. A proxy service provider can solve both of them because both answers are available.
You might ask, how is all this related to the CONNECT HTTP request? HTTP reply to this request is the only interaction step when the Proxy provider has already found the Node and the Node has successfully connected to the target resource(all metadata collected), at the same time, all traffic after this reply in this TCP session will go between the proxy client and the target resource without any changes from the proxy service. The only thing Proxy service can do is to terminate the TCP session.
Some proxies allow you to define which data you want to get from the proxy; they send you Reques-ID to easily find your requests in logs for DEBUG. Here is the SOAX + cURL example
$ curl -v https://ifconfig.co/json -x "http://package-123:deadbeef@proxy.soax.com:5000" --proxy-header "Respond-With: uid,ip,country,region,city,isp"
...Skiping data sent to proxy
< HTTP/1.1 200 OK
< Node-Asn: 37693
< Node-City: sousse
< Node-Country: tn
< Node-Ip: 196.179.13.115
< Node-Isp: ooredoo tunisia
< Node-Region: sousse governorate
< Node-Uuid: 25974425-7bd0-4342-918b-9a835096ce5c
< Request-Uid: c2323d91-3790-bf67-460c-2a5daa5c9d14
As you can see, the proxy node IP address, GEO, and ASN are all available right after connection. So you can make decisions regarding the Node suitability faster w/o additional requests, A pair of our operations, "Find out information about a Node, make a request with these properties," now have atomicity
I've created examples of how to send this customer header and read them using programming languages popular for scrapping: Golang, Python, and Javascript. You can find them on GitHub examples repo. It was the original motivation to write this article, to save it somewhere to not to search for the next time )