This article provides 6 questions you must ask to understand your proxy service costs, and why list price only tells you a fraction of the story.
By: Omri Orgad
Web data harvesting is primarily done via a proxy service — required for sending your requests through multiple IPs for anonymity. Looking at a proxy’s list price alone greatly underestimates its costs.
Let's say you are comparing 2 proxy services, one is priced at $1000/month and another at $3000/month. You go for the $1000 bargain, only to find that your fail rate is 50%, requiring you to allocate a full-time developer for IP rotation.
Furthermore, you find that in 30% of your requests the website is sending you back misleading data (i.e. your deception rate is 30%), which reduces your products' profitability. These cost you thousands — so much for the bargain…
To make better decisions you should be looking at TCO (Total Cost of Operation). TCO is much harder to evaluate, so to help you understand TCO we've provided the 6 key questions you must ask when comparing proxy services and data harvesting solutions. We've also provided a quick reference table comparing your alternatives.
Proxy Service Alternatives
The primary approaches for setting up web data harvesting are:
1. Signing up to a data harvesting service
Web data harvesting services, or web scraping services, provide the software and the underlying infrastructure for data collection. They may include proxy servers, data management and structuring, and a management layer for geo-location and IP rotation. Consider signing up to a data harvesting service if your project is simple, not planned to scale much, or if you lack the knowhow for developing data harvesting software.
2. Developing the software and licensing the infrastructure
In this case you would develop the data harvesting software yourself, either in-house, or by outsourcing, and would license the infrastructure that will allow you to route your requests through IPs in the target geo-locations.
Infrastructure licensing breaks down into 3 alternatives:
Renting cloud infrastructure:
This includes services that offer cloud infrastructure in various locations around the world and allow you to route your requests through these locations. Geographical distribution is typically limited. For example: the DigitalOcean cloud infrastructure service is available in 7 cities as of Dec 1st 2015.
Using a traditional proxy service:
Traditional proxy services may provide thousands of IPs in multiple geo-locations as well as a management layer for IP rotation, IP allocation, and geo-location selection.
Because proxy IPs are widely known, they are frequently blocked by websites, or fed with misleading data meant to jeopardize your research. This problem pertains to all above approaches including cloud infrastructure and data harvesting services, because they all use identifiable IPs.
Using a peer-to-peer proxy network:
P2P proxy networks are very large networks of residential IPs. Unlike the other alternatives, the IPs are identified as personal and unlikely to be blocked or deceived. P2P networks can offer millions of exit nodes in every city in the world. A vendor currently offering this approach is Luminati, by Hola.
The 6 Questions You Must Ask to Understand Your TCO
1.What is the cost of resources for integration?
Based on the expected integration complexity you should evaluate the total cost of developers, project managers, IT and QA for the integration phase.
Consider whether you should develop the data harvesting software in-house or license it as a service. Developing in-house increases your integration costs, and reduces your ongoing costs. Review the API: services offering a convenient API will reduce your integration time and costs.
2.What fail rate can I expect?
Fail rate is the percentage of requests blocked by the website you are researching. Fail rate has the most dramatic impact on your TCO and is the single most underestimated TCO component. Once an IP is blocked you must switch (rotate) to a new IP. The process of IP rotation is resource-intensive and may require one or more developers at full capacity. Increased fail rates can bump up your TCO by thousands of dollars each month.
Because traditional proxies and data harvesting services use known IPs, they are identified by the website and easily blocked, while a P2P proxy network uses residential IPs, so its fail rate is miniscule.
Fail rate is also impacted by the type of data you collect: harvesting a social or ecommerce website will produce higher fail rates than harvesting a weather server. Using a high fail-rate solution and harvesting high fail-rate data will result in catastrophic budgets. Also bear in mind that fail rate impacts your project timeline and data freshness.
3.What deception rate can I expect?
Deception rate is the percentage of misleading data you are getting during your data harvesting. Because proxy IPs are easy to identify, websites that are interested in jeopardizing your data harvesting will reply to your requests with misleading data. This can affect your profitability and damage your brand. This TCO risk applies to any solution that uses known IPs.
For example: a distributor monitors its retailers to make sure that its premium brand products are not being offered under minimum price. Some of the affiliate sites are actually violating their agreements and selling below minimum price. However, when they are monitored by the distributor they identify the proxy IP and send back misleading prices that are legitimate. As a result the distributor’s profitability is reduced and furthermore - brand positioning is damaged.
4.Will I be receiving clean and structured data?
Data harvesting services provide “cookie cutter” software which does not fit all types of websites. When using these services expect to get redundant and unstructured data and prepare to allocate resources for data cleansing and structuring.
For long term projects consider developing your software in-house to avoid recurring data cleansing work. For short term projects this may not be worth your investment.
Consider a data harvesting service for simple websites. For complex projects develop your own, custom software so that you get the data precisely in the format you want it.
5.How fast can I scale my operation up and down?
Data harvesting often scales in both directions. For example: price monitoring activity can fluctuate significantly by season, so make sure you understand how long it will take your vendor to scale its service up or down.
When you need to scale up and increase the number of IPs it might take your vendor a few hours to several days to configure and integrate new servers. This extends your project timeline and, as a result, your time to revenue. Furthermore, your staff is already allocated to the project and will be put on hold until the scaling process is completed.
On the other hand, when you need to scale down and are being delayed by your vendor, you are paying for resources you do not need.
Overall, every hour of scaling delays impacts your cost.
6.Does the price plan allow me to pay for what I actually use?
If you’ve signed up to a pricing tier that allows you 1 million requests/month, and used only 500K requests out of the 1 million, you’ve wasted money. Look for a flexible pricing model that allows you to scale up and down according to your actual volume.
If you expect your traffic to be dynamic you should prefer pricing models that offer unlimited activity.
Use this table as a quick reference to compare the alternative TCOs. It is indicative, with the comparative costs rated high, medium or low.
Understanding TCO is the most critical, and yet greatly underestimated topic for companies involved in web data collection. While there are many considerations when comparing your data harvesting tools, asking these 6 questions during your evaluation is the crucial first step in making better decisions and controlling your budget.