Blog Posts Business Management

Best Web Scraping or Web Crawling Ethics to Follow

Blog: NASSCOM Official Blog

Many of us are always thinking about what are the best practices one should follow when undertaking a web scrape projects. Although there have no major legal hurdles in scraping publicly available data to really write about (other than a one off case of Ryan Air), it is best advised to follow a few steps that will keep you on right side of law.

1. Never swamp the targeted site to extent of denying access to other legitimate users. You can do this by limiting your access to their non-peak hours and ramping up in the evenings till dawn, on weekends and public holidays. Some popular sites like Google, Yahoo, Amazon, Facebook etc. warn you if you access the content too fast. That is a warning signal for you to slow your scraper down.

2. Never download the same content more than once as you are just wasting their bandwidth. Try and download all content to your local machine in one go and then do the processing.

3. Try not to be the #1 user of the targeted site. If they ever get around to checking log files, you do not want to be at the top of their list. You may use proxy IPs to conceal your activity to not appear as #1 use of the site.

4. Ask the client if he has necessary permission from the site owner to download data. If the site owner finds value in sharing the data and gives permission, it is a huge plus in scraping the content.

5. If the targeted site demands you create an account (paid/free) to access data, do not use aliases. Use actual information and inform the client upfront or demand client provide access to website.

6. If the site sends a warning email, respect that. Immediately cease the scraping, delete all data and cease the project. The client will understand.

I hope the above are useful tips. Feel free to share your thoughts & experiences on best practices for web scraping projects.

The post Best Web Scraping or Web Crawling Ethics to Follow appeared first on NASSCOM Community |The Official Community of Indian IT Industry.

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="https://www.businessprocessincubator.com/content/best-web-scraping-or-web-crawling-ethics-to-follow/?feed=html" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples

BPMN.org

XPDL.org

×