Exclude or include URLs from your crawl

If you need to have a closer look at a specific part of your domain or if you want to exclude some areas you can follow one of the methods detailed here. For all methods, first open the "project settings" in the top right corner and move to tab "Advanced Crawl".

 

Overview

1. Subfolder only

2. Blacklist/Whitelist

3. Test your settings

 

1. Subfolder only

cs1.png

With these settings, your crawl will only contain all data from a certain subfolder ( e.g. "/wiki/)

 

2. Blacklist/Whitelist

cs2.png

You can use the blacklist to exclude URLs from the analysis by putting the URLs directly in the blacklist or defining rules that target specific URLs. The whitelist works the same way, except it defines what URLs should be analysed. Any URLs that don’t meet the rules set in the whitelist won’t be analysed. 

Please use regex for all blacklist or whitelist rules.

For more information on the blacklist and whitelist as well as examples for regex rules, please see here.

 

3. Test your settings

It can be very time consuming if you run your rules without testing only to see that it did not work when the crawl finished. Please test your settings first, in order to get familiar with the syntax.

 

cs4.png

Here we are testing a whitelist rule. Only URLs from either the wiki or the magazine should be crawled.

If we enter a URL that should be excluded and the response status is 9xx (in this example 950) our settings are fine if the status is still 200 then the rules did not work.

We can also test in the other direction by test-crawling the included content by entering a URL within the whitelist rules.

 

Testing for an URL that should return 200 only makes sense if we tested an excluded URL first! Otherwise, we can't be sure that we entered the whitelist rule correctly.

 

cs5.png

If we applied the rules successfully then the test will respond with a status of 200. If your test was not successful and you run it again with the same Test-URL it might be cached, please enter a different URL each time you run a test in order to not have cache discrepancy.

 

Important status codes for testing:
200: OK
950: blocked by whitelist
951: blocked by blacklist

 

You can apply all rules at once (whitelist/blacklist, subfolder, subdomain etc). Please make sure that they don't cancel out each other!

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk