If you need to have a closer look at a specific part of your domain or if you want to exclude some areas you can follow these methods:
Open the "project settings" in the top right corner, move to tab "Advanced Crawl"
1. Subfolder only
With this settings your crawl will only contain all data from a certain subfolder ( e.g. "/wiki/)
You can exclude URLs from your crawl by adding blacklist rules, in this example we want to exclude the Magazine and our Wiki, we realize that by blacklisting the subfolders. The rules should look like this:
This rule will exclude all URLs containing /wiki/, please note that this does not take the folder hierarchy into account. For instance a URL domain.com/wiki/ will be excluded as well as domain.com/subfolder/wiki/
You can apply any rule in any depth, if you wish that certain sites to be excluded e.g.:
You can add as many rules as you like.
Let's move on to the whitelist, the whitelist has the same functionalities as the blacklist but it works the opposite way. If you need to apply rules by "include only" than you can use our whitelist feature.
In this example we want to crawl ONLY our Magazine and Wiki. We realize that by whitelisting each subfolder:
Please note that this will also whitelist domain.com/subfolder/wiki/, you my need to adjust it in depth (regex:https://en.ryte.com/wiki/)
3.Test your settings
It can be very time consuming if you run your rules without testing only to see that it did not work when the crawl finished. Please test your settings first, in order to get familiar with the syntax.
Here we are testing with our whitelist rules from above, so only URLs from either the Wiki or the Magazine should be crawled.
If we enter a URL that should be excluded and the response status is 9xx (in this example 950) our settings are fine, if the status is still 200 then the rules did not work.
We can also test in the other direction by test-crawling the included content by entering a URL within the whitelist rules:
If we applied the rules successfully then the test will respond with a status of 200, this only makes sense if we testet excluded URLs first!
Important: If you're test was not successful and you run it again with the same Test-URL it might be cached, so please enter a different URL each time you run a test in order to not have cache discrepancy.
You can apply all rules at once (whitelist/blacklist, Subfolder, subdomain etc) Please make sure that they don't cancel out each other!