Using the blacklist and whitelist (ignore or include URLs)

You are able to define which subpages Ryte should check and which should be excluded from crawling. This is helpful if you only want to crawl a certain country-specific directory for example. 

You can include or exclude URLs within the project settings. Click on your project in the upper right corner and on the gear symbol in the drop-down menu. Now you are in the project settings.

Click on - advanced analysis -.

Here the option ignore/include URLs is listed:

 

Exclude URLs:

You can exclude URLs by adding blacklist rules to exclude URLs.

If we, for example, want to exclude our magazine-and wiki page, the rules would be as following:

regex:\/wiki\/
regex:\/magazine\/

Please note that this will also affect domain.com/subfolder/wiki/. You may need to adjust it in depth (e.g. regex:https:\/\/en.ryte.com\/wiki\/)

You can apply any rules by adding regular expressions such as:

Certain URL:

regex:https:\/\/en.ryte.com\/magazine\/onpage-becomes-ryte

Certain string within a URL:

regex:urlpart

Filetype:

regex:^.*\.(jpg|JPG|gif|GIF|doc|DOC|pdf|PDF|js)$


To check if your line has the right syntax you can run it through a regex validator
(enter without "regex:")

Handy metacharacters:

^   asserts position at the start of the string.

$   asserts position at the end of the string.

*   Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed.

.   matches any character.

 

The whitelist works the opposite way ("analyze only"). If you want to analyze a specific part of your domain or narrow it down by certain criteria you can realize that with the whitelist.

Important: If you want to crawl a certain subfolder only, you don't need to do this via include URLs. In this case, you can simply use the - Crawl subfolder - option in the project settings.

 

cs1.png




Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk