To access the crawl settings open the Project Settings and navigate to Advanced Crawl
1. How to crawl
2. What to crawl
3. Ignore / Include URLs
1. How to crawl
1.1 How many URLs do you want to crawl?
Define how many URLs you want to crawl in this project. If you have more projects you can assign the URL capacity to your projects. To Look up the overall account URL Limit and its usage have a look at the Account Settings under Packages.
1.2 Crawl Speed
In order to increase the speed of the crawl by raising the parallel requests, you must verify your page first. To verify your page open the Project Settings and download the authentication file. Then upload the file to your root folder and click Check authentication
Once your Page is verified you can raise the parallel request up to 100. Please note that this will require more server resources!
1.3 Accept Cookies
Enable this feature, if your site requires cookies to work properly. By default, this option is deactivated to identify issues caused by rejecting cookies, such as session id bugs or cloaking.
Browsers usually accept cookies, search engine crawlers usually don't.
1.4 Login Data
Your website is under construction and protected with a password? That's no Problem, you can add your .htaccess Authentication credentials in the Project Settings:
Once you have entered the login credentials you can run a test at Test settings to check if the crawler is able to access your domain. Now you are good to go.
1.5 URL normalization
By using URL normalization, we normalize URLs the way search engines would do it. If you deactivate this option, you see URLs exactly as they are in your source code. Examples:
- www.domain.com:80 => www.domain.com
- www.dOmAi.cOM:80 => www.domain.com
- HTTP://www.domain.com => http://www.domain.com
- hTtpS://www.domain.com:443 => https://www.domain.com
- www.domain.com/test//file.php => www.domain.com/test/file.php
1.6 Robots.txt Behaviour
Robots.txt behavior options:
- Crawl everything but create disallow statistics based on the robots.txt file
- Only crawl pages that aren't blocked by robots.txt
- Crawl everything but create Disallow-Statistics based on the custom robots.txt
- Only crawl pages that aren't blocked by your custom robots.txt
You can enter a custom robots.txt if needed.
1.7 Crawler country
The option "Crawler country" will change the server location from where the crawler will analyze.
1.8 Crawler user-agent
The option user-agent will determine the name of the crawler, you can enter firefox as user-agent for example. Or if you want to make sure that you whitelist or track our service only you can give the crawler an individual name (e.g:crawler123abc)
1.9 Additional Request Header
Define additional request headers, if necessary.
2. What to crawl
2.1 Homepage URL
The Homepage URL defines what RYTE takes as the homepage of your Domain also it will give the crawler a start point. If your Homepage is not indexable please have a look here.
2.2 Crawl subfolder
With this settings, your crawl will only contain all data from a certain subfolder ( e.g. "/wiki/)
2.3 Analyze images
Do you want to crawl images? By disabling this feature, some reports will not be created - but you have more resources for crawling HTML documents.
2.4 Crawl Subdomains
Analyze all found subdomains and show them as part of the main domain in reports. If you deactivate this feature, they will be treated as "external links".
2.5 Analyze Sitemap.xml
Analyze your sitemap.xml for errors and optimization potential. If you use many sitemap.xml files (20+) you could deactivate this function to speed up the overall crawling.
2.6 Sitemap URLs
By default, our crawler is looking for the sitemap.xml in the root folder (domain.com/sitemap.xml), If your sitemap is located elsewhere or has a different name you can add its URL to the settings so the crawler will follow it.
You'll find this option in the Project Settings -> Advanced Crawl
You can add as many sitemaps as needed.(one per line)
If your sitemap_index.xml is not recognized correctly please add each sitemaps URL from the index.
2.7 Sort GET-Parameters
Sort parameters in URLs alphabetically, this can reduce the amount of duplicate contents.
2.8 Ignore GET Parameters
Define GET parameters here, that will be automatically removed from URLs found on your website. Useful to avoid unnecessary URL variations from Session-IDs or tracking parameters. Downside: Issues like duplicate content might not be discovered.
3. Ignore / Include URLs
3.1 Exclude URLs (blacklist)
You can exclude URLs from your crawl by adding blacklist rules, in this example we want to exclude the Magazine and our Wiki, we realize that by blacklisting the "subfolders". The rules should look like this:
This rule will exclude all URLs containing /wiki/, please note that this does not take the folder hierarchy into account. For instance, a URL domain.com/wiki/ will be excluded as well as domain.com/subfolder/wiki/
You can apply any rule in any depth if you wish that certain sites to be excluded e.g.:
You can add as many rules as you like.
3.2 Include URLs (whitelist)
The whitelist has the same functionalities as the blacklist but it works the opposite way. If you need to apply rules by "include only" then you can use our whitelist feature.
In this example, we want to crawl ONLY our Magazine and Wiki. We realize that by whitelisting each "subfolder":
Please note that this will also whitelist domain.com/subfolder/wiki/, you may need to adjust it in depth (regex:https://en.ryte.com/wiki/)
3.3 Test Blacklist/whitelist Settings
It can be very time consuming if you run your rules without testing only to see that it did not work when the crawl finished. Please test your settings first, in order to get familiar with the syntax.
Here we are testing with our whitelist rules from above, so only URLs from either the Wiki or the Magazine should be crawled.
If we enter a URL that should be excluded and the response status is 9xx (in this example 950) our settings are fine if the status is still 200 then the rules did not work.
We can also test in the other direction by test-crawling the included content by entering a URL within the whitelist rules:
If we applied the rules successfully then the test will respond with a status of 200, this only makes sense if we tested excluded URLs first!
Test settings Status Codes:
200 - OK
950 - blocked by whitelist
951 - blocked by blacklist
Important: If your test was not successful and you run it again with the same Test-URL it might be cached, so please enter a different URL each time you run a test in order to not have cache discrepancy.
4. General settings
In the General Settings, you can change the Projects Name and Slug.
The name of your project, how it will be displayed within your Ryte account. (default: domain)
This is the short notation of your project and how it is displayed within the URL.
ATTENTION: The project domain will remain unaffected. This will only change the display URL of the project. This feature is important if you analyze the same domain multiple times in your Ryte account.