To set up the Analysis open the Project Settings and navigate to Advanced analysis.
1. How to analyze
2. What to analyze
3. Ignore / Include URLs
1. How to analyze
1.1 How many URLs should be analyzed?
Define how many URLs you want to crawl in this project. If you have more projects you can assign the URL capacity to your projects. To look up the overall account URL limit and its usage have a look at the Account Settings (click on your profile picture in the Dashboard) under Packages.
1.2 How fast should the analysis be?
In order to increase the speed of the crawl by raising the parallel requests, you must verify your page first. To verify your page open the Project Settings and download the authentication file. Then upload the file to your root folder and click Check authentication
Once your page is verified you can raise the parallel request up to 100. Please note that this will require more server resources!
1.3 Accept Cookies
Enable this feature, if your site requires cookies to work properly. By default, this option is deactivated to identify issues caused by rejecting cookies, such as session id bugs or cloaking.
Browsers usually accept cookies, search engine crawlers usually don't.
1.4 Login Data
Your website is under construction and protected with a password? That's not a problem, you can add your .htaccess authentication credentials in the Project Settings:
Once you have entered the login credentials you can run a test at Test settings to check if the crawler is able to access your domain. Now you are good to go!
1.5 URL normalization
By using URL normalization, we normalize URLs the way search engines do. If you deactivate this option, you see URLs exactly as they are in your source code. Examples:
- www.domain.com:80 => www.domain.com
- www.dOmAi.cOM:80 => www.domain.com
- HTTP://www.domain.com => http://www.domain.com
- hTtpS://www.domain.com:443 => https://www.domain.com
- www.domain.com/test//file.php => www.domain.com/test/file.php
1.6 Robots.txt Behaviour
Robots.txt behaviour options:
- Crawl everything but create disallow statistics based on the robots.txt file
- Only crawl pages that aren't blocked by robots.txt
- Crawl everything but create disallow-statistics based on the custom robots.txt
- Only crawl pages that aren't blocked by your custom robots.txt
You can enter a custom robots.txt if needed.
1.7 Analysis country
The option "Crawler country" will change the server location from where the crawler will analyze.
1.8 Analysis user-agent
The option user-agent will determine the name of the crawler, you can enter firefox as user-agent for example. Or if you want to make sure that you whitelist or track our service only you can give the crawler an individual name (e.g:crawler123abc)
1.9 Additional Request Header
Define additional request headers, if necessary.
2. What to analyze
2.1 Homepage URL
The homepage URL defines what Ryte takes as the homepage of your domain also it will give the crawler a start point. If your homepage is not indexable, please have a look here.
2.2 Analyze subfolder
With this setting, your crawl will only contain all data from a certain subfolder ( e.g. "/wiki/)
2.3 Analyze images
Do you want to crawl images? By disabling this feature, some reports will not be created – but you have more resources for crawling HTML documents.
2.4 Analyze subdomains
Analyze all found subdomains and show them as part of the main domain in reports. If you deactivate this feature, they will be treated as "external links".
2.5 Analyze sitemap.xml
Analyze your sitemap.xml for errors and optimization potential. If you use many sitemap.xml files (20+) you can deactivate this function to speed up the overall crawling.
2.6 Sitemap URLs
By default, our crawler is looking for the sitemap.xml in the root folder (domain.com/sitemap.xml), if your sitemap is located elsewhere or has a different name you can add its URL to the settings so the crawler will follow it.
You'll find this option in the Project Settings -> Advanced Crawl
You can add as many sitemaps as needed (one per line).
If your sitemap_index.xml is not recognized correctly please add each sitemaps URL from the index.
2.7 Sort GET-Parameters
Sort parameters in URLs alphabetically, this can reduce the number of duplicate contents.
2.8 Ignore GET Parameters
Define GET parameters that will be automatically removed from URLs found on your website. Useful to avoid unnecessary URL variations from session-IDs or tracking parameters. Downside: Issues like duplicate content might not be discovered.
3. Ignore / Include URLs
3.1 Exclude URLs (blacklist)
You can exclude URLs from your crawl by adding blacklist rules, in this example we want to exclude the Magazine and our Wiki, we realize that by blacklisting the "subfolders". The rules should look like this:
This rule will exclude all URLs containing /wiki/, please note that this does not take the folder hierarchy into account. For instance, a URL domain.com/wiki/ will be excluded as well as domain.com/subfolder/wiki/
You can apply any rule in any depth if you wish that certain sites to be excluded e.g.:
You can add as many rules as you like.
3.2 Include URLs (whitelist)
The whitelist has the same functionalities as the blacklist but it works the opposite way. If you need to apply rules by "include only" then you can use our whitelist feature.
In this example, we want to crawl ONLY our magazine and wiki. We do that by whitelisting each "subfolder":
Please note that this will also whitelist domain.com/subfolder/wiki/, you may need to adjust it in depth (regex:https://en.ryte.com/wiki/)
3.3 Test Blacklist/whitelist Settings
It can be very time consuming if you run your rules without testing only to see that it did not work when the crawl is finished. Please test your settings first, in order to get familiar with the syntax.
Here we are testing with our whitelist rules from above, so only URLs from either the wiki or the magazine should be crawled.
If we enter a URL that should be excluded and the response status is 9xx (in this example 950) our settings are fine if the status is still 200 then the rules did not work.
We can also test in the other direction by test-crawling the included content by entering a URL within the whitelist rules:
If we applied the rules successfully then the test will respond with a status of 200, this only makes sense if we tested excluded URLs first!
Test settings Status Codes:
200 - OK
950 - blocked by whitelist
951 - blocked by blacklist
Important: If your test was not successful and you run it again with the same test URL it might be cached, so please enter a different URL each time you run a test in order to not have cache discrepancy.
4. General settings
In the General Settings, you can change the project name and slug.
The name of your project, how it will be displayed within your Ryte account. (default: domain)
This is the short notation of your project and how it is displayed within the URL.
ATTENTION: The project domain will remain unaffected. This will only change the display URL of the project. This feature is important if you analyze the same domain multiple times in your Ryte account.