Overview:

1. How to analyze
2. What to analyze
3. Ignore / Include URLs
- 3.1 Exclude URLs (blacklist)
- 3.2 Include URLs (whitelist)
4. General settings

To set up the analysis, open the Project Settings and navigate to Advanced analysis.

1. How to analyze

1.1 How many URLs should be analyzed?

Define how many URLs you want to crawl in this project. If you have more projects, you can assign the URL capacity to your projects.

To look up the overall account URL Limit and its usage, have a look at your Projects in your Account Settings.

1.2 How fast should the analysis be?

You can define your crawling speed by adjusting the number of parallel requests (from 1 to 10).

In order to further increase the speed of the crawl by raising the parallel requests, you must verify your page first. To do so, you need to download the authentication file first. Then upload the file to your root folder and click Check authentication.

Once your Page is verified, you can raise the parallel request up to 100. Please note that this will require more server resources! This could lead to 5xx status codes. Less parallel requests will require less server resources.

1.3 Login Data

Your website is under construction and protected with a password? That's no Problem, you can add your .htaccess Authentication credentials in the Project Settings.

Once you have entered the login credentials, you can run a test at Test settings to check if the crawler is able to access your domain. Now you are good to go.

1.4 Robots.txt Behaviour

Robots.txt behaviour options:

1.5 Analysis user-agent

The option user-agent will determine the name of the crawler. You can enter RyteBot or GoogleBot as user-agent, while the RyteBot is the default user-agent. Or if you want to make sure that you whitelist or track our service only, you can give the crawler an individual name (e.g:crawler123abc)

2. What to analyze

2.1 Homepage URL

The Homepage URL defines what RYTE takes as the homepage of your Domain also it will give the crawler a start point. If your Homepage is not indexable please have a look here.

2.2 Analyze Subfolder

Use this setting if you want to analyze a specific subfolder. Define the folder relative to the root of the domain (e.g., /wiki/). Attention: depending on the URL structure of your subfolder, you may leave out the trailing slash.

2.3 Analyze Subdomains

Analyze all found subdomains and show them as part of the main domain in reports. If you deactivate this feature, they will be treated as "external links".

2.4 Analyze Sitemap.xml

Analyze your sitemap.xml for errors and optimization potential. If you use many sitemap.xml files (20+) you could deactivate this function to speed up the overall crawling.

2.5 Sitemap URLs

By default, our crawler is looking for the sitemap.xml in the root folder (domain.com/sitemap.xml), If your sitemap is located elsewhere or has a different name you can add its URL to the settings so the crawler will follow it.

You can add as many sitemaps as needed.(one per line)

YOAST Users:

If your sitemap_index.xml is not recognized correctly please add each sitemaps URL from the index.

(e.g.: .../page-sitemap.xml)

2.6 Ignore GET Parameters

Define GET parameters here, that will be automatically removed from URLs found on your website. Useful to avoid unnecessary URL variations from Session-IDs or tracking parameters. Downside: Issues like duplicate content might not be discovered.

3. Ignore / Include URLs

3.1 Exclude URLs (blacklist)

You can exclude URLs from your crawl by adding blacklist rules, in this example we want to exclude the Magazine and our Wiki, we realize that by blacklisting the "subfolders". The rules should look like this:

This rule will exclude all URLs where the path starts with /wiki/ or /magazine/.

You can apply any rule in any depth if you wish certain sites to be excluded e.g.:

https://en.ryte.com/product-insights/whitelist-blacklist-feature

You can add as many rules as you like.

3.2 Include URLs (whitelist)

The whitelist has the same functionalities as the blacklist, but it works the opposite way. If you need to apply rules by "include only" then you can use our whitelist feature.

In this example, we want to crawl ONLY our Magazine and Wiki. We realize that by whitelisting each "subfolder":

This rule will include all URLs where the path starts with /wiki/ or /magazine/.