Crawl Setting Manual

To access the crawl settings open the Project Settings and navigate to Advanced Crawl

 1. How to crawl

1.1 How many URLs do you want to crawl?

1.2 Crawl Speed

1.3 Accept Cookies

1.4 Login Data

1.5 URL normalization

1.6 Robots.txt Behaviour

1.7 Crawler country

1.8 Crawler user-agent

1.9 Additional Request Header

2. What to crawl

2.1 Homepage URL

2.2 Crawl subfolder

2.3 Analyze images

2.4 Crawl Subdomains

2.5 Analyze Sitemap.xml

2.6 Sitemap URLs

2.7 Sort GET-Parameters

2.8 Ignore GET Parameters

3. Ignore / Include URLs

3.1 Exclude URLs (blacklist)

3.2 Include URLs (whitelist)

3.3 Test Blacklist/whitelist Settings

4. General settings

 

 

 

1. How to crawl

1.1 How many URLs do you want to crawl?

Define how many URLs you want to crawl in this project. If you have more projects you can assign the URL capacity to your projects. To Look up the overall account URL Limit and its usage have a look at the Account Settings under Packages.

 

1.2 Crawl Speed

In order to increase the speed of the crawl by raising the parallel requests, you must verify your page first. To verify your page open the Project Settings and download the authentication file. Then upload the file to your root folder and click Check authentication

54545.png

Once your Page is verified you can raise the parallel request up to 100. Please note that this will require more server resources!

 

1.3 Accept Cookies

Enable this feature, if your site requires cookies to work properly. By default, this option is deactivated to identify issues caused by rejecting cookies, such as session id bugs or cloaking. 
Browsers usually accept cookies, search engine crawlers usually don't.

 

1.4 Login Data

Your website is under construction and protected with a password? That's no Problem, you can add your .htaccess Authentication credentials in the Project Settings:

ht1.png

 

Once you have entered the login credentials you can run a test at Test settings to check if the crawler is able to access your domain. Now you are good to go.

 

1.5 URL normalization

By using URL normalization, we normalize URLs the way search engines would do it. If you deactivate this option, you see URLs exactly as they are in your source code. Examples:

  • www.domain.com:80 => www.domain.com
  • www.dOmAi.cOM:80 => www.domain.com
  • HTTP://www.domain.com => http://www.domain.com
  • hTtpS://www.domain.com:443 => https://www.domain.com
  • www.domain.com/test//file.php => www.domain.com/test/file.php

 

1.6 Robots.txt Behaviour

Robots.txt behavior options:

  • Crawl everything but create disallow statistics based on the robots.txt file
  • Only crawl pages that aren't blocked by robots.txt
  • Crawl everything but create Disallow-Statistics based on the custom robots.txt
  • Only crawl pages that aren't blocked by your custom robots.txt

You can enter a custom robots.txt if needed.

 

1.7 Crawler country

The option "Crawler country" will change the server location from where the crawler will analyze.

 

1.8 Crawler user-agent

The option user-agent will determine the name of the crawler, you can enter firefox as user-agent for example. Or if you want to make sure that you whitelist or track our service only you can give the crawler an individual name (e.g:crawler123abc) 

 

1.9 Additional Request Header

Define additional request headers, if necessary.

 

 

2. What to crawl

 

2.1 Homepage URL

The Homepage URL defines what RYTE takes as the homepage of your Domain also it will give the crawler a start point. If your Homepage is not indexable please have a look here.

 

2.2 Crawl subfolder

cs1.png

With this settings, your crawl will only contain all data from a certain subfolder ( e.g. "/wiki/)

 

2.3 Analyze images

Do you want to crawl images? By disabling this feature, some reports will not be created - but you have more resources for crawling HTML documents.

 

2.4 Crawl Subdomains

Analyze all found subdomains and show them as part of the main domain in reports. If you deactivate this feature, they will be treated as "external links".

 

2.5 Analyze Sitemap.xml

Analyze your sitemap.xml for errors and optimization potential. If you use many sitemap.xml files (20+) you could deactivate this function to speed up the overall crawling.

 

2.6 Sitemap URLs

By default, our crawler is looking for the sitemap.xml in the root folder (domain.com/sitemap.xml), If your sitemap is located elsewhere or has a different name you can add its URL to the settings so the crawler will follow it.

You'll find this option in the Project Settings -> Advanced Crawl

sitemap.png

You can add as many sitemaps as needed.(one per line)

 

YOAST Users:

If your sitemap_index.xml is not recognized correctly please add each sitemaps URL from the index.

(e.g.: .../page-sitemap.xml)

 

 

2.7 Sort GET-Parameters

Sort parameters in URLs alphabetically, this can reduce the amount of duplicate contents.

 

2.8 Ignore GET Parameters

Define GET parameters here, that will be automatically removed from URLs found on your website. Useful to avoid unnecessary URL variations from Session-IDs or tracking parameters. Downside: Issues like duplicate content might not be discovered.

 

 

 

3. Ignore / Include URLs

3.1 Exclude URLs (blacklist)

cs2.png

You can exclude URLs from your crawl by adding blacklist rules, in this example we want to exclude the Magazine and our Wiki, we realize that by blacklisting the "subfolders". The rules should look like this:

regex:/wiki/

This rule will exclude all URLs containing  /wiki/, please note that this does not take the folder hierarchy into account. For instance, a URL domain.com/wiki/ will be excluded as well as domain.com/subfolder/wiki/

You can apply any rule in any depth if you wish that certain sites to be excluded e.g.:

regex:https://en.ryte.com/magazine/onpage-becomes-ryte

 

You can add as many rules as you like.

 

3.2 Include URLs (whitelist)

The whitelist has the same functionalities as the blacklist but it works the opposite way. If you need to apply rules by "include only" then you can use our whitelist feature.

cs3.png

In this example, we want to crawl ONLY our Magazine and Wiki. We realize that by whitelisting each "subfolder":

regex:/wiki/

regex:/magazine/

Please note that this will also whitelist domain.com/subfolder/wiki/, you may need to adjust it in depth (regex:https://en.ryte.com/wiki/)

 

 

3.3 Test Blacklist/whitelist Settings

It can be very time consuming if you run your rules without testing only to see that it did not work when the crawl finished. Please test your settings first, in order to get familiar with the syntax.cs4.png

 

Here we are testing with our whitelist rules from above, so only URLs from either the Wiki or the Magazine should be crawled.

If we enter a URL that should be excluded and the response status is 9xx (in this example 950) our settings are fine if the status is still 200 then the rules did not work.

We can also test in the other direction by test-crawling the included content by entering a URL within the whitelist rules:

cs5.png

If we applied the rules successfully then the test will respond with a status of 200, this only makes sense if we tested excluded URLs first!

Test settings Status Codes:

200 - OK

950 - blocked by whitelist

951 - blocked by blacklist

 

ImportantIf your test was not successful and you run it again with the same Test-URL it might be cached, so please enter a different URL each time you run a test in order to not have cache discrepancy.

 

 

 

4. General settings

In the General Settings, you can change the Projects Name and Slug.

Project Name

The name of your project, how it will be displayed within your Ryte account. (default: domain)

SLUG

This is the short notation of your project and how it is displayed within the URL.
ATTENTION: The project domain will remain unaffected. This will only change the display URL of the project. This feature is important if you analyze the same domain multiple times in your Ryte account.

 

 

 

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk