To find out why an URL isn't indexable, please take a look at the indexability report. There are a number of reasons why an URL might not be indexable, so make sure to check the column "Indexability" in the report to find out why your URLs may not be indexable. We have detailed the various reasons down below.
Disallow via robots.txt
The robots.txt file can be used to tell bots which parts of your website they should or shouldn’t visit. The robots.txt is only a guideline, but reputable bots will follow these directives. If a URL or directory is set to "disallow" in the robots.txt file search engine bots won’t crawl them when they visit your website. Please note that a URL might still be indexed even though it’s set to "disallow", though, if external links are pointing to it.
The robots.txt has to follow certain rules. For more information, please check the Google specifications.
To avoid duplicate content issues Google introduced the canonical tag. With this tag, website owners can define which pages are canonical and which aren’t. Only canonical URLs will be indexed, so if your page should be indexable but isn’t, make sure it has a self-referential canonical tag. If the canonical tag points to a different URL only that URL will be indexed. Common pitfalls with the canonical tag are missing trailing slashes, protocol mix-ups (HTTP rather than HTTPS) and slip-ups with the subdomain (URL with "www" or without). Make sure your canonical URL matches exactly with the URL you want to see indexed. You can check your URL in the single page analysis to see if your canonical tag has been set correctly.
The canonical tag can be set in the <head> part of the source code or in the HTTP header.
The canonical tag is only a suggestion and can be disregarded by search engines. This can happen if your canonical tags contradict each other or are broken. Make sure your canonicals are set correctly and consistently across your domain to minimise this risk.
Ideally, only URLs that answer with the status code 200 are indexed. If your page answers with a different status code, like a 3xx redirect, it will eventually be removed from the index. If you are using temporary redirects (302) make sure that your URLs answer with the status code 200 again once the temporary issue has passed and fix internal links if the redirect is permanent (301). You can check your redirects in Website Success > Indexability > Redirects > Status Codes.
Your internal links should point directly to the target URLs. Avoid linking to redirects if possible; if you find links pointing to redirects using the Redirects report above you should correct the links to point directly to the actual target URL.
noindex via robots
By setting the meta tag "robots" to "noindex" you can tell search engines that your page should not be indexed. The meta tag "robots" can be found in the <head> part of the source code but you can also say "noindex" via the HTTP header.
If a URL is set to "noindex" crawling of that URL should be allowed in the robots.txt. Search engine crawlers cannot read the noindex directive if they are disallowed from crawling that URL.
Pages that answer with header status codes 4xx or 5xx also won't get indexed by search engines. If your indexed page suddenly answers with an error status code search engines will usually keep trying for a while to see if the page becomes available again. If not it will be removed from the index. Make sure to analyze your website regularly to catch and fix any broken pages right away. In Ryte you can either use the status code report found under Website Success > Indexability > Status Codes or the Critical Errors report (Website Success > Critical Errors) to check if you have broken pages.
URLs that answered with a status code of the 5xx variety might only have been temporarily unavailable. Especially if your server isn't quite as robust the server might have struggled with the load during the crawl. If you have many 5xx errors in your analysis, try to reduce the crawl speed via the project settings. You can also set up automatic crawling outside of your business hours when your domain receives the least amount of user traffic.