Struggling with your website’s sitemap? You’re not alone!
Have you ever felt lost at sea trying to navigate your website’s structure, specifically when it comes to organising its sitemap? You’re not alone – it can feel like a real maze.
Deciding which URLs to include in your sitemap and which to leave out can be tricky, with best practices often being debated.
Before we dive into which pages you should exclude from your sitemap, let’s take a step back and recap why having a sitemap is crucial for your website.
What is a XML sitemap and why should my website have one?
Put simply, a sitemap is a file that outlines your website’s structure and provides information about its pages.
They’re particularly valuable because they can help search engines crawl your site more efficiently. By using a sitemap, you can indicate to Google which pages are important and should be prioritised for indexing.
On top of that, a sitemap includes key details about your website’s pages and files, such as their last update timestamps and any alternative language versions available.
Usually, it looks something like this:
With every section containing a breakdown of the related pages:
Setting up a sitemap today is easier than ever, thanks to various SEO plugins that that automate the process and ensure it stays current (the nitty gritty details of the sitemap process are available if you’re interested in a deep dive).
In simple terms, it’s worth having a sitemap for your website if you don’t already.
Should every page be included in your XML sitemap?
While it’s possible to include every page in your sitemap, it’s not usually necessary or practical.
Why? To maximise Google’s crawl efficiency, a sitemap should focus on listing URLs that are indexable, free of errors (aim for “200 OK” response codes), and internally linked within your site.
At this point, you might be wondering which pages you should exclude from your sitemap. Keep reading – we’ve got you covered.
Pages to exclude from your XML sitemap
1. URLs with noindex tags
Let’s get back to basics: your XML sitemap should only list URLs that you want search engines to index.
When you apply a noindex tag either within a <meta> tag or ‘via HTTP’ response header, you’re clearly signalling to search engines that you prefer those pages not to be indexed. Therefore, these URLs shouldn’t be included in your sitemap.
Including a URL with a noindex tag in your sitemap sends conflicting signals to search engines, which can confuse their indexing process.
It’s best to exclude URLs like these from your sitemap to maintain consistency in how you communicate your site’s structure to search engines.
2. URLs with HTTP status codes 3xx/4xx/5xx
When it comes to your XML sitemap, it’s crucial to remove URLs returning non-200 response codes.
These include redirection status codes (3xx), client error response codes (4xx), and server error response codes (5xx).
We get it – it’s virtually impossible to memorise every single error response code in existence (kudos if you can, we can barely remember what we put on our shopping list two days ago).
To make things easier, here’s a handy list summarising the HTTP response codes you should consider excluding from your XML sitemap:
HTTP Response code 3xx – Redirection
HTTP Response Code 300 | Multiple Choices |
HTTP Response Code 301 | Moved Permanently |
HTTP Response Code 302 | Found |
HTTP Response Code 303 | See Other |
HTTP Response Code 304 | Not Modified |
HTTP Response Code 307 | Temporary Redirect |
HTTP Response Code 308 | Permanent Redirect |
HTTP Response code 4xx – Client Error Response
HTTP Response Code 400 | Bad Request |
HTTP Response Code 401 | Unauthorised |
HTTP Response Code 402 | Payment Required |
HTTP Response Code 403 | Forbidden |
HTTP Response Code 404 | Not Found |
HTTP Response Code 405 | Method Not Allowed |
HTTP Response Code 406 | Not Acceptable |
HTTP Response Code 407 | Proxy Authentication Required |
HTTP Response Code 408 | Request Timeout |
HTTP Response Code 409 | Conflict |
HTTP Response Code 410 | Gone |
HTTP Response Code 411 | Length Required |
HTTP Response Code 412 | Precondition Failed |
HTTP Response Code 413 | Content Too Large |
HTTP Response Code 414 | URI Too Long |
HTTP Response Code 415 | Unsupported Media Type |
HTTP Response Code 416 | Range Not Satisfiable |
HTTP Response Code 417 | Expectation Failed |
HTTP Response Code 421 | Misdirected Request |
HTTP Response Code 422 | Unprocessable Content |
HTTP Response Code 423 | Locked |
HTTP Response Code 424 | Failed Dependency |
HTTP Response Code 425 | Too Early |
HTTP Response Code 426 | Upgrade Required |
HTTP Response Code 428 | Precondition Required |
HTTP Response Code 429 | Too Many Requests |
HTTP Response Code 431 | Request Header Fields Too Large |
HTTP Response Code 451 | Unavailable for Legal Reasons |
HTTP Response Code 5xx – Server Error Response
HTTP Response Code 500 | Internal Server Error |
HTTP Response Code 501 | Not Implemented |
HTTP Response Code 502 | Bad Gateway |
HTTP Response Code 503 | Service Unavailable |
HTTP Response Code 504 | Gateway Timeout |
HTTP Response Code 505 | HTTP Version Not Supported |
HTTP Response Code 506 | Variant Also Negotiates |
HTTP Response Code 507 | Insufficient Storage |
HTTP Response Code 508 | Loop Detected |
HTTP Response Code 511 | Network Authentication Required |
3. Orphan pages with no value
Orphan pages are those elusive pages that are virtually invisible unless you have a direct link to them. This is because they’re not linked to any other page or section of your site.
Because orphan pages lack internal links, search engines may struggle to discover them. This can result in these pages having very little page authority. In some cases, search engines might even decide to remove them from their index altogether.
Before removing an orphan page from your sitemap
Before deciding whether to remove an orphan page from your sitemap or include it in the first place, there are a couple of key questions to consider:
- Is the page important and valuable to the website? If not, and it doesn’t have a place in your sitemap, it’s best to remove it.
- Is the page ranking for any keywords or receiving traffic (perhaps through social shares)? If yes, make sure it’s linked and easily accessible within your website’s structure. If not, consider removing it from the sitemap.
- Is the page valuable, optimised, and well designed? If so, keep it and focus on making it more accessible. If not, removing it from your sitemap may be the best move.
Routine SEO audits can help identify orphan pages and prevent them from slipping through the cracks.
4. URLs disallowed in robots.txt
When a URL is marked as disallowed in your robots.txt file (which guides search engine crawlers on which URLs they can access) including it in a sitemap can create conflicting signals for search engines.
Any section of your website that’s restricted by a disallow directive in robots.txt should not be included in your sitemap.
5. Canonicals that are not self-referencing
If one URL points to another URL as its canonical version (canonicalised URL) it shouldn’t be included in your sitemap, as it signals to search engines that you prefer the canonical URL to be indexed.
Non-self-referencing canonical URLs should be excluded from your sitemap, but self-referencing ones can be included.
Canonical URLs and sitemaps:
When we talk about canonical URLs in SEO, we’re referring to a way of indicating to search engines the preferred version of a webpage when there are multiple versions of the same content accessible via different URLs.
This can happen due to parameters in the URL, session IDs, or similar reasons.
Self-referencing canonical URLs:
A self-referring canonical URL is one where the canonical tag points to itself.
For example, if you have a page at <https://example.com/page> and you set its canonical tag to <https://example.com/page>, this is self-referencing. In this case, you’re essentially telling search engines that this is the preferred and definitive version of the page.
Non-self-referencing canonical URLs:
On the other hand, a non-self-referring canonical URL is when the canonical tag points to a different URL than the one where the tag is placed.
For instance, if <https://example.com/page> has a canonical tag pointing to <https://example.com/canonical-page>, you’re indicating that <https://example.com/canonical-page> should be considered the authoritative version.
For effective SEO
Excluding non-self-referring canonical URLs from your XML sitemap is crucial to avoid confusing search engines.
Sitemaps are designed to provide a clear indication of the pages you want indexed, but if a page’s canonical URL points elsewhere, it suggests that the original URL isn’t the preferred version for indexing.
This inconsistency can lead to conflicting signals for search engines, potentially harming your SEO efforts. Conversely, self-referring canonical URLs, where the canonical tag points to the same URL, should be included in your sitemap.
By doing so, you explicitly highlight to search engines that this is the primary URL you want indexed, helping to enhance your site’s visibility and search performance.
6. Paginated URLs that are not “view all”
Paginated URLs are technically indexable, so you could choose to include them in your sitemap.
However, it’s important to note that they aren’t typically a priority for crawl budget efficiency. Therefore, we recommend excluding them from your XML sitemap in most cases, as they’re not intended to rand independently
One exception: if pagination is implemented with a “view all” page instead of ‘rel=”next”’ and ‘rel=”prev”’ attributes, include the “view all” pages in your XML sitemap.
To wrap it up…
Remember that sitemaps are important, but a clever SEO strategy, valuable and optimised content, and well-structured site architecture can make the difference in getting to the top of Google’s search results page.
The great news is that you don’t have to figure everything out alone. At Logic Digital we’re passionate about all things SEO and web design, and we’ve successfully supported numerous businesses over the years.