Have you ever felt lost at sea navigating around your website structure, specifically when it comes to deciding how to organise its sitemap? You’re definitely not alone, and yes – it can be a bit of a maze out there.
When it comes to deciding which URLs should be added to your website sitemap and which should be left out, things can quickly become tricky and best practices are often up for debate.
Before we dive into our closer look at what pages you should not include in your sitemap, let’s take a step back and recap why it is important to have a sitemap for your website.
What is a XML sitemap and why should my website have one?
Simply put, sitemap is a file that tells crawlers how your website is designed and holds information about your website pages. A sitemap can help search engines crawling your site more efficiently as it tells Google which pages of your website you think are important and should be prioritised for crawling.
On top of that, a sitemap holds valuable information of pages and files on your website such as when they were last updated and any alternative language versions of the page that might be available.
Usually, it looks something like this:
With every section containing a breakdown of the related pages:
Setting a sitemap up for your website is straightforward these days thanks to several SEO plugins that do the legwork for you and maintain it up to date (although the nitty gritty details of building and submitting a sitemap to Google are also available if you are interested in a deep dive), so it’s worth having one for your website if you don’t already.
Now, to the main question: should you include every single page of your website in your XML sitemap?
Well, if you want to – you could. But in practical terms, probably better not.
Why, we hear you ask? Simply put, to maximise Google’s crawl budget a sitemap should only contain indexable URLs that are ready to be served on search engines, don’t have any error HTTP response codes (“200 OK” response codes are your gold standard here) and which are linked within your website.
At this point, you might be asking yourself which pages you should exclude from your sitemap. Keep on reading, we’ve got you covered.
Here is a brief list of which pages you should exclude from your XML sitemap.
1. URLs with noindex tags
Going back to basics, your XML sitemap should only contain URLs you want to be indexed by search engines.
When you mark a URL as noindex within a <meta> tag HTTP response header, you are sending a clear signal to search engines that you do not want that page to be indexed, and as such it should not be included in your sitemap.
Including a URL tagged as noindex in a sitemap causes to send conflicting information to search engines and should be avoided.
2. URLs with HTTP status codes 3xx / 4xx / 5xx
You should exclude from your XML sitemap URLs returning non-200 response codes like redirection status codes (3xx), client error response codes (4xx) and server error response codes (5xx).
We understand that is virtually impossible to memorise every single error response code in existence (kudos to you if you can – let us know your secret trick as we can barely remember what we put on our shopping list two days ago), so we’ve got a handy list that summarises the HTTP response codes you might want to exclude from your XML sitemap:
HTTP Response code 3xx – Redirection
|HTTP Response Code 300
|HTTP Response Code 301
|HTTP Response Code 302
|HTTP Response Code 303
|HTTP Response Code 304
|HTTP Response Code 307
|HTTP Response Code 308
HTTP Response code 4xx – Client Error Response
|HTTP Response Code 400
|HTTP Response Code 401
|HTTP Response Code 402
|HTTP Response Code 403
|HTTP Response Code 404
|HTTP Response Code 405
|Method Not Allowed
|HTTP Response Code 406
|HTTP Response Code 407
|Proxy Authentication Required
|HTTP Response Code 408
|HTTP Response Code 409
|HTTP Response Code 410
|HTTP Response Code 411
|HTTP Response Code 412
|HTTP Response Code 413
|Content Too Large
|HTTP Response Code 414
|URI Too Long
|HTTP Response Code 415
|Unsupported Media Type
|HTTP Response Code 416
|Range Not Satisfiable
|HTTP Response Code 417
|HTTP Response Code 421
|HTTP Response Code 422
|HTTP Response Code 423
|HTTP Response Code 424
|HTTP Response Code 425
|HTTP Response Code 426
|HTTP Response Code 428
|HTTP Response Code 429
|Too Many Requests
|HTTP Response Code 431
|Request Header Fields Too Large
|HTTP Response Code 451
|Unavailable for Legal Reasons
HTTP Response Code 5xx – Server Error Response
|HTTP Response Code 500
|Internal Server Error
|HTTP Response Code 501
|HTTP Response Code 502
|HTTP Response Code 503
|HTTP Response Code 504
|HTTP Response Code 505
|HTTP Version Not Supported
|HTTP Response Code 506
|Variant Also Negotiates
|HTTP Response Code 507
|HTTP Response Code 508
|HTTP Response Code 511
|Network Authentication Required
3. Orphan Pages with no value
Orphan pages are pages that are virtually inaccessible unless you have a direct link to it, because they are not linked to from any other page or section of your site.
Search engines may find it tricky to discover orphan pages because they have no internal links from anywhere else on your website.
A page that has no links pointing to it will yield very little page authority, and search engines may opt to remove it entirely from the index.
However, there are a couple of questions to consider before removing an orphan page from a sitemap or not including it from the get go:
- Is the page important and valuable for the website? If not and it does not have a place in your sitemap, you should remove it.
- Is there a keyword (or more) ranking for this page? Is it getting traffic (perhaps being shared on social or similar)? If so, ensure that is linked and accessible on your website architecture. If not, remove it from the sitemap.
- Is the page valuable and optimised and does it have a good design? If so, keep it and improve it by making it accessible. If this is not the case, remove it from your sitemap.
Finding orphan pages can be made a part of routine SEO audits, to avoid any stragglers falling off your radar.
4. URLs disallowed in robots.txt
If a URL is disallowed in robots.txt (the file that tells search engine crawlers which URLs the crawler can access on your site) but included in a sitemap, it provides conflicting information to search engines.
Any section of website marked by disallow directive in robots.txt should not be added to sitemap.
5. Canonicals that are not self-referencing
A URL pointing to another as its canonical version (canonicalised URL) should not be included in your sitemap as it is signalling to search engines that you don’t wish for that URL to be indexed.
Non self-referring canonical URLs should not be added to sitemap, but self-referring ones can be made part of it.
6. Paginated URLs that are not “view all”
Paginated URLs are technically indexable, so if you wish to do so you could include them in your sitemap.
However, it is important to note that they are not a priority for crawl budget and as such we recommend not to include them in your XML sitemap to improve crawling efficiency.
In most cases, paginated pages are not pages you want to rank for so there is no need for them to be in your sitemap.
An exception: if pagination is implemented with a view all page instead of rel=”next” and rel=”prev” attributes, “view all” pages should be included in your XML sitemap.
To wrap it up…
Remember that sitemaps are still important, but a well-crafted SEO strategy, valuable and optimised content and an effective site architecture can really make the difference and put your website at the top of Google search results page.
The good news is that you don’t have to do it alone: here at Logic Digital we love all things SEO and web design and we’d love to help you, just like we’ve helped a lot of businesses through the years. If you need a hand with search engine optimisation services for your business, get in touch today to have a chat and discover how we can help you be found online by the right customers, make more conversions smashing your targets.