Jim Posted August 9 Posted August 9 Google lists 1000 example URLs under "Crawled - currently not indexed". (There are approx 20k total but only 1000 URLs may be listed.) Most of these are in the format: ....../get_file/0/b04c151f4453927a9b37d6774df63e17bb57891a1b/0/20/screenshots/1.jpg/ ...../get_file/6/93d0af9c5eccc35429879aae59d9155863ae321c7b/64000/64707/64707_720p.mp4/?rnd=1722643200142 These types of URLs would not be intended to be indexed. Is it possible to prevent Google from crawling these types of pages? I don't know if Google dicovering/crawling these sorts of URLs is harmful to the overall assessment of the site and indexing of pages - does anyone have experience of this? Quote
Tech Support Posted August 10 Posted August 10 To prevent Google from indexing something you are expected to use robots.txt. I think our default robots.txt doesn't block these files, as they are content files for videos. Quote
Jim Posted August 19 Author Posted August 19 Why is the default robots.txt set to allow robots to crawl page content URLs (files)? There doesn't appear to be any SEO purpose to allow robots to crawl these pages? If there is such purpose intended please could you explain? If there is no operational/SEO benefit to allowing robots to crawl (& inspect) these page content URLs (files) then it would appear to be appropriate to block robots from these pages. One way might be to add lines to the robots.txt file such as: User-agent: * Disallow: /get_file/ Would this block robots from the page content URLs examples listed above and not otherwise affect or stop the robots crawling the page URLs intended to be indexed? Quote
Tech Support Posted August 20 Posted August 20 For indexing videos Google needs to download files. https://developers.google.com/search/docs/crawling-indexing/sitemaps/video-sitemaps Quote Additionally, the following requirements apply to video sitemaps specifically: Don't list videos that are unrelated to the content of the host page. For example, a video that is a small addendum to the page, or unrelated to the main text content. All files referenced in the video sitemap must be accessible to Googlebot. This means that all URLs in the video sitemap: must not be disallowed for crawling by robots.txt rules, must be accessible without metafiles and without logging in, must not be blocked by firewalls or similar mechanism, and must be accessible on a supported protocol: HTTP and FTP (streaming protocols are not supported). Quote
Jim Posted August 20 Author Posted August 20 Ok ty it looks like there is no choice but to keep these URLs accessible even if they fill up the Google report with content URLs not intended to be indexed Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.