19 comments

  • andrethegiant 6 hours ago

    > there's typically a 5-7 day gap between updating the robots.txt file and crawlers processing it

    You could try moving your favicon to another dir, or root dir, for the time being, and update your HTML to match. That way it would be allowed according to the version that Google still has cached. Also, I think browsers look for a favicon at /favicon.ico regardless, so it might be worth making a copy there too.

    • throwaway2016a 4 hours ago

      /favicon.ico is the default and it will be loaded if your page does not specify a different path in the metadata but in my experience most clients respect the metadata and won't try to fetch the default path until after the <head> section of the page loads for HTML content.

      But non-HTML content has no choice but to use the default so it's generally a good idea to make sure the default path resolves.

    • ms7892 4 hours ago

      Thanks for sharing, I wasn’t knowing that browsers look for a favicon at /favicon.ico. Thanks again.

  • dazc 5 days ago

    USE X-Robots-Tag: noindex to prevent files being indexed and let google determine how they crawl your site for themselves.

    A nightmare scenario can result, otherwise, where you have content indexed but don't allow googlebot to crawl it. This does not end well.

    https://developers.google.com/search/docs/crawling-indexing/...

    • csiegert 3 hours ago

      I’ve got two questions:

      1. What does it look like for a page to be indexed when googlebot is not allowed to crawl it? What is shown in search results (since googlebot has not seen its content)?

      2. The linked page says to avoid Disallow in robots.txt and to rely on the noindex tag. But how can I prevent googlebot from crawling all user profiles to avoid database hits, bandwidth, etc. without an entry in robots.txt? With noindex, googlebot must visit each user profile page to see that it is not supposed to be indexed.

      • seanwilson 2 hours ago

        https://developers.google.com/search/docs/crawling-indexing/...

           "Important: For the noindex rule to be effective, the page or resource must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can't access the page, the crawler will never see the noindex rule, and the page can still appear in search results, for example if other pages link to it."
        
        It's counterintuitive but if you want a page to never appear on Google search, you need to flag it as noindex, and not block it via robots.txt.

        > 1. What does it look like for a page to be indexed when googlebot is not allowed to crawl it? What is shown in search results (since googlebot has not seen its content)?

        It'll usually list the URL with a description like "No information is available for this page". This can happen for example if the page has a lot of backlinks, it's blocked via robots.txt, and it's missing the noindex flag.

  • hk1337 4 hours ago

    It's good information but...

    1. Why is your favicon in the uploads directory? Usually, those would be at the root of your site or in an image directory?

    2. Why is there an uploads directory for a static site hosted on GitHub? I don't believe that is useful on GitHub, is it? You cannot have visitors upload files to it, right?

    • gwd 3 hours ago

      Speaking for myself:

      1. I want nginx to serve static files, and everything else to be reverse proxied to the webapp

      2. The configuration file that allows /favicon.ico (and others) to be a file but / and other paths to be passed to the webapp is kind of ugly. Here's mine:

          location ~* ^/(favicon.ico|apple-touch-icon.png)$ {
              root $icons_path;
          }
      
      In my own case I've so far decided to accept the ugly config file, but as you can see, I haven't gotten around to adding even a robots.txt or any of the other files the modern web ecosystem expects; and adding them involves adding them one-by-one. I can see why someone would say, "Why make an ugly hack of an nginx config, when I can just define the favicon location in the metadata to a path easily configured to be files-only?"
  • liendolucas 42 minutes ago

    I'm a complete ignorant when it comes to SEOs so what are the consequences of not having a robots.txt nor a sitemap.xml at all? Will that be detrimental in a big way?

  • xnx 4 hours ago

    The best SEO advice is to not focus on SEO and make a site that people will like.

    • dewey 2 hours ago

      Technical SEO still is a very valid optimization. Making sure you have all the relevant tags, a good structure, fast loading pages etc.

  • seanwilson 5 hours ago

    How big is your site? Crawl budget is likely only relevant for huge sites, not personal blogs.

    • moribunda 3 hours ago

      Exactly - this SEOveroptimisation

  • dewey 2 hours ago

    If you don’t have millions of pages the crawl budget limitations most likely will have zero impact.

    Make sure your basic technical SEO factors are all good. Search console is looking good and then don’t continue to worry unless you are a huge site that’s living off SEO traffic.

  • maciekpaprocki 2 hours ago

    You dont want to exclude your images. That can very much affect your results as it will remove you from image tab, but also content of articles that contain them might be affected.

  • tiffanyh 3 hours ago

    Does anyone have suggestions on what a proper robots.txt would be?

    How about:

      User-agent: *
      Allow: /
      Sitemap: https://example.com/sitemap.xml
    • akira2501 an hour ago

      The recommendation is to use an empty "Disallow:" rule rather than a catch all "Allow:" rule.

      Otherwise that is the canonical minimal example.

      • tiffanyh 42 minutes ago

        Like this?

          User-agent: *
          Disallow: 
          Sitemap: https://example.com/sitemap.xml
    • bragr 3 hours ago

      That's a valid robots.txt, but "proper" is entirely dependent on what you want to achieve. If you aren't looking to treat different bots differently, and are looking allow all of your site to be indexed, then that is exactly what you want.