Limiting site crawls

#1120326
  • Hi I want to adjust my robots.txt file so that Google-Extended and Bingbot can crawl everything but for every other user agent I want to discourage scrapping images. Is the following the correct way to do this?

    User-agent: *
    Disallow: /wp-content/uploads

    User-agent: Bingbot
    Allow: /wp-content/uploads

    User-agent: Google-Extended
    Allow: /wp-content/uploads

    I am assuming that the * wildcard covers all other user agents and that I won’t have to methodically list every possible variant.

    Also, is it feasible to take it back a directory level to /wp-content?

    I am using the LLMS.txt option and while I want to give every client’s site a good start many of them are artists and worried about copyright infringements of their creative work.

    Of course this is a polite approach and scrapping can happen anyway, but it’s a start.

Viewing 3 replies - 1 through 3 (of 3 total)
  • Hello,

    Thank you for your question regarding the adjustments to your robots.txt file. Your understanding of how the user-agent rules work is correct. The * wildcard does indeed cover all other user agents, so you won’t need to list every possible variant.

    Your proposed rules would look like this:

    User-agent: *
    Disallow: /wp-content/uploads
    
    User-agent: Bingbot
    Allow: /wp-content/uploads
    
    User-agent: Google-Extended
    Allow: /wp-content/uploads

    This setup will effectively disallow all user agents from accessing the /wp-content/uploads directory while allowing Bingbot and Google-Extended to crawl it.

    Regarding your question about taking it back a directory level to /wp-content, you can certainly do that. If you want to disallow access to the entire /wp-content directory for all user agents except Bingbot and Google-Extended, you would adjust your rules as follows:

    User-agent: *
    Disallow: /wp-content/
    
    User-agent: Bingbot
    Allow: /wp-content/
    
    User-agent: Google-Extended
    Allow: /wp-content/

    This will prevent all other user agents from crawling anything in the /wp-content directory while still allowing Bingbot and Google-Extended access.

    Keep in mind that while these rules can discourage scraping, they are not foolproof, as some scrapers may ignore the robots.txt directives.

    If you have any further questions or need assistance with anything else, feel free to ask!

    Sources:
    What Are the Best Robots.txt File Rules for Multisite? » Rank Math
    Using Rank Math’s Robots.txt Generator » Rank Math

    This is an automated reply from Rank Math Support Bot. Your ticket has been assigned to a human agent and we will get back to you shortly.

    Hello,

    You seemed to have marked this ticket as resolved.

    We’re just following up with you to see if this is a concern that we can assist you with. Or, if your issue has been resolved, please let us know so we can mark this ticket as closed.

    Thank you for choosing Rank Math.

    Thanks Adetayo,

    much appreciated. I’ll be doing more research into this. I need to give as best advice to clients on where to block and allow AI, depending on the content of their sites.

    For example, from SiteGround documentation: We block crawlers intended for AI model training to protect our clients’ website data and intellectual property and save resources. So there are some nuances.

Viewing 3 replies - 1 through 3 (of 3 total)

You must be logged in to reply to this ticket.