urls being included in sitemap and indexed by google issue

#882090
  • Resolved Michael Manley
    Rank Math free

    i have alot of urls that shouldnt be in there, i put them in the robots.txt but issue is sitl lthere.

    below in my robots.txt in rankmath
    User-agent: SiteCheckerBotCrawler
    Allow: /

    User-agent: *

    # Disallowed paths
    Disallow: /feed/
    Disallow: /wp-admin/
    Disallow: /images/
    Disallow: /wp-content/plugins/
    Disallow: /readme.html
    Disallow: /refer/
    Disallow: /thank-you/
    Disallow: */?author-reviews-page*
    Disallow: */product*
    Disallow: /?__hstc*
    Disallow: /?__hsfp*
    Disallow: *?boundingbox=*
    Disallow: */members/
    Disallow: */activity/
    Disallow: */reviews/
    Disallow: */?option=getmosociallogin&wp_nonce=*
    Disallow: */?option=oauthredirect&wp_nonce=*
    Disallow: /listing-feature/*
    Disallow: /author*
    Disallow: /tasks*
    Disallow: /?boundingbox*
    Disallow: /s?category*
    Disallow: /?category*
    Disallow: /cart*
    Disallow: /listings/?tax-listing*
    Disallow: /en-au/*
    Disallow: /category*

    # WP Import Export Rule
    Disallow: /wp-content/uploads/wp-import-export-lite/

    # Prevent URLs containing common SQL injection characters
    Disallow: /*?
    Disallow: /*&
    Disallow: /*=*
    Disallow: /*’*
    Disallow: /*”*
    Disallow: /*/*
    Disallow: /*–*
    Disallow: /*#*
    Disallow: /*;*
    Disallow: /*%22*
    Disallow: /*%27*
    Disallow: /*%2F*
    Disallow: /*%5C*
    Disallow: /*%3D*

    # Prevent SQL wildcard characters
    Disallow: /*%25*

    # Prevent URLs containing the semicolon
    Disallow: /*%3B*

    # Prevent URLs with potential SQL comments or queries
    Disallow: /*%20or%20*
    Disallow: /*%20and%20*
    Disallow: /*select%20*
    Disallow: /*union%20*
    Disallow: /*from%20*
    Disallow: /*information_schema*

    # Prevent URLs with encoded apostrophes
    Disallow: /*%60*

Viewing 5 replies - 1 through 5 (of 5 total)
  • Hello,

    We apologize for the issue you are facing with URLs being included in your sitemap and indexed by Google despite your robots.txt settings. It seems like you have a comprehensive list of disallowed paths, but there are a few things to consider:

    1. Robots.txt vs. Sitemap: The robots.txt file is used to instruct search engines which pages they should not crawl, but it does not prevent URLs from being included in the sitemap. If URLs are still appearing in your sitemap, you may need to adjust your sitemap settings in Rank Math.

    2. Sitemap Settings: Ensure that the URLs you want to exclude are set to "noindex" in your Rank Math settings. You can do this by going to the specific post or page settings and selecting the option to noindex them. This will help prevent them from being indexed by search engines.

    3. Clearing Cache: If you’ve made changes to your robots.txt or sitemap settings, make sure to clear your website cache and any caching plugins you might be using. This ensures that the latest version of your sitemap is being served.

    4. Testing Robots.txt: You can use the Robots Testing Tool to check if your robots.txt file is correctly blocking the URLs you want to exclude. Enter your URL and select the appropriate user agent to see if it’s being blocked as intended.

    5. Manual Removal: If certain URLs are still indexed, you can manually request their removal through Google Search Console. This can expedite the process of getting unwanted URLs de-indexed.

    If you continue to experience issues, please reply to this ticket, and a human agent from the Rank Math Support team will assist you further. Thank you for raising this concern!

    Sources:
    Common robots.txt Issues & How to Fix Them » Rank Math
    Rank Math Sitemap Issues and Their Fixes » Rank Math

    Ok, there are over 200 pages that are in sitemap that shouldnt be due to sql injections. It will take to long to remove them manually. How can I filter them out, i have disallowed them in robots.txt which i think google is picking up. But I was to remove them from sitemap aswell.

    Hello,

    You should edit the code you have used to create the custom sitemap to remove those entries you have blocked in your robots.txt.

    However, you can try to use and customize this filter on your site to remove those from your sitemap:

    add_filter( 'rank_math/sitemap/entry', function ( $url, $type, $object ) {
    
            $urls_to_skip = array(
                    'https://www.yoursite.com/information-schema'
            );
    
            if ( isset( $url['loc'] ) && in_array( $url['loc'], $urls_to_skip ) ) {
                    return false;
            }
    
            return $url;
    }, 10, 3 );

    You may refer to this guide on how to add filters to your website:
    https://rankmath.com/kb/wordpress-hooks-actions-filters/

    Looking forward to helping you.

    Thank you!

    Hello,

    We are super happy that this resolved your issue. If you have any other questions in the future, know that we are here to help you.

    If you don’t mind me asking, could you please leave us a review (if you haven’t already) on https://wordpress.org/support/plugin/seo-by-rank-math/reviews/#new-post about your overall experience with Rank Math? We appreciate your time and patience.

    If you do have another question in the future, please feel free to create a new forum topic, and it will be our pleasure to assist you again.

    Thank you.

Viewing 5 replies - 1 through 5 (of 5 total)

The ticket ‘urls being included in sitemap and indexed by google issue’ is closed to new replies.