HOW TO FIND ALL EXISTING AND ARCHIVED URLS ON AN INTERNET SITE

How to Find All Existing and Archived URLs on an internet site

How to Find All Existing and Archived URLs on an internet site

Blog Article

There are various good reasons you could want to find each of the URLs on an internet site, but your specific goal will decide That which you’re trying to find. As an illustration, you may want to:

Determine every indexed URL to investigate difficulties like cannibalization or index bloat
Gather present-day and historic URLs Google has observed, especially for site migrations
Come across all 404 URLs to Get well from post-migration mistakes
In Just about every scenario, just one Instrument won’t give you every little thing you require. However, Google Search Console isn’t exhaustive, in addition to a “web-site:case in point.com” search is restricted and difficult to extract details from.

In this particular article, I’ll wander you thru some applications to build your URL record and just before deduplicating the data using a spreadsheet or Jupyter Notebook, based upon your internet site’s sizing.

Previous sitemaps and crawl exports
If you’re looking for URLs that disappeared with the Reside internet site not too long ago, there’s a chance a person on your own staff can have saved a sitemap file or perhaps a crawl export before the changes were being designed. In case you haven’t currently, look for these data files; they could frequently present what you will need. But, for those who’re looking through this, you almost certainly didn't get so Fortunate.

Archive.org
Archive.org
Archive.org is an invaluable Instrument for SEO responsibilities, funded by donations. When you seek out a domain and select the “URLs” option, you could accessibility up to 10,000 stated URLs.

However, There are many limits:

URL limit: You may only retrieve nearly web designer kuala lumpur ten,000 URLs, which is insufficient for greater web sites.
Top quality: Numerous URLs might be malformed or reference source data files (e.g., photographs or scripts).
No export choice: There isn’t a built-in solution to export the list.
To bypass The dearth of an export button, use a browser scraping plugin like Dataminer.io. Nonetheless, these constraints mean Archive.org may well not supply a complete Answer for bigger sites. Also, Archive.org doesn’t point out whether or not Google indexed a URL—however, if Archive.org discovered it, there’s a very good possibility Google did, too.

Moz Pro
Although you may normally make use of a url index to search out exterior internet sites linking to you, these tools also discover URLs on your web site in the method.


How you can utilize it:
Export your inbound hyperlinks in Moz Professional to get a speedy and easy listing of target URLs from your web page. In the event you’re handling a massive Web site, consider using the Moz API to export information past what’s workable in Excel or Google Sheets.

It’s important to Take note that Moz Pro doesn’t ensure if URLs are indexed or uncovered by Google. Nonetheless, because most web-sites utilize precisely the same robots.txt procedures to Moz’s bots as they do to Google’s, this method usually performs very well as a proxy for Googlebot’s discoverability.

Google Look for Console
Google Research Console gives various beneficial resources for creating your listing of URLs.

Backlinks stories:


Similar to Moz Pro, the Back links section gives exportable lists of focus on URLs. Unfortunately, these exports are capped at one,000 URLs Each and every. You can apply filters for certain web pages, but since filters don’t utilize on the export, you would possibly ought to depend on browser scraping tools—limited to 500 filtered URLs at a time. Not suitable.

Effectiveness → Search engine results:


This export provides you with an index of internet pages acquiring look for impressions. Whilst the export is restricted, You may use Google Research Console API for more substantial datasets. There are also free Google Sheets plugins that simplify pulling extra intensive info.

Indexing → Internet pages report:


This part offers exports filtered by challenge type, even though these are definitely also limited in scope.

Google Analytics
Google Analytics
The Engagement → Internet pages and Screens default report in GA4 is a wonderful source for collecting URLs, which has a generous limit of a hundred,000 URLs.


Better still, you could implement filters to create unique URL lists, efficiently surpassing the 100k limit. Such as, in order to export only site URLs, abide by these steps:

Stage one: Increase a phase on the report

Phase 2: Click on “Create a new phase.”


Stage three: Define the section having a narrower URL pattern, such as URLs that contains /web site/


Be aware: URLs present in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they supply valuable insights.

Server log files
Server or CDN log files are Possibly the ultimate Instrument at your disposal. These logs seize an exhaustive listing of every URL route queried by end users, Googlebot, or other bots over the recorded time period.

Criteria:

Facts dimension: Log documents may be large, a lot of web pages only keep the final two months of information.
Complexity: Examining log documents is often challenging, but various tools are available to simplify the procedure.
Blend, and very good luck
When you finally’ve collected URLs from all these sources, it’s time to mix them. If your website is sufficiently small, use Excel or, for greater datasets, applications like Google Sheets or Jupyter Notebook. Guarantee all URLs are regularly formatted, then deduplicate the checklist.

And voilà—you now have a comprehensive list of present, previous, and archived URLs. Very good luck!

Report this page