How to Find All Existing and Archived URLs on a web site

There are several explanations you may need to have to search out every one of the URLs on an internet site, but your exact aim will decide Whatever you’re looking for. For instance, you might want to:

Determine each individual indexed URL to research issues like cannibalization or index bloat
Acquire current and historic URLs Google has observed, specifically for website migrations
Uncover all 404 URLs to Get better from post-migration mistakes
In Just about every state of affairs, only one tool received’t give you every thing you will need. Regrettably, Google Lookup Console isn’t exhaustive, along with a “website:example.com” research is limited and difficult to extract facts from.

In this publish, I’ll walk you thru some tools to make your URL listing and right before deduplicating the information using a spreadsheet or Jupyter Notebook, according to your website’s sizing.

Previous sitemaps and crawl exports
If you’re on the lookout for URLs that disappeared through the Are living web site a short while ago, there’s an opportunity somebody in your workforce may have saved a sitemap file or perhaps a crawl export before the changes were being designed. In case you haven’t currently, look for these data files; they will usually present what you will need. But, for those who’re reading through this, you almost certainly didn't get so Fortunate.

Archive.org
Archive.org
Archive.org is an invaluable Instrument for SEO responsibilities, funded by donations. When you seek for a domain and choose the “URLs” choice, you are able to entry as much as 10,000 mentioned URLs.

On the other hand, There are many limits:

URL Restrict: You are able to only retrieve around web designer kuala lumpur 10,000 URLs, that's insufficient for bigger web sites.
Good quality: Quite a few URLs may very well be malformed or reference useful resource documents (e.g., photos or scripts).
No export solution: There isn’t a designed-in solution to export the checklist.
To bypass The dearth of an export button, use a browser scraping plugin like Dataminer.io. Having said that, these constraints imply Archive.org might not provide a whole solution for greater sites. Also, Archive.org doesn’t show whether Google indexed a URL—but if Archive.org identified it, there’s a fantastic chance Google did, also.

Moz Pro
Even though you may perhaps usually use a hyperlink index to locate external sites linking to you personally, these instruments also explore URLs on your web site in the procedure.


The way to use it:
Export your inbound links in Moz Professional to get a brief and simple listing of goal URLs from a web site. If you’re handling a massive Internet site, think about using the Moz API to export info outside of what’s manageable in Excel or Google Sheets.

It’s essential to Be aware that Moz Professional doesn’t affirm if URLs are indexed or discovered by Google. Nonetheless, due to the fact most sites apply exactly the same robots.txt policies to Moz’s bots since they do to Google’s, this process normally performs properly like a proxy for Googlebot’s discoverability.

Google Search Console
Google Look for Console features various valuable resources for constructing your listing of URLs.

Inbound links stories:


Much like Moz Professional, the Hyperlinks part provides exportable lists of goal URLs. Sad to say, these exports are capped at one,000 URLs Every. You may implement filters for precise web pages, but considering the fact that filters don’t utilize on the export, you would possibly ought to count on browser scraping tools—limited to five hundred filtered URLs at any given time. Not excellent.

Efficiency → Search Results:


This export gives you a summary of web pages obtaining research impressions. When the export is restricted, you can use Google Look for Console API for larger sized datasets. Additionally, there are free Google Sheets plugins that simplify pulling extra comprehensive data.

Indexing → Webpages report:


This part provides exports filtered by situation style, although these are typically also confined in scope.

Google Analytics
Google Analytics
The Engagement → Internet pages and Screens default report in GA4 is a superb source for amassing URLs, which has a generous limit of 100,000 URLs.


A lot better, you may use filters to generate various URL lists, correctly surpassing the 100k limit. By way of example, in order to export only blog site URLs, abide by these steps:

Step one: Increase a phase into the report

Action 2: Simply click “Develop a new segment.”


Action 3: Outline the segment which has a narrower URL pattern, such as URLs that contains /blog/


Notice: URLs located in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they offer worthwhile insights.

Server log documents
Server or CDN log data files are perhaps the ultimate tool at your disposal. These logs seize an exhaustive record of every URL route queried by users, Googlebot, or other bots through the recorded period.

Concerns:

Data dimensions: Log information is usually substantial, numerous web pages only retain the last two weeks of knowledge.
Complexity: Examining log data files may be demanding, but many applications are offered to simplify the procedure.
Blend, and fantastic luck
As you’ve gathered URLs from each one of these resources, it’s time to combine them. If your internet site is small enough, use Excel or, for larger sized datasets, resources like Google Sheets or Jupyter Notebook. Make sure all URLs are consistently formatted, then deduplicate the list.

And voilà—you now have a comprehensive list of existing, outdated, and archived URLs. Fantastic luck!

Leave a Reply

Your email address will not be published. Required fields are marked *