Wednesday, December 28, 2016

Extract all internal and external links from a URL

How it works?
  • This is a Python script which takes complete URLs provided as command line arguments.
  • It then parses a URL in HTML using BeautifulSoup.
  • From the parsed webpage all the anchor hyper-references are extracted and later simple processing is done to sort them in two bins: internal and external.
  • If link contains http or https and the URL is a part of link then it is sorted as internal link otherwise it is external link.
  • All links starting with / or // are internal links.
  • Other tags like javascript, mail and telephone links are ignored.
  • All internal page jumps starting with # are ignored.
  • This link does not provide unique list of internal/external links so if same link is present it will be counted multiple times.
  • It will treat Top-level domain and subdomains as different URL hence it will be counted as external i.e. if you are querying for www.google.com then links with news.google.com and www.google.co.in both will be treated as different domain thus will be counted to external links.



Screenshots (click to zoom):