WebScraper

Run the shell script ./install for the first use and then run the python file solution.py using python solution.py

Please ensure that you have root access

The code was developed in Ubuntu 12.04 having python 2.7 and firefox browser; minor changes have to be made to scale it to other systems
Please ensure that you have a steady internet connection; and you may have to run the code multiple times in case a connection error\ results

Method:

Extracting from a given webpage

The TRAI website had a lot of hidden urls, so I used urllib to open a website and to find links using beautlful soup But then, I came across javascripts. Here the url remained same, but the content changed; which was again a dilemna

Parsing Javascripts

I used selenium to parse javascripts, then I crawled for pdfs in the new links too. Selenium opens browser to parse script, so I found out a method to hide all the browsers which were opened and to close it after I was done with my task

Downloading files

I had saved pdf urls in a list, I ran a for loop to download pdfs one by one using urlretrieve, Here new directories will be created as per details mentioned on the url

#References:

Stack overflow!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
get-pip.py		get-pip.py
install.sh		install.sh
solution.py		solution.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebScraper

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WebScraper

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages