How to download dynamically loaded content using python
Introduction
When you surf online, you occasionally visit websites that show content like videos or audio files which are dynamically loaded. This is basically done using AJAX calls or sessions where the URLs for these files are generated in some way for which you can not save them by normal means. A scenario would be like, for example, you visited a web page and a video is loaded through a video player like jwplayer after 1 or 2 secs. You want to save this but unfortunately you couldn’t do so by right-clicking on it and saving it as the player doesn’t show that option. Even if you use command line tool like wget or youtube-dl, it might be possible but for some reason it just doesn’t work. Another problem is that, there are still some javascript functions that need to be executed and until then, the video url does not get generated in the site. Now the question is, how would you download such files to your computer?
Well, there are different ways to download them. One way is to use a plugin or extension for the browser like Grab Any Media or FlashGot which can handle such requests and might allow you to download it. Now the thing is, lets say you want to download an entire set of videos that are loaded using different AJAX calls. In this case, the plugins or extensions, might work but it takes a long time to manually download each file. A better method would be to write a small script that automates this process. This tutorial aims to teach you guys on how to use the selenium web driver and do simple tasks like downloading dynamically loaded content in a website using python.
Prerequisites
For this tutorial, you need to have atleast some knowledge on how to program in python. If you don’t know anything about it, then I would suggest you to check out this site : http://www.tutorialspoint.com/python/. It is a great starting point to learn a new language and you can quickly learn the basics.
So, before we start, I would like to give an small introduction to the modules that I am going to use in my python script. The system that I’m using is a Ubuntu Studio 14.04. In order to install the modules, you can use python-pip and also you might need to have administrative privileges. Here are the modules as follows :
- Selenium Web Driver : The selenium framework is a suite of tools that can be used to test web applications and also automate the web browser tasks. By using it’s provided API, you can do simple tasks like automating administration work for a website or some website-related maintenance, by sending commands to the browser. It is supported in various programming languages like Python, Java, Javascript, PHP, C#, Perl, Ruby. To install this framework for python, just type the following command in the terminal :
pip install selenium
- BeautifulSoup : The bs4 is a HTML/XML parser that does a great job at screen-scraping elements and getting information like the tag names, attributes, and values. It also has set of methods that allow you do things like, to match certain instances of a text and retrieve all the elements that contain it. You can install this module like this :
pip install beautifulsoup4
- Python-Wget : This module is a python port for the wget command-line program. Its easy to setup and you can quickly download videos or files to your system by using its API. You can also follow the “traditional” method of downloading files, like using the standard
urllib
module or by doing a subprocess call to the wget command-line program, but for this tutorial, I will be using this module to get the job done. Here is how you install it :
pip install wget
- PhantomJS (Optional) : The PhantomJS is a headless browser (doesn’t have a front-end GUI and everything works at the backend) that is used for web page interaction. It is similar to the selenium web driver but the difference is that, it is headless. For this tutorial, I will be using the basic Firefox web driver via selenium, but you can test this out if you do not want a browser to popup every time the script runs. You can download the phantomjs executable from their homepage, but I advise you to use a downgraded version of this as the latest one might not be compatible with selenium module.
The Concept
The idea is basically :
- To get the web page using the selenium web driver.
- Parse and extract the video or audio urls from the html page using BeautifulSoup.
- Download the files to the system using wget.
Step 1
The first step we need to do is import the necessary modules in the python script or shell, and this can be done as shown below :
# The standard library modules
import os
import sys
# The wget module
import wget
# The BeautifulSoup module
from bs4 import BeautifulSoup
# The selenium module
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
From the selenium module, we import the following things :
- webdriver : This submodule has the functionality to initialize the various browsers like Chrome, Firefox, IE, etc.
- Keys : This allows us to send key presses or inputs to the web driver.
- WebDriverWait : The
WebDriverWait
is similar to the sleep function of thetime
module. This function tells the webdriver to wait for "n" seconds. - expected_conditions : There are common conditions that the web driver takes into account like, for example, a condition would be for an element to appears in the browser or when the title of the page is somename. Here is a list of all the available conditions 1 :
- title_is
- title_contains
- presence_of_element_located
- visibility_of_element_located
- visibility_of
- presence_of_all_elements_located
- text_to_be_present_in_element
- text_to_be_present_in_element_value
- frame_to_be_available_and_switch_to_it
- invisibility_of_element_located
- element_to_be_clickable - it is Displayed and Enabled.
- staleness_of
- element_to_be_selected
- element_located_to_be_selected
- element_selection_state_to_be
- element_located_selection_state_to_be
- alert_is_present
- By : The
By
class allows us to select HTML elements in the web driver by class name, id, xpath, name, hyperlinks, etc.
Step 2
Now, according to the concept, for a single video url that is loaded using an AJAX call, we need to get the web page using the selenium webdriver. This is done as follows :
driver = webdriver.Firefox() # if you want to use chrome, replace Firefox() with Chrome()
driver.get("http://www.example.com") # load the web page
When you execute this in the python shell or via the script (after you import the modules), you will observe that, a firefox browser will popup and a page will be loaded into it. If you want to use the PhantomJS and stop the browser from popping up, then just replace the webdriver.Firefox()
with webdriver.PhantomJS(service_args=['--ignore-ssl-errors=true'])
.
Step 3
Here is the tricky part, what you need to do is extract the video urls from the web page. As every website is designed differently, you don’t have an accurate solution. You would need to manually check for a pattern or the video element that is dynamically loaded. This can be done by looking at the browser’s developer console. From the previously mentioned scenario, lets say the video is dynamically loaded using a AJAX call after 1 sec you visit the website. Then you would need to wait till the video is loaded and then get the element. So, for that you can write the script in this manner :
WebDriverWait(driver, 50).until(EC.visibility_of_element_located((By.ID, "the-element-id"))) # waits till the element with the specific id appears
src = driver.page_source # gets the html source of the page
When you execute these two lines in the python shell, it will tell the browser to wait for 50 seconds by default until the element with the specific id appears or is visible on the screen, and then get the html source. Now if the element doesn’t have an ID, there are other ways you can get the specific element/tag. Lets say the element has a certain class, then you can just replace the By.ID
with By.CLASS_NAME
. Here is the entire list of attributes for the By
class object 2 :
- ID
- XPATH
- LINK_TEXT
- PARTIAL_LINK_TEXT
- NAME
- TAG_NAME
- CLASS_NAME
- CSS_SELECTOR
Step 4
Once you get the HTML source, you would need to parse it and extract the video tag from it. This is done as shown below :
parser = BeautifulSoup(src,"lxml") # initialize the parser and parse the source "src"
list_of_attributes = {"class" : "some-class", "name" : "some-name"} # A list of attributes that you want to check in a tag
tag = parser.findAll('video',attrs=list_of_attributes) # Get the video tag from the source
The list_of_attributes
is a python dictionary, with (key,value) pairs which specify the tags attributes. The parser.findAll()
searches the entire HTML source and gets the video tags with the specific attributes. This generates a multi-dimensional array and is stored in the tag
variable.
Step 5
The next step is to get the url from the video tag and finally download it using wget. We can do this by writing the script in this manner :
n = 0 # Specify the index of video element in the web page
url = tag[n]['src'] # get the src attribute of the video
wget.download(url,out="path/to/output/file") # download the video
Depending on the number of videos loaded in the web page, you can specify which video you want to download. This can by done changing the value of n
.
Step 6
Finally, once the job is done, we close the driver :
driver.close() # closes the driver
The script
Now, when you put all the pieces together, and with some additional functionality to login to a website, you will get something like this :
# The standard library modules
import os
import sys
# The wget module
import wget
# The BeautifulSoup module
from bs4 import BeautifulSoup
# The selenium module
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Firefox() # if you want to use chrome, replace Firefox() with Chrome()
driver.get("http://www.example.com") # load the web page
# for websites that need you to login to access the information
elem = driver.find_element_by_id("email") # Find the email input field of the login form
elem.send_keys("user@example.com") # Send the users email
elem = driver.find_element_by_id("pwd") # Find the password field of the login form
elem.send_keys("userpwd") # send the users password
elem.send_keys(Keys.RETURN) # press the enter key
driver.get("http://www.example.com/path/of/video/page.html") # load the page that has the video
WebDriverWait(driver, 50).until(EC.visibility_of_element_located((By.ID, "the-element-id"))) # waits till the element with the specific id appears
src = driver.page_source # gets the html source of the page
parser = BeautifulSoup(src,"lxml") # initialize the parser and parse the source "src"
list_of_attributes = {"class" : "some-class", "name" : "some-name"} # A list of attributes that you want to check in a tag
tag = parser.findAll('video',attrs=list_of_attributes) # Get the video tag from the source
n = 0 # Specify the index of video element in the web page
url = tag[n]['src'] # get the src attribute of the video
wget.download(url,out="path/to/output/file") # download the video
driver.close() # closes the driver
Conclusion
What we did in this tutorial is, to create a small script that automates the process of downloading a file which is dynamically loaded. The above script works for a single url. If you want to download multiple files, then you would need to manually grab the tags and dynamic content information of each website and store them in json or xml file. Then you would need to read that file and pass it through a for
loop. I created another small script that does this job. Its not full proof but is a good starting point for you guys to get an idea on how to do it. It also accepts command line arguments that allow you to download either a single video file or by taking in a file that contains all the video urls. You can get the script here : https://gitlab.com/snippets/8921.
Please keep in mind that, you will encounter some websites which are so secure, that even though what you do, you just cannot download that video or file. This is because, they are designed in such a way that, the urls for the files are generated with unique id and is embedded into the site.
I hope you find this tutorial useful and learned something new.
Resources
- The selenium web driver documentation : http://selenium-python.readthedocs.org/index.html
- The BeautifulSoup documentation : http://www.crummy.com/software/BeautifulSoup/bs4/doc/
References
I feedback.
Let me know what you think of this article on twitter @dvenkatsagar or leave a comment below!