Web scraping is the process of gathering information from the Internet. This is often employed to extract large amounts of data from websites. Even copy-pasting the lyrics of your favorite song is a form of web scraping! However, the words “web scraping” usually refer to a process that involves automation. The large amount of data on the internet is a great resource for research or your own personal use; however, due to this large amount of data it can be difficult to get the right data that is useful to you. In this blog I’ll go over why you might want to use web scraping and how to do this using Python.
There are many reasons why you might choose to use web scraping. Often when you need to compile a lot of data, for research or personal use, it's a lot easier to write a program to do this for you instead of having to spend loads of time compiling the data yourself. This is particularly prevalent while training machine learning algorithms which often require a large amount of data to learn.
Web scraping is also often used as a quick solution to get a specific piece of data quickly, especially if there is no API (application programming interface) which can be used.
Some of the challenges of creating a web scraper program is the continuous growth of the world wide web. Website technologies are constantly changing and there is no website which is the same. This requires each website to require its own personal treatment if you want to extract the information that’s relevant to you. Another major challenge is durability. Websites are constantly changing and are in active development. Once the website structure changes, your web scraper might not work. This requires you to be constantly aware of changes to website structures so you are able to fix any issues quickly by updating your web scraper.
Creating our own web scraping program
We will be creating a basic program that scrapes song lyrics off web pages and writes them to a file.
requests for performing your HTTP requests
BeautifulSoup4 for handling all of your HTML processing
You can install these dependencies with pip:
pip install requests BeautifulSoup4
At the top of our Python file we need to import the relevant modules that we require in our program.
Next we need to create a function which downloads the web pages. The requests package allows us to do all things HTTP in Python. We have created a function here called getContent() which attemps to get the content at the parameter URL by making a HTTP GET request. If the response it good it executes getLyrics() function which we will see later else prints that the song cant be found. If an error occurs during the request the logError() function will be called with the parameters of the error.
If the response is an HTML response then isGoodResponse returns True, otherwise False is returned.
If an error occurs while running the HTTP GET request the logError function is run and prints the error to the console.
Once we know the response is actually an HTML response we can start parsing the HTML response. We can do this using the BeautifulSoup4 module which we have imported at the start. After parsing we can find specific html elements which we can use to get the lyrics from the web page. Once we have the lyrics, we call the writeToFile() function with input parameter of the lyrics.
This short function writes the lyrics to a file called lyrics.txt which is located in the same folder as this Python script. It will either create the file if the file is not there or replace any text in the file already with the new lyrics.
Lastly, we will create a main() function which is used to get the relevant data from the user before passing on the data to the getContents() function. At the end of the code we will call the main() function.
Hopefully by now you have created a working program which you can actually use to get lyrics for songs! This is just the start of what you can use the BeautifulSoup4 and requests modules for.
Author: Robert Nimmo
References:
https://funthon.files.wordpress.com/2017/05/bs.png?w=772
https://www.antevenio.com/usa/wp-content/uploads/2019/12/web-scraping-1024x536.jpeg