

Retrieving Webpages Through Python Programming
ModelingRTools & LanguagesWeb Scrapingposted by ODSC Community July 22, 2020 ODSC Community

This article discusses retrieving web pages through Python programming. The internet and the World Wide Web (WWW), is probably the most prominent source of information today. Most of that information is retrievable through HTTP. HTTP was invented originally to share pages of hypertext (hence th.e name Hypertext Transfer Protocol), which eventually started the WWW.
This process occurs every time we request a web page through our devices. The exciting part is we can perform these operations programmatically to automate the retrieval and processing of information. In this article, we will learn how to leverage the Python language to fetch HTTP. Python has an HTTP client in its standard library. Further, the fantastic request modules make obtaining web pages very convenient.
This article is an excerpt from the book Python Automation Cookbook, Second Edition by Jamie Buelta, a comprehensive and updated edition that enables you to develop a sharp understanding of the fundamentals required to automate business processes through real-world tasks, such as developing your first web scraping application, analyzing information to generate spreadsheet reports with graphs, and communicating with automatically generated emails.
[Related article: Web Scraping News Articles in Python]
Interacting with forms
A common element present in web pages is forms. Forms are a way of sending values to a web page, for example, to create a new comment on a blog post, or to submit a purchase.
Browsers present forms so you can input values and send them in a single action after pressing the submit or equivalent button. We’ll see how to create this action programmatically in this recipe.
Getting ready
We’ll work against the test server https://httpbin.org/forms/post, which allows us to send a test form and sends back the submitted information.
The following is an example form to order a pizza:
You can fill the form in manually and see it return the information in JSON format, including extra information such as the browser being used.
The following is the frontend of the web form that is generated:
The following screenshot shows the backend of the web form that is generated:
Figure 3: Returned JSON content
We need to analyze the HTML to see the accepted data for the form. The source code is as follows:
Check the names of the inputs, custname, custtel, custemail, size (a radio option), topping (a multiselection checkbox), delivery (time), and comments.
How to do it…
1. Import the requests, BeautifulSoup, and re modules:
>>> import requests >>> from bs4 import BeautifulSoup >>> import re
2. Retrieve the form page, parse it, and print the input fields. Check that the posting URL is /post (not /forms/post): >>> response = requests.get(‘https://httpbin.org/forms/post’)
>>> page = BeautifulSoup(response.text) >>> form = page.find('form') >>> {field.get('name') for field in form.find_all(re. compile('input|textarea'))} {'delivery', 'topping', 'size', 'custemail', 'comments', 'custtel', 'custname'}
3. Note that textarea is a valid input and is defined in the HTML format. Prepare the data to be posted as a dictionary. Check that the values are as defined in the form:
>>> data = {'custname': "Sean O'Connell", 'custtel': '123-456- 789', 'custemail': 'sean@oconnell.ie', 'size': 'small', 'topping': ['bacon', 'onion'], 'delivery': '20:30', 'comments': ''}
4. Post the values and check that the response is the same as returned in the browser:
>>> response = requests.post('https://httpbin.org/post', data) >>> response <Response [200]> >>> response.json() {'args': {}, 'data': '', 'files': {}, 'form': {'comments': '', 'custemail': 'sean@oconnell.ie', 'custname': "Sean O'Connell", 'custtel': '123-456-789', 'delivery': '20:30', 'size': 'small', 'topping': ['bacon', 'onion']}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Content-Length': '140', 'Content-Type': 'application/x-wwwform- urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'pythonrequests/ 2.22.0'}, 'json': None, 'origin': '89.100.17.159', 'url': 'https://httpbin.org/post'}
How it works…
Requests directly encodes and sends data in the configured format. By default, it sends POST data in the application/x-www-form-urlencoded format.
The key aspect here is to respect the format of the form and the possible values that can return an error if incorrect, typically a 400 error, indicating a problem with the client.
[Related article: Building a Scraper Using Browser Automation]
There’s more…
Other than following the format of forms and inputting valid values, the main problem when working with forms is the multiple ways of preventing spam and abusive behavior. You will often have to ensure that you have downloaded a form before submitting it, to avoid submitting multiple forms or Cross-Site Request Forgery (CSRF).
To obtain the specific token, you need to first download the form, as shown in the recipe, obtain the value of the CSRF token, and resubmit it. Note that the token can have different names; this is just an example:
>>> form.find(attrs={'name': 'token'}).get('value') 'ABCEDF12345'
In this article, we learned how to obtain data from the forms of the web, parse it, and print the input fields using Python’s HTTP client. We also explored the role and application of requests, Beautiful Soup, and re–modules.
About the Author
Jaime Buelta is a full-time Python developer since 2010 and a regular speaker at PyCon Ireland. He has been a professional programmer for over two decades with a rich exposure to a lot of different technologies throughout his career. He has developed software for a variety of fields and industries, including aerospace, networking and communications, industrial SCADA systems, video game online services, and financial services.
Editor’s note: Interested in learning more about coding beyond just retrieving webpages through Python? Check out some of these upcoming similar ODSC talks:
ODSC Europe: “Programming with Data: Python and Pandas” – In this training, you will learn how to accelerate your data analyses using the Python language and Pandas, a library specifically designed for tabular data analysis.
ODSC Europe: “Introduction to Linear Algebra for Data Science and Machine Learning With Python” – The goal of this session is to show you that you can start learning the math needed for machine learning and data science using code.