How to do web scraping with BeautifulSoup and Selenium

  • Web scraping allows you to automate data extraction by combining requests and BeautifulSoup for static pages and Selenium for dynamic content.
  • Understanding the difference between original HTML and JavaScript-generated DOM is key to deciding whether static scraping is sufficient or if an automated browser is needed.
  • Selenium makes it easy to log in, handle iframes, menus, and dynamic tables, while BeautifulSoup simplifies final HTML parsing and structured extraction.
  • It is essential to consider robots.txt, terms of use, technical limits, and anti-bot measures to ensure that scraping is sustainable and legal.

web scraping with BeautifulSoup and Selenium

If you had to today Extract data from multiple web pages full of tables, JavaScript, and formsDoing it manually by copying and pasting would be absolute torture. That's what web scraping is for: automating all that repetitive work and turning it into a script that works for you while you have a coffee.

Within the Python ecosystem, BeautifulSoup and Selenium make a brutal couple to deal with both static pages and modern sites loaded with dynamic content. Let's take a look, calmly but thoroughly, how to combine both tools to scrape everything from simple listings to pages with iframes, drop-down menus and dynamic tables like those from ESPN, online stores or Star Wars-type generators.

What is web scraping and why do so many people use Python?

El web scraping is the process of extract information from websites automaticallyEssentially, your script acts like a browser: it requests pages, interprets them, and keeps only the data you're interested in. Manually copying and pasting would also be scraping, but on such a ridiculously small scale that it's not worth it at all.

Instead of doing it manually, we use scripts and libraries that download the HTML, analyze it, and extract specific data (products, prices, comments, names, complete tables…). This allows for the faster collection of large volumes of information and minimizes human error.

Python has become one of the leading languages ​​for web scraping for several very clear reasons: it is easy to read even for beginners, and it has specialized bookstores such as requests, BeautifulSoup, Selenium or ScrapyAnd it has a huge community that has already dealt with almost all the problems you're going to encounter.

When is it worth doing scraping (and when is it not)?

Scraping is especially useful when You need data from a website that doesn't offer a decent API. or whose public data is only accessible through the web interface. It is used to collect prices, reviews, product listings, ranking information, embedded comments with external services, etc.

How to use Python to create a Telegram bot
Related article:
Tutorial for understanding and applying SOLID principles in Python

Even so, it's important to be clear that scraping It should be the last resort. when there isn't a suitable API or pre-made datasets. It has significant drawbacks: websites frequently change their HTML and your script breaks, some websites outright prohibit scraping in their terms of use, and also Processing many pages can be slow and consume a lot of resources..

In an ideal world, You would always work with well-documented APIs or ready-to-use datasetsIn practice, this API often doesn't exist or is very limited, and then there's no other option than to use scraping and adapt your code to the actual structure of the site.

Static vs dynamic: how web pages are really generated

To scrape effectively, you need to understand exactly what your script sees. View source code The browser (the classic option in any browser since the 90s) displays the original HTML returned by the server. However, The actual content you see on screen may be very different. because JavaScript comes into play.

Think, for example, of a tutorial page with Disqus comments at the end. If you open the source code, You won't see the comments or the Disqus iframe in the initial HTML.The reason is that this iframe is generated dynamically with JavaScript, once the page has already loaded.

In addition, many websites mix different technologies: HTML, CSS for presentation, JavaScript that manipulates the DOM, and content embedded in iframes (such as third-party comments). The result is that there is a clear difference between what the server returns and the final DOM that the browser constructs after executing scripts, loading iframes, etc.

Static scraping with requests and BeautifulSoup

web scraping with BeautifulSoup and Selenium

The call static scraping is limited to Download the HTML exactly as the server sends it.without executing JavaScript. In other words, it gets exactly what you would see with "view source". If the information you're looking for is already in that initial HTML, you don't need to complicate things with automated browsers.

The typical combination is to use requests to make the HTTP request and then BeautifulSoup to parse the HTMLSomething as simple as r = requests.get(url) It returns a response with r.ok to check if everything went well and r.content with the bytes of the page. Normally, it is decoded to UTF-8 to work with text.

Once you have the HTML, BeautifulSoup turns that string into a navigable DOM treeYou can search for elements by tag, class, attribute, or combinations thereof. For example, if on an author page each tutorial is within a tag <article>, you may:

  • Locate all the article.
  • Within each one, find the main link and extract the attribute href.
  • Build a list with the URLs of all the tutorials.

Similarly, you can scrape a list of planets of Star Wars From a Wikia or Wikipedia page: download the page, find the rows <tr> From a table, you filter the cells <td> that contain the name and you add them to a list that you then convert into a DataFrame with pandas and save in CSV.

When do you need dynamic scraping with Selenium?

Static scraping fails when The information only appears after running JavaScript.Loading an iframe, clicking buttons, or filling out forms. For those cases you need dynamic scrapingThat is, to automate a real (or headless) browser to run JS just like a user would.

Tools like Selenium WebDriver They are originally designed for automated testing, but are perfect for control the browser and access the live DOMWith Selenium you can open a URL, click buttons, fill out forms, change iframes, wait for elements to appear, and then extract the resulting HTML.

This approach is essential for cases such as:

  • Disqus Comments embedded as an iframe generated via JavaScript.
  • Name generators that show results only after pressing a button and dynamically build tables.
  • Apps like ESPN Fantasy with drop-down menus, tables that change depending on the day, and elements that reload without refreshing the entire page.

Install and prepare Selenium (and the WebDriver)

To get started with Selenium in Python you need, on the one hand, Install the selenium package using pip or pipenvand on the other hand, download the browser driver (ChromeDriver, geckodriver for Firefox, etc.).

With pip, something like this is usually enough. pip install seleniumIf you work with virtual environments, pipenv install selenium It is also a good option because it isolates dependencies.

In Windows, for example, Download the Firefox or Chrome driverYou unzip the file, place the .exe in an appropriate folder, and add that path to the environment variable. PATHThis allows the script to invoke it without an absolute path. The idea is the same in Ubuntu or Linux, changing the corresponding binary.

When you want full control, you can explicitly specify the path when creating the driver, something like this:

service = ChromeService(executable_path="./chromedriver")
driver = webdriver.Chrome(service=service)

In some advanced scenarios, the following is used: anti-detection browsers or headless solutions with custom fingerprintingFor example, launching sessions through tools like Nstbrowser and then connecting Selenium using the remote debugger address returned by that tool. A configuration is defined with user-agent, platform, hardware concurrency, memory, any proxies, etc., the browser is launched with a devtools URL, and the debug port is obtained for use as a... debuggerAddress in Selenium's ChromeOptions.

Automate navigation with Selenium

Once you have the driver, Selenium allows you to reproduce the actions of a real user: opening pages, clicking, filling out forms, selecting options in drop-down menus, switching between tabs or iframes, and scrolling through the page.

For example, suppose you want Log in to a fantasy baseball league on ESPNGo to the scoreboard and scrape the results tables week by week. The typical workflow with Selenium would be:

  • Open the league URL with driver.get().
  • Wait for one to appear login iframelocate it by tag or attribute and do switch_to.frame(iframe).
  • Inside the iframe, wait for the username field, send the name and press ENTER.
  • Wait for the password field, fill it in and submit.
  • Return to main content with switch_to.default_content().
  • Wait until the link becomes clickable. Scoreboard and click.
  • Wait for it to load drop-down menu of days (by class as dropdown__select), create an object Select and choose the schedule you want by index.
  • Once the content has been refreshed, locate the tables with the class Table and walk through its rows and cells.

During this process it is key to use WebDriverWait and expected_conditions to ensure that elements exist or are clickable before interacting with them. This greatly reduces the synchronization errors typical on pages that load elements asynchronously.

How to avoid StaleElementReferenceException when extracting data?

In settings like ESPN, it's common to stumble upon StaleElementReferenceExceptionThis exception occurs when an element you had located is no longer valid because the DOM has been updated (for example, when changing the day in the dropdown or refreshing dynamic content).

The typical error pattern is:

  • You select an option from the drop-down menu, and the page reloads the content.
  • You try to use references to tables, rows, or cells that you found before the change.
  • The DOM has changed and those Selenium objects no longer point to anything valid, which is why exceptions are thrown.

The solution consists of relocate the elements after each significant changeThat is, after selecting a new matchday in the dropdown, you wait for the tables to reload, and only then do you do it again. driver.find_elements() for the corresponding classes or XPaths.

In the example code that iterates through scoreboards and rows, there is another important detail: Absolute XPaths should not be used from the document root within a loop that iterates over each tableIf within each scoreboard you do scoreboard.find_elements(By.XPATH, '//tr')You're actually reverting to the entire DOM (because of the double slashes at the beginning) and bringing in all the rows of the page in each iteration. The correct approach is to use relative XPaths ('.//tr') or methods such as find_elements(By.TAG_NAME, 'tr') about the current element.

Combine Selenium and BeautifulSoup for advanced scraping

In many cases, Selenium is primarily used for interact with the page and leave the DOM in the desired stateBeautifulSoup, on the other hand, handles the more convenient extraction of data once you have the final HTML.

A very clear example is that of a Star Wars random name generator which allows you to select gender (male, female, or both) and number of names (up to 100). The workflow with Selenium would be:

  • Open the generator page with webdriver.Firefox() or Chrome.
  • Locate the radio button of 100 names using find_element_by_xpath filtering by name="choice" y value="100", and do click().
  • Locate the button Generate! searching for the input with name="submit" y value="Generate!", and click.
  • Once the table with the names is generated, take driver.page_source and pass it on to BeautifulSoup using a parser like lxml o html.parser.
  • Within BeautifulSoup, locate the table containing the names, access its rows and cells, and extract the text for each name.
  • Add each new name to a list, avoiding duplicates, until you reach the desired amount (for example, 100,000).
Excel with Python
Related article:
Excel with Python: Integrating scripts and automating analysis

This process may require a loop that repeats the click on Generate and parse the table again, since the website is offering blocks of 100 names per iteration. With pandas you can convert the resulting list into a DataFrame and save it directly as a CSV file, labeling the column as name and adding the date to the file name for version control.

Scrapbook online stores with Selenium, BeautifulSoup and pandas

Another very common practical example is that of online stores with dynamic catalogsFor example, a website selling laptops where you want to extract model names and prices. While some of the content might be in the initial HTML, it usually includes pagination, filters, dynamic menus, and so on.

The typical strategy is:

  • Use Selenium to load each results pageApply filters if necessary and make sure the products are visible.
  • Inspect the HTML in the browser to locate the div containers for each product and its CSS classesFor example, a div with a mysterious class of type _4R01T which houses the laptop's title.
  • Using BeautifulSoup or directly with Selenium, search for all those containers and extract the relevant fields: name, price, features that share a structure.
  • Create a list of dictionaries or simple lists with that data and then pass it to pandas.DataFrame.
  • Export everything to one CSV in order to be able to carry out further analysis (price comparison, historical evolution, market studies, etc.).

This type of script can be programmed to iterate through multiple pages of results by changing the page number in the URL or by clicking the "Next" button using Selenium. Ultimately, you can have a complete dataset with hundreds or thousands of products ready to be analyzed.

Common use cases for web scraping

With these techniques, the practical uses of web scraping multiply. Some of the most interesting are:

  • Competitor analysis: monitor products, services, prices and marketing strategies by extracting data from competitor websites.
  • Price comparison: gather prices from different e-commerce platforms to detect bargains, fluctuations or differences between countries.
  • Social media and news monitoringAlthough many platforms have APIs, you can scrape public pages to analyze the popularity of hashtags, mentions, headlines, or comments.
  • lead generation: extract public contacts from directories or business listings, always respecting data protection legislation.
  • Sentiment analysis: combining news, blog posts and comments to measure public opinion on a specific brand or topic.
  • Cybersecurity and auditing: see what information from your own company is easily accessible and could be exploited by third parties.

Legal aspects, robots.txt and ethics of scraping

Before you unleash your scraper at full speed, don't forget the less fun part: the legal and ethical implicationsAlthough you may technically be able to access certain public information, that does not mean you can use it without restrictions or overload someone else's server.

A useful first reference is the file robots.txt of each site, normally accessible as /robots.txtFor example, if you visit facebook.com/robots.txt You will see rules that indicate Which sections are recommended for crawlers and which ones should be avoided by Disallow o Allow. Field User agent It identifies the type of bot the rules are directed at, and the asterisk usually means "any agent".

Although robots.txt is not a real security mechanism (it can be ignored), Deliberately skipping it can leave you on slippery ground.especially if it also causes performance problems for the target site. Furthermore, many websites include specific clauses in their terms of use that prohibit scraping or the mass use of their data.

It is also essential to monitor the frequency of requestsAn uncontrolled script that launches hundreds of requests per second can crash a server or trigger anti-bot systems, which is not only unethical but potentially illegal if considered a denial-of-service attack.

Anti-bots, CAPTCHAs, and techniques to avoid breaking your scraper

Modern websites typically integrate anti-bot mechanismsIP blocking after too many requests, JavaScript challenges that verify the browser is "real", CAPTCHAs, frequency limits, etc. All of this makes sustained scraping quite complicated.

To deal with these barriers, some projects use rotating proxies, headless browsers with realistic fingerprints, and anti-detection tools that simulate human behavior. It is also good practice to introduce random pauses, variation in navigation patterns, and fewer requests per minute Not to raise suspicion.

Even so, it is worth remembering that Not every technical challenge should be attempted to be solved at all costs.If a site actively protects its content and API, you may not legally have any room to scrape it beyond very limited use.

Alternatives to classic web scraping

Although web scraping is incredibly powerful, it's often worthwhile to explore alternatives first. One of the cleanest options is... Official APIsSome sites offer stable APIs with quotas, authentication, and structured data in JSON. These are more reliable, generally better supported, and They avoid a lot of headaches with HTML changes or blocks.

Another option is to go directly to datasets already preparedFree or paid. Open data platforms, specialized marketplaces, or academic repositories often offer CSV, JSON, or SQL databases with just what you need, although they may not always be a perfect fit for your specific situation.

Even with these alternatives, web scraping remains very popular because It gives you enormous flexibility.You can access almost any data displayed on a page, combine diverse sources, and adapt the process to your needs without waiting for anyone to expose an API.

How to use Python to create a Telegram bot
Related article:
Excel with Python: A complete guide to mastering both

Overall, to master Python, BeautifulSoup and Selenium It opens the door to automating everything from everyday tasks like collecting product prices to complex projects that combine iframes, dynamic content, dropdown menus, and filters. By understanding how the DOM is actually generated, you can solve typical problems such as... StaleElementReferenceException And respecting the legal and technical limits of each site, you can build robust scrapers that turn the web into a huge source of data ready for analysis, machine learning models, or whatever you can think of. Share the information and more users will learn how to use web scraping..