top of page

Part 1 Web Scraping

Rodrigo Ledesma

Updated: May 18, 2022


So, you might be thinking, why is this person talking about Disney and Machine Learning? Well, since I was a kid and I first went to Walt Disney World in Orlando, Florida, with my parents, I fell in love with the experience. The first time I went, I only visited Disney; but the second time, I got to see Universal Studios and the Wizarding World of Harry Potter. This is where my love for the experience reached a climax. As the years passed, I started planning my imaginary perfect visit to both parks. But as you can imagine, there are expert companies dedicated to helping you plan your trip, so it is not a piece of cake. This series of articles will discuss how I combine my passion for analyzing data and making Machine Learning models with the excitement and illusion of going to an amusement park.


Suppose you are like me and start planning your vacations six months in advance, and you read every article and watch every video containing information about your destination. In that case, you will understand the importance of my work. For example, when planning my trip (the one I never went to because of covid), I realized that Walt Disney World has four main parks…



On top of that, Universal Studios has two more parks, and in total, there are more than 200 different attractions if you visit the six parks; if we want to be mathematical, if we don’t eat and take very short bathroom breaks, we will be able to visit maybe 60 different attractions in a week (10 per day), if we make a quick calculation regarding combinations with repetition, we have a little more than 6⁵⁰ other possibilities to choose. Of course, this is an exaggeration, and it is extremely difficult (but not impossible) to visit more than three different parks in a single day, but you get my point.


At the end of this series, I want to be able to decide which day or which date will be the best for me to visit my favorite park and ride my famous rollercoaster with the minimum amount of waiting time? Or, which will be the best order of rides at Universal, given that I will be visiting on Mother’s day and the forecast will be cloudy? Can you imagine the potential of this work? We can even create a startup to help clients plan their trips with AI and make the most out of their vacations! If you are a head hunter and you like this idea, don't hesitate to get in touch with me ;)


My research aims to find the perfect combination of rollercoaster rides, shows, and parks via optimization problems. The first problem I will face will be how to collect reliable data. I mean, how can I get accurate information about how long it takes for an average visitor (without any fast pass or any other acquired service to skip lines or make them shorter) from the moment they step into the queue until they get to ride the rollercoaster or see the show.



Throughout these articles, I will tell you how I managed to obtain the information and what the dataset will contain regarding variables, and which technologies I chose to use. Once this topic is done, I will describe the results I obtained using a series of Machine Learning (ML) algorithms and different techniques. Finally, I want to clarify that this is my first post and that I am no expert in any of these topics still, but I intend to set a solid base for other research and communicate my passion to the general audience.


My idea so far will be to discuss the following topics


  • The use of APIs to obtain reliable information

  • Storing datasets with an automated python script

  • Databases to keep and read data

  • Supervised learning for time predictions

  • Unsupervised learning for time predictions

  • Semi-Supervised Learning for time predictions

  • Recurrent Neural networks for time series

  • Online Learning and the use of Kafka to simulate a continuous flow of information and make real-time predictions.

  • Evaluation of variable importance in online learning

I don’t know how far these articles will take me, so let’s start with the first part of my work. Thank you for your interest and time, and with no further ado, here is the first part of what I feel will be a whole universe of possibilities.


How will I get the information, and which data will be the best for my analysis?

Well, this is no trivial question. Both companies (Disney and Universal) offer their guests and the general public information almost in real-time regarding how long the queues are in each ride, only by downloading their app. I did some research, and in reality, this information will not be 100% accurate as to how they do the calculations is based on statistics and asking guests to hold a device that keeps track of time while they wait in line. I watched more than 300 youtube videos of bloggers who go to the parks every day and compared the app’s data with their personal experiences. Generally speaking, the app has 80% accuracy. For my analysis, this is enough, as I am not currently living in the US, and it is impossible for me to get more accurate information.


The problem I faced was, how am I supposed to extract this information? I can do some image-to-text recognition or even do some projects with Convolutional Neural Networks, but will it be worth it? My conclusion was no. So I found a web page that does this job for me.


The name of the web page is The Laughing Place, and it gives you a table with almost real-time information regarding how long the queue for each ride or show is.



Now, how can I automate extracting the information from six different parks and repeat the extraction a couple of hundred times a day? The obvious answer was to create a Python script that would allow me to read the table, extract the information and finally store my precious data in a secure location. So as you can imagine, this is where the blog turns out to be interesting.


Web Scraping the Laughing Place

I decided to use the Beautiful Soup library to tackle my problem because it is friendly. In addition, there are tons of different youtube videos where you can learn how to extract information from tables available on YouTube. I like to work using Jupyter Notebooks, so please feel free to look at my GitHub page.


#Magic Kingdom
    source = requests.get('https://www.laughingplace.com/w/p/magic-kingdom-current-wait-times/').text
    soup = BeautifulSoup(source, 'lxml')
    soup.prettify() 
    df = pd.DataFrame()
    table = soup.find('table')
    rows = table.find_all('tr')
    for row in rows:
        cols=row.find_all('td')
        cols=[x.text.strip() for x in cols]
        temp = pd.DataFrame([cols])
        df = pd.concat([df,temp])Magic_kingdom = df.filter([0,1], axis=1)

I tried to make my code as easy to read as possible because, after six months, the code started to turn into hieroglyphics. Let’s go line by line; first, we created a source object with the URL where the table is stored, then a BeautifulSoup object. Finally, I used the prettify function to make the information more manageable. After defining the setup parameters of the soup object now, I looked for a table and started to dive into the table to extract the name of the ride and the amount of time guests will be waiting.


I repeated this process for all the parks I was interested in. So I consider there is no need to put all the code here. For this first article, this is it. The next one will be how to extract more relevant data from the internet and use APIs with python.


Thank you for reading. I hope you have enjoyed the article.

Comments


bottom of page