I’m a PHP programmer *most of the time*. I started learning PHP first, at the same time as HTML and CSS, and almost daily since then I coded in PHP.
The syntax is elegant, you write code fast and it just works, especially if it is paired with a great framework like Laravel.
But the last few weeks I have been busy in my free time building a scraper script for the CEAC (Consular Electronic Application Center), a scraper to extract diversity visa program data.
By the way – a couple of people have questioned if this is legal. Yes, it is. The Department of State has said this data is in the public domain and can be copied and distributed without their permission. Here is their copyright statement.
Anyway, in the DV2024 program there are approximately 41.000 cases to be checked in order to get updated information about the issuance state, and make statistics like this brilliant guy Xarthisius did here.
The problem is that Xarthisius publishes this data once a week and I wanted to be prepared to run in whenever I want, especially when the DV2025 program starts (the one where we won).
So I started building the scraper in Python, to be honest just because I found a related tool to check non-immigrant visa status, I took inspiration from that and continued with Python. now, I do not code Python regularly, but I do not find it hard, it is a fun language to build stuff with.
Here’s a summary of what the scraper has to do:
- Do the initial request to setup the session.
- Before each case check, fetch the captcha image and try to break it using various custom ML models. If all models return the same code, carry on, else refresh the captcha code (another request).
- A modal opens, parses HTML data from that modal, and puts it into nice structures in Python (number of persons, consulate, status, dates, etc).
- If there are multiple persons in the same case, we have to make another request for each person to get detailed status about that person.
- After all data is parsed, close the modal and carry on with another case number.
This process is more complicated than it seems, from time to time the firewall blocks you, so you have to rotate a VPN, and you have to do the process in multiple threads to get all the cases in ~1 hour.
After I perfected the scraper, and I was able to get around 10k cases in 10 minutes, I thought I could migrate this “little” scraper to PHP since I am more comfortable with it (no hard feelings, Python, I like you but I code faster in PHP…… at least before this little project).
The code and project in PHP are more elegant, I have types everywhere, and fetching a case looks like this:
The real problem in the PHP implementation started when I needed real-world scaling. PHP is not commonly used with multi-threads, you need a custom extension for that, it does not work ok, threads cannot communicate in a performant manner with the parent process, and you cannot share easily information between threads. As a result, even though I have a class to fetch CEAC cases in PHP, it takes so much more to fetch 41k cases. So I decided to put the PHP project on hold and keep using Python.
At the end of the day, even if I did hacky things in Python with Futures and shared data between threads and workers, I am able to fetch 1000 cases in 2 minutes, depending on how well CEAC responds.
There are things that I could still try when I feel like I need to continue, things like:
- Batching x amount of cases and checking them in a queue with Laravel Horizon.
- Keeping the current used VPN server in cache and syncing it that way across workers.
- Learn the async/threads extension better
- Use a Selenium backend with multiple open tabs.
But the moral of the story is… sometimes you have to code in another language because it is simply better for the thing you want to do. Even if I love PHP, I know Python, Rust, Go, Ruby, and many other programming languages. PHP is a decent language, but when you need concurrency and threads, Python might be better. There might be an even better language to code this scraper, but Python does the job and it is easy to maintain.
So don’t limit yourself in the tech industry, try to learn other new things and languages. It will only benefit you.
Leave a Reply