I've been scraping Craigslist for freelance programming jobs since 2006. This post details the various setups over the years, and how I arrived at being able to scrape their 2022 AJAX pages.
So, initially, believe it or not, I was using the CL RSS feeds in Google Reader. And it was glorious!
Then Google Reader closed and I used RSS Owl and another feed reader for the next ten years.
Then, I believe what happened was Craigslist blocked all the popular RSS readers. So, I think the first scraper I wrote simply scraped the RSS feeds.
Around 2020 Craigslist removed all the RSS feeds. This is when I wrote a scraper to scrape the actual webpages.
And that brings us to now, in late 2022 Craigslist slowly rolled out a new AJAX page structure that broke my scraper.
Now, most developers will probably laugh at me, but I started to dissect the front-end code in order to scrape the data from the AJAX calls. I found the AJAX URL and it had most of the data I was after, but some necessary data was missing.
I stopped working on it for a number of days, and while I was doing something else I happened to come across Selenium. I looked into Selenium a couple years back but it wasn't really in my toolbox. So, when I bumped into it here, it immediately gave me the idea to load the page, wait for the AJAX to finish, and then basically scrape static HTML. And it worked like a charm! I didn't have to learn their AJAX!
I just wonder if the new CL AJAX pages are mainly to prevent scraping, or did they do it for a different reason?