Meet Antonio, he came to New York from Italy. He is a foodie and music lover. It is Saturday night and he wants to find out what’s happening in the City. He opens his event discovery app on his android and found out Katy Perry, his favorite singer is performing on stage within a walking distance.
Do you know how curated Event aggregator platforms (like the one Antonio used) find the data they need?
Event aggregator platforms find data from other websites to display on their web app or mobile App. Although some sites have API’s, web scraping is the only reliable way of getting data in most
of the cases. Here are some thoughts on building a backend crawler for an event discovery platform which uses scraped data from 50+ websites.
Choice of Technology: We selected Python Scrapy framework for building web scrapers. It is an open source and a mature framework. We’ve been using it for a while. The CMS used for the project is in Django which is also a python framework. Any popular NoSQL database would be a fine choice for storing the data.
Here are the major steps in building a back end crawler for an event discovery platform.
Standardization: We need to define a standard data structure for event data sets. It would be easy to work on data if data from all event sites are unified under one standard.
Serialization: Define data points needed, like event name, event description and event timings. Data extracted from event sites should be serialized for ease of handling.
Building: Build the web scrapers using Scrapy framework.
Detect pattern change: Change in the pattern of the web site can either affect the quality of data being scraped or crash the Scraper. We need systems in place to detect the pattern change. Think about an event conducted on Monday but listed in your App as Saturday. Antonio would not impressed with your app.
Updated data: New events will be added to the sites and we need the updated data. Web scraper should be scheduled intelligently according to updating sequence in the event site.
Maintenance: When a site changes its pattern or scraper crash, the web scraper should be moved from schedule pool to the maintenance section. Once the problem is fixed, it should be brought back to the scheduler queue.
Storing events: Upcoming events has more importance than the events conducted in past. Future events and past events should be stored separately as batches in the database.
The seven steps outlined above are the learning experiences we had when built and event discovery platform. You will be facing similar problems when building a content discovery platform.
Thanks for reading this blog post. Datahut offers affordable data extraction services (DaaS).
If you need help with your web scraping projects, let us know and we will be glad to help.
Comments