Author: Julius Cerniauskas
There’s a much maligned topic in web scraping – data parsing. Building scrapers would be a lot easier if the data presented through HTML wasn’t intended for browsers. However, that is the case, which means that the data extraction process has to go through several hoops before delivering results.
Parsing is part of the process. Unfortunately, it’s one of the most resource-intensive parts of the entire web scraping chain. In fact, developing a parser for a specific website is not enough. Maintaining it over time is required. Even then, that might not be the end as some complex websites might need numerous parsers to work the data out of the source.
The dilemma
Any sufficiently large scraping project has to develop their own parsers. That means dedicated time and resources to a, comparatively, low-skill task. Most of the time, developing and maintaining parsers is a task for junior developers.
However, junior developers are a highly valuable resource. Spending time maintaining and writing parsers usually barely improves their skills. In fact, it might even bring a certain level of annoyance.
On the other hand, parsing is a critical part of the scraping process. Most of the time, the data acquired is messy and unusable without intervention. Since the end goal of all web scraping, whether for personal or commercial use, is to provide data for analysis, parsing is a necessity.
In short, we have an essentially necessary process that takes up a significant portion of resources and time while not being significantly challenging or useful to the individual. In other words, it’s a resource sink. Solving such a challenge would free up a lot of highly skilled hands and brains to do greater work.
A look towards automation
If you were to approach any sensible CXO or businessperson in general with an idea to save significant time for developers, they would accept the suggestion with open arms. There’s rarely anything better than saving resources through automation.
However, automating parsing isn’t as simple as it may seem. Partly, the reason is the frequent maintenance required. Usually, the requirement arises because websites change their layouts. If they do so, the parser breaks.
Yet, predicting future layout and coding changes is simply impossible. Therefore, no rule-based approach is truly viable. Classical programming is of little help here. Manual work, as mentioned previously, is a huge time and resource sink.
There’s one option remaining that has built up a lot of hype over the past decade or so. That is machine learning applications. Parsing seems to be the perfect way to test the mettle of machine learning engineers.
Since all of HTML has a similar structure across certain categories of pages, the visual changes are decidedly small. Additionally, layout changes aren’t usually massive overhauls of an entire website. They’re mostly incremental UX and UI improvements that are implemented. While that may add to the annoyance of a developer, it’s a great candidate for a stochastic algorithm looking for similarities between trained data and new data.
Preparing for adaptive parsing
Before engaging into any machine learning project, at least these questions should be answered beforehand:
- What will be the limits of the model?
- What type of learning will be needed?
- What type (labeled/unlabeled) data will be used?
- How will the data be acquired?
Luckily, for our Adaptive Parser project at Oxylabs, we had the easiest answers to the last three questions. Since we already knew what we were looking at and for (data from specific pages), we could use labeled data. That meant supervised learning, one of the most practical and easy to execute models, can be used.
However, the true difficulty lies in answering the first question as the rest, at least partly, depend on it. Since all resources are finite, the machine learning model should be as narrow as required and as wide as possible. For us, it meant looking at how our clients are using our solutions (e.g. Real-Time Crawler) and making a decision based on data.
As we discovered through our research, e-commerce product pages were the most painful ones to parse. Generally, the source can be a bit wonky for parsing purposes. Additionally, there’s usually almost identical fields that are only sometimes available (e.g. “new price”/“old price”).
These fields can be confusing to machine learning models as well due to their similarity. However, answering the question about limits lets us set proper expectations for accuracy and the amount of data required. Clearly, we’ll need quite a bit of labeled data as we will have at least one problematic field.
Answering the final question was somewhat easier. We already knew where to pick up our examples. In fact, we could quite quickly collect a large amount of e-commerce pages. However, the strenuous part is labeling. It’s quite easy to get your hands on large amounts of unlabeled data.
Labeling data and training
Every supervised learning dataset has to be labeled. In our case that meant providing labels for most fields in every e-commerce page and it had to be done at least partly manually. If it could be automated, someone would have already created an adaptive parser.
In order to save time and in-house resources, we took a two-pronged approach. First, we hired a few helping hands that would label fields from our soon-to-be training set. Second, we spent some time developing a GUI-based labeling application to speed up the process. The idea is simple – we spend more financial resources on manual repetitive tasks to save up time for cognitive tasks for our machine learning engineers.
After getting our hands on enough labeled data to start training our Adaptive Parser, the process is really a lot of trial and error with some strategizing peppered in between. Sometimes, the model will struggle with specific parts and some logic-based nudging will be required (or it will at least speed up the process).
Many months and hundreds of tests later, we have a solution that is able to automatically parse fields in e-commerce product pages, which can adapt to changes with reasonable accuracy. Of course, now maintenance will be the challenge, but we have shown that it’s possible to automate parsing.
Conclusion
Automating parsing in web scraping isn’t just about saving resources. It’s also about increasing the speed, efficiency, and accuracy of data over time. All of these factors influence the way businesses engage with external data. Primarily, there’s less time dedicated to working around the data and more time to working with data.
More discussions on the pressing topics around web scraping, industry trends and expert tips will be shared in an annual web scraping conference Oxycon. It will take place online on August 25-26th and the registration is free of charge.