about
I am currently a student under Google Summer of Code 2016 working with Loklak in FOSSASIA and the other tools required to make loklak happen. I am an active open source contributor and dedicate myself to the open source community and open software.
Scraping and Feeding IOT Datasets into Loklak - Part 2
22 Aug 2016
As the integrations for the IOT services had begun, there are challenges especially with scraping multiple pages at once, such was the case with the NOAA Alerts and Weather information of the US Government. To scrape this information for the live updates that happen every 5 minutes it was necessary to simplify the process with which this process could be completed without running a complex web scraper work on them every time taking up precious cycles, What’s really interesting about the website is the way in which the data can be modeled into XML for any given page. Using this and leveraging the XML and other data conversion logic implemented previously for such task, I started digging deeper into the total working of the website and realized that appending &y=0 to the alerts URL resulted in XML generation, here’s an example of how this works
https://alerts.weather.gov/cap/wwaatmget.php?x=AKC013&y=0
and
https://alerts.weather.gov/cap/wwaatmget.php?x=AKC013
Equivalent XML being
So extracting this has become quite a challenge because this poses two different challenges , one is how we can efficiently retrieve the information of the counties and how we can construct the alert urls. Perl to the rescue here !
Makes a request to the website and constructs the states list of all the states present in the USA. Now it’s time to construct it’s counties.
There we go voila, we now have a perfect mapping in between every single county and the alert URL requirement for that particular county. The NOAA scraper and parser has been quite a challenge but provides the data in real-time from the loklak server. The information can be passed via the XML Parser written as a service at /api/xml2json.json and the developers can receive the information in their required format.