fbpx
Hector Yeomans Image
Hector Yeomans, Engineering Partner at Commit

Targeting transparency with scraping and Cheerio

January 27, 2022 in CHOP

Mexico has different laws that mandate transparency. Those laws are interpreted by our federation and by our states differently. It should be easy for a citizen of this country to know how much is spent in all government areas. The reality is that you have to know where to look to get information.

If you work for the state government in Mexico, you get access to social security. That means every month, a fee is deducted from your paycheck, and you get access to healthcare and a retirement plan. My mom was a government employee for 28 years. She retired almost four years ago. I was in charge of picking up medicine for her every month. Half of the time I went to the drug store to pick the medication, they told me they didn’t have that medicine. It ran out, and I had to pay for it out-of-pocket. At some point, I stopped going to the government drug store and went to buy it by myself. It was more accessible and less time-consuming. Over those months, a question troubled my mind: how do I know how much the government spent on medicines?

My local government has a website that is a search engine. There you can look for contracts and any public tender. The principal purpose of the site is to fulfill the law; however, theuser experience is abysmal. Searching for contracts on medications and treatments takes around 30 seconds. From there, you have to understand how government procurement works. All files, paper trail, and signatures are in PDF format. The problem is that instead of making the PDF format in text format, all are image format. I was discouraged from searching for my answer, but I noticed something while browsing the site. All identifiers for any public tender were consecutive web page. It started at ID=10 and went all the way to ID=95000. I had an idea to write a scraper, and here is where Cheerio and node come into play.

# Cheerio

Cheerio is a package that parses HTML and allows the extraction of data. It implements the part of jQuery model for querying information, and it is fast. What I did was a simple script to visit each public tender page and save the data locally. Then with Cheerio, I parsed and traversed the data to get what was necessary. I won’t dig too much into Cheerio API, but I want to present two selectors that made my life easier.

### Child selector

Most of the HTML written by this page is riddled with <table> tags. Parsing data with nested <tr> and <td> is unwieldy. Thanks to the child selector, I was able to extract as much information as possible. In the following snippet I iterated each table row and then using `first-child` and `last-child` I got all the information I needed:

```js
      	$('table > tbody tr').each((index, tr) => {
        	const $firstTd = $(tr).find('td:first-child')
        	if ($firstTd.html() === null) {
          	return
        	}

        	if (!$firstTd.text().trim().endsWith('.pdf')) {
          	return
        	}

        	const href = $(tr).find('td:last-child a').attr('href')
        	const splitted = href.split('=')
        	const contratacionId = splitted[1].replace('&anexosId', '')
        	const anexosId = splitted[2]
        	const name = `contratacionId-${contratacionId}-anexosId-${anexosId}`

        	linksToPdf.push({
          	id,
          	name: name,
          	link: `${baseUrl}${href}`,
        	})
      	})
```

## Sibling selector

I found that specific API returned partial HTML responses. I encountered an HTML response that had two `<table>` tags at the root level. I found that the sibling selector was able to target the second or third table in the response. In the following snippet, I iterate each table row on the third table that is at the root of the HTML payload:

```js
    	$ciudad('table ~ table ~ table > tbody:first')
      	.find('tr')
      	.each(function () {
        	const nombre = formatName($(this).find('td:first').text())
        	regidores.push(nombre)
      	})
```

# Conclusion

After digging through many malformed HTML pages, I was able to upload a simple excel file. I posted the link to an excel file in my Twitter account, and people started copying it. This excel file contained all information on past contracts. It was more accessible for people to filter and search. After a couple of months of keeping track of contracts, I got invited to a radio show to talk about transparency and what I have found.

On the radio show, we uncovered multiple actions of possible fraud. Government agencies bought different artifacts at a price higher than the market. Also, we found construction places that weren’t finished but got all the money to finish it.

Hector Yeomans is a full-stack developer who loves using his technical skills to solve customer problems. When he is not at the computer, he enjoys reading about Mexican history and politics.

###

Interested in joining our waitlist? Sign up now.
We hire Software Engineers to build their careers with promising early-stage startups. Apply today!