Companion Card Directory
8 similar datasets, 8 different presentation implementations. Why? Government.
βοΈ Jacob Mulquinπ 11/08/2022
I was cleaning out my Code folder the other day and came across an old Python project titled companion_card_directory
.
Oh wow, I'd totally forgotten about this!
This project was intended to unify all the different Companion Card directories from around Australia into a single place. The idea was say you lived in Wollongong, but were going on a holiday up to the Gold Coast, you wouldn't have to traverse through another website with different UI and mechanics. The project would comprise of a scraper and a webpage to display the data. It looks like I stopped working on it after the scraper starting spitting out JSON datasets.
So I fired up the code:
cd companion_card_directory
python3 companion_card_directory
Things seemed to be going smoothly, until I was greeted with this lovely error:
Downloading: https://www.sa.gov.au/__data/assets/pdf_file/0009/684828/I051-Companion-Card-Affiliate-List-07_2021.pdf
Traceback (most recent call last):
File "companion_card_directory.py", line 9, in <module>
scrape()
File "/home/jacob/Code/companion_card_directory/companion_card_directory/scrape.py", line 465, in sa
pdf = pdfplumber.open(file)
File "/home/jacob/.local/lib/python3.8/site-packages/pdfplumber/pdf.py", line 60, in open
return cls(path_or_fp, **kwargs)
File "/home/jacob/.local/lib/python3.8/site-packages/pdfplumber/pdf.py", line 33, in __init__
self.doc = PDFDocument(PDFParser(stream), password=password)
File "/home/jacob/.local/lib/python3.8/site-packages/pdfminer/pdfparser.py", line 39, in __init__
PSStackParser.__init__(self, fp)
File "/home/jacob/.local/lib/python3.8/site-packages/pdfminer/psparser.py", line 502, in __init__
PSBaseParser.__init__(self, fp)
File "/home/jacob/.local/lib/python3.8/site-packages/pdfminer/psparser.py", line 172, in __init__
self.seek(0)
File "/home/jacob/.local/lib/python3.8/site-packages/pdfminer/psparser.py", line 514, in seek
PSBaseParser.seek(self, pos)
File "/home/jacob/.local/lib/python3.8/site-packages/pdfminer/psparser.py", line 202, in seek
self.fp.seek(pos)
AttributeError: 'bytes' object has no attribute 'seek'
Also, of the 5 states/territories that had been scraped previously, only 2 were still populating results. Thankyou ACT and Queensland for not fixing what wasn't broken.
*facepalm* So that's why I abandoned it, because it's a pain in the ass to have to update the scraping method each time one of the state governments does a departmental restructure or decides on a web refresh. This should not be surprising at all, because as with all web-scraping projects, diligent upkeep is required or things fall apart.
Anyway, I wanted to see it in action again, so I decided to make updates and document the process here. Fingers crossed some of the states/territories will come to their senses and make the data available in CSV/JSON/XML, but I'm not feeling hopeful. I will be lodging feedback with each entity to advocate that the data be made available in more accessible formats.
What is the Companion Card?
The Companion Card is a card provided to some people with a disability that enables them to take a support person with them to eligible venues without incurring the cost for that support person. It was introduced because it is discriminatory to expect a person to have to pay for a companion if that companion is required due to their disability. While it's not compulsory, businesses are encouraged to adopt it's usage where the cost would not be prohibitive (i.e. It makes sense for a museum, but not so much for a restaurant)
It was an endeavour introduced by the Victorian government and now all states and territories have implemented the Companion Card program, with each state being responsible for issuing cards. I don't know why there wasn't a push for a federal companion card but thankfully the cards can be used between jurisdictions.
There used to be a National site available at https://companioncard.gov.au but it seems to have been decommissioned from 1 February 2022.
State/territory sites are as follows:
- Australian Capital Territory working :)
- New South Wales broken :(
- Northern Territory broken :(
- Queensland working :)
- South Australia broken :(
- Tasmania broken :(
- Victoria broken :(
- Western Australia broken :(
Fixing what's broken
New South Wales
Ho' boy the NSW government has had a massive overhaul of their digital stuff lately.
For their search results, they are using elasticsearch and have it presented at https://www.nsw.gov.au/api/v1/elasticsearch/prod_content/_search
, a simple POST
request with the correct query in the body and BAM, JSON data with name, description and category. Unfortunately we still need to scrape each single page to get contact details and address.
Thiiis close NSW, you almost got 5 stars.
Fixed! :)
Northern Territory
It looks like the NT government has updated their website in the past year. They used to have a very basic 1-page website which listed all the businesses located but now they have a dedicated wesite.
The result list is all well and good, but for some odd reason they don't include the affiliate "venue type" within the result set itself, so in order to get category data, multiple redundant requests requests needed to be made.
But we got there, and now we get a nice JSON object of the results instead of HTML.
Through playing around with it I discovered that the site was developed by TropicsNet.
Fixed! :)
South Austraila
South Australia is an interesting case because they don't provide the list of affiliates through their webpage, they only offer a PDF file. Obviously it's broken because the PDF I was extracting from previously doesn't exist anymore.
I updated the code to look at the page where the PDF is linked and find the URL to the PDF that way. I need to run the function to get from remote or cache twice but I really can't be bothered to figure out why. I chalk it up to my n00b python state and I'm too lazy to search out why it's not working.
Come on South Australia, up your game mate. You only provide a list of business names in a PDF file. There is no contact information, no addresses, no other formats. You can do better, I believe in you.
Fixed! :)
Tasmania
Tasmania's website is interesting, they have a page titled "Tasmanian businesses that accept the Companion Card", but also a "Tasmanian Companion Card directory". Hmmm..
I was originally using the former, but for some reason it has stopped working with an error:
tas
Cached: /home/jacob/Code/companion_card_directory/data/tas/tasmanianbusinessesthatacceptthecompanioncard.html
Traceback (most recent call last):
File "companion_card_directory.py", line 10, in <module>
scrape()
File "/home/jacob/Code/companion_card_directory/companion_card_directory/scrape.py", line 399, in tas
name = strong.get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'
It is failing because I was originally extracting the text of strong
elements within the 8th+ paragraphs within the #main-content
div. Yuck. Very finicky... especially since that page looks to be manually curated.
My new strategy is to look at the other page, the "directory".
The URL: https://www.companioncard.communities.tas.gov.au/affiliates/directory/search?queries_region_query_posted=1
Then they appear to have an array variable queries_region_query
that is populated like so:
-
&queries_region_query[0]=nw
-
&queries_region_query[1]=nor
-
&queries_region_query[2]=south
-
&queries_region_query[3]=sw
-
&queries_region_query[4]=nat
Now lets see if we can add them all together and get an output of all affiliates: plz plz plz
This page is so much easier to parse with BeautifulSoup.
But then I scrapped it, and refactored it to grab from each individual category page. Sure, it's not as efficient, but these government departments don't give me much choice since they don't provide easily digestable data formats.
Fixed! :)
Victoria
There's a certain sense of irony in the fact that the state that originally created a Companion Card is one of the only ones still making their data available in PDF only. At least the PDF contains an address and description.
I was hoping they would update since last year, but no luck.
Yep, the PDF still says updated 2016, lol.
This one was actually the easiest to fix: It seems like an idiosyncrasy with the pdfplumber
library I was using. Adding a simple parameter and it worked just as it did before.
Fixed! :)
Western Australia
It's really easy to trigger a few different out of memory errors on this page, simply navigate to https://www.wacompanioncard.org.au/directory-affiliates/?_page=1&num=50
directly. All you have to do is navigate to the page, select "Show 50" at the bottom and then refresh the page.
But if you look closely, is it really an out of memory error?
Another one happens when you provide &_ajax_=
as an empty paramter. That gives us an error in a different file. They probably could have done with some more testing on this site.
Rather than try and scrape the links to individual affiliates through this search page, I decided to go a different route. Thankfully the site is using the "Yoast SEO" plugin. This plugin automatically generates sitemaps that we can use, including this one: https://www.wacompanioncard.org.au/affiliates_dir_ltg-sitemap.xml
. Oh look, a beautiful list of all the affiliate URLs.
Then it's just business-as-usual BeautifulSoup scrapy-fun-times.
Fixed! :)
Stats
State | Entries | Time Taken (Cached) | Time Per Entry |
---|---|---|---|
nt | 54 | 0.12 | 0.0022 |
act | 90 | 5.68 | 0.0631 |
nsw | 1105 | 28.52 | 0.0258 |
qld | 844 | 2.66 | 0.0032 |
wa | 589 | 29.37 | 0.0499 |
sa | 764 | 1.93 | 0.0025 |
tas | 226 | 0.15 | 0.0007 |
vic | 661 | 9.72 | 0.0147 |
Of course I'm sure these stats aren't very useful because they're more an indication of how poorly optimized the code is and how the data is sourced.
Merging all the data together
So the fun part begins, how do we put this information together so it is more meaningful and easy to use?
In the NSW scrape, I included the "state region" key originally, which was a region as defined by the NSW government. I thought it's a good idea to have the regions listed as somebody may be going on a holiday to a particular destination.
To do this, postcode information is taken from the awesome Matthew Proctor. I've used this list before and it's extremely valuable. Auspost charge you for this information otherwise.
I extracted the SA3 regions into separate state files, organised by postcode, e.g: in postcodes/nsw.json
:
"2500": {
"postcode": "2500",
"region": "Wollongong",
"state": "nsw"
}
This makes it easier to lookup later.
After battling my way through python and getting frustrated that I knew exactly how to overcome my problem using PHP and subsequently reminding myself I'm using python to stretch myself, I finally got the bits of information together. Each business is now associated with a SA3 Region! There's a few empty records coming from Victoria but that's a task for another time.
Hooray, now anyone who wants to know about Companion Card affiliate businesses around the country can do so :)
I had also considered normalizing the category names as each state and territory do it slightly differently, but that's a task for another time.
The data and script
- Total Records: 4334
- Filesize (CSV): 497 KiB
- Filesize (JSON): 1.1 MiB
You can download the minimized dataset here (I have not linked because webcrawlers):
- https://mulquin.com/articles/companion-card-directory/all.csv
- https://mulquin.com/articles/companion-card-directory/all.json
You can find the source here: mulquin/companion_card_directory
Here's a few sample records:
{
"address": "255 Keira Street, Wollongong, 2500 NSW",
"category": "Museums and galleries",
"email": "",
"facebook": "",
"instagram": "",
"name": "Project Contemporary Artspace",
"phone": "+61 431 542 309",
"region": "",
"state": "nsw",
"twitter": "",
"website": "http://www.projectgallery.com.au/"
},
{
"address": "PO Box 142, Wonthaggi, 3995",
"category": "",
"email": "",
"facebook": "",
"instagram": "",
"name": "Wonthaggi Agricultural Show Society",
"phone": "",
"region": "Gippsland - South West",
"state": "vic",
"twitter": "",
"website": ""
},
{
"address": "Champions Way, WILLOWBANK",
"category": "Sport and Recreation",
"email": "",
"facebook": "",
"instagram": "",
"name": "Willowbank Raceway",
"phone": "(07) 5461 5461",
"region": "",
"state": "qld",
"twitter": "",
"website": "http://www.willowbankraceway.com.au"
},
{
"address": "",
"category": "",
"email": "",
"facebook": "",
"instagram": "",
"name": "ABC Collinswood Centre, Adelaide",
"phone": "",
"region": "",
"state": "sa",
"twitter": "",
"website": ""
},
{
"address": "",
"category": "Events and Festivals",
"email": "",
"facebook": "",
"instagram": "",
"name": "Darwin Festival",
"phone": "08 8943 4200",
"region": "",
"state": "nt",
"twitter": "",
"website": "http://www.darwinfestival.org.au"
},
{
"address": "Level 5, 2 Kavanagh Street, Southbank VIC 3006",
"category": "",
"email": "",
"facebook": "",
"instagram": "",
"name": "Australian Ballet",
"phone": "1300 369 741",
"region": "",
"state": "act",
"twitter": "",
"website": "http://www.australianballet.com.au/"
},
{
"address": "154 CONNELL RD, WEST END, WA 6530",
"category": "Family activities, Tourist attractions",
"email": "",
"facebook": "",
"instagram": "",
"name": "Abrolhos Adventures",
"phone": "(08) 9942 4515",
"region": "Mid West",
"state": "wa",
"twitter": "",
"website": "https://www.abrolhosadventures.com.au/"
},
{
"address": "",
"category": "Entertainment and the arts",
"email": "",
"facebook": "",
"instagram": "",
"name": "Circus Oz",
"phone": "",
"region": "",
"state": "tas",
"twitter": "",
"website": "http://www.circusoz.com/"
}
In Summary
I'd like to give each state and territory a score out of 5 for their data and a brief judgement from my perspective.
- Australian Capital Territory [πππππ]
- Didn't fix what's not broken
- Provide pretty decent data coverage
- No search functionality
- Pretty easy to scrape
- New South Wales [πππππ]
- Thankyou for the JSON, thankyou
- Great data coverage
- Easy-to-use-app
- Would have given 5 stars if the search JSON included contact details/address
- Northern Territory [πππππ]
- Pretty run-of-the-mill
- Would be good to ahve the venue type inside the search itself
- Queensland [πππππ]
- Didn't fix what's not broken
- Data has good coverage, except not having postcodes on the address (so could not get regions)
- Can't access their website currently because of
Error code: SSL_ERROR_UNSAFE_NEGOTIATION
- South Australia [πππππ]
- Wooden spoon award
- Terrible data coverage (name ONLY)
- PDF only, despite it being well formatted
- Tasmania [πππππ]
- Older style CMS means navigation is confusing
- Mediocre data coverage
- Victoria [πππππ]
- PDF only (offering 2 separate files with different ordering)
- PDFs look like they were ripped straight from Excel, just export as CSV!
- Just OK data coverage (no phone number)
- Western Australia [πππππ]
- Strange issues with the search
- Good data coverage
And to all the states and territory employees that are undoubtedly perusing through the hotbed of action that is my website, please please please fight to have this data really accessible. There's nothing to gain from being coy with this data, it's publically available and both the people with disabilities and the businesses would thank you for making it available. Remember, we want them to connect!
Until next time!