The Longitudinal Australian Business Integrated Intelligence (LABii) DataVault was developed by Moyle, Pandey and Stern (2018). It is a longitudinal panel of every business in Australia since the introduction of the GST in 1999. The primary dataset is the Australian Business Register (ABR) supplied by the Australian Tax Office (ATO) – there are currently over 15 million unique ABNs in the DataVault and this is growing by almost 1 million records per year. LABii is comprised of separate public and non-public datasets like the Australian Business Register, Intellectual Properties Australia (register on patents, trademarks, design and plant breeder), Mergers and Acquisitions data, listings of the Australian Stock Exchange, amongst others. In 2020 it was integrated with the ABS BLADE. New data becomes available from various different sources on an ad hoc or periodic basis.
To continue to develop and automate the DataVault and deliver on key projects for the Queensland Government, the LABii team are seeking a skilled data wrangler focused on developing and implementing a reproducible data processing pipeline:
- Automated periodic download of data from a range of sources including investigating new sources for inclusion
- Working with the LABii team and eResearch to trial a number of database structures to determine the most efficient data structure for sustainable long-term database management (currently a cleaned and processed datavault is around 30GB of data. We currently process in STATA, which holds this dataset in memory for processing causing our computers to run slow due to limited RAM – we are trialling other database structures, such as SQL, Hadoop, AWS/Microsoft tools, to determine if there is a more efficient way of processing and managing the data to reduce run times particularly as the database grows).
- Developing and documenting the process for importing the downloaded data into the efficient data structure, as well as cleaning, processing, integrating and extracting analysis-ready data (this includes creating derived variables and delivering summary tables to support reporting and analysis). It is very important that your work is reproducible, maintainable, sustainable and able to be extended. Others will have to run, maintain and extend the systems you develop and so your design, organisation, implementation and documentation practices should support this.
It is critical that your work aligns well with QUT’s eResearch environment and is carried out in consultation with the eResearch and LABii Team, as well as our external partners. We believe our ideal Data Wrangler will have demonstrable skills and experience in:
- Data wrangling (database construction, management, and processing including knowledge of SQL, Hadoop, AWS or Microsoft Azure)
- Data automation (coding in multiple platforms such as python, STATA and R)
- Data science and reproducible research (including the ability to code and document your work)
- Strong communication skills and ability to work well in a team
If you are interested in working with the LABii Team, please contact: Dr Char-lee Moyle, firstname.lastname@example.org. Applications close 21 May 2021.