Logo Sadman's Blog
Analyzing UofT PEY Co-op Job Postings Part 1: Scraping Job Board Data

Analyzing UofT PEY Co-op Job Postings Part 1: Scraping Job Board Data

May 22, 2024
52 min read
Table of Contents

UofT PEY Co-op

Whether you’re a new applicant to engineering/computer science at the University of Toronto (UofT) or someone who’s going through their 1st, 2nd, or even 3rd or 4th years, you’ve probably been curious at one point or another about what 12-16 month job positions are actually posted on the much acclaimed PEY Co-op job board.

Well, as a computer engineering student at UofT who’s just finished their third year (and consequently have been able to access the portal for the entire past two semesters), I thought it would be interesting to do my own data collection on the topic, along with some analysis on recurring patterns (e.g. locations, international opportunities, etc.), and share both my findings and the raw data (thousands of jobs posted across several months) so that future PEY Co-op students can get a better idea of what to expect from the portal.

SELECT * FROM JobPostings LIMIT 5;
idpostingDatetitlecompanycompanyDivisioncompanyWebsitelocationlocationTypenumPositionssalarystartDateendDatefunctionapplicationDeadline
435412023-09-17Business Insights AnalystAir CanadaMain Divisionwww.aircanada.comBramptonHybrid: 3 days on-site1$0.00 hourly for 40.0 hours per week09/01/202408/31/2025OperationsNov 15, 2023 11:59 PM
439022023-09-17Full Stack Engineer, Business SystemsecobeeMain Officewww.ecobee.comToronto, ONRemote or Hybrid2$0.00 hourly for 0.0 hours per week05/06/202404/25/2025EngineeringOct 6, 2023 11:59 PM
440102023-09-17Software Engineering InternGGY a Moodys Analytics CompanyAXISmoodysanalytics.comTorontoOn-Site1Salary Not Available, 40.0 hours per week05/06/202409/08/2025Information Technology (IT)Sep 29, 2023 11:59 PM
440452023-09-17Compute Kernel Development Engineering InternUntether AIMain Officeuntether.aiTorontoOn-Site1Salary Not Available, 40.0 hours per week05/06/202404/25/2025EngineeringOct 8, 2023 11:59 PM

Scraping job posting data

The UofT PEY Co-op job board itself is located behind a login portal at https://www.uoftengcareerportal.ca/notLoggedIn. To get in, you need to:

  1. be a student at UofT (to have valid login credentials);
  2. be enrolled in the PEY Co-op program; and
  3. be registered to start your 12-16 month internship sometime between May and September following your current year.

All of which means that unless you’re a keen student in your second year who’s opted in to get access and do your PEY early, you’re either in your third year or you don’t have access to the job portal. As an engineering student who’s just finished their third year (at the time of writing), I’ve had privileged access for 8 months and counting, and I’ve been able to save data on every single job posted on the portal since the very first day I’ve had access (I came prepared).

PEY job board login portal
The PEY job board login portal at https://www.uoftengcareerportal.ca/notLoggedIn

What you can expect from the job board

The landing page for the PEY Co-op job board has the same look as the one for all the cross-institutional job boards at UofT’s Career Learning Network (CLNx) and uses the same organization of elements and processes for browsing, searching for, and applying to job postings, so if you’re a UofT student who’s used CLNx in the past (e.g. for applying to work study positions) then you already know what it’s like to experience using the PEY job board (from what I’ve seen on Reddit, I’m pretty sure WaterlooWorks uses the same frontend as well).

PEY Co-op job board viewed jobs
Viewed jobs (Viewed)

Sample job postings

Your average PEY job postings look like the below.

Note: The design on the left was used for a couple years, until the most recent redesign in late 2023 (which took place halfway through my fall semester of third year). If you’re an upcoming PEY Co-op student, job postings should look like the image on the right for you (Edit 2024-11-14: the job board has been redesigned again, and now looks unlike either image; check out my post on 2024-2025 UofT Co-op job postings to see what they look like now).

Nothing much to look at, just some basic tables with information about the job and the company, which thankfully are simple to parse.

Getting the posting date for jobs

One thing that’s been important to me since the very start of this project is making sure that the timestamps of job postings are available to view. As someone who took a few months before I landed a position that I was really satisfied with, there were times where I felt a bit apprehensive at turning down offers for roles that I thought were fine but didn’t feel excited about. I had no data beyond anecdotes from upper years about what’s posted over the course of the fall and winter semesters, and so I couldn’t really predict whether that dream role I had in mind was just a couple days from being posted (or whether jobs would start drying up so I should stick with whatever I had at that point in time), which is why I hope that at least one of the things this project of mine can provide is some reassurance to future PEY students about what jobs are posted and when.

One thing to keep in mind: every year is different, and just because some company posted some number of jobs at some point in time doesn’t mean that they’ll do it again next year. Of course, it also doesn’t mean that they won’t do it again, so make sure your takeaways from the data are taken with a grain of salt.

There’s just one problem: there’s absolutely no data indicating when a job was posted.

Well, except for one thing: the New Posting Since Last Login button on the landing page.

Whenever you login to the portal, that New Posting Since Last Login button gets updated with links to all of the jobs posted since your last login, so if you were to check the job board every single day and save the data for all of the job postings shown there each time you login, well, then you’ve successfully attributed a posting data for every single job.

Which is why that’s exactly what I’ve done for the 263 days that I’ve had access to the portal (from 2023-09-17 to 2024-06-06).

How I’ve been saving posting dates for jobs

It’s all thanks to Gildas Lormeau’s SingleFile web extension, which allows for saving a complete web page into a single HTML file (unlike Chrome1). In addition (and rather importantly), the SingleFile extension allows for saving pages for all tabs in the current window (this is important for making the whole archival process a not-headache). While SingleFile is also usable as a CLI, the tricky navigation for the PEY job board website means that manually navigating to pages & then saving them using the extension is a lot easier.

By Ctrl-clicking on every single job posting shown behind New Posting Since Last Login (so that every new posting opens on a new tab) and then using the SingleFile extension to save the page each tab in one go (using the Save all tabs option under the SingleFile extension.), I’m able to condense the whole process of saving new postings for the day to just 1-2 minutes. Put each day’s postings into a timestamped folder (made faster thanks to a handy AutoHotKey script that’s always a keyboard shortcut away), which itself goes into a big folder on my local computer of all PEY job postings collected thus far, and I’ve got myself data on almost 2k job postings just waiting to be analyzed for some insights.

📁PEY POSTINGS ARCHIVE/
├── 📁2023-09-17_20-14-10/
│   ├── ...Job Postings (9_17_2023 8_13_11 PM).html
│   ├── ...Job Postings (9_17_2023 8_13_12 PM).html
│   └── ...
├── 📁2023-09-18_00-51-40/
│   ├── Job ID 43541...(9_18_2023 12_51_48 AM).html
│   ├── Job ID 43554...(9_18_2023 12_52_08 AM).html
│   └── ...
├── 📁2023-09-18_16-08-36/
├── ...
└── 📁2024-05-17_16-25-26/
A snapshot of my local directory structure with HTML pages for saved job postings. You might notice that the naming convention for HTML files are slightly different across the folders shown. That's because my file naming system changed a few times over the course of my two semesters (due to the job board redesign and my reconfiguring of options for SingleFile).

Why not write a script to automate saving postings?

Is it possible to automate the whole process of saving data for job postings?

Technically, yes, it’s absolutely feasible, but given how easy it is manually save data for job postings in a minute or two for every couple hundred of jobs (with the assistance of a few scripts to make the CTRL-clicking a lot faster), it’s just not worth the time to make the routine task more efficient (I’d be spending more time than I’d save, as any XKCD reader can relate to).

Storing job postings in a database

HTML is fine for temporary storage purposes2, but I need something that will allow me to view and analyze the data in a fast, efficient, and easy-to-use manner.

Enter the underrated gem of database technologies: SQLite.

Now, I’ve never used SQLite before (just PostgreSQL, with a bit of Python on the side via psycopg2), but thanks to the sqlite3 documentation for Python I’m right-at-home as someone who’s used psycopg2 in the past (and of course Python is ideal for the scale of data I’m working with here, just a couple thousand HTML files).

But before I can start inputting all the data into a SQLite database, I need to figure out how to extract the key information (e.g. job title, location, description, company etc.) first.

Extracting data from HTML

If it were just a single page, I could use something like Microsoft Edge’s Smart Copy or the Table Capture extension and call it a day, but extracting data from >20k pages is a whole different ballgame.

The HTML code for each job posting page doesn’t have the best formatting…

Job ID 43628 UofT Engineering Career Centre...Job Postings (9_18_2023 12_52_04 AM).html
<div class=orbisModuleHeader>
<div class=row-fluid>
<div class=span2>
<div style=text-align:center>
<h1>
Job ID
: 43628
</h1>
</div>
</div>
<div class=span7>
<h1>
ML Developer - Software Automation &amp; Developer Infrastructure
</h1>
<h5>
 
Cerebras Systems - Computer Science
</h5>
</div>
<div class=span3>
 
<ul class=pager>
 
<li>
<a href=javascript:void(0)><i class=icon-chevron-left></i>Back to Overview</a>
</li>
...
The page for the Cerebras position above has 3506 empty lines with only 809 lines dedicated to actual code, and the HTML code on non-empty lines isn't exactly the most pleasing to read.

…but thanks to everything being stored in tables I can just use Python’s trusty BeautifulSoup4 library on my locally saved HTML pages and get the text in every table data cell in less than 50 lines of code:

parse.py
import argparse
from bs4 import BeautifulSoup
 
def parse_html_file(filepath, verbose=False):
    with open(filepath, 'r', encoding='utf-8') as file:
        html_content = file.read()
 
    soup = BeautifulSoup(html_content, 'lxml')
 
    data = {}
    rows = soup.find_all('tr')  ## find all table rows
 
    for row in rows:
        tds = row.find_all('td')  ## find all table data cells
 
        if len(tds) >= 2:
            label_td = tds[0]
            label_text = '\n'.join(label_td.stripped_strings)
 
            value_td = tds[1]
            value_text = '\n'.join(value_td.stripped_strings)
 
            data[label_text] = value_text
 
    if verbose:
        for key, value in data.items():
            print(f"{key}: {value}")
    else:
        print("Parsing completed.")
 
if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Parse HTML for 2 column table data.")
    parser.add_argument("-f", "--filepath", required=True, help="Path to the HTML file to be parsed.")
    parser.add_argument("-v", "--verbose", action="store_true", help="Print parsed data.")
 
    args = parser.parse_args()
    parse_html_file(args.filepath, args.verbose)

Trying out the quickly-written parse.py script on one of the saved HTML pages, and it’s able to get all of the values for all of the table fields with no issues:

python parse.py --verbose --filepath "Job ID 43628....html"
Output of above command
Position Type:: Professional Experience Year Co-op (PEY Co-op: 12-16 months)
Job Title:: ML Developer - Software Automation & Developer Infrastructure
Job Location:: Toronto
Job Location Type:: Flexible
If working on site, can you provide a copy of your COVID-19 safety protocols?:: No
Number of Positions:: 1
Salary:: $42.00 hourly for 40.0 hours per week
Start Date:: 05/06/2024
End Date:: 04/25/2025
Job Function:: Information Technology (IT)
Job Description:: Cerebras Systems has pioneered a groundbreaking chip and system that revolutionizes deep learning applications. Our system empowers ML researchers to achieve unprecedented speeds in training and inference workloads, propelling AI innovation to new horizons.
The Condor Galaxy 1 (CG-1), unveiled in a recent announcement, stands as a testament to Cerebras' commitment to pushing the boundaries of AI computing. With a staggering 4 ExaFLOP processing power, 54 million cores, and 64-node architecture, the CG-1 is the first of nine powerful supercomputers to be built and operated through an exclusive partnership between Cerebras and G42. This strategic collaboration aims to redefine the possibilities of AI by creating a network of interconnected supercomputers that will collectively deliver a mind-boggling 36 ExaFLOPS of AI compute power upon completion in 2024.
Cerebras is building a team of exceptional people to work together on big problems. Join us!.
About The Role
As a Machine Developer - Software Automation & Developer Infrastructure Engineer, you will use your knowledge of testing and testability to influence better software design, promote proper engineering practice, bug prevention strategies, testability, scalability, and other advanced quality concepts. The position will play a huge role in the quality of Cerebras software. We are looking for engineers that have a broad set of technical skills and who are ready to tackle the biggest at-scale problems in HW-based deep learning accelerators.
Responsibilities
Write scripts to automate testing and create tools to allow easy development of software regression tests
Help identify weak spots and potential customer pain points and drive the software organization towards customer focused quality metrics
Implement creative ways to break software and identify potential problems
Contribute to developing requirements specifications with a focus on developing verification tests
Job Requirements:: Requirements
Enrolled within University of Toronto's PEY program with a degree in Computer Science, Computer Engineering, or any other related discipline
Experience in developing automated tests for compute/machine learning or networking systems within a large-scale enterprise environment
Ability to take responsibility for monitoring product development and usage at all levels with an end goal toward improving product quality
Strong knowledge of software system design, C++ and Python
Preferred
Strong software testing experience with a proven track record in scaling highly technical teams
Knowledge of UNIX/Linux and Windows environments
Knowledge of neural network architecture and ML/AI deep learning principles
Prior experience in designing and developing test automation for HW systems involving ASICs or FPGAs
Prior experience working with live hardware systems and debug tools operating in a real time environment such as networking devices or live computing systems
Preferred Disciplines:: Computer Engineering
Computer Science
Engineering Science (Electrical and Computer)
Engineering Science (Infrastructure)
Engineering Science (Machine Intelligence)
Engineering Science (Robotics)
All Co-op programs:: No
Targeted Co-op Programs:: Targeted Programs
Professional Experience Year Co-op (12 - 16 months)
Application Deadline:: Nov 1, 2023 11:59 PM
Application Receipt Procedure:: Online via system
If by Website, go to:: https://www.cerebras.net/careers/?gh_jid=5321500003
Additional Application Information:: Please apply with
both resume & transcript.
Lacking transcript will disqualify you from being considered.
Note that applications will be considered on a rolling basis. Apply as early as possible.
Note to PEY Co-op applicants: In addition to your application by email/website, please ensure that you select the “I intend to apply for this position” tab on the portal.  This will give us a record of your submitted application in the event that you will be invited for interviews.
U of T Job Coordinator:: Yasmine Abdelhady
Organization:: Cerebras Systems
Division:: Computer Science
Website:: https://cerebras.net/
Length of Workterm:: FLEXIBLE PEY Co-op: 12-16 months (range)
Output of the parse.py script for the Cerebras job posting.

Finetuning data extraction

There’s a few nuances to the data extraction that mean this simple script needs just a bit more extending so it can properly parse the entire dataset.

I’m getting the text corresponding to any inline links, but the links themselves aren’t included since they’re within the html <a> tags, so I need to add handling for those as well. Another nuance is the fact that the formatting for HTML pages has changed3 (several times actually) over the course of the last two semesters.

Since the location of the data has always remained in tables, that’s largely a non-issue. While the job title and company name included in the header above any table on job posting pages are currently missed by the script, that data is also present in the tables below (and extracted by the script properly), so that, too, is a non-issue.

With one exception: job IDs aren’t extracted.

Luckily, I had the foresight to configure SingleFile to include the job ID automatically as part of the filename for each HTML page back when I started the archival process, so I can add some logic to parse that as well. Unluckily, however, there were a few periods of time where that configuration was lost4, so I’m going to have to do some file contents parsing regardless.

A quick manual parsing of the different HTML files shows that there’s only two different locations that job IDs can be located (one for the old design, and the other for the new design), so I can just add another function to try and find the job ID located at either location in all files and I’m now able to extract all the job IDs as well:

parse.py
import argparse
from bs4 import BeautifulSoup
import re
 
def extract_job_id_from_html(soup):
    ## try to find job ID in a <h1> tag with the specific class
    header_tag = soup.find('h1', class_='h3 dashboard-header__profile-information-name mobile--small-font color--font--white margin--b--s')
    if header_tag:
        header_text = header_tag.get_text(strip=True)
        match = re.match(r'^(\d+)', header_text)
        if match:
            return match.group(1)
 
    ## if not found, try to find an <h1> tag containing the words "Job ID"
    job_id_tag = soup.find('h1', string=re.compile(r'Job ID', re.IGNORECASE))
    if job_id_tag:
        job_id_text = job_id_tag.get_text(strip=True)
        match = re.search(r'Job ID\s*:\s*(\d+)', job_id_text, re.IGNORECASE)
        if match:
            return match.group(1)
 
    return None
 
def parse_html_file(filepath, verbose=False):
    with open(filepath, 'r', encoding='utf-8') as file:
        html_content = file.read()
 
    soup = BeautifulSoup(html_content, 'lxml')
 
    data = {}
    job_id = extract_job_id_from_html(soup)
    if job_id:
        data['Job ID'] = job_id
 
    rows = soup.find_all('tr')  ## find all table rows
 
    for row in rows:
        tds = row.find_all('td')  ## find all table data cells
 
        if len(tds) >= 2:
            label_td = tds[0]
            label_text = '\n'.join(label_td.stripped_strings)
 
            value_td = tds[1]
            value_text = '\n'.join(value_td.stripped_strings)
 
            data[label_text] = value_text
 
    return data
 
if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Parse HTML for 2 column table data.")
    parser.add_argument("-f", "--filepath", required=True, help="Path to the HTML file to be parsed.")
    parser.add_argument("-v", "--verbose", action="store_true", help="Print parsed data.")
 
    args = parser.parse_args()
    data = parse_html_file(args.filepath, args.verbose)
 
    if args.verbose:
        for key, value in data.items():
            print(f"{key}: {value}")
    else:
        print("Parsing completed.")

Now with a method of easily extracting all the relevant data from each HTML page, all that’s left is to automate running the parser across all files saved within timestamped subdirectories on my local computer and pipe the data into a SQLite DB.

Storing data in SQLite

Why SQLite?

SQLite sits right there in the sweet middle spot between raw data formats (like CSV, JSON) that are good fits for simple data (e.g. temperature, word lists) but aren’t as great for larger datasets with more complexity (especially when it comes to data analysis) and other larger RDBMS (relational database management system) libraries that might be better designed for scaleability but are really just overkill for the little pet project that I have here at hand.

And the fact that SQLite is a single file on disk means that sharing the finely extracted and processed data is a lot easier too, which is important for me because I want future PEY co-op students to learn what they can from the dataset and better set their expectations for what to expect from program (and I hope the data helps prospective UofT students better make their admission decisions as well).

Most importantly, SQLite is serverless (unlike PostgreSQL or MySQL), which saves me a lot of headache and setup[^especially] for this relatively small-scale project (especially since sqlite3 is built into Python’s standard library!).

Pipelining data from ~2k HTML files to a single SQLite DB

Using SQLite with Python via sqlite3 is simple enough. All I need to do is add some additional code for extracting the job posting date from the parent folder for each job posting’s HTML page and for iterating across every subfolder for PEY job postings on my local computer, as well as draft up a schema for the SQLite DB and write some code for piping data from Python dictionaries to the DB file.

Database Schema

Thankfully, all job postings share largely the same fields, so (for the time being) the schema just ends up being an amalgamation of all the relevant table fields.

One small slight is that Application Receipt Procedure and Application Method are often used interchangeably (Application Receipt Procedure largely before the redesign, and Application Method after), and when one field is used the other is usually left blank (except for some positions which use both, meaning that I can’t exactly combine them into a single field without losing data; e.g. sometimes Application Method for a posting states to use the built-in applicant tracking system while Application Receipt Procedure states to use some company link, which is likely just attributable to job posters sticking with the default options when creating a new post). It’s largely a non-issue, and after some further exploration into the specific overlapping data I should be able to easily combine both into a single field.

I could potentially extend the schema with functional dependencies for repetitive data like company names, although it’s really only company and maybe a few other fields where values are guaranteed to repeat (Job Location Type, Salary, etc. can vary wildly and use combinations of text or integers, so there’s not much we can do to unify them) so it’s not the best representation for a general-purpose database (lossy conversions can always be made for analysis).

Regardless, I can update the previous python file used for parsing to handle the new SQLite schema easily enough:

parse_to_db.py
...
def store_data_in_db(data, db_cursor):
    columns = ', '.join([f'"{key}"' for key in data.keys()])
    placeholders = ', '.join(['?' for _ in data.values()])
    sql = f'INSERT INTO "JobPosting" ({columns}) VALUES ({placeholders})'
    try:
        db_cursor.execute(sql, tuple(data.values()))
    except sqlite3.IntegrityError:
        logging.info("Integrity Error: Skipping row")
        pass
 
def create_db_schema(db_cursor):
    db_cursor.execute('''
    CREATE TABLE IF NOT EXISTS JobPosting (
        id INTEGER,
        postingDate DATE,
        title TEXT,
        company TEXT,
        companyDivision TEXT,
        companyWebsite TEXT,
        location TEXT,
        locationType TEXT,
        numPositions INTEGER,
        salary TEXT,
        startDate TEXT,
        endDate TEXT,
        function TEXT,
        description TEXT,
        requirements TEXT,
        preferredDisciplines TEXT,
        applicationDeadline TEXT,
        applicationMethod TEXT,
        applicationReceiptProcedure TEXT,
        applicationDetails TEXT,
        PRIMARY KEY(id, postingDate)
    )
    ''')
 
if __name__ == "__main__":
    ...
github.com/sadmanca/uoft-pey-coop-job-postings/blob/main/parse_to_db.py
Parsing to .db

Finally, with some additional code for properly parsing through all subfolders in my local directory and setting the postingDate value based on the folder name5 for each file, I can transform the entire dataset of >1.8k job postings into a single .db file under 10MB in size in ~3 min (which might’ve taken even less time if it weren’t running on an old spinning hard drive).

parse_to_db.py
import argparse
from bs4 import BeautifulSoup
import re
import os
import sqlite3
from tqdm import tqdm
import logging
 
def extract_job_id_from_html(soup):
    ## Try to find job ID in a <h1> tag with the specific class
    header_tag = soup.find('h1', class_='h3 dashboard-header__profile-information-name mobile--small-font color--font--white margin--b--s')
    if header_tag:
        header_text = header_tag.get_text(strip=True)
        match = re.match(r'^(\d+)', header_text)
        if match:
            return match.group(1)
 
    ## If not found, try to find an <h1> tag containing the words "Job ID"
    job_id_tag = soup.find('h1', string=re.compile(r'Job ID', re.IGNORECASE))
    if job_id_tag:
        job_id_text = job_id_tag.get_text(strip=True)
        match = re.search(r'Job ID\s*:\s*(\d+)', job_id_text, re.IGNORECASE)
        if match:
            return match.group(1)
 
    return None
 
def parse_html_file(filepath, job_posting_date, verbose=False):
    with open(filepath, 'r', encoding='utf-8') as file:
        html_content = file.read()
 
    soup = BeautifulSoup(html_content, 'lxml')
 
    ## Extract the year, month, and day from the job_posting_date string
    posting_date = job_posting_date.split('_')[0]
    data = {'postingDate': posting_date}
    job_id = extract_job_id_from_html(soup)
    if job_id:
        data['id'] = job_id
 
    rows = soup.find_all('tr')  ## find all table rows
 
    for row in rows:
        tds = row.find_all('td')  ## find all table data cells
 
        if len(tds) >= 2:
            label_td = tds[0]
            label_text = '\n'.join(label_td.stripped_strings).replace(':', '')
 
            value_td = tds[1]
            value_text = '\n'.join(value_td.stripped_strings)
 
            links = value_td.find_all('a')
            for link in links:
                url = link.get('href')
                link_text = link.get_text()
                value_text = value_text.replace(link_text, f'{link_text} ({url})')
 
            ## Map label_text to corresponding database column
            column_mapping = {
                ## 'Job ID': 'id',
                ## 'Job Posting Date': 'postingDate',
                'Job Title': 'title',
                'Organization': 'company',
                'Division': 'companyDivision',
                'Website': 'companyWebsite',
                'Job Location': 'location',
                'Job Location Type': 'locationType',
                'Number of Positions': 'numPositions',
                'Salary': 'salary',
                'Start Date': 'startDate',
                'End Date': 'endDate',
                'Job Function': 'function',
                'Job Description': 'description',
                'Job Requirements': 'requirements',
                'Preferred Disciplines': 'preferredDisciplines',
                'Application Deadline': 'applicationDeadline',
                'Application Method': 'applicationMethod',
                'Application Receipt Procedure': 'applicationReceiptProcedure',
                'If by Website, go to': 'applicationReceiptProcedure',
                'Additional Application Information': 'applicationDetails',
            }
 
            ## Check if label_text matches any of the predefined columns
            if label_text in column_mapping:
                db_column = column_mapping[label_text]
                ## If key already exists, append the value to it
                if db_column in data:
                    data[db_column] += f'\n{value_text}'
                else:
                    data[db_column] = value_text
 
    return data
 
def store_data_in_db(data, db_cursor):
    columns = ', '.join([f'"{key}"' for key in data.keys()])
    placeholders = ', '.join(['?' for _ in data.values()])
    sql = f'INSERT INTO "JobPosting" ({columns}) VALUES ({placeholders})'
    try:
        db_cursor.execute(sql, tuple(data.values()))
    except sqlite3.IntegrityError:
        logging.info("Integrity Error: Skipping row")
        pass
 
def create_db_schema(db_cursor):
    db_cursor.execute('''
    CREATE TABLE IF NOT EXISTS JobPosting (
        id INTEGER,
        postingDate DATE,
        title TEXT,
        company TEXT,
        companyDivision TEXT,
        companyWebsite TEXT,
        location TEXT,
        locationType TEXT,
        numPositions INTEGER,
        salary TEXT,
        startDate TEXT,
        endDate TEXT,
        function TEXT,
        description TEXT,
        requirements TEXT,
        preferredDisciplines TEXT,
        applicationDeadline TEXT,
        applicationMethod TEXT,
        applicationReceiptProcedure TEXT,
        applicationDetails TEXT,
        PRIMARY KEY(id, postingDate)
    )
    ''')
 
if __name__ == "__main__":
    logging.basicConfig(filename='run.log', level=logging.INFO, format='%(asctime)s %(message)s')
 
    parser = argparse.ArgumentParser(description="Parse HTML files in a folder and store data in SQLite DB.")
    parser.add_argument("-d", "--directory", default=os.getcwd(), help="Path to the directory containing HTML files. Default is the current directory.")
    parser.add_argument("--db", default=os.path.join(os.getcwd(), "job_postings.db"), help="SQLite database file to store the parsed data. Default is 'job_postings.db' in the directory specified by -d.")
    parser.add_argument("-v", "--verbose", action="store_true", help="logging.info parsed data.")
 
    args = parser.parse_args()
 
    conn = sqlite3.connect(args.db)
    cursor = conn.cursor()
    create_db_schema(cursor)
 
    ## Get the list of files
    files = [os.path.join(dirpath, file) for dirpath, _, files in os.walk(args.directory) for file in files if file.endswith('.html') or file.endswith('.htm')]
 
    ## Create a progress bar
    with tqdm(total=len(files)) as pbar:
        for subdir, _, files in os.walk(args.directory):
            job_posting_date = os.path.basename(subdir)
            for file in files:
                if file.endswith('.html') or file.endswith('.htm'):
                    filepath = os.path.join(subdir, file)
                    logging.info(filepath)
                    data = parse_html_file(filepath, job_posting_date, args.verbose)
                    store_data_in_db(data, cursor)
                    ## Update the progress bar
                    pbar.update(1)
 
    conn.commit()
    conn.close()
    logging.info("Parsing and storing completed.")
github.com/sadmanca/uoft-pey-coop-job-postings/blob/main/parse_to_db.py

Viewing the data

And so we have it: a single database file storing every single job posted on the UofT PEY Co-op job board for 2T5s6 (from 2023 to 2024). You can take a look at some of the results from querying the data below7, or download the SQLite .db file from the latest release at github.com/sadmanca/uoft-pey-coop-job-postings.

Sample Queries

Which companies posted the most number of jobs?

SELECT company, COUNT(*) as num_postings
FROM JobPosting
GROUP BY company
ORDER BY num_postings DESC;
SQLite query for: Which companies posted the most number of jobs?
CompanyNum Postings
Sanofi Canada92
Celestica Inc.74
Hydro One62
Huawei Technologies Canada Co.,Ltd.47
Pulsenics Inc35
Airbus31
RAISE (Ian Martin Group) on behalf of Enbridge30
Geotab30
The Independent Electricity System Operator (IESO)26
SOTI Inc.25
Ericsson23
STEMCELL Technologies22
Isowater Corporation22
Toronto Hydro20
Veoneer Canada Inc. (Formerly Autoliv)18
AMD (Advanced Micro Devices, Inc.)18
The Six Semiconductor17
Cenovus17
Qualcomm Canada Inc.15
Prep Doctors14
Kepler Communications13
Evertz Microsystems Ltd13
Agnico Eagle Mines Ltd.13
Xero12
Untether AI12
National Research Council Canada12
Bombardier Aerospace12
AlphaWave SEMI12
Vale Canada11
Questrade11
Paradigm Electronics Inc.11
Lincoln Electric Company of Canada11
IBI Group11
FGF Brands11
Environment and Climate Change Canada11
City of Toronto11
Atlantic Packaging Products Ltd.11
Acciona Infrastructure Canada Inc.11
ecobee10
SAFRAN Landing Systems (Messier-Dowty Inc.)10
IKO Industries10
Ceridian10
Synopsys9
NETINT Technologies Inc.9
MHIRJ Aviation Group9
Litens Automotive Group9
Kinectrics Inc.9
Collins Aerospace9
Cerebras Systems9
Canadian Natural Resources Limited9
Brigham & Women’s Hospital, Harvard Medical School9
York Region8
Toronto Transit Commission (TTC)8
Tenstorrent Inc8
Honda of Canada Mfg8
Stackpole International7
EllisDon7
Aecon Group Inc.7
eCAMION6
Thermo Fisher Scientific6
Royal Canadian Mounted Police (RCMP)6
Powersmiths International Corp.6
Ontario Clean Water Agency (OCWA)6
Northbridge Financial6
NVIDIA6
NOVA Chemicals6
Mondelez International6
Ministry of Transportation6
Mattamy Homes Ltd.6
Marvell Technology Group (US)6
Martinrea International Inc6
Ledcor6
Kingston Midstream6
Global Affairs Canada6
Genesys Canada Laboratories Inc. (Genesys)6
ExxonMobil Canada6
Amazon6
Welbilt5
WSP Canada Inc.5
Suncor Energy5
Smith + Andersen5
Purpose Building Inc.5
Ontario Teachers Pension Plan (OTPP)5
Mold-Masters Limited5
Mission Control Space Services5
Ministry of the Environment, Conservation and Parks5
Microchip Technology5
Jaguar Land Rover Canada5
Husky Injection Molding Systems Ltd.5
Colortech Inc.5
BBA5
Astera Labs5
Arxtron Technologies5
Air Canada5
eShipper4
e-Zinc Inc4
Zynga4
WOLF Advanced Technology4
VPC Group Inc.4
University of Toronto - Facilities & Services4
The Travel Corporation4
Tesla Motors4
Synaptive Medical Inc.4
Solvay Inc4
Scotiabank4
Purolator Inc.4
Petro Canada Lubricants (Hollyfrontier)4
Peel Plastic Products Limited4
PICCO Group of Companies4
Ministry of Children, Youth, and Social Services4
KAPP Infrastructure Inc.4
Intersect Power4
Intel Canada4
Grounded Engineering4
GHD4
CGI Group Inc.4
Bronte Construction4
Bayer Healthcare4
Alectra Utilities4
Zebra Technologies3
Viavi Solutions3
Umicore Autocat Canada Corp3
Uken Games3
Tencent Technology Company3
TYLin3
Seismic Software, Inc.3
Region of Peel3
Region of Durham3
Pulsenics3
Perpetua3
Nfinite Nanotechnology3
Nanoleaf3
Movellus3
Mother Parker’s Tea and Coffee3
Ministry of Health and Long-Term Care3
Magna International3
MDA3
Lumentum Inc.3
LEA Consulting Ltd.3
Kernal Biologics, Inc.3
InteraXon Inc3
Halton Region3
Green Infrastructure Partners Inc3
GGY a Moodys Analytics Company3
FedEx Ground3
FedEx Express Canada3
Eastern Power Limited3
Dynasty Power3
Deighton Associates Ltd.3
Defence Research and Development Canada3
Defence R&D Canada (DRDC) - Ottawa3
Circle Cardiovascular Imaging Inc.3
Ciena3
Cadence Design Systems Inc3
CES Transformers3
C. F. Crozier & Associates3
Boston Scientific3
Boeing3
Arctic Wolf Networks3
Aercoustics Engineering Limited3
AYOS Designs3
AGS Automotive Systems3
Zaber Technologies2
Xpan Inc.2
WalterFedy2
Volta Energy2
Veeva Systems2
Varicent2
Vale2
Urbantech Consulting2
University of Toronto - ITS Enterprise Applications and Solutions Integration2
University of Toronto2
Turnkey Modular Systems Inc2
Thornhill Medical2
Thales Canada2
Terrestrial Energy2
TC Energy (formerly TransCanada)2
Synergy Partners2
Stryker2
Structura Biotechnology Inc.2
StratumAI2
StoneX Group Inc.2
Square2
Spring Air Systems Inc.2
SparkLease Inc2
Snowflake2
Smarter Alloys2
SmartTrade Technologies2
SickKids2
Shell Canada Limited2
Safuture Inc.2
Royal Bank of Canada (RBC)2
Read Jones Christoffersen (RJC Engineers)2
Rapyuta Robotics Co., Ltd. (SWITZERLAND)2
Rakuten Kobo Inc.2
RBC Capital Markets2
Quantum Lifecycle Partners LP2
Quanta Technology2
Qualitrol Corp2
Overbond2
Nuclear Waste Management Organization2
Next Hydrogen Corporation (NHC)2
Mokal Industries2
ModiFace2
Modern Niagara Group Inc.2
Magna2
Lobo Consulting Services Inc.2
Liburdi Turbine Services2
Lanxess2
Lake Harbour Co Ltd2
Kijiji2
KGHM International LTD.2
Jamieson Wellness Inc.2
Intel Corporation2
Inoki2
Indigenous Services Canada2
Indie Semiconductor2
Hungerford Consulting Ltd.2
Health Canada2
Hatch Ltd.2
Harmonic Fund Services2
Hanon Systems EFP Canada Ltd. (Formerly Magna)2
HBNG Holborn Group2
Groq Inc.2
Griffith Foods Ltd.2
Grascan Construction Ltd.2
G&W Canada Corporation2
Faire2
Evoco Ltd2
Empire Life2
Electrovaya Corp.2
EHV Power2
Dragados Canada2
Deep Genomics2
Continuous Colour Coat Limited2
Coltene SciCan2
City of Mississauga2
Chatsimple2
ChargeLab Inc2
CentML Inc2
Canadian Tire Corporation2
Canadian Nuclear Safety Commission2
Canadensys Aerospace2
Cadillac Fairview2
CIMCO Refrigeration2
CERT Systems Inc.2
British Columbia Investment Management Corp2
BlueRock Theraputics2
Bloom AI2
Berg Chilling Systems2
Bank of Montreal (BMO)2
Bank of America Merrill Lynch2
BMO Financial Group2
Astro Shoring2
Arup2
Arctic Canadian Diamond Company Ltd.2
Aqusense2
Alta Planning + Design Canada, Inc.2
ACS Powersports2
iManage1
ePIC Blockchain Technologies1
Zomp Inc.1
ZS Associates1
YScope Inc1
Wolfe Heavy Equipment1
Walmart Canada1
Wakefield Canada - Castrol1
Volvo Cars1
Vivid Machines1
Vexos1
Veolia Water Technologies & Solutions1
Upstream Works Software1
Upper Canada Consultant1
University of Toronto Scarborough Campus1
UT Elevator (NA) Inc.1
UHN- (Princess Margaret, Toronto General, Toronto Western, Toronto Rehab)1
Tridel1
Trexo Robotics1
Trend Micro Canada Technologies Inc.1
Tractel Ltd1
Toyota1
Town of Oakville1
Thomson Reuters1
Third Rails Ltd1
The Poirier Group1
The Napoleon Group of Companies1
The Miller Group1
The Master Group1
Tetra Trust1
Tetra Tech1
Tersa Earth Innovations1
Tenaris Shawcor1
Teck Resources Ltd.1
Technical University of Munich (TUM)1
TD Securities1
Synopsys Inc1
Sun Life Financial1
Stelco1
Stantec Consulting Ltd.1
Sowingo1
SnapWrite1
Slate Technologies Canada Limited1
Simpplr1
Signal 11
Siemens Canada Limited1
Semtech1
Select Equity Group, L.P.1
Salytics1
Safety Power Inc1
SRK Consulting (Canada) Inc.1
SNC-Lavalin1
SLR Consulting (Canada) Ltd.1
SGS Canada Inc1
S&C Electric Canada Ltd.1
RxFood1
Rockwool1
RocMar Engineering Inc.1
Ripple Therapeutics1
Reinsurance Management Associates1
Regional Municipality of York (York Region)1
QMC Metering Solutions1
Public Services and Procurement Canada1
Providence Therapeutics1
Procter & Gamble (P&G)1
ProAutomated1
Princess Margaret Cancer Centre, University Health Network1
Pratt & Whitney Canada Inc.1
Powertech Labs Inc.1
Power Advisory1
Posterity Group1
Pomerleau, Inc1
PointClickCare1
Pinchin Ltd.1
PepsiCo1
Pelicargo1
PartnerRe Life Reinsurance Company of Canada1
Parsons Inc.1
Parametric Research Labs Inc.1
Pani Energy1
Ovintiv1
Ontario Transit Group1
Ontario Public Service1
Ontario Ministry of Transportation1
OPTrust1
ONIT Energy Ltd1
Nutrien1
Nuclear Promise X Inc.1
Nojumi Solutions Inc.1
New School Foods Inc.1
NAPCO, Royal Pipe & Fittings1
Modern Niagara1
Micrometric Jig Boring & Jig Grinding Ltd.1
Metro Inc.1
Metric Contracting Service Corporation1
Mesosil1
McCain Foods1
Mars Canada Inc.1
Manulife Financial1
Manulife1
Mancor Industries1
Lumerate1
Loblaw Companies Limited1
Lactalis Canada1
Konrad Group1
Kinross Gold Corporation1
Kiewit1

Where are most jobs located?

SELECT location, COUNT(*) as num_postings
FROM JobPosting
GROUP BY location
ORDER BY num_postings DESC;
SQLite query for: Where are most jobs located?
LocationNum Postings
Toronto295
Toronto, ON168
Mississauga, ON73
Mississauga55
Markham, Ontario52
Oakville, ON38
Ottawa, ON35
Mirabel, Quebec31
Markham27
Vancouver, BC25
North York, ON20
Toronto, ON, CA18
Owen Sound18
Ottawa18
Oakville18
7455 Birchmount Rd, Markham, ON L3R 5C2, Canada18
Markham, ON17
Main Office (Markham, ON)17
Connaught Campus, 1755 Steeles Avenue West, Toronto, ON, Canada, M2R 3T417
Brampton17
Toronto, Ontario15
Mississauga, Ontario15
Vaughan14
Brampton, ON13
Burlington12
Etobicoke11
1755 Steeles Avenue West, Toronto, ON, Canada, M2R 3T411
Markham, Ontario, Canada10
Calgary10
Scarborough9
Whitby8
Remote8
Etobicoke, Ontario8
Concord, Ontario8
Canada8
Rankin Inlet, Nunavut7
Newmarket7
7
Woodbridge6
Remote / Toronto6
North York6
Newmarket, Ontario6
National Capital Region, Ottawa6
Estevan, SK6
Concord, ON6
Calgary, Alberta6
Calgary, AB - Head Office or Fort McMurray, AB ? Horizon and Albian Mine Site6
Alliston, Ontario6
Ajax, ON6
1755 Steeles Avenue West, Toronto, ON6
Ottawa, Ontario5
Ontario5
Lloydminster, SK5
Joffre, AB5
Cambridge, Massachusetts, USA5
Calgary, AB5
CA-AB-Calgary5
240 Richmond Street West Toronto, ON, Canada, M5V 1V65
Waterloo4
Toronto, Canada4
Newmarket, ON4
Mississauga, ON, CA4
Markham, Canada4
Collingwood4
Bolton, ON4
AI CENTRE of EXCELLENCE (Digital Data Hub), 240 Richmond Street West Toronto, ON, Canada, M5V 1V64
49 Rutherford Rd, Brampton4
3966 JinDu Rd., Shanghai4
2430 Royal Windsor Drive, Mississauga, Ontario4
196 Spadina Ave, Toronto ON4
1890 Alstep Dr, Mississauga, ON L5S 1W14
175 Mostar Street, Suite 200, Stouffville, Ontario4
175 Galaxy Blvd., Toronto, ON4
1 Commerce Valley Dr E, Thornhill, ON L3T 7X64
Winnipeg, MB3
Whitby/Scarborough3
Vaughan, ON3
Vancouver3
Toronto, Ontario, CA3
Toronto, ON, Canada3
Toronto or Thornhill3
Sudbury3
St. George (downtown Toronto)3
Sarnia, Ontario3
Mississauga, 2100 Meadowvale Blvd. (Hwy 401/Mississauga Rd)3
Fargo, North Dakota3
Downtown Toronto3
Chatham, ON3
Cambridge, MA (USA)3
Burlington, On3
Burlington, ON3
Brampton, ON (87 Orenda Road)3
Beijing, China3
Barrie, ON3
Baker Lake or Rankin Inlet, Nunavut3
Alberta3
Aercoustics Office - 1004 Middlegate Road, Mississauga3
90 Bales Drive East, East Gwillimbury3
71 Rexdale Boulevard3
500 Commissioners Street3
33 Kern road3
2400 Royal Windsor Drive, Mississauga, Ontario3
150 Toro Road3
Yellowknife, NT2
Woodbridge, Ontario2
Waterloo, ON2
Vaughan, Ontario2
Various in Canada2
Vancouver, British Columbia2
Toronto, Hybrid2
Toronto or Ottawa2
Toronto Office: 317 Adelaide Street West, Toronto2
Toronto (HQ) - Hybrid2
Toronto (Gardiner Expressway at Exhibition Station)2
Toronto & Vancouver2
Tiverton, Ontario2
Throughout Toronto2
TOKYO, JAPAN2
Stouffville2
SickKids Research Institute - 686 Bay St. Toronto, ON M5G 0A42
See Notes2
Santa Clara, California, USA2
Santa Clara, CA (USA)2
ST. George Campus2
Richmond Hill2
Remote in Canada2
Remote (HQ located in Montreal)2
Rankin Inlet or Baker Lake, Nunavut2
Ottawa/Gatineau2
Orillia2
Ontario, Canada2
One of 71 Rexdale Boulevard OR 715 Milner Avenue2
One of 500 Commissioners St, 71 Rexdale Blvd, 715 Milner Ave2
On-site2
Oakville, Ontario2
North York, Ontario2
North York, ON or Chatham, ON2
Newmarket, ON, CA2
Multiple locations2
Mont-Saint-Hilaire, QC2
Metro Hall, 55 John Street, Toronto, ON2
Markham, ON, CA2
Kitchener2
Hybrid Work Environment - Virtual and/or East Gwillimbury, Ontario2
Greater Toronto Area2
Georgetown, ON, Canada2
Georgetown, ON2
GTA, London/Guelph, Ottawa2
GTA2
Fort Saskatchewan, AB2
Fort McMurray, Alberta2
Etobicoke, ON2
Downtown Toronto or Downtown Ottawa2
DRDC Ottawa, Shirley’s Bay Campus, 3701 Carling Avenue, Ottawa, ON K1A 0Z42
Collingwood ON2
Cambridge, ON2
Cambridge2
Calgary and Various Locations2
Burnaby, BC2
Boston, MA, USA2
Boston, MA2
Bolton, Ontario2
Bolton2
Boisbriand/Mississauga2
Barrie, ON, CA2
Barrie2
Alliston2
Ajax, Ontario2
Ajax2
95 Arrow Road, M9M 2L42
901 Simcoe Street South Oshawa2
843 Eastern Ave & Union Station2
8027 Dixie Rd. Brampton2
7455 Birchmount Road, Markham, ON2
715 Milner Avenue2
71 Rexdale Blvd2
6688 Kitimat Rd., Mississauga, ON, L5N 1P82
60 Wingold Ave, Toronto, Ontario M6B 1P52
501 Alliance Ave, Suite 406, Toronto, ON2
400 Highway 6 North, DUNDAS, ON L9H 7K42
40 Bertrand Ave, Scarborough, ON M1L 2P6, Canada.2
393 University Avenue2
3745 Barnes Lake Road - Ashcroft, BC2
3484 Semenyk Ct, Mississauga2
300 - 3200 Dufferin St2
3 Paisley Ln, Uxbridge, ON L9P 0G52
277 Gladstone Ave. Toronto, Ontario, M6J 3L92
250 University Avenue Suite 400, Toronto2
250 Ferrand Drive2
2440 Winston Park Dr, Oakville, ON L6H 7V22
2425 Matheson Blvd E #200, Mississauga, ON L4W 5K4, Canada2
24 Buckingham St, Etobicoke M8Y 2W22
20 carlson court2
1755 Steeles Avenue West, Toronto, ON2
1138 Bathurst Street (Bathurst & Dupont)2
111 Progress Ave and 333 progress Ave, Scarborough2
1 York Street Suite 10102
toronto, ontario1
Woodbridge, ON1
West Hill, ON, CA, M1E 2K31
Waterloo/Remote1
Waterloo, Ontario1
Wakefield Canada1
Virtual1
Victoria, BC1
Various location in Canada1
Various in Ontario1
Various Construction Field Offices within the Metropolitan Toronto area1
Town of Oakville Town Hall: 1225 Trafalgar Road, Oakville; and Work From Home1
Toronto/Thornhill1
Toronto/Barrie, ON1
Toronto, Ontario OR Montreal, Quebec1
Toronto, Ontario (Hybrid)1
Toronto, ON or Calgary, AB1
Toronto, ON (various locations)1
Toronto, CA1
Toronto and Kitchener1
Toronto OR Gatineau1
Toronto ON1
Toronto (Hybrid)1
Toronto (HQ) - Hybrid - In office once a week1
Tiverton1
Thunder Bay, ON1
Thornhill, ON1
TORONTO1
TD Centre, 66 Wellington St W, Toronto1
Surrey, BC1
Suite 440, 20 Toronto Street, Toronto, ON M5C 2B81
Sudbury, Ontario1
Sudbury, ON, CAN1
Sudbury, ON Canada1
Strathroy, ON1
St. Clair River / Toronto1
St. Catharines, ON1
Scarborough, Ontario1
Richmond hill1
Rexdale, ON, CA1
Regina1
Rainbow Lake, AB1
Princess Margaret Cancer Centre1
Pickering, ON1
Peterborough, Ontario1
Ottawa, Toronto, North Bay1
Ottawa, On1
Ottawa (preferred) or remote1
Orillia, ON, CA1
Oakville, ON or Toronto, ON1
OPS office locations in Ontario (Toronto, Oshawa, Guelph, St. Catherines, North Bay, Kingston, Sudbury, London, Ottawa, Newmarket, Hamilton, Thunder Bay, Sault Ste Marie, and Peterborough) and/or remotely following a hybrid model1
North York, Toronto OR Calgary, Alberta1
North York, ON or Toronto, ON1
North Bay1
Nepean1
Nanticoke, ON, CA1
NEWMARKET, ON1
Munich, Germany (Preferred) or Toronto, Canada1
Multiple1
Mountain View, CA1
Mooretown, ON1
Montreal, Toronto1
Montreal, Quebec1
Mississuaga, ON1
Mississauga,ON1
Mississauga, Ontarip1
Mississauga, On1
Mississauga, Canada1
Markhan1
Markham, ON Canada1
Markham, Mississauga, OR Ottawa, ON1
Markham or Oakville1
Markham or Mississauga, ON1
Manhattan, NY1
MaRS Centre, 101 College Street, Toronto1
London, UK1
London1
Lloydminster, AB; Lloydminster, SK1
Laval, Quebec1
Lakefield, ON1
Lake Charles, Louisiana1
Kingston, ON1
Kingston (294 King Street East) and/or Toronto (4900 Yonge Street)1
Kincardine, ON, CA1
Kincardine1
Impala Canada1
Hyrbid, Toronto1
Hybrid, Toronto1
Hamilton, Ontario1
Hamilton and/or Vaughan offices1
Guelph Office1
Greater Sudbury, ON1
GTA, or outskirts of Ontario1
GTA, London/Guelph,1
For McMurray1
Flexible1
Etobicoke, Ontario/ Remote1
Edmonton, AB, CA1
Edmonton or Calgary, AB1
Dresden, ON1
Downtown Toronto, Richmond Hill, Vancouver / Virtual1
Downtown Toronto (remote)1
Downsview, 123 Garratt Blvd, North York, ON1
Dorval, Quebec1
Delta, Canada1
Copper Cliff, ON1
Connaught Campus,1755 Steeles Avenue West, Toronto, ON, Canada, M2R 3T41
Concord1

How many job postings are there for mechatronics positions?

SELECT COUNT(*) AS num_mech_postings
FROM JobPosting
WHERE title                LIKE '%mechatronic%'
   OR company              LIKE '%mechatronic%'
   OR companyDivision      LIKE '%mechatronic%'
   OR function             LIKE '%mechatronic%'
   OR description          LIKE '%mechatronic%'
   OR requirements         LIKE '%mechatronic%'
   OR preferredDisciplines LIKE '%mechatronic%';
SQLite query for: How many job postings are there for mechatronics positions?
┌─────────────────────┐
│  num_mech_postings  │
│─────────────────────│
│  87                 │
└─────────────────────┘

What does a job for a mechatronics position look like?

SELECT *
FROM JobPosting
WHERE title                LIKE '%mechatronic%'
   OR company              LIKE '%mechatronic%'
   OR companyDivision      LIKE '%mechatronic%'
   OR function             LIKE '%mechatronic%'
   OR description          LIKE '%mechatronic%'
   OR requirements         LIKE '%mechatronic%'
   OR preferredDisciplines LIKE '%mechatronic%'
LIMIT 1;
SQLite query for: What does a job for a mechatronics position look like?
idpostingDatetitlecompanycompanyDivisioncompanyWebsitelocationlocationTypenumPositionssalarystartDateendDatefunction
438732023-09-18Operational Technology InternHonda of Canada MfgRecruitmenthttps://www.hondacanadamfg.ca/Alliston, OntarioOn-Site45$26.17 hourly for 40.0 hours per week05/06/202408/25/2025Engineering

Next Steps

Edit (2024-11-11): part 2 of this post series was intended to be about cleaning the data. Unfortunately, I’ve been so preoccupied with other things (including a better dashboard for 2024-2025 co-op job postings) that I haven’t had the time to do any of that 😅, so part 2 ended up being about the dashboard I built to display all this job info on my website. As it is, I’ll leave the data cleaning for another day, but I hope you found the initial insights interesting!


You’ll notice that the data still needs some cleaning8, which is fine, should be fairly simple to do by manually parsing through the dataset and aggregating similar values.

But after some basic data cleaning is the fun part: analyzing the data to generate some insights on posted jobs, including…

  • How many jobs are posted for chemical/mechanical/mineral/… engineers?
  • During what periods are the bulk of jobs posted?
  • When do specific companies post most/all their jobs?
  • How many jobs are reposted9, and which ones?
  • …and more!

Footnotes

  1. Chrome and virtually all other browsers have a slightly more complicated setup for saving pages which makes organizing files for saved pages slightly less elegant compared to dealing with just a single file via SingleFile: when you press Ctrl + S on a page, it doesn’t just save that page’s HTML file but also a folder containing all of the media from the page (which, given that none of the job postings contain images, is just one more thing to have to delete).

  2. Despite a large portion of the saved HTML files consisting of useless space in lines of JavaScript, CSS, and HTML that have nothing to do with the data that counts, it’s just not worth the time investment for me to parse through every file and remove the clutter since all the saved job postings for a year only take up a couple hundred MB.

  3. At times purposefully by myself (e.g. adding Job ID and later Job Title to filenames for saved HTML files for easier file browsing and duplication checking), and at times by the university (i.e. there was a big redesign of the job board that took place in the latter half of the fall semester, which made it so the PEY job board uses the same frontend design as job boards on CLNx, whereas prior to that it looked a bit different despite functionally working the same).

  4. At one point, all of my extensions were uninstalled somehow (probably due to a Chrome update gone wrong), and so I had forgotten to reconfigure SingleFile to include job IDs in the saved filenames for many ways until I got some free time and was able to pore over my notes (and reconfigure everything back to the way it was).

  5. e.g. 📁2023-09-17_20-14-10; easily achievable thanks to github.com/sadmanca/ahk-scripts/blob/master/keys.ahk

  6. UofT Engineering students and alumni often include their graduation year after their name to denote their class. If you start first year in 2021, you will be part of the class of 2025, denoted as “2T5” (pronounced “two-tee-five”).

  7. I’ve built-in a SQL view called JobPostings (the table with the actual data is called JobPosting, without the “s”) that excludes some of the columns with very long values (e.g. description, requirements, preferredDisciplines, applicationDetails) to make table formatting look a bit better if you’re running SELECT * queries. To run queries with the excluded columns, use the JobPosting table instead.

  8. e.g. Toronto and Toronto, ON are present as two distinct locations, when really they should just be one.

  9. Something the Engineering Career Centre (ECC) people who run the PEY Co-op job board don’t inform you about ahead of time, employers can “repost” a job with the same job id (usually several weeks) after the original which makes it’ll show up under New Posting Since Last Login even if you’ve already viewed (and even though it’s technically still the same single job posting). It’s not commonplace but it did occur for a few dozen jobs (I imagine for those where the position was left unfilled after some initial batch of students applied and it was reposted to reinvigorate interest).