Graduate Job Classification Case Study
Clavié et al., 2023 (opens in a new tab) provide a case-study on prompt-engineering applied to a medium-scale text classification use-case in a production system. Using the task of classifying whether a job is a true "entry-level job", suitable for a recent graduate, or not, they evaluated a series of prompt engineering techniques and report their results using GPT-3.5 (gpt-3.5-turbo
).
The work shows that LLMs outperforms all other models tested, including an extremely strong baseline in DeBERTa-V3. gpt-3.5-turbo
also noticeably outperforms older GPT3 variants in all key metrics, but requires additional output parsing as its ability to stick to a template appears to be worse than the other variants.
The key findings of their prompt engineering approach are:
For tasks such as this one, where no expert knowledge is required, Few-shot CoT prompting performed worse than Zero-shot prompting in all experiments.
The impact of the prompt on eliciting the correct reasoning is massive. Simply asking the model to classify a given job results in an F1 score of 65.6, whereas the post-prompt engineering model achieves an F1 score of 91.7.
Attempting to force the model to stick to a template lowers performance in all cases (this behaviour disappears in early testing with GPT-4, which are posterior to the paper).
Many small modifications have an outsized impact on performance.
The tables below show the full modifications tested.
Properly giving instructions and repeating the key points appears to be the biggest performance driver.
Something as simple as giving the model a (human) name and referring to it as such increased F1 score by 0.6pts.
Prompt Modifications Tested
Performance Impact of All Prompt Modifications
Template stickiness refers to how frequently the model answers in the desired format.
Last updated