GPT-4
Last updated
Last updated
In this section, we cover the latest prompt engineering techniques for GPT-4, including tips, applications, limitations, and additional reading materials.
More recently, OpenAI released GPT-4, a large multimodal model that accept image and text inputs and emit text outputs. It achieves human-level performance on various professional and academic benchmarks.
Detailed results on a series of exams below:
Detailed results on academic benchmarks below:
GPT-4 achieves a score that places it around the top 10% of test takers on a simulated bar exam. It also achieves impressive results on a variety of difficult benchmarks like MMLU and HellaSwag.
OpenAI claims that GPT-4 was improved with lessons from their adversarial testing program as well as ChatGPT, leading to better results on factuality, steerability, and better alignment.
GPT-4 APIs currently only supports text inputs but there is plan for image input capability in the future. OpenAI claims that in comparison with GPT-3.5 (which powers ChatGPT), GPT-4 can be more reliable, creative, and handle more nuanced instructions for more complex tasks. GPT-4 improves performance across languages.
While the image input capability is still not publicly available, GPT-4 can be augmented with techniques like few-shot and chain-of-thought prompting to improve performance on these image related tasks.
From the blog, we can see a good example where the model accepts visual inputs and a text instruction.
The instruction is as follows:
Note the "Provide a step-by-step reasoning before providing your answer" prompt which steers the model to go into an step-by-step explanation mode.
The image input:
This is GPT-4 output:
This is an impressive result as the model follows the correct instruction even when there is other available information on the image. This open a range of capabilities to explore charts and other visual inputs and being more selective with the analyses.
One area for experimentation is the ability to steer the model to provide answers in a certain tone and style via the system
messages. This can accelerate personalization and getting accurate and more precise results for specific use cases.
For example, let's say we want to build an AI assistant that generate data for us to experiment with. We can use the system
messages to steer the model to generate data in a certain style.
In the example below, we are interested to generated data samples formatted in JSON format.
ASSISTANT Response:
And here is a snapshot from the OpenAI Playground:
To achieve this with previous GPT-3 models, you needed to be very detailed in the instructions. The difference with GPT-4 is that you have instructed the style once via the system
message and this will persists for any follow up interaction. If we now try to override the behavior, here is what you get.
ASSISTANT Response:
This is very useful to get consistent results and behavior.
According to the blog release, GPT-4 is not perfect and there are still some limitations. It can hallucinate and makes reasoning errors. The recommendation is to avoid high-stakes use.
On the TruthfulQA benchmark, RLHF post-training enables GPT-4 to be significantly more accurate than GPT-3.5. Below are the results reported in the blog post.
Checkout this failure example below:
The answer should be Elvis Presley
. This highlights how brittle these models can be for some use cases. It will be interesting to combine GPT-4 with other external knowledge sources to improve the accuracy of cases like this or even improve results by using some of the prompt engineering techniques we have learned here like in-context learning or chain-of-thought prompting.
Let's give it a shot. We have added additional instructions in the prompt and added "Think step-by-step". This is the result:
Keep in mind that I haven't tested this approach sufficiently to know how reliable it is or how well it generalizes. That's something the reader can experiment with further.
Another option, is to create a system
message that steers the model to provide a step-by-step answer and output "I don't know the answer" if it can't find the answer. I also changed the temperature to 0.5 to make the model more confident in its answer to 0. Again, please keep in mind that this needs to be tested further to see how well it generalizes. We provide this example to show you how you can potentially improve results by combining different techniques and features.
Keep in mind that the data cutoff point of GPT-4 is September 2021 so it lacks knowledge of events that occurred after that.
See more results in their main blog post (opens in a new tab) and technical report (opens in a new tab).
We will summarize many applications of GPT-4 in the coming weeks. In the meantime, you can checkout a list of applications in this Twitter thread (opens in a new tab).
Coming soon!
Instruction Tuning with GPT-4 (opens in a new tab) (April 2023)
Evaluation of GPT and BERT-based models on identifying protein-protein interactions in biomedical text (March 2023)
GPT-4 Technical Report (opens in a new tab) (March 2023)
Last updated on April 16, 2023