크크루쿠쿠

Prompt Engineering Guide 3. Advanced Prompting 본문

DeepLearning/공부

Prompt Engineering Guide 3. Advanced Prompting

JH_KIM 2023. 5. 4. 23:15

이 글은 https://github.com/dair-ai/Prompt-Engineering-Guide 를 공부하기 위해 제 입맛대로 번역한 글입니다


Advanced Prompting

앞에 본거와 같이 프롬프트의 개선을 꼭 필요하다는 것이 분명해졌을것입니다.

고급 개념으로 넘어가기 전에 몇 가지 개념을 좀더 다뤄봅시다.


Zero-Shot Prompting

오늘날의 LLM은 대량의 데이터와 명령들을 따르도록 tune 되어있어 zero shot task수행이 가능합니다.

Prompt:
Classify the text into neutral, negative, or positive. 

Text: I think the vacation is okay.
Sentiment:

Output:
Neutral

위의 예시처럼 우린 어떤 예시를 주지 않고도 잘 수행하는것을 볼 수 있습니다. 이러한 것을 zero-shot 이 가능하다는 것을 볼 수 있습니다.

만약 zero-shot 이 잘 작동하지 않는다면 demonstration이나 예시를 제공해주는것이 좋습니다.

Few-Shot Prompting

LLM은 이미 zero shot 에 뛰어난 성능을 보이지만 더 복잡한 작업에서는 부족할 수 있습니다. 이를 개선하기 위해 few-shot prompting은 in-context learning을 하게끔 예시등을 제시하면서 모델이 더 좋은 성능을 발휘하게끔 유도합니다.

Prompt:
A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses
the word whatpu is:
We were traveling in Africa and we saw these very cute whatpus.
To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses
the word farduddle is:

Output:
When we won the game, we all started to farduddle in celebration

1-shot으로 모델이 어떻게든 작업을 수행하는 모습을 볼 수 있습니다.

"Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?" 의 연구 결과에서 나온 몇가지 팁을 봅시다.

  • label space와 예시에서 지정한 input text의 분포는 (input text에 대해 label이 올바른지에 대해 관계없이) 모두 중요합니다.

- format은 performance에 대해 중요한 역활입니다. 무작위 labels를 사용하더라도 없는것보단 낫습니다.

- 랜덤 라벨을 실제 라벨의 분포에서 가져오는것도 또한 도움이 됩니다.

한번 random label을 붙여 시도해봅시다.

Prompt:
This is awesome! // Negative
This is bad! // Positive
Wow that movie was rad! // Positive
What a horrible show! //

Output:
Negative

여전히 정답을 보여줍니다. 형식을 지키는 것 또한 도움이 됐습니다. 실제 실험에서 최신 GPT 모델은 무작위 format에서도 강력해지고 있긴 합니다.

Prompt:
Positive This is awesome! 
This is bad! Negative
Wow that movie was rad!
Positive
What a horrible show! --

Output:
Negative

ㅇㅣ렇게 format이 정확하지 않아도 모델은 여전히 정확하게 예측해줍니다.

복잡하고 다양한 작업에서도 이러한 현상이 적용되는지에 대해서는 더 철저한 분석을 수행해야 합니다.

Limitations of Few-shot Prompting

few-shot prompting은 많은 task에 대해서는 잘 작동하지만 복잡한 추론 task에서는 완벽하지는 않습니다.

예전 예시중에

Prompt:
The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. 

A: 

Output:
Yes, the odd numbers in this group add up to 107, which is an even number.

이렇게 잘 작동하지 않는 예시가 있었습니다.

실제로 계산을 하면 홀수의 합은 15+5+13+7+1 = 41 로 합산 자체도 틀렸고 107이 맞다 하더라도 이는 홀수입니다.

하지만 GPT4 model에서는 이렇게 답합니다.

To find the sum of the odd numbers in this group, we first need to identify which numbers are odd. The odd numbers in this group are 15, 5, 13, 7, and 1. 

Now, we add these numbers together: 

15 + 5 + 13 + 7 + 1 = 41

The sum of the odd numbers in this group is 41, which is an odd number, not an even number.

아주 잘 나오는 것을 볼 수 있습니다. 하지만 우리는 간단한 예시를 든 것이기 때문에 좀더 복잡한 문제일 경우에는 GPT4의 경우에도 reasoning이 취약한 것을 볼 수 있습니다.

우선 다시 전 다빈치 모델로 돌아가서 prompt를 개선시켜 봅시다.

Prompt:
The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
A: The answer is False.

The odd numbers in this group add up to an even number: 17,  10, 19, 4, 8, 12, 24.
A: The answer is True.

The odd numbers in this group add up to an even number: 16,  11, 14, 4, 8, 13, 24.
A: The answer is True.

The odd numbers in this group add up to an even number: 17,  9, 10, 12, 13, 4, 2.
A: The answer is False.

The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. 
A: 

Output:
The answer is True.

여전히 제대로 작동하지 않습니다! 심지어 이 예시는 GPT4에서도 같은 답을 합니다. 이러한 유형의 reasoning task에서는 단순한 few shot prompt만으로는 충분하지 않는것 같습니다. 좀더 자세히 보면 우리가 소개한 작업 유형은 몇가지 reasoning step이 있습니다. 즉 문제를 쪼개고 이를 모델에 시연하게 된다면 도움이 될것입니다. 최근에는 보다 복잡한 산술, 상식, 상징적 추론 작업을 처리하기 위해 "chain-of-thought (CoT) prompting" 라는 프롬프트가 유명해졌습니다.

Chain-of-Thought Prompting

CoT는 복잡한 추론을 가능하게 합니다. 이를 few-shot prompting과 결합하면 더욱 좋은 결과를 얻을 수 있습니다.

Prompt:
The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
A: Adding all the odd numbers (9, 15, 1) gives 25. The answer is False.

The odd numbers in this group add up to an even number: 17,  10, 19, 4, 8, 12, 24.
A: Adding all the odd numbers (17, 19) gives 36. The answer is True.

The odd numbers in this group add up to an even number: 16,  11, 14, 4, 8, 13, 24.
A: Adding all the odd numbers (11, 13) gives 24. The answer is True.

The odd numbers in this group add up to an even number: 17,  9, 10, 12, 13, 4, 2.
A: Adding all the odd numbers (17, 9, 13) gives 39. The answer is False.

The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. 
A:

Output:
Adding all the odd numbers (15, 5, 13, 7, 1) gives 41. The answer is False.

와우 정말 기가막힙니다. 심지어 더 적은 예시를 주더라도 좋은 결과를 볼 수 있습니다.

Prompt:
The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
A: Adding all the odd numbers (9, 15, 1) gives 25. The answer is False.

The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. 
A:

Output:
Adding all the odd numbers (15, 5, 13, 7, 1) gives 41. The answer is False.

멋있죠?

Zero-Shot CoT

최근에 나온 아이디어로 원래의 프롬프트에 "Let's think step by step" 한 구절을 추가하는 zero-shot CoT 입니다.

Prompt:
I went to the market and bought 10 apples. 
I gave 2 apples to the neighbor and 2 to the repairman. 
I then went and bought 5 more apples and ate 1. 
How many apples did I remain with?

Output:
11 apples

오답입니다! 그렇다면 우리의 special prompt를 넣어봅시다.

Prompt:
I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with?

Let's think step by step.

Output:
First, you started with 10 apples.
You gave away 2 apples to the neighbor and 2 to the repairman, so you had 6 apples left.
Then you bought 5 more apples, so now you had 11 apples.
Finally, you ate 1 apple, so you would remain with 10 apples.

와우 기가막힙니다. 특히 이 방법은 예시가 별로 없을 경우에 특히 더 유용합니다.

Self-Consistency

프롬프트 엔지니어링을 위한 더 진보된 기술 중 하나는 self consistency일 것입니다.
어떤 연구에서 제안한 self consistency는 "CoT에 사용되는 naive한 greedy decoding을 대체하는 것" 을 목표로 합니다. 이 아이디어는 few-shot CoT를 통해 여러가지 reasoning path를 샘플링하고 가장 consistent한 answer를 고르는 것입니다. 이는 reasong task의 성능을 향상시키는데 매우 도움이 됩니다.

Prompt:
When I was 6 my sister was half my age. Now
I’m 70 how old is my sister?

Output:
35

또 틀렸네요 한번 개선시켜봅시다.
우리는 few-shot example를 사용해볼겁니다.

Prompt:
Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done,
there will be 21 trees. How many trees did the grove workers plant today?
A: We start with 15 trees. Later we have 21 trees. The difference must be the number of trees they planted.
So, they must have planted 21 - 15 = 6 trees. The answer is 6.

Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
A: There are 3 cars in the parking lot already. 2 more arrive. Now there are 3 + 2 = 5 cars. The answer is 5.

Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
A: Leah had 32 chocolates and Leah’s sister had 42. That means there were originally 32 + 42 = 74
chocolates. 35 have been eaten. So in total they still have 74 - 35 = 39 chocolates. The answer is 39.

Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops
did Jason give to Denny?
A: Jason had 20 lollipops. Since he only has 12 now, he must have given the rest to Denny. The number of
lollipops he has given to Denny must have been 20 - 12 = 8 lollipops. The answer is 8.

Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does
he have now?
A: He has 5 toys. He got 2 from mom, so after that he has 5 + 2 = 7 toys. Then he got 2 more from dad, so
in total he has 7 + 2 = 9 toys. The answer is 9.

Q: There were nine computers in the server room. Five more computers were installed each day, from
monday to thursday. How many computers are now in the server room?
A: There are 4 days from monday to thursday. 5 computers were added each day. That means in total 4 * 5 =
20 computers were added. There were 9 computers in the beginning, so now there are 9 + 20 = 29 computers.
The answer is 29.

Q: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many
golf balls did he have at the end of wednesday?
A: Michael initially had 58 balls. He lost 23 on Tuesday, so after that he has 58 - 23 = 35 balls. On
Wednesday he lost 2 more so now he has 35 - 2 = 33 balls. The answer is 33.

Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
A: She bought 5 bagels for $3 each. This means she spent 5

Q: When I was 6 my sister was half my age. Now I’m 70 how old is my sister?
A:

Output 1:
When I was 6 my sister was half my age, so she was 3. Now I am 70, so she is 70 - 3 = 67. The answer is 67.

Output 2:
When the narrator was 6, his sister was half his age, which is 3. Now that the narrator is 70, his sister would be 70 - 3 = 67 years old. The answer is 67.

Output 3:
When I was 6 my sister was half my age, so she was 3. Now I am 70, so she is 70/2 = 35. The answer is 35.

이러한 결과에서 최종 정답을 계산하는 데에는 몇 가지 step이 더 필요하지만, 간단하게 설명하면 67이라는 정답을 얘기하는 다수가 있기 때문에 이 정답이 최종 정답이 되는것입니다.

Generated Knowledge Prompting

LLMs는 계속 발전하고 인기있는 기법중 하나는 지식이나 정보를 모델이 더 정확하게 예측하게 incorporate 해주는 기법입니다.
비슷한 아이디어로 예측을 하기 전에 지식을 생성하는 데에도 모델을 사용하는 것입니다.

Prompt:
Part of golf is trying to get a higher point total than others. Yes or No?

Output:
Yes.

그냥 대충 말한것 같습니다. 그렇다면 한번 정보생성을 해볼까요?

Prompt:
Input: Greece is larger than mexico.
Knowledge: Greece is approximately 131,957 sq km, while Mexico is approximately 1,964,375 sq km, making Mexico 1,389% larger than Greece.

Input: Glasses always fog up.
Knowledge: Condensation occurs on eyeglass lenses when water vapor from your sweat, breath, and ambient humidity lands on a cold surface, cools, and then changes into tiny drops of liquid, forming a film that you see as fog.
Your lenses will be relatively cool compared to your breath, especially when the outside air is cold.

Input: A fish is capable of thinking.
Knowledge: Fish are more intelligent than they appear. 
In many areas, such as memory, their cognitive powers match or exceed those of ’higher’ vertebrates including non-human primates. 
Fish’s long-term memories help them keep track of complex social relationships.

Input: A common effect of smoking lots of cigarettes in one’s lifetime is a higher than normal chance of getting lung cancer.
Knowledge: Those who consistently averaged less than one cigarette per day over their lifetime had nine times the risk of dying from lung cancer than never smokers. 
Among people who smoked between one and 10 cigarettes per day, the risk of dying from lung cancer was nearly 12 times higher than that of never smokers.

Input: A rock is the same size as a pebble.
Knowledge: A pebble is a clast of rock with a particle size of 4 to 64 millimetres based on the Udden-Wentworth scale of sedimentology. 
Pebbles are generally considered larger than granules (2 to 4 millimetres diameter) and smaller than cobbles (64 to 256 millimetres diameter).

Input: Part of golf is trying to get a higher point total than others.
Knowledge:

Knowledge 1:
The objective of golf is to play a set of holes in the least number of strokes. 
A round of golf typically consists of 18 holes. 
Each hole is played once in the round on a standard golf course. 
Each stroke is counted as one point, and the total number of strokes is used to determine the winner of the game.

Knowledge 2:
Golf is a precision club-and-ball sport in which competing players (or golfers) use many types of clubs to hit balls into a series of holes on a course using the fewest number of strokes. 
The goal is to complete the course with the lowest score, which is calculated by adding up the total number of strokes taken on each hole. 
The player with the lowest score wins the game.

그렇다면 생성된 정보로 reformat해서 다시 해볼까요?

Prompt:
Question: Part of golf is trying to get a higher point total than others. Yes or No?

Knowledge: The objective of golf is to play a set of holes in the least number of strokes. A round of golf typically consists of 18 holes. Each hole is played once in the round on a standard golf course. Each stroke is counted as one point, and the total number of strokes is used to determine the winner of the game.

Explain and Answer: 

Answer 1 (confidence very high):
No, the objective of golf is not to get a higher point total than others. Rather, the objective is to play a set of holes in the least number of strokes. The total number of strokes is used to determine the winner of the game, not the total number of points.

Prompt:
Question: Part of golf is trying to get a higher point total than others. Yes or No?

Knowledge: Golf is a precision club-and-ball sport in which competing players (or golfers) use many types of clubs to hit balls into a series of holes on a course using the fewest number of strokes. The goal is to complete the course with the lowest score, which is calculated by adding up the total number of strokes taken on each hole. The player with the lowest score wins the game.

Explain and Answer:

Answer 2 (confidence is a lot lower):
Yes, part of golf is trying to get a higher point total than others. Each player tries to complete the course with the lowest score, which is calculated by adding up the total number of strokes taken on each hole. The player with the lowest score wins the game.

첫번째 답변에는 매우 confident한 점을 보여주지만 두번째는 아닙니다. 이는 demo 목적으로 간단히 한것으로 최종 답변이 나오기 까지 고려할 사항들이 있으며 자세한 점은 논문을 참고해주세요.

Automatic Prompt Engineer (APE)

이 논문은 자동 instruction 생성 및 선택을 위한 프레임워크를 제안합니다.
first step으로는 LLMs가 task를 위해 instruction 후보를 생성하기 위해 example을 출력합니다.
이러한 후보들은 검색 절차를 guide합니다. target 모델을 이용해서 명령어를 실행한 다음 평가 점수를 기반으로 가장 적합한 instruction을 선택합니다.
예시로 여기서 주어진 "Let's think step by step" 보다 좋은 instruction인 "Let's work this out in a step by step way to be sure we have the right answer." 를 구합니다.

Comments