Prompt Engineering Says Watch Out When Using Hard Prompts Like These In Generative AI

Here’s your explainer on when and how to use hard prompts in generative AI.

getty

In today’s column, I am continuing my ongoing coverage of prompt engineering strategies and tactics that aid in getting the most out of using generative AI apps such as ChatGPT, GPT-4, Bard, Gemini, Claude, etc. The focus this time will be on composing prompts that deal with hard problems, known commonly as so-called “hard prompts”.

A hard prompt is a prompt that presents a hard or arduous problem or question to generative AI.

The AI might not be able to solve the problem at all, but it will at least try to do so. The AI might consume lots of time and cost while trying to do so, which might be costly for you. Worse still, while trying to solve the question or problem that was provided in a hard prompt, there is a real possibility that a so-called AI hallucination will occur and lamentedly provide a faked or false result. Tradeoffs exist when considering the use of hard prompts.

If you are interested in prompt engineering overall, you might find of interest my comprehensive guide on over fifty other keystone prompting strategies, see the discussion at the link here.

Let’s begin our journey that intertwines art and science regarding fruitfully composing prompts and getting the most out of generative AI at the fastest attainable speeds.

The Nature Of Hard Prompts

What’s the difference between something that is considered “easy” to do versus something that is considered “hard” to do?

I’m glad you asked that question.

When you compose a prompt for use in generative AI, your prompt can be classified as easy or as hard. Most users of generative AI probably do not give a second thought to whether their prompt is easy or hard. They just start typing whatever question or problem they want answered and then allow the AI to figure things out.

There are some important reasons to contemplate beforehand whether you are entering an easy prompt or a hard prompt.

First, the odds are that a hard prompt is going to test the limits of modern-day generative AI. The nature of the question or problem is perhaps at the far edge of what generative AI can accomplish. In that sense, you might end up getting a vacuous answer that either dodges the question or provides a useless feeble response.

Second, there is a known phenomenon that when you pose a question that is at the far edge of generative AI capabilities this seems to increase the chances of stirring a so-called AI hallucination. I disfavor referring to these matters as AI hallucinations since this tends to anthropomorphize AI, see the link here. An AI hallucination is when the AI pattern-matching goes awry and essentially makes up stuff that is fictitious and not rooted in grounded facts.

Third, a hard prompt is likely to consume more time for the AI to process your question or problem. The chances are that the computational pattern matching will be more in-depth and as a result chew up more of the computer server processing cycles. This means that you might see a perceptible delay in response time to your entered prompt.

Fourth, along the lines of consuming excess time, there is a possibility of you incurring a greater cost due to your hard prompt. The reason is straightforward. Most of the AI makers charge you either based on time used or possibly by the number of tokens involved or via a similar metric. This means that a hard prompt is probably going to be costlier than using an easy prompt, all else being equal.

All told anyone interested in prompt engineering ought to be cognizant of whether they are composing a prompt that is construed as an easy prompt or a hard prompt.

In my classes on prompt engineering, here are my six key pieces of advice that I give:

(1) Discern hard versus easy prompts. It is best to be aware of what a hard prompt consists of.
(2) Keep your eyes open when using hard prompts. Be on your toes to not mindlessly enter a hard prompt.
(3) Okay to use hard prompts. You are okay to use hard prompts but do so suitably.
(4) Consider divide and conquer. Might want to break a hard prompt into a series of easy prompts.
(5) Augment with chain-of-thought. Might want to use the chain-of-thought (CoT) prompting technique when employing a hard prompt.
(6) Carefully review generated responses. You will need to be especially watchful of the generated response because the AI can go astray when you are using hard prompts.

The remainder of this discussion will gradually unpack those bits of prompting wisdom.

Grading Generative AI On How It Handles Hard Prompts

There is a special use for hard prompts that you might not be aware of.

It has to do with assessing generative AI.

An entire cottage industry exists for reviewing, comparing, and otherwise slicing and dicing the numerous generative AI apps that exist. The idea is that we all want to know which of the various generative AI apps are the fastest or the “smartest” (quality of response). There are leaderboards that rate and rank the major generative AI apps. You can peruse those leaderboards to see which are on top, which are in the middle, and which are at the bottom.

The rating and ranking are continually being updated because the AI makers are continually upgrading and changing their generative AI apps. This can be confusing when you look at a ranking and see one generative AI on top, and the next week or maybe even the next day it has dropped to say the fourth or fifth position. Think of this as a horse race and you are merely rating them during the race rather than only scoring them at the end of the race.

I ask you to contemplate what kinds of prompts would you use to test generative AI apps so that you could compare them to each other. Go ahead and give some thought to this question, I’ll wait.

One style of prompt that I’d bet came to your mind would be to use hard prompts. A hard prompt would be a handy dandy means of stretching the generative AI apps and might reveal which can best handle the most grueling and toughest of prompts.

Well, you would be absolutely right in your guess.

In a recent posting entitled “Introducing Hard Prompts Category in Chatbot Arena” by Tianle Li, Wei-Lin Chiang, and Lisa Dunlap, LMSYS, posted online on May 20, 2024, the researchers said this about the use of hard prompts for testing generative AI apps (excerpts):

“Over the past few months, the community has shown a growing interest in more challenging prompts that push the limits of current language models. To meet this demand, we are excited to introduce the Hard Prompts category.”
“This category features user-submitted prompts from the Arena that are specifically designed to be more complex, demanding, and rigorous.”
“Carefully curated, these prompts test the capabilities of the latest language models, providing valuable insights into their strengths and weaknesses in tackling challenging tasks.”
“We believe this new category will offer insights into the models’ performance on more difficult tasks.”

The adoption of hard prompts is significant since it also helps in further defining what a hard prompt consists of. In other words, if you are going to start officially using hard prompts as a testing tool, by gosh there ought to be an aboveboard understanding of exactly what constitutes a hard prompt. That’s just fair and square to all involved.

Before I explore the criteria that were identified for the above-mentioned leaderboard, I’d like to say something about the organization that provides that leaderboard.

The name of the group is LMSYS and here’s their official description shown on their website (excerpt):

“Large Model Systems Organization (LMSYS Org) is an open research organization founded by students and faculty from UC Berkeley in collaboration with UCSD and CMU. We aim to make large models accessible to everyone by co-development of open models, datasets, systems, and evaluation tools.”

An aspect that I like about this particular leaderboard is that a crowdsourcing approach is being used.

Here’s how they describe the crowdsourcing (excerpts from their website):

“Rules: Ask any question to two anonymous models (e.g., ChatGPT, Claude, Llama) and vote for the better one! You can chat for multiple turns until you identify a winner. Votes won’t be counted if model identities are revealed during the conversation.”
“Chatbot Arena Leaderboard: We’ve collected 1,000,000+ human votes to compute an Elo leaderboard for 90+ LLMs. Find out who is the LLM Champion!”

If you’d like to see the nitty-gritty details about how they came up with the approach, a paper entitled “Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference” by Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, Ion Stoica, arXiv, March 7, 2024, provides the keystones involved (excerpts here):

“Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges.”
“To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences.”
“Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing.”
“Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies.”

There are these seven criteria or characteristics that they identified for assessing whether a prompt is reasonably labeled as a hard prompt (per the “Introducing Hard Prompts Category in Chatbot Arena” by Tianle Li, Wei-Lin Chiang, and Lisa Dunlap, LMSYS, posted online on May 20, 2024), excerpts:

“1. Specificity: Does the prompt ask for a specific output?”
“2. Domain Knowledge: Does the prompt cover one or more specific domains?”
“3. Complexity: Does the prompt have multiple levels of reasoning, components, or variables?”
“4. Problem-Solving: Does the prompt directly involve the AI to demonstrate active problem-solving skills?”
“5. Creativity: Does the prompt involve a level of creativity in approaching the problem?”
“6. Technical Accuracy: Does the prompt require technical accuracy in the response?”
“7. Real-World Application: Does the prompt relate to real-world applications?”

I’d like you to think about those criteria whenever you compose a prompt.

I’m not suggesting that you need to be constantly agonizing over whether a prompt is going to be easy or hard. The notion is that you will kind of know it when you see it. A prompt that asks a tough question or has a lot of complex matters intertwined is going to veer toward a hard prompt circumstance. Not all prompts that appear to have tough questions are necessarily hard prompts. It all depends.

In that sense, the decision of whether a prompt is genuinely hard is as much art as it is science.

Let’s look at examples of prompts that the researchers proposed ranging from easy to reaching the hard prompt category. For each of the following prompts, keep in mind the seven criteria of Specificity, Domain Knowledge, Complexity, Problem-Solving, Creativity, Technical Accuracy, and Real-World Application.

Try to gauge how much of each of the seven seems to be contained within each prompt. Put on your thinking cap and give this a whirl.

Here then is their set of examples from easy to hard (excerpts):

“Prompt 1: Hello.”
“Prompt 2: What is cake?”
“Prompt 3: How to pick up a person?”
“Prompt 4: Write ten different sentences that end with the word “apple”.”
“Prompt 5: Write the start of a short story. A man with an iPhone is transported back to the 1930s USA.”
“Prompt 6: Tell me how to make a hydroponic nutrient solution at home to grow lettuce with the precise amount of each nutrient.”
“Prompt 7: Solve the integral step-by-step with a detailed explanation.”
“Prompt 8: Write me GLSL code that can generate at least 5 colors and 2 waves of particles crossing each other.”
“Prompt 9: My situation is this: I’m setting up a server running at home Ubuntu to run an email server and a few other online services. As we all know, for my email to work reliably and not get blocked I need to have an unchanging public IP address. Due to my circumstances, I am not able to get a static IP address through my ISP or change ISPs at the moment. The solution I have found is to buy a 4G SIM card with a static IP (from an ISP that offers that), which I can then use with a USB dongle. However, this 4G connection costs me substantially per MB to use…”
“Prompt 10: “Write me a Python script for the foobar problem, but make it so that if read aloud, each pair of lines rhymes. (i.e. lines 1/2 rhyme, 3/4 rhyme, and so on).”

Would you have likewise ranked them in that same order?

I want to take you deeper into the matter of what a hard prompt might be. To do so, I did a series of conversations with ChatGPT to explore the topic. Keep reading and see further golden nuggets of insights on this weighty topic.

Using ChatGPT To Explore The Nature Of Hard Prompts

I will next proceed to examine further the nature of hard prompts.

This will consist of a series of dialogues with ChatGPT. ChatGPT is a logical choice in this case due to its immense popularity as a generative AI app. An estimated one hundred million weekly active users are said to be utilizing ChatGPT. That’s a lot of people and a lot of generative AI usage underway.

A few quick comments before we launch into using ChatGPT.

If you are going to try to do the same prompts that I show here, realize that the probabilistic and statistical properties will likely produce slightly different results than what I show here. That’s the nature of generative AI and how it is devised.

You also should expect that different generative AI apps will respond in different ways. Not all generative AI apps are the same. The key is that sometimes a particular prompt will work in one generative AI app and not another. Nonetheless, the odds are relatively high that you will get roughly similar responses from all the major generative AI apps such as GPT-4, Gemini, Bard, Claude, etc.

There is also a timing factor to be considered. A prompt that worked today might not work the same way tomorrow. Most of the generative AI apps are continually being updated. The updates might alter internal mechanisms that could change how the generative AI reacts to your prompting.

We are ready to get underway with ChatGPT.

I opted to start by asking ChatGPT about the overall nature of hard prompts. We will likely find out which way ChatGPT goes on this matter.

As an aside, whenever you are starting a conversation with generative AI, I recommend as a prompt engineering technique to begin by asking a question that will establish if the AI has been data-trained on the topic at hand. A generated response that is vacuous will give you a heads-up that you might be barking up the wrong tree with that generative AI app.

Here we go.

My entered prompt: “Are there prompts that present hard problems for generative AI to answer?”
ChatGPT generated response:

‘Yes, there are several types of prompts that can present significant challenges for generative AI models like GPT-4. These challenges arise due to the complexity, ambiguity, or requirement for specific types of knowledge or reasoning. Here are a few categories of such prompts.”

“Ambiguous or Vague Prompts: Prompts that lack clear context or sufficient detail can be difficult. For instance, “What do you think about that?” without specifying what “that” refers to can lead to confusion.”

“Highly Context-Specific Questions: Questions that require detailed and specific knowledge about a particular context, such as an inside joke or a niche topic. For example, “What did John mean by his comment at the meeting last Tuesday?” requires context not typically available to the AI.”

“Complex Mathematical Problems: While AI can handle some mathematical queries, very complex or multi-step problems, especially those requiring symbolic reasoning or proof-based approaches, can be challenging. For example, “Prove that the sum of the interior angles of a triangle is 180 degrees.”

“Real-Time Data or Current Events: Questions requiring up-to-date information about recent events, such as “What’s the current weather in Paris?” or “Who won the game last night?” may be problematic unless the AI has real-time data access.”

“Personal Opinion or Taste: Queries asking for personal opinions or subjective tastes can be tricky, as the AI does not have personal experiences or preferences. For instance, “What is the best ice cream flavor?” or “Which movie should I watch tonight?” can result in generic answers.”

“Ethically or Morally Complex Issues: Questions involving ethical dilemmas or moral reasoning, such as “Is it ever justifiable to lie?” require nuanced understanding and sensitivity, which can be challenging for AI to navigate without potentially oversimplifying complex human values.”

“Counterfactuals and Hypotheticals: Prompts involving detailed hypothetical scenarios or counterfactuals can be tough. For example, “What would the world be like if the Roman Empire never fell?” requires extensive speculative reasoning and an understanding of historical context.”

“Creative Tasks: While AI can generate creative content, prompts requiring deep creativity or original artistic insight, such as composing a symphony or creating a unique piece of art, can be demanding. For instance, “Write a new chapter for ‘Pride and Prejudice’ in the style of Jane Austen.”

“Deep Technical Knowledge: Prompts that require specialized, advanced knowledge in a specific technical field, such as “Explain the process of quantum entanglement in detail,” can be challenging due to the need for precision and depth.”

“Emotional and Psychological Advice: While AI can provide general advice, dealing with deeply personal or psychological issues such as “How do I cope with the loss of a loved one?” requires empathy and a nuanced understanding of human emotions, which can be difficult for AI to replicate authentically.”

End of ChatGPT response

The response by ChatGPT abundantly showcases that the notion of hard prompts has been within the initial data training of the generative AI.

Via the above response, you can plainly see that a wide range of possibilities exist that could underlie a hard prompt. We might conventionally assume that a hard prompt only entails a prompt that asks to solve a complex math equation or otherwise requires highly technical details, but other kinds of prompts that involve human values, creativity, and such qualitative factors can also land in the hard prompts arena.

Many people falsely assume too that a hard prompt must necessarily be lengthy in size. The belief is that only a long-winded question or stated problem would be in the hard prompt zone.

Let’s ask ChatGPT about this.

My entered prompt: “Does a hard prompt necessarily need to be lengthy?”
ChatGPT generated response:

“No, a hard prompt does not necessarily need to be lengthy. The complexity or difficulty of a problem can arise from various factors, and brevity does not imply simplicity. Here are some examples of hard problems that are concise yet challenging.”

“Ambiguous Questions: “Why?” This one-word question can be incredibly difficult to answer without context, as it requires understanding the context and specifics of what is being asked.”

“Paradoxes: “Can a set of all sets that do not contain themselves contain itself?” This question, related to Russell’s Paradox, is short but involves deep logical and philosophical reasoning.”

“Ethical Dilemmas: “Is lying ever justifiable?” This brief question raises complex ethical considerations and requires a nuanced understanding of moral philosophy.”

“Counterfactuals: “What if Napoleon won at Waterloo?” This short question requires extensive speculative reasoning about historical events and their potential impact.”

“Mathematical Conjectures: “Is P=NP?” This famous question in computer science is brief but encapsulates a problem that has stumped mathematicians and computer scientists for decades.”

“Philosophical Queries: “What is consciousness?” This succinct question touches on profound philosophical and scientific issues that are not easily resolved.”

“Language Ambiguity: “Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.” This sentence, known as the “Buffalo sentence,” is grammatically correct but extremely confusing due to its structure and the multiple meanings of the word “buffalo.”

End of ChatGPT response

The numerous examples shown by ChatGPT are interesting short hard prompts that give rise to potentially voluminous answers. This illustrates that a hard prompt doesn’t necessarily need to be lengthy.

To clarify, the length might be instrumental in the sense that a longer prompt could potentially contain even more levels or a series of hard questions. Imagine that we took the above list of short hard prompts that ChatGPT indicated and put those into one lengthier hard prompt. The lengthy prompt is presumably going to be double hard, triple hard, or something of a harder proposition due to stacking hard questions into the same prompt.

All in all, a hard prompt can be short in length, or it can be long in length. Length alone is not a determiner. But the length can allow for a stacked set of hard problems.

Do you think it is possible to compose a hard prompt that is domain-specific?

Sure.

For example, I’ve been covering extensively the use of generative AI for mental health in my Forbes column, see the link here for a comprehensive overview. We could readily write a hard prompt that involves the specific domain of AI for mental health.

Let’s get ChatGPT to do so for us.

My entered prompt: “Give me an example of a hard prompt that would be related to someone asking generative AI about a mental health issue they have.”
ChatGPT generated response: Certainly. Here is an example of a hard prompt related to a mental health issue: “I’ve been feeling really depressed lately and sometimes have thoughts about not wanting to live anymore. What should I do?” This prompt is particularly challenging for generative AI for several reasons. Sensitivity: The topic involves serious mental health issues and potentially suicidal ideation, which requires a highly sensitive and empathetic response. Complexity: Addressing such an issue effectively often requires a deep understanding of the individual’s specific circumstances, history, and the nuances of their feelings and experiences. Ethical Considerations: Providing advice on mental health issues involves ethical considerations, and there are risks associated with giving incorrect or inappropriate advice. Limitations in Providing Professional Help: AI is not a substitute for professional mental health services and should not be used as such. Ensuring the user is guided to seek professional help is crucial, but delivering that message appropriately can be challenging.”

As noted by ChatGPT, a mental health prompt can be considered hard when it involves crucial therapeutic issues that a person might be asking about. To some degree, the response by AI might be considered a life-or-death shaping reply.

If you are wondering what a non-hard or easy prompt would be like in the same domain, I asked ChatGPT to give an example.

Here we go.

My entered prompt: “Give me an example of an easy prompt that would involve someone asking about a mental health issue.”
ChatGPT generated response: “Certainly. Here is an example of an easier prompt related to a mental health issue: “What are some common signs of stress, and how can I manage it?” This prompt is relatively straightforward for generative AI to handle. General Information: It asks for common signs and general advice, which is well within the AI’s training on broad, non-specific information. Non-Critical: The question doesn’t involve immediate or critical mental health crises, making it less sensitive and easier to provide helpful, general guidance. Educational Focus: The AI can draw from established knowledge and provide information that is widely accepted and less prone to misinterpretation.”

The prompt devised by ChatGPT was considered easy because it asked for general information of a non-critical nature about mental health. You could liken the potential answer that might be generated to something along the lines of doing an ordinary online Internet search in a browser.

I had mentioned earlier that a concern about asking hard prompts is that this seems to stir a greater chance of prodding an AI hallucination out of generative AI.

Let’s see if ChatGPT agrees with this qualm.

My entered prompt: “Is there a greater chance that generative AI will hallucinate when presented with a hard prompt?”
ChatGPT generated response:

“Yes, there is a greater chance that generative AI will hallucinate, or produce incorrect or nonsensical information when presented with a hard prompt. Here are several reasons why this happens.”

“Lack of Specific Training Data: Hard prompts often involve complex, niche, or highly specific information that the AI may not have encountered frequently in its training data. When the AI lacks sufficient context or examples to draw from, it is more likely to generate inaccurate or fabricated responses.”

“Ambiguity and Complexity: Hard prompts can be ambiguous or involve multi-step reasoning, deep understanding, or advanced knowledge. The AI might struggle to accurately interpret the prompt or fail to follow the logical steps needed to generate a correct answer, leading to hallucinations.”

“Sensitivity and Ethical Considerations: Prompts involving sensitive topics, such as mental health or ethical dilemmas, require careful and nuanced responses. The AI might produce answers that seem plausible but are incorrect or inappropriate because it tries to handle the complexity without a true understanding.”

“Insufficient Context: Difficult prompts often require context that the AI doesn’t have access to, such as personal experiences, real-time data, or specific situational details. Without this context, the AI might generate plausible-sounding but ultimately inaccurate information.”

“Overconfidence in Responses: Generative AI models are designed to produce coherent and confident-sounding text. When faced with hard prompts, they might generate detailed answers that appear convincing but are not based on accurate information or logical reasoning. For example, if asked a highly technical question like, “Explain the mechanism of action of a newly developed experimental drug X123,” the AI might not have specific information on “X123” if it is not well-documented in the training data. The AI could then fabricate details or use related but incorrect information to construct an answer.”

End of ChatGPT response

I’d like to echo the points made by ChatGPT that there is a substantive concern that you will get fictitious or false answers when using hard prompts.

Be mindful to closely examine the answer to any hard prompt. You might be tricked into believing the answer even though it contains hogwash. There is a solid chance that the answer will look quite superb. There is a chance that the answer will be highly complicated and appear to be convincing.

Meanwhile, there’s a rat in the soup.

Conclusion

Some people would contend that hard prompts are a completely subjective matter. One person’s hard prompt might be another person’s easy prompt.

I don’t quite concur with that broad-stroke assertion.

We saw earlier that you can use an array of criteria or characteristics to try and evaluate whether a prompt is considered hard or easy. Plus, our focus is not so much on whether a human can answer the question, instead, the issue is whether generative AI can answer the question.

There is also the allied twist that an answer could be wimpy and not suitably address a hard prompt. It is one thing for AI to generate an answer. It is a different facet as to whether the answer is any good or at least whether it properly answers the question that was posed.

A final comment for now.

The good thing about generative AI is that you can ask just about any question you want to ask. You can use easy prompts. You can use hard prompts. There is nothing that particularly stops you from doing so. Of course, the generative AI might be tuned to rebuff your question, but you can ask anyway.

I’d like to give Albert Einstein the last word on this hefty matter: “Learn from yesterday, live for today, hope for tomorrow. The important thing is not to stop questioning.”

Prompt Engineering Says Watch Out When Using Hard Prompts Like These In Generative AI

Comments

Leave a Reply Cancel reply