One of the most common use cases for LLM is chat. However, LLM is stateless and doesn't have a memory to store the chat conversation.
Given this fact, we need to send to LLM all the previous history of the chat so it will understand the content and response properly.
This may be challenging from a cost perspective since we need to send to LLM the same information that LLM has already answered. Most LLM providers expose their models through APIs and change input and output tokens. It means that with each follow-up question in chat, you pay extra for the questions that were already answered.
I will use Amazon Bedrock service and Claude 3 LLM for this demo.
I also assume that you have already set AWS credentials on your workstation or that you are running this code from your notebook in Amazon SageMaker Studio.
The same information that you see here appears in the notebook that you can access in this Git repository.
There is also an option to use the implementation of the same idea by LangChain.
Current Architecture¶
Usually, the process looks like this: The user sends a question to chat. Chat history is being stored in a low-latency database (like DynamoDB), and each time the whole history + new question is sent to LLM. The response is being added to the database storage and also sent to the user.
Approach to reduce the number of tokens¶
If my chat history is only two or three questions, I probably should not do anything. Tokens overhead are not so large.
But if my conversation contains multiple questions, I propose the following enhancement:
After the three or four questions/answer pairs, don't store the conversation. Summarize the conversation to this point, and store only the summarized text. All subsequent requests are made in the context of the summarized text.
The architecture remains the same. But now we sent to LLM a summarized conversation and user question. In response, we ask the model to provide a new summary that includes the latest question and an answer to the user's question.
import boto3
import json
import logging
bedrock_runtime = boto3.client(service_name='bedrock-runtime')
messages=[]
input_tokens=0
output_tokens=0
This function simulates a standard approach. All previous conversation history is being sent to LLM. We will use "input_tokens" and "output_tokens" variables to count the number of tokens.
def generate_message(bedrock_runtime, model_id, system_prompt, input_msg, max_tokens):
    global input_tokens
    global output_tokens
    user_message =  {"role": "user", "content": input_msg}
    messages.append(user_message)
    
    body=json.dumps(
        {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": max_tokens,
            "system": system_prompt,
            "messages": messages
        }  
    )  
    
    response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
    response_body = json.loads(response.get('body').read())
    assistant_res = response_body['content'][0]['text']
    input_tokens += response_body['usage']['input_tokens']
    output_tokens += response_body['usage']['output_tokens']
    messages.append({"role": "assistant", "content": assistant_res})
    return assistant_res
we will use Claude Sonnet LLM model
model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'
system_prompt = """
You are an advanced AI assistant named Claude 3, designed to engage in open-ended conversation on a wide range of topics. Your goal is to be a helpful, friendly, and knowledgeable conversational partner. You should demonstrate strong language and communication abilities, and draw upon a broad base of knowledge to discuss topics like science, history, current events, arts and culture, and more.
Respond to the human in a natural, human-like way, using appropriate tone, sentiment, and personality. Be curious and ask follow-up questions to better understand the human's interests and needs. Provide detailed and informative responses, while also being concise and to-the-point.
Be helpful, thoughtful, and intellectually engaging. Aim to be a pleasant and valuable conversational companion. If you don't know the answer to something, say so, but offer to research it or provide related information. Avoid giving opinions on sensitive topics like politics or controversial issues unless directly asked.
Overall, your goal is to have an enjoyable, productive, and enlightening conversation with the human. Draw upon your extensive knowledge and capabilities to be an exceptional AI chatbot. Try to provide as short as possible answers without compomizing on the content.
"""
max_tokens = 1000
input_msg="Can you help me choose the movie based on my preferences? "
response = generate_message (bedrock_runtime, model_id, system_prompt, input_msg, max_tokens)
print(response)
input_msg="I like classic movies about historic events. For example, Spartakus, which features Kirk Douglas "
response = generate_message (bedrock_runtime, model_id, system_prompt, input_msg, max_tokens)
print(response)
input_msg="I prefer movies that won some prizes, like the Oscars or the Khan Festival."
response = generate_message (bedrock_runtime, model_id, system_prompt, input_msg, max_tokens)
print(response)
At this point, I decided that, starting with the with the next question, I would use a new approach that includes summarization.
I created two new variables to store the tokens I used so far: "store_for_later_input_tokens" and "store_for_later_output_tokens."
store_for_later_input_tokens=input_tokens
store_for_later_output_tokens=output_tokens
server_side_summary=""
concatenated_content = " ".join(element['content'] for element in messages)
I will use the "generate_summary" function to generate the summary of chat until this point. I don't have a database in my notebook, so I store the summary in server_side_summary variable.
summary_input_token=0
summary_output_token=0
def generate_summary(bedrock_runtime, model_id, input_msg, max_tokens):
    global summary_input_token
    global summary_output_token
    system_prompt="""
    This is the chat between an AI chatbot and a human. 
    AI chatbots help humans choose movies to watch. The AI bot proposes different movies, 
    and the human refines his search and demands from the movie. 
    User will provide you with the chat text. Your goal is to summarize the chat and reduce the conversation  length by at least 40% while preserving the content.
    """
  
    
    user_message =  {"role": "user", "content": f"the chat is: {input_msg}"}
    
    body=json.dumps(
        {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": max_tokens,
            "system": system_prompt,
            "messages": [user_message],
            "temperature": 0.2,
        }  
    )  
    
    response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
    response_body = json.loads(response.get('body').read())
   
    summary_input_token = response_body['usage']['input_tokens']
    summary_output_token = response_body['usage']['output_tokens']
    assistant_res = response_body['content'][0]['text']
    return assistant_res
I used Claude Haiku for summarization. This is because I observed that the summary created by Haiku is more concrete.
server_side_summary = generate_summary(bedrock_runtime, "anthropic.claude-3-haiku-20240307-v1:0", concatenated_content, max_tokens)
print(server_side_summary)
Summarize and answer the question¶
"generate_summary_and_return_answer" function will summarize the text and answer user's question. The return format is JSON for simplicity and better parsing. "sum_answ_input_token" and "sum_answ_output_token" variables will be used to store input and output tokens
sum_answ_input_token=0
sum_answ_output_token=0
def generate_summary_and_return_answer(bedrock_runtime, model_id, input_msg, max_tokens):
    global sum_answ_input_token
    global sum_answ_output_token
    system_prompt=server_side_summary
    user_prompt=""". The user will now ask you for additional recommendation. Address and answer the users's remark (or question). 
    After this summarize the conversation. The summarization should preserve the context of all requests made by user. Response by using JSON object in the following format.
    {"answer": "<answer for the user's remark (or question)>","summarization: "<summary of conversation including the last question and answer you provided>"}
    Make sure to produce only valid JSON according to JSON syntax rules.
    """
    
    user_message =  {"role": "user", "content": f"This is the users's remark (or question): {input_msg}"}
  
    
    body=json.dumps(
        {
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": max_tokens,
            "system": system_prompt+user_prompt,
            "messages": [user_message]
        }  
    )  
    
    response = bedrock_runtime.invoke_model(body=body, modelId=model_id)
    response_body = json.loads(response.get('body').read())
    assistant_res = response_body['content'][0]['text']
    sum_answ_input_token += response_body['usage']['input_tokens']
    sum_answ_output_token += response_body['usage']['output_tokens']
   
    corrected_json_str  = assistant_res.replace('\n', '\\n')
   
    return corrected_json_str
response=generate_summary_and_return_answer(bedrock_runtime, model_id, "Can you narrow your list to movies with a duration of at least two hours?", max_tokens)
server_side_summary= json.loads(response)['summarization']
print(json.loads(response)['answer'])
response=generate_summary_and_return_answer(bedrock_runtime, model_id, "I also want only the movies with a happy end only", max_tokens)
server_side_summary= json.loads(response)['summarization']
print(json.loads(response)['answer'])
response=generate_summary_and_return_answer(bedrock_runtime, model_id, "Limit only to movies produced during last 20 years", max_tokens)
server_side_summary= json.loads(response)['summarization']
print(json.loads(response)['answer'])
response=generate_summary_and_return_answer(bedrock_runtime, model_id, "I actually like The Imitation Game. Tell me the plot", max_tokens)
server_side_summary= json.loads(response)['summarization']
print(json.loads(response)['answer'])
total_input_with_summary=store_for_later_input_tokens+summary_input_token+sum_answ_input_token
total_output_with_summary=store_for_later_output_tokens+summary_output_token+sum_answ_output_token
print(f"Total input tokens with summarization: {total_input_with_summary}")
print(f"Total output tokens with summarization: {total_output_with_summary}")
Without Summarization¶
input_msg="Can you narrow your list to movies with a duration of at least two hours?"
response = generate_message (bedrock_runtime, model_id, system_prompt, input_msg, max_tokens)
print(response)
input_msg="I also want only the movies with a happy end only"
response = generate_message (bedrock_runtime, model_id, system_prompt, input_msg, max_tokens)
print(response)
input_msg="Limit only to movies produced during last 20 years"
response = generate_message (bedrock_runtime, model_id, system_prompt, input_msg, max_tokens)
print(response)
input_msg="I actually like The Imitation Game. Tell me the plot"
response = generate_message (bedrock_runtime, model_id, system_prompt, input_msg, max_tokens)
print(response)
total_input_no_summary=input_tokens+summary_input_token
total_output_no_summary=output_tokens+summary_output_token
print(f"Total input tokens without summarization: {total_input_with_summary}")
print(f"Total output tokens without summarization: {total_output_with_summary}")
import pandas as pd
import matplotlib.pyplot as plt
# Creating the DataFrame
data = {
    "Total Input No Summarization": [total_input_no_summary, total_output_no_summary],
    "Total Input With Summarization": [total_input_with_summary, total_output_with_summary]
}
df = pd.DataFrame(data, index=["Input", "Output"])
# Plotting the DataFrame as a bar chart
ax = df.plot(kind="bar", figsize=(10, 6), rot=0)
# Adding labels and title
plt.title("Input and Output Tokens with and without summarization")
plt.ylabel("Number of tokens")
# Display the plot
plt.show()
Let's also display the summary text to make sure that the context is preserved.
server_side_summary
It is important to understand that this is not academic research and is mostly based on an intuitive approach. It may be that for different use cases, we will get different results. But for this specific case study, the results are remarkable
Note that the number of output tokens is almost the same. It proves that, despite providing different input formats, the LLM returns almost the same answer. If we discuss it further, the difference may be because of the non-zero default LLM temperature.
 
 
No comments:
Post a Comment