diff --git a/docs/README-USECASES.md b/docs/README-USECASES.md index ff848ae0..cb5d5361 100644 --- a/docs/README-USECASES.md +++ b/docs/README-USECASES.md @@ -60,7 +60,7 @@ Depending on the context window supported by the LLM, you can either send a larg ### Summarization -Here is a GPTScript that sends a large document in batches to the LLM and produces a summary of the entire document. [Link to example here] +Here is a GPTScript that sends a large document in batches to the LLM and produces a summary of the entire document. [hamlet-summarizer](../examples/hamlet-summarizer) Here is a GPTScript that reads the content of a large SQL database and produces a summary of the entire database. [Link to example here] diff --git a/examples/hamlet-summarizer/.gitignore b/examples/hamlet-summarizer/.gitignore new file mode 100644 index 00000000..f7275bbb --- /dev/null +++ b/examples/hamlet-summarizer/.gitignore @@ -0,0 +1 @@ +venv/ diff --git a/examples/hamlet-summarizer/Hamlet.pdf b/examples/hamlet-summarizer/Hamlet.pdf new file mode 100644 index 00000000..e6150634 Binary files /dev/null and b/examples/hamlet-summarizer/Hamlet.pdf differ diff --git a/examples/hamlet-summarizer/README.md b/examples/hamlet-summarizer/README.md new file mode 100644 index 00000000..ee172f16 --- /dev/null +++ b/examples/hamlet-summarizer/README.md @@ -0,0 +1,40 @@ +# Hamlet Summarizer + +This is an example tool that summarizes the contents of a large documents in chunks. + +The example document we are using is the Shakespeare play Hamlet. It is about 51000 tokens +(according to OpenAI's tokenizer for GPT-4), so it can fit within GPT-4's context window, +but this serves as an example of how larger documents can be split up and summarized. +This example splits it into chunks of 10000 tokens. + +Hamlet PDF is from https://nosweatshakespeare.com/hamlet-play/pdf/. + +## Design + +The script consists of three tools: a top-level tool that orchestrates everything, a summarizer that +will summarize one chunk of text at a time, and a Python script that ingests the PDF and splits it into +chunks and provides a specific chunk based on an index. + +The summarizer tool looks at the entire summary up to the current chunk and then summarizes the current +chunk and adds it onto the end. In the case of models with very small context windows, or extremely large +documents, this approach may still exceed the context window, in which case another tool could be added to +only give the summarizer the previous few chunk summaries instead of all of them. + +## Run the Example + +```bash +# Create a Python venv +python3 -m venv venv + +# Source it +source venv/bin/activate + +# Install the packages +pip install -r requirements.txt + +# Set your OpenAI key +export OPENAI_API_KEY=your-api-key + +# Run the example +gptscript --cache=false hamlet-summarizer.gpt +``` diff --git a/examples/hamlet-summarizer/hamlet-summarizer.gpt b/examples/hamlet-summarizer/hamlet-summarizer.gpt new file mode 100644 index 00000000..4aa4a619 --- /dev/null +++ b/examples/hamlet-summarizer/hamlet-summarizer.gpt @@ -0,0 +1,35 @@ +tools: hamlet-summarizer, sys.read, sys.write + +First, create the file "summary.txt" if it does not already exist. + +You are a program that is tasked with fetching partial summaries of a play called Hamlet. + +Call the hamlet-summarizer tool to get each part of the summary. Begin with index 0. Do not proceed +until the tool has responded to you. + +Once you get "No more content" from the hamlet-summarizer, stop calling it. +Then, print the contents of the summary.txt file. + +--- +name: hamlet-summarizer +tools: hamlet-retriever, sys.read, sys.append +description: Summarizes a part of the text of Hamlet. Returns "No more content" if the index is greater than the number of parts. +args: index: (unsigned int) the index of the portion to summarize, beginning at 0 + +You are a theater expert, and you're tasked with summarizing part of Hamlet. +Get the part of Hamlet at index $index. +Read the existing summary of Hamlet up to this point in summary.txt. + +Summarize the part at index $index. Include as many details as possible. Do not leave out any important plot points. +Do not introduce the summary with "In this part of Hamlet", "In this segment", or any similar language. +If a new character is introduced, be sure to explain who they are. +Add two newlines to the end of your summary and append it to summary.txt. + +If you got "No more content" just say "No more content". Otherwise, say "Continue". + +--- +name: hamlet-retriever +description: Returns a part of the text of Hamlet. Returns "No more content" if the index is greater than the number of parts. +args: index: (unsigned int) the index of the part to return, beginning at 0 + +#!python3 main.py "$index" diff --git a/examples/hamlet-summarizer/main.py b/examples/hamlet-summarizer/main.py new file mode 100644 index 00000000..0a0d1c77 --- /dev/null +++ b/examples/hamlet-summarizer/main.py @@ -0,0 +1,24 @@ +import tiktoken +import sys +from llama_index.readers.file import PyMuPDFReader +from llama_index.core.node_parser import TokenTextSplitter + +index = int(sys.argv[1]) +docs = PyMuPDFReader().load("Hamlet.pdf") + +combined = "" +for doc in docs: + combined += doc.text + +splitter = TokenTextSplitter( + chunk_size=10000, + chunk_overlap=10, + tokenizer=tiktoken.encoding_for_model("gpt-4").encode) + +pieces = splitter.split_text(combined) + +if index >= len(pieces): + print("No more content") + sys.exit(0) + +print(pieces[index]) diff --git a/examples/hamlet-summarizer/requirements.txt b/examples/hamlet-summarizer/requirements.txt new file mode 100644 index 00000000..500efd39 --- /dev/null +++ b/examples/hamlet-summarizer/requirements.txt @@ -0,0 +1,3 @@ +tiktoken==0.6.0 +llama-index-core==0.10.14 +llama-index-readers-file==0.1.6 \ No newline at end of file