Finish-to-end information science mission utilizing Streamlit, Upstash, and OpenAI to construct higher information navigation and comprehension utilizing community evaluation
This text will information you thru an end-to-end information science mission utilizing a number of state-of-the-art instruments within the AI area. This software is known as Thoughts Mapper as a result of it permits you to create conceptual maps by injecting info right into a information base and retrieving it in a sensible means.
The motivation was to transcend the “easy” RAG framework, the place a consumer queries a vector database and its response is then fed to an LLM like GPT-4 for an enriched reply.
Thoughts Mapper leverages RAG to create intermediate consequence representations helpful to carry out some form of information intelligence which is permits us in flip to raised perceive the output outcomes of RAG over lengthy and unstructured paperwork.
Merely talking, I need to use RAG as a foundational step to construct various responses, not simply textual. A thoughts map is considered one of such responses.
Listed below are a number of the software’s options:
- Manages textual content in principally all types: copy-paste, textual and originating from audio supply (video is contemplated too if the mission is nicely acquired)
- Makes use of an in-project SQLite database for information persistence
- Leverages the state-of-the-art Upstash vector database to retailer vectors effectively
- Chunks from the vector database are then used to create a information graph of the knowledge
- A ultimate LLM is known as to touch upon the information graph and extract insights
We’ll use Streamlit as library for frontend rendering of our logic. All the code shall be written in Python.
In order for you to try the app you’ll be constructing, test it out right here
I’ve uploaded a collection of textual content paperwork copy-pasted from Wikipedia about distinguished people within the AI world like Sam Altman, Andrej Karpathy, and extra. We’ll question this data base to display how the mission works.
A thoughts map seems like this, when utilizing a immediate like
“Who’s Andrej Karpathy?”
Be at liberty to navigate the linked software, present your OpenAI API key and Upstash REST Url + Token and immediate the present information base for some demo insights.
The deployed Streamlit app has the inputs part disabled to keep away from exposing the database publicly. In case you construct the app from the bottom up or clone it from Github, you’ll have the database obtainable underneath the primary department of the mission.
If this introduction stimulated your curiosity, then be part of me and let’s dive deeper into the reasons and code!
Right here’s the Github of the mission if you wish to observe alongside.
The software program works following this algorithm
- consumer uploads or pastes textual content into the software program and saves the information right into a database. Person may also add an audio monitor which will get transcribed due to OpenAI’s Whisper mannequin
2. when the information is saved, it’s break up into textual chunks and these chunks are then embedded utilizing OpenAI ada-002 mannequin
3. vectors are saved into Upstash vector database, with metadata hooked up
4. when consumer asks a query to the assistant, the question is embedded utilizing the identical mannequin and that vector is used to retrieve the highest n most related chunks utilizing dot product similarity metric
5. these related chunks of textual content, that are associated to the enter question, are fed into an AI agent accountable of extracting entities and relationships from all of the chunks
6. these entities and relationships make up a Python dictionary which is then used to construct the thoughts map
7. one other agent reads the content material of the identical dictionary and creates a remark to explain the thoughts map and spotlight related info
END.
Let’s briefly undergo the mission dependencies to get a greater understanding of the blocks that make up the logic.
Poetry
I exploit Poetry for principally all of my tasks. It’s a handy and easy Python env and package deal supervisor. You may obtain Poetry from this hyperlink.
In case you cloned the repository, all you need to do is poetry set up
contained in the mission’s folder in your terminal. Poetry will set up and maintain all of it.
Upstash Vector Database
Upstash was actually a latest discovery and I felt I needed to try it out with an actual mission. Whereas Upstash’s been releasing state-of-the-art merchandise for a while, it was lacking a vector database. Lower than a month in the past, the corporate launch the vector database, which is totally on the cloud and free for experimentation and much more. I discovered myself having fun with utilizing it’s API, and the net service had 0 lag.
OpenAI
As talked about, this mission leverages Whisper for audio file transcription and GPT-4 to empower the brokers to extract and remark the thoughts map. We might additionally use open supply fashions if we needed to.
In case you haven’t already, you possibly can setup an OpenAI API key at this hyperlink right here.
NetworkX
NetworkX empowers the thoughts map element within the software program. It takes care of making nodes of entities and edges amongst these. With Plotly, the interactive visualization lib, you possibly can actually visualize complicated networks. You may learn extra concerning the lib at this hyperlink.
Streamlit
There are a bunch of core libraries like Pandas and Numpy however I gained’t even record them right here. However, Streamlit must be talked about as a result of it makes the frontend attainable. An actual boon for information scientists which have little information of frontend frameworks and JavaScript.
Now that we’ve an higher thought of the primary parts of our software program, let’s begin constructing it from scratch. Sit tight as a result of it’s going to be fairly an extended learn.
That is how the entire mission seems:
Clearly the logic is contained within the src
folder. It incorporates the majority of the logic, whereas there’s a devoted folder for the llm
components. We’ll go step-by-step and construct all the scripts. We’ll begin with the one devoted to the information construction, i.e. schema.py.
Let’s begin by defining the knowledge schema. It’s usually the very first thing I do when working with information. We’ll use SQLModel and Pydantic to outline an Data
object that may retailer the knowledge and permit desk creation in SQLite.
# schema.pyfrom sqlmodel import SQLModel, Discipline
from typing import Non-obligatory
import datetime
from enum import Enum
class FileType(Enum):
AUDIO = "audio"
TEXT = "textual content"
VIDEO = "video"
class Data(SQLModel, desk=True):
id: Non-obligatory[int] = Discipline(default=None, primary_key=True)
filename: str = Discipline()
title: Non-obligatory[str] = Discipline(default="NA", distinctive=False)
hash_id: str = Discipline(distinctive=True)
created_at: float = Discipline(default=datetime.datetime.now().timestamp())
file_type: FileType
textual content: str = Discipline(default="")
embedded: bool = Discipline(default=False)
__table_args__ = {"extend_existing": True}
Every textual content we’ll enter within the database shall be an Data
. It’ll have
- and ID, which is able to act as a main key and thus be autoincremental
- a filename that may point out the title of the file uploaded in string format
- a title that the consumer can specify optionally in string format
- hash_id: created by encoding with MD5 hashing the textual content. We’ll use the hash ID to carry out database operations like learn, delete and replace.
- created_at is mechanically generated by utilizing as a default worth the present time indicating when the merchandise was saved in database
- file_type signifies whether or not the enter information was textual, audio or video (not applied, however might be)
- textual content incorporates the supply information used for your entire logic
- embedded is a boolean worth that may assist us level to the objects which have been embedded and thus current within the cloud vector database
Word: the piece of code __table_args__ = {"extend_existing": True}
is important be capable to entry and manipulate information within the database from Streamlit.
Now that we bought the information schema down, let’s write our first utility operate: the logger. It’s an extremely helpful factor to have, and because of the lib Wealthy we’ll additionally get pleasure from having some cool colours within the terminal.
# logger.pyimport logging
from wealthy.logging import RichHandler
from typing import Non-obligatory
def get_console_logger(title: Non-obligatory[str] = "default") -> logging.Logger:
logger = logging.getLogger(title)
if not logger.handlers:
logger.setLevel(logging.DEBUG)
console_handler = RichHandler()
console_handler.setLevel(logging.DEBUG)
formatter = logging.Formatter(
"%(asctime)s - %(title)s - %(levelname)s - %(message)s"
)
console_handler.setFormatter(formatter)
logger.addHandler(console_handler)
return logger
We’ll simply import it in all of our core scripts.
Since we’re at it, let’s additionally write our utils.py script with some helper features.
# utils.pyimport wave
import contextlib
from pydub import AudioSegment
import hashlib
import datetime
from src import logger
logger = logger.get_console_logger("utils")
def compute_cost_of_audio_track(audio_track_file_path: str):
file_extension = audio_track_file_path.break up(".")[-1].decrease()
duration_seconds = 0
if file_extension == "wav":
with contextlib.closing(wave.open(audio_track_file_path, "rb")) as f:
frames = f.getnframes()
charge = f.getframerate()
duration_seconds = frames / float(charge)
elif file_extension == "mp3":
audio = AudioSegment.from_mp3(audio_track_file_path)
duration_seconds = len(audio) / 1000.0 # pydub returns period in milliseconds
else:
logger.error(f"Unsupported file format: {file_extension}")
return
audio_duration_in_minutes = duration_seconds / 60
price = spherical(audio_duration_in_minutes, 2) * 0.006 # default value of whisper mannequin
logger.data(f"Price to transform {audio_track_file_path} is ${price:.2f}")
return price
def hash_text(textual content: str) -> str:
return hashlib.md5(textual content.encode()).hexdigest()
def convert_timestamp_to_datetime(timestamp: str) -> str:
return datetime.datetime.fromtimestamp(int(timestamp)).strftime("%Y-%m-%d %H:%M:%S")
We gained’t find yourself utilizing the compute_cost_of_audio_track
operate on this model of the software, however I’ve included it nonetheless if you wish to use it as a substitute.
hash_text
goes for use rather a lot to create the hash IDs to insert within the database, whereas convert_timestamp_to_datetime
is helpful to know the default datetime object positioned within the database upon merchandise creation.
Now let’s have a look at the database setup. We’ll setup the conventional CRUD interface:
# db.pyfrom sqlmodel import SQLModel, create_engine, Session, choose
from src.schema import Data
from src.logger import get_console_logger
sqlite_file_name = "database.db"
sqlite_url = f"sqlite:///{sqlite_file_name}"
engine = create_engine(sqlite_url, echo=False)
logger = get_console_logger("db")
SQLModel.metadata.create_all(engine)
def read_one(hash_id: dict):
with Session(engine) as session:
assertion = choose(Data).the place(Data.hash_id == hash_id)
info = session.exec(assertion).first()
return info
def add_one(information: dict):
with Session(engine) as session:
if session.exec(
choose(Data).the place(Data.hash_id == information.get("hash_id"))
).first():
logger.warning(f"Merchandise with hash_id {information.get('hash_id')} already exists")
return None # or elevate an exception, or deal with as wanted
info = Data(**information)
session.add(info)
session.commit()
session.refresh(info)
logger.data(f"Merchandise with hash_id {information.get('hash_id')} added to the database")
return info
def update_one(hash_id: dict, information: dict):
with Session(engine) as session:
# Verify if the merchandise with the given hash_id exists
info = session.exec(
choose(Data).the place(Data.hash_id == hash_id)
).first()
if not info:
logger.warning(f"No merchandise with hash_id {hash_id} discovered for replace")
return None # or elevate an exception, or deal with as wanted
for key, worth in information.objects():
setattr(info, key, worth)
session.commit()
logger.data(f"Merchandise with hash_id {hash_id} up to date within the database")
return info
def delete_one(id: int):
with Session(engine) as session:
# Verify if the merchandise with the given hash_id exists
info = session.exec(
choose(Data).the place(Data.hash_id == id)
).first()
if not info:
logger.warning(f"No merchandise with hash_id {id} discovered for deletion")
return None # or elevate an exception, or deal with as wanted
session.delete(info)
session.commit()
logger.data(f"Merchandise with hash_id {id} deleted from the database")
def add_many(information: record):
with Session(engine) as session:
for information in information:
# Reuse add_one operate for every merchandise
consequence = add_one(data)
if result's None:
logger.warning(
f"Merchandise with hash_id {data.get('hash_id')} couldn't be added"
)
else:
logger.data(
f"Merchandise with hash_id {data.get('hash_id')} added to the database"
)
session.commit() # Commit on the finish of the loop
def delete_many(ids: record):
with Session(engine) as session:
for id in ids:
# Reuse delete_one operate for every merchandise
consequence = delete_one(id)
if result's None:
logger.warning(f"No merchandise with hash_id {id} discovered for deletion")
else:
logger.data(f"Merchandise with hash_id {id} deleted from the database")
session.commit() # Commit on the finish of the loop
def read_all(question: dict = None):
with Session(engine) as session:
assertion = choose(Data)
if question:
assertion = assertion.the place(
*[getattr(Information, key) == value for key, value in query.items()]
)
info = session.exec(assertion).all()
return info
def delete_all():
with Session(engine) as session:
session.exec(Data).delete()
session.commit()
logger.data("All objects deleted from the database")
With this script, we’ll be capable to create the database and simply learn, create, delete and replace objects one after the other or in bulk.
Now that we’ve our info construction and an interface to the database, we’ll transfer to the administration of audio recordsdata.
This was a very optionally available step, however I needed to spice issues up. Our code will enable customers to add any .mp3 or .wav recordsdata and transcribe their contents via OpenAI’s Whisper mannequin. My persona in thoughts was a college pupil that might acquire his notes by way of voice recording.
Be mindful Whisper is a paid mannequin. On the time of writing this text, the value was $0.006 / minute. You may be taught extra at this hyperlink.
Let’s create whisper.py and a single operate known as create_transcript
.
from src.logger import get_console_loggerlogger = get_console_logger("whisper")
def create_transcript(openai_client, file_path: str) -> None:
audio_file = open(file_path, "rb")
logger.data(f"Creating transcript for {file_path}")
transcript = openai_client.audio.transcriptions.create(
mannequin="whisper-1", file=audio_file
)
logger.data(f"Transcript created for {file_path}")
return transcript.textual content
This operate may be very easy, and it’s only a easy wrapper round OpenAI’s audio module.
The attentive eye will discover that openai_client
is an argument to the operate. That’s not a mistake, and we’ll see why in only a second.
Now we will deal with textual content in all (of the supported) types, that are fundamental textual content and audio. It’s time to vectorize these texts and push them to our Upstash vector database.
We’ll be utilizing a number of extra instruments right here to correctly embed our paperwork for vector search and RAG.
- Tiktoken: the well-known library by OpenAI that enables for easy and environment friendly tokenization primarily based on LLM (in our case, GPT-3.5)
- LangChain: I really like this library, and discover it very versatile regardless of what portion of the neighborhood says about it. On this mission, I borrow from it the RecursiveCharacterTextSplitter object
Once more, in case you cloned the repo, Poetry will import the required dependencies mechanically. If not, simply run the command poetry add langchain tiktoken
.
After all, we’ll additionally want to put in Upstash Vector — the command is poetry add upstash-vector
. As soon as put in, go to the web page https://console.upstash.com/ to setup your cloud surroundings.
Be sure to select 1536 as vector dimensionality to match the scale of OpenAI ADA mannequin.
As I discussed earlier than, Upstash is a paid software, however they do have a really beneficiant free tier that I used extensively for this mission.
Free: The free plan is appropriate for small tasks. It has a restrict of 10,000 queries and 10,000 updates restrict day by day.
That is nice to get began constructing tasks like these. Scalability, as well as, is just not a difficulty since you possibly can simply tune your necessities.
As soon as carried out, come up with your REST url and token
Now we’re prepared to write down our script.
# vector_db.pyfrom src.logger import get_console_logger
import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter
from upstash_vector import Vector
from tqdm import tqdm
import random
logger = get_console_logger("vector_db")
MODEL = "text-embedding-ada-002"
ENCODER = tiktoken.encoding_for_model("gpt-3.5-turbo")
def token_len(textual content):
"""Calculate the token size of a given textual content.
Args:
textual content (str): The textual content to calculate the token size for.
Returns:
int: The variety of tokens within the textual content.
"""
return len(ENCODER.encode(textual content))
def get_embeddings(openai_client, chunks, mannequin=MODEL):
"""Get embeddings for a listing of textual content chunks utilizing the required mannequin.
Args:
openai_client: The OpenAI shopper occasion to make use of for producing embeddings.
chunks (record of str): The textual content chunks to embed.
mannequin (str): The mannequin identifier to make use of for embedding.
Returns:
record of record of float: An inventory of embeddings, every similar to a piece.
"""
chunks = [c.replace("n", " ") for c in chunks]
res = openai_client.embeddings.create(enter=chunks, mannequin=mannequin).information
return [r.embedding for r in res]
def get_embedding(openai_client, textual content, mannequin=MODEL):
"""Get embedding for a single textual content utilizing the required mannequin.
Args:
openai_client: The OpenAI shopper occasion to make use of for producing the embedding.
textual content (str): The textual content to embed.
mannequin (str): The mannequin identifier to make use of for embedding.
Returns:
record of float: The embedding of the given textual content.
"""
# textual content = textual content.change("n", " ")
return get_embeddings(openai_client, [text], mannequin)[0]
def query_vector_db(index, openai_client, query, top_n=1):
"""Question the vector database for related vectors to the given query.
Args:
index: The vector database index to question.
openai_client: The OpenAI shopper occasion to make use of for producing the query embedding.
query (str): The query to question the vector database with.
system_prompt (str, optionally available): An extra immediate to supply context for the query. Defaults to an empty string.
top_n (int, optionally available): The variety of high related vectors to return. Defaults to 1.
Returns:
str: A string containing the concatenated texts of the highest related vectors.
"""
logger.data("Creating vector for query...")
question_embedding = get_embedding(openai_client, query)
logger.data("Querying vector database...")
res = index.question(vector=question_embedding, top_k=top_n, include_metadata=True)
context = "n-".be part of([r.metadata["text"] for r in res])
logger.data(f"Context returned. Size: {len(context)} characters.")
return context
def create_chunks(textual content, chunk_size=150, chunk_overlap=20):
"""Create textual content chunks primarily based on specified dimension and overlap.
Args:
textual content (str): The textual content to separate into chunks.
chunk_size (int, optionally available): The specified dimension of every chunk. Defaults to 150.
chunk_overlap (int, optionally available): The variety of overlapping characters between chunks. Defaults to twenty.
Returns:
record of str: An inventory of textual content chunks.
"""
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=token_len,
separators=["nn", "n", " ", ""],
)
return text_splitter.split_text(textual content)
def add_chunks_to_vector_db(index, chunks, metadata):
"""Embed textual content chunks and add them to the vector database.
Args:
index: The vector database index so as to add chunks to.
chunks (record of str): The textual content chunks to embed and add.
metadata (dict): The metadata to affiliate with every chunk.
Returns:
None
"""
for chunk in chunks:
random_id = random.randint(0, 1000000) # workaround whereas ready for metadata search to be applied
metadata["text"] = chunk
vec = Vector(
id=f"chunk-{random_id}", vector=get_embedding(chunk), metadata=metadata
)
index.upsert(vectors=[vec])
logger.data(f"Added chunk to vector db: {chunk}")
def fetch_by_source_hash_id(index, source_hash_id: str, max_results=10000):
"""Fetch vector IDs from the database by supply hash ID.
Args:
index: The vector database index to go looking.
source_hash_id (str): The supply hash ID to filter the vectors by.
max_results (int, optionally available): The utmost variety of outcomes to return. Defaults to 10000.
Returns:
record of str: An inventory of vector IDs that match the supply hash ID.
"""
ids = []
for i in tqdm(vary(0, max_results, 1000)):
search = index.vary(
cursor=str(i), restrict=1000, include_vectors=False, include_metadata=True
).vectors
for end in search:
if consequence.metadata["source_hash_id"] == source_hash_id:
ids.append(consequence.id)
return ids
def fetch_all(index):
"""Fetch all vectors from the database.
Args:
index: The vector database index to fetch vectors from.
Returns:
record: An inventory of vectors from the database.
"""
return index.vary(
cursor="0", restrict=1000, include_vectors=False, include_metadata=True
).vectors
There’s extra occurring on this script so let me dive deeper for a second.
get_embedding
and get_embeddings
are used to encode one or a number of texts. Simply conveniently positioned right here for higher management.
query_vector_db
permits us to question Upstash for related objects to our question vector. On this operate, we embed the question and carry out the lookup via the index’s .question
technique. The index, along with OpenAI’s shopper, are handed in as arguments later within the Streamlit app. The returned object is a string known as context which is a concatenation of the highest N most related objects to the enter question.
Persevering with, we leverage LangChain’s RecursiveCharacterTextSplitter
to effectively create textual chunks from the paperwork.
Now a little bit of CRUD interface additionally for the vector DB: including and fetching information (updating and deletion are simply carried out too and we’ll try this within the frontend).
Word: on the time of writing this text, Upstash doesn’t but assist search on metadata. Because of this since we’re utilizing hash_id to determine our paperwork, these aren’t immediately querable. I’ve added a easy workaround within the code to flick thru a bunch (100k) paperwork and lookup for the hash ID manually. I’ve learn on-line they’ll be implementing this performance quickly.
We’ll begin engaged on coding our LLM behaviors by engaged on prompts first.
There are going to be two brokers. The primary one is accountable for extracting community information from the textual content, whereas the second is accountable for analyzing that community information.
The immediate to the primary agent is the next:
You're an knowledgeable in creating community graphs from textual information.
You're additionally a note-taking knowledgeable and you'll be able to create thoughts maps from textual content.
You're tasked with making a thoughts map from a given textual content information by extracting the ideas and relationships from the textual content.n
The relationships ought to be amongst objects, individuals, or locations talked about within the textual content.nTYPES ought to solely be one of many following:
- is a
- is said to
- is a part of
- is just like
- is completely different from
- is a kind of
Your output ought to be a JSON containing the next:
{ "relationships": [{"source": ..., "target": ..., "type": ..., "origin": _source_or_target_}, {...}] } n
- supply: The supply noden
- goal: The goal noden
- sort: The kind of the connection between the supply and goal nodesn
NEVER change this output format. ENGLISH is the output language. NEVER change the output language.
Your response shall be used as a Python dictionary, so be at all times aware of the syntax and the information sorts to return a JSON object.n
INPUT TEXT:n
The analyzer agent is as a substitute utilizing this immediate
You're a senior enterprise intelligence analyst, who is ready to extract priceless insights from information.
You're tasked with extracting info from a given thoughts map information.n
The thoughts map information is a JSON containing the next:
{{ "relationships": [{{"source": ..., "target": ..., "type": ..."origin": _source_or_target_}}, {{...}}] }} n
- supply: The supply noden
- goal: The goal noden
- sort: The kind of the connection between the supply and goal nodesn
- origin: The origin node from which the connection originatesnYou're to extract insights from the thoughts map information and supply a abstract of the relationships.n
Your output ought to be a short touch upon the thoughts map information, highlighting related insights and relationships utilizing centrality and different graph evaluation strategies.n
NEVER change this output format. ENGLISH is the output language. NEVER change the output language.n
Maintain your output very transient. Only a remark to spotlight the highest most related info.
MIND MAP DATA:n
{mind_map_data}
These two prompts shall be imported within the Pythonic means: that’s, as scripts.
Let’s create a script within the LLM folder known as prompts.py and create a dictionary of intents the place we place the prompts as values.
# llm.prompts.pyPROMPTS = {
"mind_map_of_one": """You're an knowledgeable in creating community graphs from textual information.
You're additionally a note-taking knowledgeable and you'll be able to create thoughts maps from textual content.
You're tasked with making a thoughts map from a given textual content information by extracting the ideas and relationships from the textual content.n
The relationships ought to be amongst objects, individuals, or locations talked about within the textual content.n
TYPES ought to solely be one of many following:
- is a
- is said to
- is a part of
- is just like
- is completely different from
- is a kind of
Your output ought to be a JSON containing the next:
{ "relationships": [{"source": ..., "target": ..., "type": ...}, {...}] } n
- supply: The supply noden
- goal: The goal noden
- sort: The kind of the connection between the supply and goal nodesn
NEVER change this output format. ENGLISH is the output language. NEVER change the output language.
Your response shall be used as a Python dictionary, so be at all times aware of the syntax and the information sorts to return a JSON object.n
INPUT TEXT:n
""",
"inspector_of_mind_map": """
You're a senior enterprise intelligence analyst, who is ready to extract priceless insights from information.
You're tasked with extracting info from a given thoughts map information.n
The thoughts map information is a JSON containing the next:
{{ "relationships": [{{"source": ..., "target": ..., "type": ...}}, {{...}}] }} n
- supply: The supply noden
- goal: The goal noden
- sort: The kind of the connection between the supply and goal nodesn
- origin: The origin node from which the connection originatesn
You're to extract insights from the thoughts map information and supply a abstract of the relationships.n
Your output ought to be a short touch upon the thoughts map information, highlighting related insights and relationships utilizing centrality and different graph evaluation strategies.n
NEVER change this output format. ENGLISH is the output language. NEVER change the output language.n
Maintain your output very transient. Only a remark to spotlight the highest most related info.
MIND MAP DATA:n
{mind_map_data}
""",
}
On this means we will simply import and use the prompts just by pointing on the agent’s intent (mind_map_of_one, inspector_of_mind_map). We’ll import the prompts within the llm.py script.
# llm.llm.pyfrom src.logger import get_console_logger
from src.llm.prompts import PROMPTS
logger = get_console_logger("llm")
MIND_MAP_EXTRACTION_MODEL = "gpt-4-turbo-preview"
MIND_MAP_INSPECTION_MODEL = "gpt-4"
def extract_mind_map_data(openai_client: object, textual content: str) -> None:
logger.data(f"Extracting thoughts map information from textual content...")
response = openai_client.chat.completions.create(
mannequin=MIND_MAP_EXTRACTION_MODEL,
response_format={"sort": "json_object"},
temperature=0,
messages=[
{"role": "system", "content": PROMPTS["mind_map_of_one"]},
{"function": "consumer", "content material": f"{textual content}"},
],
)
return response.selections[0].message.content material
def extract_mind_map_data_of_two(
openai_client: object, source_text: str, target_text: str
) -> None:
logger.data(f"Extracting thoughts map information from two texts...")
user_prompt = PROMPTS["mind_map_of_many"].format(
source_text=source_text, target_text=target_text
)
response = openai_client.chat.completions.create(
mannequin=MIND_MAP_INSPECTION_MODEL,
response_format={"sort": "json_object"}, # this is essential!
messages=[
{"role": "system", "content": PROMPTS["mind_map_of_many"]},
{"function": "consumer", "content material": user_prompt},
],
)
return response.selections[0].message.content material
def extract_information_from_mind_map_data(openai_client_ object, information: dict) -> None:
logger.data(f"Extracting info from thoughts map information...")
user_prompt = PROMPTS["inspector_of_mind_map"].format(mind_map_data=information)
response = openai_client.chat.completions.create(
mannequin="gpt-4",
messages=[
{"role": "system", "content": PROMPTS["inspector_of_mind_map"]},
{"function": "consumer", "content material": user_prompt},
],
)
return response.selections[0].message.content material
All of the heavy work is completed by the 2 easy features that merely join an GPT agent to the suitable immediate. Word response_format={“sort"=”json_object"}
within the first operate. This ensures that GPT-4 builds a JSON illustration of the textual content’s community information. With out this line, your entire software turns into extremely unstable.
Let’s put the logic to the take a look at. When handed the immediate “Who’s Andrej Karpathy?” the primary agent creates this community illustration:
{
"relationships":[
{
"source":"Andrej Karpathy",
"target":"Slovak-Canadian",
"type":"is a"
},
{
"source":"Andrej Karpathy",
"target":"computer scientist",
"type":"is a"
},
{
"source":"Andrej Karpathy",
"target":"director of artificial intelligence and Autopilot Vision at Tesla",
"type":"served as"
},
{
"source":"Andrej Karpathy",
"target":"OpenAI",
"type":"worked at"
},
{
"source":"Andrej Karpathy",
"target":"deep learning",
"type":"specialized in"
},
{
"source":"Andrej Karpathy",
"target":"computer vision",
"type":"specialized in"
},
{
"source":"Andrej Karpathy",
"target":"Bratislava, Czechoslovakia",
"type":"was born in"
},
{
"source":"Andrej Karpathy",
"target":"Toronto",
"type":"moved to"
},
{
"source":"Andrej Karpathy",
"target":"University of Toronto",
"type":"completed degrees at"
},
{
"source":"Andrej Karpathy",
"target":"University of British Columbia",
"type":"completed master's degree at"
},
{
"source":"Andrej Karpathy",
"target":"OpenAI",
"type":"is a founding member of"
},
{
"source":"Andrej Karpathy",
"target":"Tesla",
"type":"became director of artificial intelligence at"
},
{
"source":"Andrej Karpathy",
"target":"Elon Musk",
"type":"reported to"
},
{
"source":"Andrej Karpathy",
"target":"MIT Technology Review's Innovators Under 35 for 2020",
"type":"was named one of"
},
{
"source":"Andrej Karpathy",
"target":"YouTube videos on how to create artificial neural networks",
"type":"makes"
},
{
"source":"Andrej Karpathy",
"target":"Stanford University",
"type":"received a PhD from"
},
{
"source":"Fei-Fei Li",
"target":"Stanford University",
"type":"is part of"
},
{
"source":"Andrej Karpathy",
"target":"natural language processing",
"type":"focused on"
},
{
"source":"Andrej Karpathy",
"target":"CS 231n: Convolutional Neural Networks for Visual Recognition",
"type":"authored and was the primary instructor of"
},
{
"source":"CS 231n: Convolutional Neural Networks for Visual Recognition",
"target":"Stanford",
"type":"is part of"
}
]
}
This information comes from unstructured Wikipedia textual content uploaded within the software for testing functions. The illustration appears simply superb! Be at liberty to edit the prompts to extract much more potential info.
All that is still now could be to make use of this Python dictionary of relationships to create our interactive thoughts map with NetworkX and Plotly.
There’s going to be one operate solely, however goes to be fairly intense in case you’ve by no means labored with NetworkX earlier than. It’s not the best framework to work with, however the outputs you will get from turning into proficient at it are priceless.
What we’ll do is initialize a graph object with G = nx.DiGraph()
, which creates a brand new directed graph. The operate iterates over a listing of relationships offered within the information dictionary. For every relationship, it provides an edge to the graph G from the supply node to the goal node, with an attribute sort that describes the connection.
for relationship in information["relationships"]:
G.add_edge(
relationship["source"], relationship["target"], sort=relationship["type"]
)
As soon as carried out, the graph’s format is computed utilizing the spring format algorithm, which positions the nodes in a means that tries to attenuate the overlap between edges and maintain the perimeters’ lengths uniform. The seed parameter ensures that the format is reproducible.
Lastly, Plotly’s Graph Objects (go) module takes care of making scatterplots for every information level, representing a node on the chart.
Right here’s how the mind_map.py script seems.
# mind_map.pyimport networkx as nx
from graphviz import Digraph
import plotly.categorical as px
import plotly.graph_objects as go
def create_plotly_mind_map(information: dict) -> go.Determine:
"""
information is a dictionary containing the next
{ "relationships": [{"source": ..., "target": ..., "type": ...}, {...}] }
supply: The supply node
goal: The goal node
sort: The kind of the connection between the supply and goal nodes
"""
### START - NETWORKX LOGIC ###
# Create a directed graph
G = nx.DiGraph()
# Add edges to the graph
for relationship in information["relationships"]:
G.add_edge(
relationship["source"], relationship["target"], sort=relationship["type"]
)
# Create a format for our nodes
format = nx.spring_layout(G, seed=42)
traces = []
for relationship in information["relationships"]:
x0, y0 = format[relationship["source"]]
x1, y1 = format[relationship["target"]]
edge_trace = go.Scatter(
x=[x0, x1, None],
y=[y0, y1, None],
line=dict(width=0.5, colour="#888"), # Set a single colour for all edges
hoverinfo="none",
mode="traces",
)
traces.append(edge_trace)
# Modify node hint to paint primarily based on supply node
node_x = []
node_y = []
for node in G.nodes():
x, y = format[node]
node_x.append(x)
node_y.append(y)
### END - NETWORKX LOGIC ###
node_trace = go.Scatter(
x=node_x,
y=node_y,
mode="markers+textual content",
# add textual content to the nodes and origin
textual content=[node for node in G.nodes()],
hoverinfo="textual content",
marker=dict(
showscale=False,
colorscale="Greys", # Change colorscale to grayscale
reversescale=True,
dimension=20,
colour='#505050', # Set node colour to grey
line_width=2,
),
)
# Add node and edge labels
edge_annotations = []
for edge in G.edges(information=True):
x0, y0 = format[edge[0]]
x1, y1 = format[edge[1]]
edge_annotations.append(
dict(
x=(x0 + x1) / 2,
y=(y0 + y1) / 2,
xref="x",
yref="y",
textual content=edge[2]["type"],
showarrow=False,
font=dict(dimension=10),
)
)
node_annotations = []
for node in G.nodes():
x, y = format[node]
node_annotations.append(
dict(
x=x,
y=y,
xref="x",
yref="y",
textual content=node,
showarrow=False,
font=dict(dimension=12),
)
)
node_trace.textual content = [node for node in G.nodes()]
# Create the determine
fig = go.Determine(
information=traces + [node_trace],
format=go.Structure(
showlegend=False,
hovermode="closest",
margin=dict(b=20, l=5, r=5, t=40),
annotations=edge_annotations,
xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
),
)
# Modify the format to incorporate the legend
fig.update_layout(
legend=dict(
title="Origins",
traceorder="regular",
font=dict(dimension=12),
)
)
# Modify the node textual content colour for higher visibility on darkish background
node_trace.textfont = dict(colour="white")
# Modify the format to incorporate the legend and set the plot background to darkish
fig.update_layout(
paper_bgcolor="rgba(0,0,0,1)", # Set the background colour to black
plot_bgcolor="rgba(0,0,0,1)", # Set the plot space background colour to black
legend=dict(
title="Origins",
traceorder="regular",
font=dict(dimension=12, colour="white"), # Set legend textual content colour to white
),
xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
yaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
)
for annotation in edge_annotations:
annotation["font"]["color"] = "white" # Set edge annotation textual content colour to white
# Replace the colour of the node annotations for higher visibility
for annotation in node_annotations:
annotation["font"]["color"] = "white" # Set node annotation textual content colour to white
# Replace the sting hint colour to be extra seen on a darkish background
for hint in traces:
if "line" in hint:
hint["line"][
"color"
] = "#888" # Set edge colour to a single colour for all edges
# Replace the node hint marker border colour for higher visibility
node_trace.marker.line.colour = "white"
return fig
Be at liberty to easily copy-paste this operate in your logic and alter it as you please.
And that is how the thoughts map seems for the immediate “Who’s Sam Altman?”
Nice work! We’re carried out with the backend logic! Our final step is to implement the Streamlit app.