Increase productiveness with video conferencing transcripts and summaries with the Amazon Chime SDK Assembly Summarizer resolution

Contents

Resolution overview Stipulations Entry the assembly with Amazon Chime SDK Speaker diarization with Amazon Transcribe Use Anthropic Claude and Amazon Bedrock to boost the answer Immediate engineering for speaker title identification Immediate engineering for cleansing the transcript Transcript Processing Speaker Identification Homophone Substitute Filler Phrase Elimination Immediate engineering for summarization Immediate engineering for assembly invitation parsing Clear up Conclusion In regards to the authors

Companies in the present day closely depend on video conferencing platforms for efficient communication, collaboration, and decision-making. Nonetheless, regardless of the comfort these platforms supply, there are persistent challenges in seamlessly integrating them into present workflows. One of many main ache factors is the dearth of complete instruments to automate the method of becoming a member of conferences, recording discussions, and extracting actionable insights from them. This hole ends in inefficiencies, missed alternatives, and restricted productiveness, hindering the seamless circulation of data and decision-making processes inside organizations.

To handle this problem, we’ve developed the Amazon Chime SDK Assembly Summarizer utility deployed with the Amazon Cloud Growth Package (AWS CDK). This utility makes use of an Amazon Chime SDK SIP media utility, Amazon Transcribe, and Amazon Bedrock to seamlessly be a part of conferences, report assembly audio, and course of recordings for transcription and summarization. By integrating these providers programmatically via the AWS CDK, we purpose to streamline the assembly workflow, empower customers with actionable insights, and drive higher decision-making outcomes. Our resolution presently integrates with standard platforms resembling Amazon Chime, Zoom, Cisco Webex, Microsoft Groups, and Google Meet.

Along with deploying the answer, we’ll additionally educate you the intricacies of immediate engineering on this publish. We information you thru addressing parsing and knowledge extraction challenges, together with speaker diarization, name scheduling, summarization, and transcript cleansing. By means of detailed directions and structured approaches tailor-made to every use case, we illustrate the effectiveness of Amazon Bedrock, powered by Anthropic Claude fashions.

Resolution overview

The next infrastructure diagram supplies an summary of the AWS providers which can be used to create this assembly summarization bot. The core providers used on this resolution are:

An Amazon Chime SDK SIP Media Utility is used to dial into the assembly and report assembly audio
Amazon Transcribe is used to carry out speech-to-text processing of the recorded audio, together with speaker diarization
Anthropic Claude fashions in Amazon Bedrock are used to determine names, enhance the standard of the transcript, and supply an in depth abstract of the assembly

For an in depth rationalization of the answer, consult with the Amazon Chime Assembly Summarizer documentation.

Stipulations

Earlier than diving into the undertaking setup, be sure you have the next necessities in place:

Yarn – Yarn have to be put in in your machine.
AWS account – You’ll want an lively AWS account.
Allow Claude Anthropic fashions – These fashions ought to be enabled in your AWS account. For extra info, see Mannequin entry.
Allow Amazon Titan – Amazon Titan ought to be activated in your AWS account. For extra info, see Amazon Titan Fashions.

Check with our GitHub repository for a step-by-step information on deploying this resolution.

Entry the assembly with Amazon Chime SDK

To seize the audio of a gathering, an Amazon Chime SDK SIP media utility will dial into the assembly utilizing the assembly supplier’s dial-in quantity. The Amazon Chime SDK SIP media utility (SMA) is a programable telephony service that can make a telephone name over the general public switched phone community (PSTN) and seize the audio. SMA makes use of a request/response mannequin with an AWS Lambda operate to course of actions. On this demo, an outbound name is made utilizing the CreateSipMediaApplicationCall API. This can trigger the Lambda operate to be invoked with a NEW_OUTBOUND_CALL occasion.

As a result of most dial-in mechanisms for conferences require a PIN or different identification to be made, the SMA will use the SendDigits motion to ship these twin tone multi-frequency (DTMF) digits to the assembly supplier. When the appliance has joined the assembly, it should introduce itself utilizing the Communicate motion after which report the audio utilizing the RecordAudio motion. This audio might be saved in MP3 format and saved to an Amazon Easy Storage Service (Amazon S3) bucket.

Speaker diarization with Amazon Transcribe

As a result of the SMA is joined to the assembly as a participant, the audio might be a single channel of all of the contributors. To course of this audio file, Amazon Transcribe might be used with the ShowSpeakerLabels setting:

const response = await transcribeClient.ship(
    new StartTranscriptionJobCommand({
        TranscriptionJobName: jobName,
        IdentifyLanguage: true,
        MediaFormat: 'wav',
        Media: {
            MediaFileUri: audioSource,
        },
        Settings: {
            ShowSpeakerLabels: true,
            MaxSpeakerLabels: 10,
        },
        OutputBucketName: `${BUCKET}`,
        OutputKey: `${PREFIX_TRANSCRIBE_S3}/`,
    }),
);

With speaker diarization, Amazon Transcribe will distinguish totally different audio system within the transcription output. The JSON file that’s produced will embody the transcripts and objects together with speaker labels grouped by speaker with begin and finish timestamps. With this info offered by Amazon Transcribe, a turn-by-turn transcription might be generated by parsing via the JSON. The consequence might be a extra readable transcription. See the next instance:

spk_0: Hey Courtroom , how’s it going ?
spk_1: Hey Adam, it’s going good . How are you
spk_0: doing ? Nicely , uh hey , thanks for uh becoming a member of me in the present day on this name . I’m excited to speak to you about uh architecting on Aws .
spk_1: Superior . Yeah , thanks for inviting me . So ,
spk_0: uh are you able to inform me slightly bit about uh the servers you’re presently utilizing on premises ?
spk_1: Yeah . So for our servers , we presently have 4 net servers working with 4 gigabyte of RA M and two CP US and we’re presently working Linux as our working system .
spk_0: Okay . And what are you utilizing in your database ?
spk_1: Oh , yeah , for a database , we presently have a 200 gigabyte database working my as to will and I can’t bear in mind the model . However um the factor about our database is typically it lags . So we’re actually in search of a sooner possibility for
spk_0: that . So , um once you’re , you’re noticing lags with reads or with rights to the database
spk_1: with , with reads .
spk_0: Yeah . Okay . Have you ever heard of uh learn replicas ?
spk_1: I’ve not .
spk_0: Okay . That might be a possible uh resolution to your drawback . Um In case you don’t thoughts , I’ll go forward and simply uh ship you some info on that later for you and your workforce to evaluation .
spk_1: Oh , yeah , thanks , Adam . Yeah . Something that would assist . Yeah , be actually useful .
spk_0: Okay , final query earlier than I allow you to go . Um what are you doing uh to enhance the safety of your on premises ? Uh knowledge ?
spk_1: Yeah , so , so primary , we’ve been experiencing some esto injection assaults . We do have a Palo Alto firewall , however we’re not capable of totally obtain layer server safety . So we do want a greater possibility for that .
spk_0: Have you ever ex have you ever skilled uh my sequel assaults previously or sequel injections ?
spk_1: Sure.
spk_0: Okay , nice . Uh All proper . After which are you guys certain by any compliance requirements like PC I DS S or um you realize GDR ? Uh what’s one other one ? Any of these C simply uh
spk_1: we’re certain by destiny , reasonable complaints . So .
spk_0: Okay . Nicely , you need to transition to fed ramp excessive at any time .
spk_1: Uh Not within the close to future . No .
spk_0: Okay . All proper , Courtroom . Nicely , thanks for that uh context . I might be reaching out to you once more right here quickly with a comply with up electronic mail and uh yeah , I’m trying ahead to speaking to you once more uh subsequent week .
spk_1: Okay . Sounds good . Thanks , Adam in your assist .
spk_0: All proper . Okay , bye .
spk_1: All proper . Take care .

Right here, audio system have been recognized based mostly on the order they spoke. Subsequent, we present you methods to additional improve this transcription by figuring out audio system utilizing their names, quite than spk_0, spk_1, and so forth.

Use Anthropic Claude and Amazon Bedrock to boost the answer

This utility makes use of a big language mannequin (LLM) to finish the next duties:

Speaker title identification
Transcript cleansing
Name summarization
Assembly invitation parsing

Immediate engineering for speaker title identification

The primary job is to boost the transcription by assigning names to speaker labels. These names are extracted from the transcript itself when an individual introduces themselves after which are returned as output in JSON format by the LLM.

Particular directions are offered for circumstances the place just one speaker is recognized to offer consistency within the response construction. By following these directions, the LLM will course of the assembly transcripts and precisely extract the names of the audio system with out further phrases or spacing within the response. If no names are recognized by the LLM, we prompted the mannequin to return an Unknown tag.

On this demo, the prompts are designed utilizing Anthropic Claude Sonnet because the LLM. Chances are you’ll have to tune the prompts to switch the answer to make use of one other accessible mannequin on Amazon Bedrock.

Human: You’re a assembly transcript names extractor. Go over the transcript and extract the names from it. Use the next directions within the <directions></directions> xml tags
<transcript> ${transcript} </transcript>
<directions>
– Extract the names like this instance – spk_0: “name1”, spk_1: “name2”.
– Solely extract the names like the instance above and don’t add some other phrases to your response
– Your response ought to solely have an inventory of “audio system” and their related title separated by a “:” surrounded by {}
– if there is just one speaker recognized then encompass your reply with {}
– the format ought to appear to be this {“spk_0” : “Title”, “spk_1: “Name2”, and so forth.}, no pointless spacing ought to be added
</directions>

Assistant: Ought to I add anything in my reply?

Human: Solely return a JSON formatted response with the Title and the speaker label related to it. Don’t add some other phrases to your reply. Do NOT EVER add any introductory sentences in your reply. Solely give the names of the audio system actively talking within the assembly. Solely give the names of the audio system actively talking within the assembly within the format proven above.

Assistant:

After the audio system are recognized and returned in JSON format, we are able to substitute the generic speaker labels with the title attributed speaker labels within the transcript. The consequence might be a extra enhanced transcription:

Adam: Hey Courtroom , how’s it going ?
Courtroom: Hey Adam , it’s going good . How are you
Adam: doing ? Nicely , uh hey , thanks for uh becoming a member of me in the present day on this name . I’m excited to speak to you about uh architecting on Aws .
Courtroom: Superior . Yeah , thanks for inviting me . So ,
Adam: uh are you able to inform me slightly bit about uh the servers you’re presently utilizing on premises ?
Courtroom: Yeah . So for our servers , we presently have 4 net servers working with 4 gigabyte of RA M and two CP US and we’re presently working Linux as our working system .

… transcript continues….

However what if a speaker can’t be recognized as a result of they by no means launched themselves. In that case, we need to let the LLM go away them as unknown, quite than attempt to drive or a hallucinate a label.

We are able to add the next to the directions:

If no title is discovered for a speaker, use UNKNOWN_X the place X is the speaker label quantity

The next transcript has three audio system, however solely two recognized audio system. The LLM should label a speaker as UNKNOWN quite than forcing a reputation or different response on the speaker.

spk_0: Yeah .
spk_1: Uh Thanks for becoming a member of us Courtroom . I’m your account govt right here at Aws . Uh Becoming a member of us on the decision is Adam , one in every of our options architect at Aws . Adam want to introduce your self .
spk_2: Hey , all people . Excessive Courtroom . Good to fulfill you . Uh As your devoted options architect right here at Aws . Uh my job is to assist uh you at each step of your cloud journey . Uh My , my position includes serving to you perceive architecting on Aws , together with greatest practices for the cloud . Uh So with that , I’d find it irresistible if you happen to might simply take a minute to introduce your self and possibly inform me slightly about what you’re presently working on premises .
spk_0: Yeah , nice . It’s uh nice to fulfill you , Adam . Um My title is Courtroom . I’m the V P of engineering right here at any firm . And uh yeah , actually excited to listen to what you are able to do for us .
spk_1: Thanks , work . Nicely , we , we talked slightly little bit of final time about your objectives for migrating to Aws . I invited Adam to , to hitch us in the present day to get a greater understanding of your present infrastructure and different technical necessities .
spk_2: Yeah . So co might you inform me slightly bit about what you’re presently working on premises ?
spk_1: Positive . Yeah , we’re , uh ,
spk_0: we’re working a 3 tier net app on premise .

After we give Claude Sonnet the choice to not drive a reputation , we see outcomes like this:

{“spk_0”: “Courtroom”, “spk_1”: “UNKNOWN_1”, “spk_2”: “Adam”}

Immediate engineering for cleansing the transcript

Now that the transcript has been diarized with speaker attributed names, we are able to use Amazon Bedrock to scrub the transcript. Cleansing duties embody eliminating distracting filler phrases, figuring out and rectifying misattributed homophones, and addressing any diarization errors stemming from subpar audio high quality. For steering on undertaking these duties utilizing Anthropic Claude Sonnet fashions, see the offered immediate:

Human: You’re a transcript editor, please comply with the <directions> tags.
<transcript> ${transcript} </transcript>
<directions>
– The <transcript> comprises a speaker diarized transcript
– Go over the transcript and take away all filler phrases. For instance “um, uh, er, nicely, like, you realize, okay, so, truly, mainly, truthfully, anyway, actually, proper, I imply.”
– Repair any errors in transcription that could be attributable to homophones based mostly on the context of the sentence. For instance, “one as an alternative of gained” or “excessive as an alternative of hello”
– As well as, please repair the transcript in circumstances the place diarization is badly carried out. For instance, in some circumstances you will notice that sentences are cut up between two audio system. On this case infer who the precise speaker is and attribute it to them.

– Please evaluation the next instance of this,

Enter Instance
Courtroom: Adam you’re saying the improper factor. What
Adam: um do you imply, Courtroom?

Output:
Courtroom: Adam you’re saying the improper factor.
Adam: What do you imply, Courtroom?

– In your response, return all the cleaned transcript, together with all the filler phrase elimination and the improved diarization. Solely return the transcript, don’t embody any main or trailing sentences. You aren’t summarizing. You might be cleansing the transcript. Don’t embody any xml tags <>
</directions>
Assistant:

Transcript Processing

After the preliminary transcript is handed into the LLM, it returns a refined transcript, free from errors. The next are excerpts from the transcript:

Speaker Identification

Enter:

spk_0: Hey Courtroom , how’s it going ?
spk_1: Hey Adam, it’s going good . How are you
spk_0: doing ? Nicely , uh hey , thanks for uh becoming a member of me in the present day on this name . I’m excited to speak to you about uh architecting on Aws .

Output:

Adam: Hey Courtroom, how’s it going?
Courtroom: Hey Adam, it’s going good. How are you?
Adam: Thanks for becoming a member of me in the present day. I’m excited to speak to you about architecting on AWS

Homophone Substitute

Enter:

Adam: Have you ever ex have you ever skilled uh my sequel assaults previously or sequel injections ?
Courtroom: Sure .

Output:

Adam: Have you ever skilled SQL injections previously?
Courtroom: Sure.

Filler Phrase Elimination

Enter:

Adam: Okay , nice . Uh All proper . After which are you guys certain by any compliance requirements like PC I DS S or um you realize GDR ? Uh what’s one other one ? Any of these C simply uh
Courtroom: we’re certain by destiny , reasonable complaints . So .
Adam: Okay . Nicely , you need to transition to fed ramp excessive at any time

Output:

Adam: Okay, nice. After which are you guys certain by any compliance requirements like PCI DSS or GDPR? What’s one other one? Any of these? CJIS?
Courtroom: We’re certain by FedRAMP reasonable compliance.
Adam: Okay. Will you need to transition to FedRAMP excessive at any time?

Immediate engineering for summarization

Now that the transcript has been created by Amazon Transcribe, diarized, and enhanced with Amazon Bedrock, the transcript might be summarized with Amazon Bedrock and an LLM. A easy model would possibly appear to be the next:

Human:
You’re a transcript summarizing bot. You’ll go over the transcript under and supply a abstract of the transcript.
Transcript: ${transcript}

Assistant:

Though it will work, it may be improved with further directions. You need to use XML tags to offer construction and readability to the duty necessities:

Human: You’re a transcript summarizing bot. You’ll go over the transcript under and supply a abstract of the content material inside the <directions> tags.
<transcript> ${transcript} </transcript>

<directions>
– Go over the dialog that was had within the transcript.
– Create a abstract based mostly on what occurred within the assembly.
– Spotlight particular motion objects that got here up within the assembly, together with follow-up duties for every particular person.
</directions>

Assistant:

As a result of conferences typically contain motion objects and date/time info, directions are added to explicitly request the LLM to incorporate this info. As a result of the LLM is aware of the speaker names, every particular person is assigned particular motion objects for them if any are discovered. To stop hallucinations, an instruction is included that permits the LLM to fail gracefully.

Human: You’re a transcript summarizing bot. You’ll go over the transcript under and supply a abstract of the content material inside the <directions> tags.

<transcript> ${transcript} </transcript>

<directions>
– Go over the dialog that was had within the transcript.
– Create a abstract based mostly on what occurred within the assembly.
– Spotlight particular motion objects that got here up within the assembly, together with follow-up duties for every particular person.
– If related, deal with what particular AWS providers have been talked about in the course of the dialog.
– If there’s enough context, infer the speaker’s position and point out it within the abstract. For example, “Bob, the client/designer/gross sales rep/…”
– Embrace necessary date/time, and point out urgency if vital (e.g., deadline/ETAs for outcomes and subsequent steps)
</directions>

Assistant: Ought to I add anything in my reply?

Human: If there’s not sufficient context to generate a correct abstract, then simply return a string that claims “Assembly not lengthy sufficient to generate a transcript.

Assistant:

Alternatively, we invite you to discover Amazon Transcribe Name Analytics generative name summarization for an out-of-the-box resolution that integrates instantly with Amazon Transcribe.

Immediate engineering for assembly invitation parsing

Included on this demo is a React-based UI that can begin the method of the SMA becoming a member of the assembly. As a result of this demo helps a number of assembly sorts, the invitation have to be parsed. Reasonably than parse this with a fancy common expression (regex), will probably be processed with an LLM. This immediate will first determine the assembly sort: Amazon Chime, Zoom, Google Meet, Microsoft Groups, or Cisco Webex. Based mostly on the assembly sort, the LLM will extract the assembly ID and dial-in info. Merely copy/paste the assembly invitation to the UI, and the invitation might be processed by the LLM to find out methods to name the assembly supplier. This may be executed for a gathering that’s presently occurring or scheduled for a future assembly. See the next instance:

Human: You might be an info extracting bot. Go over the assembly invitation under and decide what the assembly id and assembly sort are <directions></directions> xml tags

<meeting_invitation>
${meetingInvitation}
</meeting_invitation>

<directions>
1. Determine Assembly Sort:
Decide if the assembly invitation is for Chime, Zoom, Google, Microsoft Groups, or Webex conferences.

2. Chime, Zoom, and Webex
– Discover the meetingID
– Take away all areas from the assembly ID (e.g., #### ## #### -> ##########).

3. If Google – Directions Extract Assembly ID and Dial in
– For Google solely, the assembly invitation will name a meetingID a ‘pin’, so deal with it as a meetingID
– Take away all areas from the assembly ID (e.g., #### ## #### -> ##########).
– Extract Google and Microsoft Dial-In Quantity (if relevant):
– If the assembly is a Google assembly, extract the distinctive dial-in quantity.
– Find the dial-in quantity following the textual content “to hitch by telephone dial.”
– Format the extracted Google dial-in quantity as (+1 ###-###-####), eradicating dashes and areas. For instance +1 111-111-1111 would turn out to be +11111111111)

4. If Microsoft Groups – Directions if assembly sort is Microsoft Groups.
– Take note of these directions fastidiously
– The meetingId we need to retailer within the generated response is the ‘Cellphone Convention ID’ : ### ### ###
– within the assembly invitation, there are two IDs a ‘Assembly ID’ (### ### ### ##) and a ‘Cellphone Convention ID’ (### ### ###), ignore the ‘Assembly ID’ use the ‘Cellphone Convention ID’
– The meetingId we need to retailer within the generated response is the ‘Cellphone Convention ID’ : ### ### ###
– The meetingID that we wish is referenced because the ‘Cellphone Convention ID’ retailer that one because the assembly ID.
– Discover the telephone quantity, extract it and retailer it because the dialIn quantity (format (+1 ###-###-####), eradicating dashes and areas. For instance +1 111-111-1111 would turn out to be +11111111111)

5. meetingType guidelines
– The one legitimate responses for meetingType are ‘Chime’, ‘Webex’, ‘Zoom’, ‘Google’, ‘Groups’

6. Generate Response:
– Create a response object with the next format:
{
meetingId: “assembly id goes right here with areas eliminated”,
meetingType: “assembly sort goes right here (choices: ‘Chime’, ‘Webex’, ‘Zoom’, ‘Google’, ‘Groups’)”,
dialIn: “Insert Google/Microsoft Groups Dial-In quantity with no dashes or areas, or N/A if not a Google/Microsoft Groups Assembly”
}

Assembly ID Codecs:
Zoom: ### #### ####
Webex: #### ### ####
Chime: #### ## ####
Google: ### ### ####
Groups: ### ### ###

Be sure that this system doesn’t create faux telephone numbers and solely consists of the Microsoft or Google dial-in quantity if the assembly sort is Google or Groups.

</directions>

Assistant: Ought to I add anything in my reply?

Human: Solely return a JSON formatted response with the meetingid and assembly sort related to it. Don’t add some other phrases to your reply. Don’t add any introductory sentences in your reply.

Assistant:

With this info extracted from the invitation, a name is positioned to the assembly supplier in order that the SMA can be a part of the assembly as a participant.

Clear up

In case you deployed this pattern resolution, clear up your assets by destroying the AWS CDK utility from the AWS Command Line Interface (AWS CLI). This may be executed utilizing the next command:

Conclusion

On this publish, we confirmed methods to improve Amazon Transcribe with an LLM utilizing Amazon Bedrock by extracting info that may in any other case be troublesome for a regex to extract. We additionally used this technique to extract info from a gathering invitation despatched from an unknown supply. Lastly, we confirmed methods to use an LLM to offer a summarization of the assembly utilizing detailed directions to supply motion objects and embody date/time info within the response.

We invite you to deploy this demo into your personal account. We’d love to listen to from you. Tell us what you suppose within the points discussion board of the Amazon Chime SDK Assembly Summarizer GitHub repository. Alternatively, we invite you to discover different strategies for assembly transcription and summarization, resembling Amazon Stay Assembly Assistant, which makes use of a browser extension to gather name audio.

In regards to the authors

Adam Neumiller is a Options Architect for AWS. He’s centered on serving to public sector prospects drive cloud adoption via using infrastructure as code. Outdoors of labor, he enjoys spending time together with his household and exploring the good outside.

Courtroom Schuett is a Principal Specialist SA – GenAI centered on third occasion fashions and the way they can be utilized to assist remedy buyer issues. When he’s not coding, he spends his time exploring parks, touring together with his household, and listening to music.

Christopher Lott is a Principal Options Architect within the AWS AI Language Providers workforce. He has 20 years of enterprise software program growth expertise. Chris lives in Sacramento, California, and enjoys gardening, cooking, aerospace/basic aviation, and touring the world.

Hold Su is a Senior Utilized Scientist at AWS AI. He has been main AWS Transcribe Contact Lens Science workforce. His curiosity lies in call-center analytics, LLM-based abstractive summarization, and basic conversational AI.

Jason Cai is an Utilized Scientist at AWS AI. He has made contributions to AWS Bedrock, Contact Lens, Lex and Transcribe. His pursuits embody LLM brokers, dialogue summarization, LLM prediction refinement, and data graph.

Edgar Costa Filho is a Senior Cloud Infrastructure Architect with a deal with Foundations and Containers, together with experience in integrating Amazon EKS with open-source tooling like Crossplane, Terraform, and GitOps. In his position, Edgar is devoted to aiding prospects in attaining their enterprise goals by implementing greatest practices in cloud infrastructure design and administration.

Increase productiveness with video conferencing transcripts and summaries with the Amazon Chime SDK Assembly Summarizer resolution

Resolution overview

Stipulations

Entry the assembly with Amazon Chime SDK

Speaker diarization with Amazon Transcribe

Use Anthropic Claude and Amazon Bedrock to boost the answer

Immediate engineering for speaker title identification