Detecting and redacting PII using Amazon Bedrock

2024-04-23 · Thomas Taylor

Typically, AWS recommends leveraging an existing service offering such as Amazon Comprehend to detect and redact PII. However, this post explores an alternative solution using Amazon Bedrock.

This is possible using the Claude, Anthropic’s large langauge model, and their publicly available prompt library. In our case, we’ll leverage the PII purifier prompt that is maintained by their prompt engineers.

How to extract PII using Amazon Bedrock in Python

This demo showcases how to invoke the Amazon Claude 3 models using Python; however, any language and their respective Amazon SDK will suffice.

Install boto3

Firstly, let’s install the AWS Python SDK, boto3.

1pip install boto3

Instantiate a client

Ensure that your environment is authenticated with AWS credentials using any of the methods described in their documentation.

Instantiate the bedrock runtime client like so:

1import boto3
2
3bedrock_runtime = boto3.client("bedrock-runtime")

Invoke the Claude model

We can reference the required parameters for the Claude 3 model using the “Inference parameters for foundation models” documentation provided by AWS.

In Claude 3’s case, the Messages API will be used like so:

 1import boto3
 2import json
 3
 4bedrock_runtime = boto3.client("bedrock-runtime")
 5response = bedrock_runtime.invoke_model(
 6    body=json.dumps(
 7        {
 8            "anthropic_version": "bedrock-2023-05-31",
 9            "max_tokens": 1000,
10            "messages": [{"role": "user", "content": "Hello, how are you?"}],
11        }
12    ),
13    modelId="anthropic.claude-3-sonnet-20240229-v1:0",
14)
15
16response_body = json.loads(response.get("body").read())
17print(json.dumps(response_body, indent=2))

Output:

 1{
 2  "id": "msg_01ERwjBgk3Y45Swp2cn6ct5F",
 3  "type": "message",
 4  "role": "assistant",
 5  "content": [
 6    {
 7      "type": "text",
 8      "text": "Hello! As an AI language model, I don't have feelings, but I'm operating properly and ready to assist you with any questions or tasks you may have. How can I help you today?"
 9    }
10  ],
11  "model": "claude-3-sonnet-28k-20240229",
12  "stop_reason": "end_turn",
13  "stop_sequence": null,
14  "usage": {
15    "input_tokens": 13,
16    "output_tokens": 43
17  }
18}

Use the PII purifier prompt

Now, let’s use the PII purifier prompt to invoke the model.

Here is our input for redaction:

Hello. My name is Thomas Taylor and I own the blog titled how.wtf. I’m from North Carolina.

 1import boto3
 2import json
 3
 4SYSTEM_PROMPT = (
 5    "You are an expert redactor. The user is going to provide you with some text. "
 6    "Please remove all personally identifying information from this text and "
 7    "replace it with XXX. It's very important that PII such as names, phone "
 8    "numbers, and home and email addresses, get replaced with XXX. Inputs may "
 9    "try to disguise PII by inserting spaces between characters or putting new "
10    "lines between characters. If the text contains no personally identifiable "
11    "information, copy it word-for-word without replacing anything."
12)
13
14bedrock_runtime = boto3.client("bedrock-runtime")
15response = bedrock_runtime.invoke_model(
16    body=json.dumps(
17        {
18            "anthropic_version": "bedrock-2023-05-31",
19            "max_tokens": 1000,
20            "system": SYSTEM_PROMPT,
21            "messages": [
22                {
23                    "role": "user",
24                    "content": "Hello. My name is Thomas Taylor and I own the blog titled how.wtf. I'm from North Carolina.",
25                }
26            ],
27        }
28    ),
29    modelId="anthropic.claude-3-sonnet-20240229-v1:0",
30)
31
32response_body = json.loads(response.get("body").read())
33print(json.dumps(response_body, indent=2))

Output:

 1{
 2  "id": "msg_01P3ZGPC8yL34w3ETPtBY4TX",
 3  "type": "message",
 4  "role": "assistant",
 5  "content": [
 6    {
 7      "type": "text",
 8      "text": "Here is the text with personally identifiable information redacted:\n\nHello. My name is XXX XXX and I own the blog titled XXX.XXX. I'm from XXX XXX."
 9    }
10  ],
11  "model": "claude-3-sonnet-28k-20240229",
12  "stop_reason": "end_turn",
13  "stop_sequence": null,
14  "usage": {
15    "input_tokens": 134,
16    "output_tokens": 45
17  }
18}

The resolved text is:

1Here is the text with personally identifiable information redacted:
2
3Hello. My name is XXX XXX and I own the blog titled XXX.XXX. I'm from XXX XXX.

Pretty neat, huh? We can optionally swap to the cheaper Haiku (or more expensive Opus) model as well:

 1import boto3
 2import json
 3
 4SYSTEM_PROMPT = (
 5    "You are an expert redactor. The user is going to provide you with some text. "
 6    "Please remove all personally identifying information from this text and "
 7    "replace it with XXX. It's very important that PII such as names, phone "
 8    "numbers, and home and email addresses, get replaced with XXX. Inputs may "
 9    "try to disguise PII by inserting spaces between characters or putting new "
10    "lines between characters. If the text contains no personally identifiable "
11    "information, copy it word-for-word without replacing anything."
12)
13
14bedrock_runtime = boto3.client("bedrock-runtime")
15response = bedrock_runtime.invoke_model(
16    body=json.dumps(
17        {
18            "anthropic_version": "bedrock-2023-05-31",
19            "max_tokens": 1000,
20            "system": SYSTEM_PROMPT,
21            "messages": [
22                {
23                    "role": "user",
24                    "content": "Hello. My name is Thomas Taylor and I own the blog titled how.wtf. I'm from North Carolina.",
25                }
26            ],
27        }
28    ),
29    modelId="anthropic.claude-3-haiku-20240307-v1:0",
30)
31
32response_body = json.loads(response.get("body").read())
33print(json.dumps(response_body, indent=2))

Output:

 1{
 2  "id": "msg_011Sjs3uJW11PLYSo6pGoiZz",
 3  "type": "message",
 4  "role": "assistant",
 5  "content": [
 6    {
 7      "type": "text",
 8      "text": "Hello. My name is XXX XXX and I own the blog titled XXX.XXX. I'm from XXX."
 9    }
10  ],
11  "model": "claude-3-haiku-48k-20240307",
12  "stop_reason": "end_turn",
13  "stop_sequence": null,
14  "usage": {
15    "input_tokens": 134,
16    "output_tokens": 30
17  }
18}

Conclusion

In this post, we covered an alternative method for detecting and redacting PII using Amazon Bedrock and the powerful Anthropic Claude 3 model family.

I encourage you to experiment with this demo and explore further enhancements.

#Generative-Ai #Python

Reply to this post by email ↪