How do you moderate prompt content?

You should be cautious when your Generative AI application directly allows the user's prompt input.

AI companies provide various ways to help you moderate your content.


Open AI

1) Open AI provides a free moderation API, it generates scores based on the input prompt.

2) Scores are given based on individual offensive categories.

3) Sample of a flagged prompt:

How to hack an iphone?

{
    "id": "modr-......744e",
    "model": "omni-moderation-latest-intents",
    "results": [
        {
            "flagged": true,
            "categories": {
                "harassment": false,
                "harassment/threatening": false,
                "sexual": false,
                "hate": false,
                "hate/threatening": false,
                "illicit": true,
                "illicit/violent": false,
                "self-harm/intent": false,
                "self-harm/instructions": false,
                "self-harm": false,
                "sexual/minors": false,
                "violence": false,
                "violence/graphic": false
            },
            "category_scores": {
                "harassment": 0.000031999824407395835,
                "harassment/threatening": 0.000017952796934677738,
                "sexual": 0.00003150386940813393,
                "hate": 8.888084683809127e-6,
                "hate/threatening": 4.832563818725537e-6,
                "illicit": 0.9746453529214525,
                "illicit/violent": 0.0001585430675821233,
                "self-harm/intent": 0.0002257049339720922,
                "self-harm/instructions": 0.00021393274262539142,
                "self-harm": 0.0004664994696390103,
                "sexual/minors": 4.832563818725537e-6,
                "violence": 0.0005411170227538132,
                "violence/graphic": 9.028039015031105e-6
            },
            "category_applied_input_types": {
                "harassment": [
                    "text"
                ],
                "harassment/threatening": [
                    "text"
                ],
                "sexual": [
                    "text"
                ],
                "hate": [
                    "text"
                ],
                "hate/threatening": [
                    "text"
                ],
                "illicit": [
                    "text"
                ],
                "illicit/violent": [
                    "text"
                ],
                "self-harm/intent": [
                    "text"
                ],
                "self-harm/instructions": [
                    "text"
                ],
                "self-harm": [
                    "text"
                ],
                "sexual/minors": [
                    "text"
                ],
                "violence": [
                    "text"
                ],
                "violence/graphic": [
                    "text"
                ]
            }
        }
    ]
}

Gemini

1) Google provides safety settings where you can set a threshold for each harm category.

2) You can provide an array like below in your request.

"safetySettings": [
        {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_ONLY_HIGH"},
        {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_MEDIUM_AND_ABOVE"},
        {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_MEDIUM_AND_ABOVE"},
        {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_MEDIUM_AND_ABOVE"}
]

3) Available settings:

HARM_CATEGORY_HARASSMENT

HARM_CATEGORY_HATE_SPEECH

HARM_CATEGORY_SEXUALLY_EXPLICIT

HARM_CATEGORY_DANGEROUS_CONTENT

HARM_CATEGORY_CIVIC_INTEGRITY

BLOCK_NONE

BLOCK_ONLY_HIGH

BLOCK_MEDIUM_AND_ABOVE

BLOCK_LOW_AND_ABOVE

4) In the response, Gemini provides information on potential offenses.

"safetyRatings": [
                {
                    "category": "HARM_CATEGORY_HATE_SPEECH",
                    "probability": "NEGLIGIBLE"
                },
                {
                    "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
                    "probability": "NEGLIGIBLE"
                },
                {
                    "category": "HARM_CATEGORY_HARASSMENT",
                    "probability": "NEGLIGIBLE"
                },
                {
                    "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
                    "probability": "NEGLIGIBLE"
                }
            ],

AI Summary
gpt-4o-2024-08-06 2024-12-15 15:30:19
The blog highlights content moderation techniques for Generative AI applications, focusing on tools by OpenAI and Google's Gemini. OpenAI offers a free moderation API that evaluates prompts for offensive content, while Gemini provides configurable safety settings to block or manage harmful categories. Both aim to enhance user interaction safety.
Chrome On-device AI 2025-01-20 10:02:14

Share Article