You should be cautious when your Generative AI application directly allows the user's prompt input.

AI companies provide various ways to help you moderate your content.


Open AI

1) Open AI provides a free moderation API, it generates scores based on the input prompt.

2) Scores are given based on individual offensive categories.

3) Sample of a flagged prompt:

How to hack an iphone?

{
    "id": "modr-......744e",
    "model": "omni-moderation-latest-intents",
    "results": [
        {
            "flagged": true,
            "categories": {
                "harassment": false,
                "harassment/threatening": false,
                "sexual": false,
                "hate": false,
                "hate/threatening": false,
                "illicit": true,
                "illicit/violent": false,
                "self-harm/intent": false,
                "self-harm/instructions": false,
                "self-harm": false,
                "sexual/minors": false,
                "violence": false,
                "violence/graphic": false
            },
            "category_scores": {
                "harassment": 0.000031999824407395835,
                "harassment/threatening": 0.000017952796934677738,
                "sexual": 0.00003150386940813393,
                "hate": 8.888084683809127e-6,
                "hate/threatening": 4.832563818725537e-6,
                "illicit": 0.9746453529214525,
                "illicit/violent": 0.0001585430675821233,
                "self-harm/intent": 0.0002257049339720922,
                "self-harm/instructions": 0.00021393274262539142,
                "self-harm": 0.0004664994696390103,
                "sexual/minors": 4.832563818725537e-6,
                "violence": 0.0005411170227538132,
                "violence/graphic": 9.028039015031105e-6
            },
            "category_applied_input_types": {
                "harassment": [
                    "text"
                ],
                "harassment/threatening": [
                    "text"
                ],
                "sexual": [
                    "text"
                ],
                "hate": [
                    "text"
                ],
                "hate/threatening": [
                    "text"
                ],
                "illicit": [
                    "text"
                ],
                "illicit/violent": [
                    "text"
                ],
                "self-harm/intent": [
                    "text"
                ],
                "self-harm/instructions": [
                    "text"
                ],
                "self-harm": [
                    "text"
                ],
                "sexual/minors": [
                    "text"
                ],
                "violence": [
                    "text"
                ],
                "violence/graphic": [
                    "text"
                ]
            }
        }
    ]
}

Gemini

1) Google provides safety settings where you can set a threshold for each harm category.

2) You can provide an array like below in your request.

"safetySettings": [
        {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_ONLY_HIGH"},
        {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_MEDIUM_AND_ABOVE"},
        {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_MEDIUM_AND_ABOVE"},
        {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_MEDIUM_AND_ABOVE"}
]

3) Available settings:

HARM_CATEGORY_HARASSMENT

HARM_CATEGORY_HATE_SPEECH

HARM_CATEGORY_SEXUALLY_EXPLICIT

HARM_CATEGORY_DANGEROUS_CONTENT

HARM_CATEGORY_CIVIC_INTEGRITY

BLOCK_NONE

BLOCK_ONLY_HIGH

BLOCK_MEDIUM_AND_ABOVE

BLOCK_LOW_AND_ABOVE

4) In the response, Gemini provides information on potential offenses.

"safetyRatings": [
                {
                    "category": "HARM_CATEGORY_HATE_SPEECH",
                    "probability": "NEGLIGIBLE"
                },
                {
                    "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
                    "probability": "NEGLIGIBLE"
                },
                {
                    "category": "HARM_CATEGORY_HARASSMENT",
                    "probability": "NEGLIGIBLE"
                },
                {
                    "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
                    "probability": "NEGLIGIBLE"
                }
            ],