You should be cautious when your Generative AI application directly allows the user's prompt input.
AI companies provide various ways to help you moderate your content.
Open AI
1) Open AI provides a free moderation API, it generates scores based on the input prompt.
2) Scores are given based on individual offensive categories.
3) Sample of a flagged prompt:
How to hack an iphone?
{
"id": "modr-......744e",
"model": "omni-moderation-latest-intents",
"results": [
{
"flagged": true,
"categories": {
"harassment": false,
"harassment/threatening": false,
"sexual": false,
"hate": false,
"hate/threatening": false,
"illicit": true,
"illicit/violent": false,
"self-harm/intent": false,
"self-harm/instructions": false,
"self-harm": false,
"sexual/minors": false,
"violence": false,
"violence/graphic": false
},
"category_scores": {
"harassment": 0.000031999824407395835,
"harassment/threatening": 0.000017952796934677738,
"sexual": 0.00003150386940813393,
"hate": 8.888084683809127e-6,
"hate/threatening": 4.832563818725537e-6,
"illicit": 0.9746453529214525,
"illicit/violent": 0.0001585430675821233,
"self-harm/intent": 0.0002257049339720922,
"self-harm/instructions": 0.00021393274262539142,
"self-harm": 0.0004664994696390103,
"sexual/minors": 4.832563818725537e-6,
"violence": 0.0005411170227538132,
"violence/graphic": 9.028039015031105e-6
},
"category_applied_input_types": {
"harassment": [
"text"
],
"harassment/threatening": [
"text"
],
"sexual": [
"text"
],
"hate": [
"text"
],
"hate/threatening": [
"text"
],
"illicit": [
"text"
],
"illicit/violent": [
"text"
],
"self-harm/intent": [
"text"
],
"self-harm/instructions": [
"text"
],
"self-harm": [
"text"
],
"sexual/minors": [
"text"
],
"violence": [
"text"
],
"violence/graphic": [
"text"
]
}
}
]
}
Gemini
1) Google provides safety settings where you can set a threshold for each harm category.
2) You can provide an array like below in your request.
"safetySettings": [
{"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_ONLY_HIGH"},
{"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_MEDIUM_AND_ABOVE"},
{"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_MEDIUM_AND_ABOVE"},
{"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_MEDIUM_AND_ABOVE"}
]
3) Available settings:
HARM_CATEGORY_HARASSMENT HARM_CATEGORY_HATE_SPEECH HARM_CATEGORY_SEXUALLY_EXPLICIT HARM_CATEGORY_DANGEROUS_CONTENT HARM_CATEGORY_CIVIC_INTEGRITY |
BLOCK_NONE BLOCK_ONLY_HIGH BLOCK_MEDIUM_AND_ABOVE BLOCK_LOW_AND_ABOVE |
4) In the response, Gemini provides information on potential offenses.
"safetyRatings": [
{
"category": "HARM_CATEGORY_HATE_SPEECH",
"probability": "NEGLIGIBLE"
},
{
"category": "HARM_CATEGORY_DANGEROUS_CONTENT",
"probability": "NEGLIGIBLE"
},
{
"category": "HARM_CATEGORY_HARASSMENT",
"probability": "NEGLIGIBLE"
},
{
"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
"probability": "NEGLIGIBLE"
}
],