[spam-prevention] What is the best way to programmatically detect porn images?

Akismet does an amazing job at detecting spam comments. But comments are not the only form of spam these days. What if I wanted something like akismet to automatically detect porn images on a social networking site which allows users to upload their pics, avatars, etc?

There are already a few image based search engines as well as face recognition stuff available so I am assuming it wouldn't be rocket science and it could be done. However, I have no clue regarding how that stuff works and how I should go about it if I want to develop it from scratch.

How should I get started?

Is there any open source project for this going on?

This question is related to spam-prevention

The answer is


Seems to me like the main obstacle is defining a "porn image". If you can define it easily, you could probably write something that would work. But even humans can't agree on what is porn. How will the application know? User moderation is probably your best bet.


I've heard about tools which were using very simple, but quite effective algorithm. The algorithm calculated relative amount of pixels with color value near to some predefined "skin" colours. If that amount is higher than some predefined value then image is considered to be of erotic/pornographic content. Of course that algorithm will give false positive results for close-up face photos and many other things.
Since you are writing about social networking there will be lots of "normal" photos with high amount of skin colour on it, so you shouldn't use this algorithm to deny all pictures with positive result. But you can use it provide some help for moderators, for example flag these pictures with higher priority, so if moderator want to check some new pictures for pornographic content he can start from these pictures.


BOOM! Here is the whitepaper containing the algorithm.

Does anyone know where to get the source code for a java (or any language) implementation?

That would rock.

One algorithm called WISE has a 98% accuracy rate but a 14% false positive rate. So what you do is you let the users flag the 2% false negatives, ideally with automatic removal if a certain number of users flag it, and have moderators view the 14% false positives.


The BrightCloud web service API is perfect for this. It's a REST API for doing website lookups just like this. It contains a very large and very accurate web filtering DB and one of the categories, Adult, has over 10M porn sites identified!


Two options I can think of (though neither of them is programatically detecting porn):

  1. Block all uploaded images until one of your administrators has looked at them. There's no reason why this should take a long time: you could write some software that shows 10 images a second, almost as a movie - even at this speed, it's easy for a human being to spot a potentially pornographic image. Then you rewind in this software and have a closer look.
  2. Add the usual "flag this image as inappropriate" option.

This is actually reasonably easy. You can programatically detect skin tones - and porn images tend to have a lot of skin. This will create false positives but if this is a problem you can pass images so detected through actual moderation. This not only greatly reduces the the work for moderators but also gives you lots of free porn. It's win-win.

#!python    
import os, glob
from PIL import Image

def get_skin_ratio(im):
    im = im.crop((int(im.size[0]*0.2), int(im.size[1]*0.2), im.size[0]-int(im.size[0]*0.2), im.size[1]-int(im.size[1]*0.2)))
    skin = sum([count for count, rgb in im.getcolors(im.size[0]*im.size[1]) if rgb[0]>60 and rgb[1]<(rgb[0]*0.85) and rgb[2]<(rgb[0]*0.7) and rgb[1]>(rgb[0]*0.4) and rgb[2]>(rgb[0]*0.2)])
    return float(skin)/float(im.size[0]*im.size[1])

for image_dir in ('porn','clean'):
    for image_file in glob.glob(os.path.join(image_dir,"*.jpg")):
        skin_percent = get_skin_ratio(Image.open(image_file)) * 100
        if skin_percent>30:
            print "PORN {0} has {1:.0f}% skin".format(image_file, skin_percent)
        else:
            print "CLEAN {0} has {1:.0f}% skin".format(image_file, skin_percent)

This code measures skin tones in the center of the image. I've tested on 20 relatively tame "porn" images and 20 completely innocent images. It flags 100% of the "porn" and 4 out of the 20 of the clean images. That's a pretty high false positive rate but the script aims to be fairly cautious and could be further tuned. It works on light, dark and Asian skin tones.

It's main weaknesses with false positives are brown objects like sand and wood and of course it doesn't know the difference between "naughty" and "nice" flesh (like face shots).

Weakness with false negatives would be images without much exposed flesh (like leather bondage), painted or tattooed skin, B&W images, etc.

source code and sample images


Detecting porn images is still a definite AI task which is very much theoretical yet.

Harvest collective power and human intelligence by adding a button/link "Report spam/abuse". Or employ several moderators to do this job.

P.S. Really surprised how many people ask questions assuming software and algorithms are all-mighty without even thinking whether what they want could be done. Are they representatives of that new breed of programmers who have no understanding of hardware, low-level programming and all that "magic behind"?

P.S. #2. I also remember that periodically it happens that some situation when people themselves cannot decide whether a picture is porn or art is taken to the court. Even after the court rules, chances are half of the people will consider the decision wrong. The last stupid situation of the kind was quite recently when a Wikipedia page got banned in UK because of a CD cover image that features some nakedness.


This is a nudity detector. I haven't tried it. It's the only OSS one I could find.

https://code.google.com/p/nudetech


If you're really have time and money:

One way of doing it is by 1) Writing an image detection algorithm to find whether an object is human or not. This can be done by bitmasking an image to retrieve it's "contours" and see if the contours fits a human contour.

2) Data mine a lot of porn images and use data mining techniques such as the C4 algorithms or Particle Swarm Optimization to learn to detect pattern that matches porn images.

This will require that you identify how a naked man/woman contours of a human body must look like in digitized format (this can be achieved in the same way OCR image recognition algorithms works).

Hope you have fun! :-)


I shall not today attempt further to define the kinds of material I understand to be embraced within that shorthand description ["hard-core pornography"]; and perhaps I could never succeed in intelligibly doing so. But I know it when I see it, and the motion picture involved in this case is not that.

United States Supreme Court Justice Potter Stewart, 1964


There is software that detects the probability for porn, but this is not an exact science, as computers can't recognize what is actually on pictures (pictures are only a big set of values on a grid with no meaning). You can just teach the computer what is porn and what not by giving examples. This has the disadvantage that it will only recognize these or similar images.

Given the repetitive nature of porn you have a good chance if you train the system with few false positives. For example if you train the system with nude people it may flag pictures of a beach with "almost" naked people as porn too.

A similar software is the facebook software that recently came out. It's just specialized on faces. The main principle is the same.

Technically you would implement some kind of feature detector that utilizes a bayes filtering. The feature detector may look for features like percentage of flesh colored pixels if it's a simple detector or just computes the similarity of the current image with a set of saved porn images.

This is of course not limited to porn, it's actually more a corner case. I think more common are systems that try to find other things in images ;-)


The answer is really easy: It's pretty safe to say that it won't be possible in the next two decades. Before that we will probably get good translation tools. The last time I checked, the AI guys were struggling to identify the same car on two photographs shot from a slightly altered angle. Take a look on how long it took them to get good enough OCR or speech recognition together. Those are recognition problems which can benefit greatly from dictionaries and are still far from having completely reliable solutions despite of the multi-million man months thrown at them.

That being said you could simply add an "offensive?" link next to user generated contend and have a mod cross check the incoming complaints.

edit:

I forgot something: IF you are going to implement some kind of filter, you will need a reliable one. If your solution would be 50% right, 2000 out of 4000 users with decent images will get blocked. Expect an outrage.


Add an offensive link and store the md5 (or other hash) of the offending image so that it can automatically tagged in the future.

How cool would it be if somebody had a large public database of image md5 along with descriptive tags running as a webservice? Alot of porn isn't original work (in that the person who has it now, didn't probably make it) and the popular images tend to float around different places, so this could really make a difference.



Nude.js based on the whitepaper by Rigan Ap-apid from De La Salle University.


There is no way you could do this 100% (i would say maybe 1-5% would be plausible) with nowdays knowledge. You would get much better result (than those 1-5%) just checking the image-names for sex-related-words :).

@SO Troll: So true.


A graduate student from National Cheng Kung University in Taiwan did a research on this subject in 2004. He was able to achieve success rate of 89.79% in detecting nude pictures downloaded from the Internet. Here is the link to his thesis: The Study on Naked People Image Detection Based on Skin Color
It's in Chinese therefore you may need a translator in case you can't read it.


It is not rocket science. Not anymore. It is very similar to face recognition. I think that the easiest way to deal with it is to use machine learning. And since we are dealing with images, I can point towards neuronal networks, because these seem to be preferred for images. You will need training data. And you can find tons of training data on the internet but you have to crop the images to the specific part that you want the algorithm to detect. Of course you will have to break the problem into different body parts that you want to detect and create training data for each, and this is where things become amusing.

Like someone above said, it cannot be done 100% percent. There will be cases where such algorithms fail. The actual precision will be determined by your training data, the structure of your neuronal networks and how you will choose to cluster the training data (penises, vaginas, breasts, etc, and combinations of such). In any case I am very confident that this can be achieved with high accuracy for explicit porn imagery.


You can find many whitepapers on the net dealing with this subject.


Look at file name and any attributes. There's not nearly enough information to detect even 20% of naughty images, but a simple keyword blacklist would at least detect images with descriptive labels or metadata. 20 minutes of coding for a 20% success rate isn't a bad deal, especially as a prescreen that can at least catch some simple ones before you pass the rest to a moderator for judging.

The other useful trick is the opposite of course, maintain a whitelist of image sources to allow without moderation or checking. If most of your images come from known safe uploaders or sources, you can just accept them bindly.


This one looks promising. Basically they detect skin (with calibration by recognizing faces) and determine "skin paths" (i.e. measuring the proportion of skin pixels vs. face skin pixels / skin pixels). This has decent performance. http://www.prip.tuwien.ac.at/people/julian/skin-detection


I would rather allow users report on bad images. Image recognition development can take too much efforts and time and won't be as much as accurate as human eyes. It's much cheaper to outsource that moderation job.

Take a look at: Amazon Mechanical Turk

"The Amazon Mechanical Turk (MTurk) is one of the suite of Amazon Web Services, a crowdsourcing marketplace that enables computer programs to co-ordinate the use of human intelligence to perform tasks which computers are unable to do."


short answer: use a moderator ;)

Long answer: I dont think there's a project for this cause what is porn? Only legs, full nudity, midgets etc. Its subjective.


I've seen a web filtering application which does porn image filtering, sorry I can't remember the name. It was pretty prone to false positives however most of the time it was working.

I think main trick is detecting "too much skin on the picture :)


Examples related to spam-prevention

What is the best way to programmatically detect porn images?