To verify spam blog distribution in the case of W = 100, blogs in extracted co-occurrence clusters were sorted in descending order by spam score and were divided into 10 groups.
However, there are many spam blogs that reap benefits from advertising without this type of unique content.
Detecting spam blogs is a difficult task for filtering methods based on machine-learning mechanisms, e.g.
In this paper, one important criterion to evaluate spam blogs is how much value a blogger adds to his or her blog.
To meet this requirement and to extract huge co-occurrence clusters between spam blogs and spam words from the blogosphere, the Shared Interest algorithm (SI) has been developed.
Spam blogs and words can be extracted as large cooccurrence clusters because spam blogs share many common words, due to their using copied-and-pasted content from other blogs and to their postings on multiple blogs and sites.
This paper conducted a preliminary experiment ranking extracted clusters based on spam score, dividing them into 10 groups, sampling 20 blogs in each group, and checking spam blogs in each sample.
Spam blogs and words are mutually detected with spam seeds.
Step 2: The spam rate swrate of each word [w.sub.j] [member of] UWord is calculated by [absolute value of]sblog([w.sub.i])]/[absolute value of]ablog([w.sub.j])], where sblog([w.sub.i]) and ablog([w.sub.i]) denote a set of spam blogs which use word [w.sub.i] and a set of all blogs which use the word.
Step 4: The rate of detected spam blogs is calculated by [absolute value of]SBlog]/[absolute value of]ABlog].
Because parameter S is the estimated rate of spam blogs based on an observation of a data set and the estimated number of spam blogs in step 5, i.e.
* S: rate of spam blogs in all blogs (the rate is defined by the results of sampling from all blogs)