Search
Home
Saved
0
1.5K views
Upload
Sign In
RELATED TITLES
0
CSE158 Assignment 2 Report Uploaded by Brian Nguyen
Books
Audiobooks
Magazines News
Documents
Sheet Music
Report for school project.
Save
Embed
Share
Print
Download
Join
The Unwinding: An Inner History
1
of 7
Yes Please
Sapiens: A Brief History of
Search document
Brian Lu Nguyen A12940672
[email protected]
1. The Exhentai.org Dataset The dataset being studied is 100,000 of the most recent doujinshi uploaded to the website exhentai.org as of June 10 th, 2017. Doujinshi are Japanese self-published works, often taking the form of fanfiction comics, although original works do exist. Exhentai.org is a website dedicated to archiving the doujinshi that exist, and requires an account to access, as the vast majority of these works are 18+. This dataset includes only the original Japanese language doujinshi manga, and does not contain con tain alternate language translations. To be more frank, this is a dataset of Japanese pornographic manga (comics). While exhentai.org does host doujin translations in a multitude of languages (English, Chinese, Russian, Spanish, etc.), I decided to thin this dataset to just the Japanese entries to remove any weird results that may come from involving all languages.
on every page. Once the urls were com then ran a program that extracted the metadata for each of the galleries usin exhentai’s dedicated API. Each galler associated ID and gallery token, whic used in a JSON request that retrieves gallery’s metadata. This process is inc time intensive (taking several hours) despite following th e site’s guideline limiting “25 entries per request, 4 -5 sequential requests… before having t for ~5 seconds” (https://ehwiki.org/wiki/API https://ehwiki.org/wiki/API)), I was banned several times and needed to u multiple proxies to get around the ba all of the metadata eventually compil series of JSON files, I could then begin analysis. Sign up to vote on this title
1:Useful Not useful Table Basic Statistics of the Dataset Exhentai.org Dataset Statistics
Home
Saved
Books
Audiobooks
Magazines
News
Documents
Sheet Music
Upload
Sign In
Join
Search
Home
Saved
1.5K views
0
Upload
Sign In
RELATED TITLES
0
CSE158 Assignment 2 Report Uploaded by Brian Nguyen
Books
Audiobooks
Magazines
Report for school project.
Save
Embed
Share
Print
Download
News
Documents
Sheet Music
Join
The Unwinding: An Inner History
1
of 7
Figure 1a: Example Metadata Information
Yes Please
Sapiens: A Brief History of
Search document
female: ahegao tankoubon female: futanari male: males only female: double penetration penetration female: nakadashi male: sole male
An oddity visible in Table 2 is the dispari between male: yaoi (male gay content) an female: yuri (lesbian content). The latter doesn’t register among the top 20, instea 28th at 6.2% on the list – about 7% less fr than yaoi. This suggests that yaoi as a gen much less niche than yuri, and has wider This makes sense in context, considering amount of female consumers of doujinsh gravitate to yaoi content. This exists to su extent that the term “fujoshi” , a self-depr term for female fans of “boys love”, is com vocabulary in Western anime and manga communities.
Figure 1b: Equivalent on exhentai.org
Table 2: Top 20 Most Popular Tags Tag Descriptor
Frequency
female: big breasts
30%
female: lolicon
25%
group
21%
Another interesting note from this table i presence of “tankoubon” at 9% frequency the dataset. Tankoubon are paperback vo that often act as an omnibus for multiple artists to have theironworks published in. W Sign up to vote this title outside Useful the scope ofNot myuseful intended model, could make a network community of arti collaborating within such anthological w
Home
Saved
Books
Audiobooks
Magazines
News
Documents
Sheet Music
Upload
Sign In
Join
Search
Home
Saved
1.5K views
0
Upload
Sign In
RELATED TITLES
0
CSE158 Assignment 2 Report Uploaded by Brian Nguyen
Books
Audiobooks
Magazines
Report for school project.
Save
Embed
Share
Print
Download
News
Documents
Sheet Music
Join
The Unwinding: An Inner History
1
of 7
Table 3: Top 20 Most Popular Series
Yes Please
Sapiens: A Brief History of
Search document
mean that the art for the doujins are clea that the scanning techniques used are representing doujins better, Older series older doujins are prone to poor quality s the original pages, or simply poor quality begin with.
Tag Descriptor
Frequency
Touhou Project
7%
Kantai Collection
5%
Idolm@ster
3%
Mahou Shoujo Lyrical Nanoha
1%
Neon Genesis Evangelion
1%
Love Live
1%
Sailor Moon
.7%
Inochi Wazuka
Granblue Fantasy
.7%
Crimson
Free
.7%
Natsuka Q-ya
To Love-Ru
.6%
Nekogen
Pokemon
.6%
Ueda Yuu
Puella Magi Madoka Magica
.57%
Uchi-Uchi Keyaki
K-On
.55%
Shingeki no Kyojin
.54%
Touken Ranbu
.5%
Ore no Imouto ga Konna ni Kawaii Wake ga Nai Street Fighter
.49% .47%
Kawamori Misaki
Fate/stay Night
.47%
Ken
Sword Art Online
.46%
Yanagawa Rio
Kuroko no Basuke
.45%
Koutarou
Table 4: Top 20 Most Popular Artists Tag Descriptor
Fre
Itaba Hiroshi
Erect Sawaru Saigado Nozarishi Satoru Marui Maru Manabe Jouji
Equal There is an interesting thing to note here with regards to popularity. Each of these series came out at a different time, so while some have had
Zen9Sign up to vote on this title Takasugi Kou Not useful Useful Nagiyama
Home
Saved
Books
Audiobooks
Magazines
News
Documents
Sheet Music
Upload
Sign In
Join
Search
Home
Saved
0
1.5K views
Upload
Sign In
RELATED TITLES
0
CSE158 Assignment 2 Report Uploaded by Brian Nguyen
Books
Audiobooks
Magazines
Report for school project.
Save
Embed
Share
Print
Download
News
Documents
Sheet Music
Join
The Unwinding: An Inner History
1
of 7
2. Predictive Task With sexuality and personal taste differing from person to person, I’d like to find out if there are general factors that make a doujinshi highly rated. This could range from qualitative features such as the image size of each scanned page (ex. HD vs. standard definition porn), content-specific features like a gallery’s tags (i.e . the fetishes the work plays upon), or highly subjective su bjective features features like the popularity of the series the doujinshi is based on. Essentially, my predictive task is to predict the rating of a doujinshi gallery based on its metadata.
Yes Please
Sapiens: A Brief History of
Search document
1. Filesize-per-image of the gall the images in the image gallery fo doujinshi are of low resolution, it unlikely to be rated highly. Conve the images in the image gallery ar resolution (or possibly in color), t score might be higher. This value calculated using the ‘filecount’ an ‘filesize’ properties of the gallery metadata, functioning on the assu that all of the pages are roughly consistent in size within galleries divided by filecount should produ value).
This task can be performed using an SVM classifier as was applied to predicting a beer’s ABV in homework 1 and improved upon in homework 2. Apart from simply testing my model’s performance against the test set, I will be evaluating my model based on its test accuracy compared with other models, each varying in complexity. For instance, if my model is performing worse than a naïve classifier, it clearly needs improvement.
2. Number of popular tags occu the doujin – tag frequencies acro entire dataset may indicate gener what people have a preference fo can be calculated similarly to the common unigrams used in Home The tags are included in the meta each gallery file, and can be comp with a list of “top tags” (sorted frequency of appearance in douji galleries).
I’m applying an SVM classifier here because I’m only really interested in what makes a doujinshi “good”, and so would only need to predict whether the doujinshi’s score lies above a certain threshold. In this case, I’ll be attempting to predict whether a doujin lies
This feature also naturally includ popularity of the artist. If the arti Sign up to vote on this title prolific, they’re probably doing so Not useful right. The can be found with Useful artist tags of the metadata, and can be a by stripping the artist from the re
Home
Saved
Books
Audiobooks
Magazines
News
Documents
Sheet Music
Upload
Sign In
Join
Search
Home
Saved
0
1.5K views
Upload
Sign In
RELATED TITLES
0
CSE158 Assignment 2 Report Uploaded by Brian Nguyen
Books
Audiobooks
Magazines News
Documents
Sheet Music
Report for school project.
Save
Embed
Share
Print
Download
Join
The Unwinding: An Inner History
1
of 7
1. Popularity of the series – if the source material is popular, that might be an indication of quality (i.e. Game of Thrones porn is probably better than Marvel: Inhumans porn). And this feature for the other: 1. Popularity of the tags – just going off of all of the tags combined, if the tags in the gallery are part of the most frequent tags, they should be favored/desired. This does not consider the filesize-per-image and is used to gauge whether the filesize feature is actually helping. 3. Model The model I chose was a classifier that runs logistic regression on the features listed above to predict whether a doujin has an above-average rating. I ’m using this model because it was effective and relatively simple to implement. The data from the exhentai.org dataset makes this model effective, due to its similarity to the beer review dataset we worked with, and the statistics found during my exploration seemed to indicate that the elements I’m using to predict ratings are effective (or at least have significant enough differences in frequency so as to make an impact). I constructed the feature vector using the
Yes Please
Sapiens: A Brief History of
Search document
every time a popular artist or tag was encountered.
My training-validation-test set sp assembled by first randomizing the d along with their scores, then cutting t into a 70-15-15 ratio. 70000 doujins used in the training set and 15000 do were used in the validation and test s randomizing the doujin list biases the because the doujins are ordered from least recent. Splitting them chronolog like this results in the classifier traini recent content and testing on old con model would overfit to the training se perform poorly on the test set.
Optimizing the model took some tinkering with the amount of “top tags/artists/series/etc.” I was consid Using only the top 10 or top 50 did no provide enough information to the cl so I ended up calculating the feature using the top 100 of each of the tag c I use the hyperparameter lambda = 1 there being a dropoff in effectiveness this point.
My two other models, based on th popular series and most Sign up to vote on this titlepopular tags perform as well. usefulThe series-on Usefulnearly Not performed worse than the content (ta model, implying that a series ’ popula
Home
Saved
Books
Audiobooks
Magazines
News
Documents
Sheet Music
Upload
Sign In
Join
Search
Home
Saved
0
1.5K views
Upload
Sign In
RELATED TITLES
0
CSE158 Assignment 2 Report Uploaded by Brian Nguyen
Books
Audiobooks
Magazines News
Documents
Sheet Music
Report for school project.
Save
Embed
Share
Print
Download
Join
The Unwinding: An Inner History
1
of 7
tagged accurately and that a high resolution image doesn’t contain flaws in image clarity. Another complication is the presence or lack of consideration for censoring in the gallery, as regardless of image quality a censored doujinshi is not as preferable as an uncensored one. The second model (series-only) is pretty naïve and falls flat. The popularity (frequency) of a series just means that more doujins exist for it, and these doujins can vary wildly in quality, especially for older series. The third model (tag-only) served as a test for the first model and carries similar pros and cons. The lack of filesize-per-image means that the classifier simply has less to go on and the accuracy suffered as a result. The numerical results and the conclusions drawn from them are included in section 5 (Results). 4. Pornographic (Academic) Literature Academic data-driven research on porn isn’t a subject most scholars are willing to cover, research on obscure Japanese pornographic comics even less so. In a more general sense, however, data science/analytics literature pertaining to porn does exist.
Yes Please
Sapiens: A Brief History of
Search document
also factor into their analysis, delving what people look for, both sexually an literally, in their pornography.
For a more academic approach, Sexualitics is a dedicated collaboratio between scholars that “tries to contri human sexuality understanding throu data approach”. They release dataset papers to help promote more discuss what is otherwise a bit of a taboo sub of their studies from 2014 (http://sexualitics.org/wpcontent/uploads/2014/08/mazieres dies_2014.pdf ) come fairly close to th analysis this assignment focuses on, n frequency and exploration of tags of a site’s data (in their case, xHamster). T study focused on network connectivi categorizing tags into ethnic groups b which regions searched for which ter conceit was that different types of pe prefer different types of porn, and tha difference could be seen at a cultural Similar to the community-building al taught in this course, the Sexualitics s was able to cordon off certain fetishe could be used to describe different et regional porn preferences. This type Sign up to vote on this title research could improve the accuracy Usefulmarketing Not useful targeted on various porn s possibly increase revenues for sites
Home
Saved
Books
Audiobooks
Magazines
News
Documents
Sheet Music
Upload
Sign In
Join
Search
Home
Saved
0
1.5K views
Upload
Sign In
RELATED TITLES
0
CSE158 Assignment 2 Report Uploaded by Brian Nguyen
Books
Audiobooks
Magazines News
Documents
Sheet Music
Report for school project.
Save
Embed
Share
Print
Download
Join
The Unwinding: An Inner History
1
of 7
anime or manga that the doujinshi is based off of. For example, the series “Touhou Project”, identifiable among the doujins as those containing the tag “parody:touhou project”, contains a substantial amount of lesbian sex, and could be grouped in a community of other such series whose doujins produce similarly tagged content. Being porn, these doujinshi share a similarity to videos hosted on xHamster (which were used in the Sexualitics paper), and can thus be used in a similar context for analysis and study. 5. Results Predictive Task: Given the metadata of the gallery, predict whether the the gallery rate at above or below the average rating (~4.2/5.0)
Model 1 Performance (Filesize-per-image and Top 100 Popular Tags) Set Training Validation Test
Accuracy 0.737 0.738 0.739
Model 2 Performance (Top 100 Series) Set Training Validation
Accuracy 0.684
Yes Please
Sapiens: A Brief History of
Search document
My proposed model (Model 1) seems s eems come out on top over the two naïve class it certainly seems to be doing something The significance of these results at the ve means that my model can predict whethe doujin rates above or below the average w 74% accuracy.
Model 1’s performance over Model 3 that my filesize-per-image feature indeed the predictions more accurate, specificall around 4%. The performance of Model 3 impressive though, as tags alone can pred score with 71% accuracy. If anything tha the effectiveness of tags on people ’s enjo and rating of the doujinshi. Model 2 perfo the worst, possibly due to the reasons me in section 3 previously.
As a whole, Model 1 performed well b its features matched the best with why so would rate a doujinshi highly. People ’s po preferences can be highly specific, and if doujin’s tags “hit the spot ”, so to speak, th doujin is likely to do well. On a general le people want higher definition media, and higher the resolution of the images, the b doujin can be enjoyed. Adding on specific for artists or series diluted the accuracy o Sign up to vote on this title predictor during testing.
Useful Not useful These results suggest that (possibly unsurprisingly) what matters most in a
Home
Saved
Books
Audiobooks
Magazines
News
Documents
Sheet Music
Upload
Sign In
Join