CSE158 Assignment 2 Report



Search



Home



Saved

0

1.5K views



Upload

Sign In

RELATED TITLES

0

CSE158 Assignment 2 Report Uploaded by Brian Nguyen



Books



Audiobooks



Magazines News



Documents



Sheet Music



Report for school project.









Save

Embed

Share

Print

Download



Join

The Unwinding: An Inner History

1

of 7

 

Yes Please

Sapiens: A Brief History of

 Search document



Brian Lu Nguyen A12940672 [email protected]

1. The Exhentai.org Dataset The dataset being studied is 100,000 of the most recent doujinshi uploaded to the website exhentai.org as of June 10 th, 2017. Doujinshi are Japanese self-published works, often taking the form of fanfiction comics, although original works do exist. Exhentai.org is a website dedicated to archiving the doujinshi that exist, and requires an account to access, as the vast majority of these works are 18+. This dataset includes only the original Japanese language doujinshi manga, and does not contain con tain alternate language translations. To be more frank, this is a dataset of Japanese pornographic manga (comics). While exhentai.org does host doujin translations in a multitude of languages (English, Chinese, Russian, Spanish, etc.), I decided to thin this dataset to just the Japanese entries to remove any weird results that may come from involving all languages.

on every page. Once the urls were com then ran a program that extracted the metadata for each of the galleries usin exhentai’s dedicated API. Each galler associated ID and gallery token, whic used in a JSON request that retrieves gallery’s metadata. This process is inc time intensive (taking several hours) despite following th e site’s guideline limiting “25 entries per request, 4 -5 sequential requests… before having t for ~5 seconds” (https://ehwiki.org/wiki/API https://ehwiki.org/wiki/API)), I was banned several times and needed to u multiple proxies to get around the ba all of the metadata eventually compil series of JSON files, I could then begin analysis. Sign up to vote on this title



1:Useful  Not useful Table Basic Statistics of the Dataset Exhentai.org Dataset Statistics





Home



Saved



Books



Audiobooks



Magazines



News



Documents



Sheet Music



Upload

Sign In

Join



Search



Home



Saved

1.5K views

0



Upload

Sign In

RELATED TITLES

0




Books



Audiobooks



Magazines












Save

Embed

Share

Print

Download



News



Documents



Sheet Music

Join


1

of 7

 

Figure 1a: Example Metadata Information

Yes Please


 Search document



female: ahegao tankoubon female: futanari male: males only female: double penetration penetration female: nakadashi male: sole male

An oddity visible in Table 2 is the dispari between male: yaoi (male gay content) an female: yuri (lesbian content). The latter doesn’t register among the top 20, instea 28th at 6.2% on the list – about 7% less fr than yaoi. This suggests that yaoi as a gen much less niche than yuri, and has wider This makes sense in context, considering amount of female consumers of doujinsh gravitate to yaoi content. This exists to su extent that the term “fujoshi” , a self-depr term for female fans of “boys love”, is com vocabulary in Western anime and manga communities.

Figure 1b: Equivalent on exhentai.org

Table 2: Top 20 Most Popular Tags Tag Descriptor

Frequency

female: big breasts

30%

female: lolicon

25%

group

21%

Another interesting note from this table i presence of “tankoubon” at 9% frequency the dataset. Tankoubon are paperback vo that often act as an omnibus for multiple  artists to have theironworks published in. W Sign up to vote this title outside Useful the scope ofNot myuseful intended model,   could make a network community of arti collaborating within such anthological w





Home



Saved



Books



Audiobooks



Magazines



News



Documents



Sheet Music



Upload

Sign In

Join



Search



Home



Saved

1.5K views

0



Upload

Sign In

RELATED TITLES

0




Books



Audiobooks



Magazines












Save

Embed

Share

Print

Download



News



Documents



Sheet Music

Join


1

of 7

 

Table 3: Top 20 Most Popular Series

Yes Please


 Search document



mean that the art for the doujins are clea that the scanning techniques used are representing doujins better, Older series older doujins are prone to poor quality s the original pages, or simply poor quality begin with.

Tag Descriptor

Frequency

Touhou Project

7%

Kantai Collection

5%

Idolm@ster

3%

Mahou Shoujo Lyrical Nanoha

1%

Neon Genesis Evangelion

1%

Love Live

1%

Sailor Moon

.7%

Inochi Wazuka

Granblue Fantasy

.7%

Crimson

Free

.7%

Natsuka Q-ya

To Love-Ru

.6%

Nekogen

Pokemon

.6%

Ueda Yuu

Puella Magi Madoka Magica

.57%

Uchi-Uchi Keyaki

K-On

.55%

Shingeki no Kyojin

.54%

Touken Ranbu

.5%

Ore no Imouto ga Konna ni Kawaii Wake ga Nai Street Fighter

.49% .47%

Kawamori Misaki

Fate/stay Night

.47%

Ken

Sword Art Online

.46%

Yanagawa Rio

Kuroko no Basuke

.45%

Koutarou

Table 4: Top 20 Most Popular Artists Tag Descriptor

Fre

Itaba Hiroshi

Erect Sawaru Saigado Nozarishi Satoru Marui Maru Manabe Jouji

Equal There is an interesting thing to note here with regards to popularity. Each of these series came out at a different time, so while some have had

Zen9Sign up to vote on this title Takasugi Kou  Not useful  Useful Nagiyama







Home



Saved



Books



Audiobooks



Magazines



News



Documents



Sheet Music



Upload

Sign In

Join



Search



Home



Saved

0

1.5K views



Upload

Sign In

RELATED TITLES

0




Books



Audiobooks



Magazines












Save

Embed

Share

Print

Download



News



Documents



Sheet Music

Join


1

of 7

 

2. Predictive Task With sexuality and personal taste differing from person to person, I’d like to find out if there are general factors that make a doujinshi highly rated. This could range from qualitative features such as the image size of each scanned page (ex. HD vs. standard definition porn), content-specific features like a gallery’s tags (i.e . the fetishes the work plays upon), or highly subjective su bjective features features like the popularity of the series the doujinshi is based on. Essentially, my predictive task is to predict the rating of a doujinshi gallery based on its metadata.

Yes Please


 Search document



1. Filesize-per-image of the gall the images in the image gallery fo doujinshi are of low resolution, it unlikely to be rated highly. Conve the images in the image gallery ar resolution (or possibly in color), t score might be higher. This value calculated using the ‘filecount’ an ‘filesize’ properties of the gallery metadata, functioning on the assu that all of the pages are roughly consistent in size within galleries divided by filecount should produ value).

This task can be performed using an SVM classifier as was applied to predicting a beer’s ABV in homework 1 and improved upon in homework 2. Apart from simply testing my model’s performance against the test set, I will be evaluating my model based on its test accuracy compared with other models, each varying in complexity. For instance, if my model is performing worse than a naïve classifier, it clearly needs improvement.

2. Number of popular tags occu the doujin – tag frequencies acro entire dataset may indicate gener what people have a preference fo can be calculated similarly to the common unigrams used in Home The tags are included in the meta each gallery file, and can be comp with a list of “top tags” (sorted frequency of appearance in douji galleries).

I’m applying an SVM classifier here because I’m only really interested in what makes a doujinshi “good”, and so would only need to predict whether the doujinshi’s score lies above a certain threshold. In this case, I’ll be attempting to predict whether a doujin lies

This feature also naturally includ popularity of the artist. If the arti  Sign up to vote on this title prolific, they’re probably doing so Not useful right. The can be found with  Useful artist tags of the metadata, and can be a by stripping the artist from the re





Home



Saved



Books



Audiobooks



Magazines



News



Documents



Sheet Music



Upload

Sign In

Join



Search



Home



Saved

0

1.5K views



Upload

Sign In

RELATED TITLES

0




Books



Audiobooks



Magazines News



Documents



Sheet Music












Save

Embed

Share

Print

Download



Join


1

of 7

 

1. Popularity of the series – if the source material is popular, that might be an indication of quality (i.e. Game of Thrones porn is probably better than Marvel: Inhumans porn). And this feature for the other: 1. Popularity of the tags – just going off of all of the tags combined, if the tags in the gallery are part of the most frequent tags, they should be favored/desired. This does not consider the filesize-per-image and is used to gauge whether the filesize feature is actually helping. 3. Model The model I chose was a classifier that runs logistic regression on the features listed above to predict whether a doujin has an above-average rating. I ’m using this model because it was effective and relatively simple to implement. The data from the exhentai.org dataset makes this model effective, due to its similarity to the beer review dataset we worked with, and the statistics found during my exploration seemed to indicate that the elements I’m using to predict ratings are effective (or at least have significant enough differences in frequency so as to make an impact). I constructed the feature vector using the

Yes Please


 Search document



every time a popular artist or tag was encountered.

My training-validation-test set sp assembled by first randomizing the d along with their scores, then cutting t into a 70-15-15 ratio. 70000 doujins used in the training set and 15000 do were used in the validation and test s randomizing the doujin list biases the because the doujins are ordered from least recent. Splitting them chronolog like this results in the classifier traini recent content and testing on old con model would overfit to the training se perform poorly on the test set.

Optimizing the model took some tinkering with the amount of “top tags/artists/series/etc.” I was consid Using only the top 10 or top 50 did no provide enough information to the cl so I ended up calculating the feature using the top 100 of each of the tag c I use the hyperparameter lambda = 1 there being a dropoff in effectiveness this point.

My two other models, based on th  popular series and most Sign up to vote on this titlepopular tags perform as well. usefulThe series-on  Usefulnearly  Not performed worse than the content (ta model, implying that a series ’ popula





Home



Saved



Books



Audiobooks



Magazines



News



Documents



Sheet Music



Upload

Sign In

Join



Search



Home



Saved

0

1.5K views



Upload

Sign In

RELATED TITLES

0




Books



Audiobooks



Magazines News



Documents



Sheet Music












Save

Embed

Share

Print

Download



Join


1

of 7

 

tagged accurately and that a high resolution image doesn’t contain flaws in image clarity. Another complication is the presence or lack of consideration for censoring in the gallery, as regardless of image quality a censored doujinshi is not as preferable as an uncensored one. The second model (series-only) is pretty naïve and falls flat. The popularity (frequency) of a series just means that more doujins exist for it, and these doujins can vary wildly in quality, especially for older series. The third model (tag-only) served as a test for the first model and carries similar pros and cons. The lack of filesize-per-image means that the classifier simply has less to go on and the accuracy suffered as a result. The numerical results and the conclusions drawn from them are included in section 5 (Results). 4. Pornographic (Academic) Literature Academic data-driven research on porn isn’t a subject most scholars are willing to cover, research on obscure Japanese pornographic comics even less so. In a more general sense, however, data science/analytics literature pertaining to porn does exist.

Yes Please


 Search document



also factor into their analysis, delving what people look for, both sexually an literally, in their pornography.

For a more academic approach, Sexualitics is a dedicated collaboratio between scholars that “tries to contri human sexuality understanding throu data approach”. They release dataset papers to help promote more discuss what is otherwise a bit of a taboo sub of their studies from 2014 (http://sexualitics.org/wpcontent/uploads/2014/08/mazieres dies_2014.pdf ) come fairly close to th analysis this assignment focuses on, n frequency and exploration of tags of a site’s data (in their case, xHamster). T study focused on network connectivi categorizing tags into ethnic groups b which regions searched for which ter conceit was that different types of pe prefer different types of porn, and tha difference could be seen at a cultural Similar to the community-building al taught in this course, the Sexualitics s was able to cordon off certain fetishe could be used to describe different et regional porn preferences. This  type Sign up to vote on this title research could improve the accuracy  Usefulmarketing  Not useful targeted on various porn s possibly increase revenues for sites





Home



Saved



Books



Audiobooks



Magazines



News



Documents



Sheet Music



Upload

Sign In

Join



Search



Home



Saved

0

1.5K views



Upload

Sign In

RELATED TITLES

0




Books



Audiobooks



Magazines News



Documents



Sheet Music












Save

Embed

Share

Print

Download



Join


1

of 7

 

anime or manga that the doujinshi is based off of. For example, the series “Touhou Project”, identifiable among the doujins as those containing the tag “parody:touhou project”, contains a substantial amount of lesbian sex, and could be grouped in a community of other such series whose doujins produce similarly tagged content. Being porn, these doujinshi share a similarity to videos hosted on xHamster (which were used in the Sexualitics paper), and can thus be used in a similar context for analysis and study. 5. Results Predictive Task: Given the metadata of the gallery, predict whether the the gallery rate at above or below the average rating (~4.2/5.0)

Model 1 Performance (Filesize-per-image and Top 100 Popular Tags) Set Training Validation Test

Accuracy 0.737 0.738 0.739

Model 2 Performance (Top 100 Series) Set Training Validation

Accuracy 0.684

Yes Please


 Search document



My proposed model (Model 1) seems s eems come out on top over the two naïve class it certainly seems to be doing something The significance of these results at the ve means that my model can predict whethe doujin rates above or below the average w 74% accuracy.

Model 1’s performance over Model 3 that my filesize-per-image feature indeed the predictions more accurate, specificall around 4%. The performance of Model 3 impressive though, as tags alone can pred score with 71% accuracy. If anything tha the effectiveness of tags on people ’s enjo and rating of the doujinshi. Model 2 perfo the worst, possibly due to the reasons me in section 3 previously.

As a whole, Model 1 performed well b its features matched the best with why so would rate a doujinshi highly. People ’s po preferences can be highly specific, and if doujin’s tags “hit the spot ”, so to speak, th doujin is likely to do well. On a general le people want higher definition media, and higher the resolution of the images, the b doujin can be enjoyed. Adding on specific for artists or series diluted the accuracy o  Sign up to vote on this title predictor during testing.

 Useful  Not useful These results suggest that (possibly unsurprisingly) what matters most in a





Home



Saved



Books



Audiobooks



Magazines



News



Documents



Sheet Music



Upload

Sign In

Join

CSE158 Assignment 2 Report

Recommend Documents