Chapter I Exercise 1.1 Draw the inverted index that would be built for the following document collection: Doc 1 new home sales top forecasts Doc 2 home sales rise in july Doc 3 increase in home sales in july Doc 4 july new home sales rise forecast
Doc 1
home
Doc 1
increase
Doc 3
july
Doc 2
Doc 3
Doc 2
Doc 3
Doc 4
new
Doc 1
Doc 4
rise
Doc 2
Doc 4
sale
Doc 1
Doc 2
top
Doc 1
Doc 4
Doc 3
Doc 4
Exercise 1.2 Consider these documents: Doc 1 breathrough drug for schi!ophrenia Doc 2 new schi!ophrenia drug Doc 3 new approach for treatment of schi!ophrenia Doc 4 new hopes for schi!ophrenia patients "uestion: a# Draw the term$ term$documen documentt incidence incidence matrix matrix for this this document document collection collection%% b# Draw the inverted index representation for this collection& as in 'igure 1%3 (page )#% *nswer: a# +erm erm docum document ent inc incid idenc encee +erm,Document
Doc 1
Do c 2
Doc 3
Doc 4
approach
-
-
1
-
breakthrouh
1
-
-
-
!ru
1
1
-
-
hope
-
-
-
1
new
-
1
1
1
patient
-
-
-
1
schi"ophrenia
1
1
1
1
treatment
-
-
1
-
b# .nverted index approach
3
breakthrouh
1
!ru
1
hope
4
new
2
patient
4
schi"ophrenia
1
treatment
3
2 3
4
2
3
4
Exercise 1.3 'or the document collection shown in /xercise 1%2& what are the returned results for these 0ueries: a# schi!ophrenia *D drug b# for *D + (drug approach# answer a# Doc 1& Doc 2 b# Doc 4
Exercise 1.4 'or the 0uery below& can we still run through the intersection in time (xy#& where x and y are the lengths of the postings lists for 5rutus and Caesar6 .f not& what can we achieve6 a# 5rutus *D + Caesar b# 5rutus + Caesar answer a# 7es& we can still run (xy# as long as any postings of Caesar which matched with postings of 5rutus is not intersected% r in other& the length of intersected posting list is less than or e0ual to length of postings of 5rutus% b# o& we can not achieve (xy# in x + y operation& since the length of + y is unnown and same as length of intersect posting list% 8e can optimi!e this 0uery into 9xy;& where y is length of postings which not existed in x and y%
Exercise 1.# /xtend the postings merge algorithm to arbitrary 5oolean 0uery formulas% 8hat is its time complexity6 'or instance& consider: a# (5rutus Caesar# *D + (*ntony Cleopatra#
Can we always merge in linear time6
Exercise 1.$ 8e can use distributive laws for *D and to rewrite 0ueries% a# =how how to rewrite the 0uery in /xercise 1%> into disjunctive normal form using the distributive laws% b# 8ould the resulting 0uery be more or less efficiently evaluated that the original form of this 0uery6 c# .s this result true in general or does it depend on the word and the contents of the document collection6 *nswer a# ( Brutus∨Caesar )∧¬( Antony∨Cleopatra )≡(Brutus ∨Caesar)∧(¬ Antony∨¬ Cleopatra) b# 8ould be less efficiently because we need to produce each negation first before operation % c# .t is true for general%
Exercise 1.% ecommend a 0uery processing order for a# (tangerine trees# *D (marmalade sies# *D (aleidoscope eyes# given the following postings list si!es: &erm
'ostins (i"e
eyes
213312
aleidoscope
?)--@
marmalade
1-)@13
sies
2)1A>?
tangerine
4AA>3
trees
31A?12
*nswer: (tangerine# (trees# B 3A34A> (marmalade# (sies# B 3)@>)1 (aleidoscope# (eyes# B 1-?3321 .nitial 0uery is already omptimum% (tangerine trees# *D (marmalade sies# *D (aleidoscope eyes#
Exercise 1.) .f the 0uery is: b# friends *D romans *D (+ countrymen# ow could we use the fre0uency of countrymen in evaluating the best evaluation order6 .n particular& propose a way of handling negation in determining the order of 0uery processing% *nswer: +he merge result from this 0uery is any intersected term from friends *D romans that not matche! with any terms found in countrymen% +he result length should be smaller than or e0ual to the length of friend *D romans%
Exercise 1.* 'or a conjunctive 0uery& is processing postings lists in order of si!e guaranteed to be op timal6 /xplain why it is& or give an example where it isnt% *nswer: rocessing conjunctive 0uery in order of si!e is not guarantee to be optimal because the distributive law proof that addition in any direction has always produce exactly same result and the representation of linear growth tell that any addition function of si!e is always exactly product total of length of its element% 'or example& (a b# c is e0uivalent with a (b c#%
+ea!in 1.1, 8rite out a postings merge algorithm& in the style o f 'igure 1%A (page 11#& for an x y 0uery% *nswer: I N V E R T E D _ I N D E X* p x=x I N V E R T E D _ I N D E X* p y=y I N V E R T E D _ I N D E X* z=a l l o c a t e ( ) w h i l e( * p x ) . n e x t! =n u l l| |( * p y ) . n e x t! =n u l l : i f( * p x ) . d o c I d! =n u l l& &( * p y ) . d o c I d! =n u l l : i f( * p x ) . d o c I d= =( * p y ) . d o c I d : ( * z ) . d o c I d=( * p x ) . d o c I d p x=( * p x ) . n e x t p y=( * p y ) . n e x t e l s e : i fp x . d o c I d
z=( * z ) . n e x t & z=n u l l
Exercise 1.11 ow should the boolean 0uery x *D + y be handled6 8hy is naive evaluation of this 0uery normally very expensive6 8rite out a postings merge algorithm that evaluates this 0uery efficiently% *nswer: 'ind any terms of x that not matched with terms of y% +he algorithm may be lie this: I N V E R T E D _ I N D E X* p x=x I N V E R T E D _ I N D E X* p y=y I N V E R T E D _ I N D E X* z=a l l o c a t e ( ) w h i l ep x! =n u l l : i f( * p x ) . d o c I d<( * p y ) . d o c I d : ( * z ) . d o c I d=( * p x ) . d o c I d p x=( * p x ) . n e x t e l s e : p y=( * p y ) . n e x t ( * z ) . n e x t=a l l o c a t e ( ) z=( * z ) . n e x t & z=n u l l
Exercise 1.12 8rite a 0uery using 8estlaw syntax which would find any of the words professor& teacher& or lecturer in the same sentence as form of the verb explain% *nswer: (professorE +eacherE
Exercise 1.13 +ry using boolean search features on a couple of major web search engines% 'or instance& choose a word& such as burglar& and submit the 0ueries (.# burglar& (ii# burglar *D burglar& and (iii# burglar burglar%
Exercise 2.1 *re the following statetemnts true or false6 a# .n a 5oolean retrieval system& stemming never lowers precision% ('alse# b# .n a 5oolean retrieval system& stemming never lowers recall% (+rue#
c# =temming increases the si!e of the vocabulary% ('alse# d# =temming should be invoed at indexing time but not while processing a 0uery% ('alse#
Exercise 2.2 =uggest what normali!ed form should be used for these words (including the word itself as a possibility#: a# FCos& cos& because b# =hiite& shiite c# cont d& count down& counting down d# awaii& hawai e# oure& o roure& roure
Exercise 2.3 +he following pairs of words are stemmed to the same form by orter stemmer% 8hich pairs would you argue shouldnt be conflated% Give your reasoning: a# *bandon , abandoment% +his associative is good because term abandoment is clearly adjective form of verb abandon% b# *bsorbency , absorbent% +his associative loos fine% c# Hareting , marets% +his associative loos not relevant because marets refer to place& while mareting refer to job function% d# Iniversity , universe% +his associative clearly not relevant& because term universe has general meaning while term university refers to place% e# Jolume , volumes% +his associative loos really good%
Exercise 2.4 'or the orter stemmer rule group shown in (2%1#: a# 8hat is the purpose of including an identity rule such as == K ==6 +o identify such term belong to adjective% b# *pplying just this rule group& what will the following words be stemmed to6 circus canaries boss circus canari boss c# 8hat rule should be added to correctly stem pony6 ./== K 7 d# +he stemming for ponies and pony might seem strange% Does it have a deleterious effect on retrieval6 8hy or why not6 o& it does not have high deleterious effect on retrieval result because poni as result of stemming of ponies and pony have been stored at index at first will always matched with poni from post processed term of 0uery%