Introduction To Information Rertrieval Answer

Chapter I Exercise 1.1 Draw the inverted index that would be built for the following document collection: Doc 1 new home sales top forecasts Doc 2 home sales rise in july Doc 3 increase in home sales in july Doc 4 july new home sales rise forecast

Doc 1

home

Doc 1

increase

Doc 3

july

Doc 2

Doc 3

Doc 2

Doc 3

Doc 4

new

Doc 1

Doc 4

rise

Doc 2

Doc 4

sale

Doc 1

Doc 2

top

Doc 1

Doc 4

Doc 3

Doc 4

Exercise 1.2 Consider these documents: Doc 1 breathrough drug for schi!ophrenia Doc 2 new schi!ophrenia drug Doc 3 new approach for treatment of schi!ophrenia Doc 4 new hopes for schi!ophrenia patients "uestion: a# Draw the term$ term$documen documentt incidence incidence matrix matrix for this this document document collection collection%% b# Draw the inverted index representation for this collection& as in 'igure 1%3 (page )#% *nswer: a# +erm erm docum document ent inc incid idenc encee +erm,Document

Doc 1

Do c 2

Doc 3

Doc 4

approach

-

-

1

-

breakthrouh

1

-

-

-

!ru

1

1

-

-

hope

-

-

-

1

new

-

1

1

1

patient

-

-

-

1

schi"ophrenia

1

1

1

1

treatment

-

-

1

-

b# .nverted index approach

3

breakthrouh

1

!ru

1

hope

4

new

2

patient

4

schi"ophrenia

1

treatment

3

2 3

4

2

3

4

Exercise 1.3 'or the document collection shown in /xercise 1%2& what are the returned results for these 0ueries: a# schi!ophrenia *D drug b# for *D + (drug  approach# answer a# Doc 1& Doc 2 b# Doc 4

Exercise 1.4 'or the 0uery below& can we still run through the intersection in time (xy#& where x and y are the lengths of the postings lists for 5rutus and Caesar6 .f not& what can we achieve6 a# 5rutus *D + Caesar b# 5rutus  + Caesar answer a# 7es& we can still run (xy# as long as any postings of Caesar which matched with postings of 5rutus is not intersected% r in other& the length of intersected posting list is less than or e0ual to length of postings of 5rutus% b# o& we can not achieve (xy# in x  + y operation& since the length of + y is unnown and same as length of intersect posting list% 8e can optimi!e this 0uery into 9xy;& where y is length of postings which not existed in x and y%

Exercise 1.# /xtend the postings merge algorithm to arbitrary 5oolean 0uery formulas% 8hat is its time complexity6 'or instance& consider: a# (5rutus  Caesar# *D + (*ntony  Cleopatra#

Can we always merge in linear time6
Exercise 1.$ 8e can use distributive laws for *D and  to rewrite 0ueries% a# =how how to rewrite the 0uery in /xercise 1%> into disjunctive normal form using the distributive laws% b# 8ould the resulting 0uery be more or less efficiently evaluated that the original form of this 0uery6 c# .s this result true in general or does it depend on the word and the contents of the document collection6 *nswer a# ( Brutus∨Caesar )∧¬( Antony∨Cleopatra )≡(Brutus ∨Caesar)∧(¬ Antony∨¬ Cleopatra) b# 8ould be less efficiently because we need to produce each negation first before operation % c# .t is true for general%

Exercise 1.% ecommend a 0uery processing order for a# (tangerine  trees# *D (marmalade  sies# *D (aleidoscope  eyes# given the following postings list si!es: &erm

'ostins (i"e

eyes

213312

aleidoscope

?)--@

marmalade

1-)@13

sies

2)1A>?

tangerine

4AA>3

trees

31A?12

*nswer: (tangerine#  (trees# B 3A34A> (marmalade#  (sies# B 3)@>)1 (aleidoscope#  (eyes# B 1-?3321 .nitial 0uery is already omptimum% (tangerine  trees# *D (marmalade  sies# *D (aleidoscope  eyes#

Exercise 1.) .f the 0uery is: b# friends *D romans *D (+ countrymen# ow could we use the fre0uency of countrymen in evaluating the best evaluation order6 .n particular& propose a way of handling negation in determining the order of 0uery processing% *nswer: +he merge result from this 0uery is any intersected term from friends *D romans that not matche! with any terms found in countrymen% +he result length should be smaller than or e0ual to the length of friend *D romans%

Exercise 1.* 'or a conjunctive 0uery& is processing postings lists in order of si!e guaranteed to be op timal6 /xplain why it is& or give an example where it isnt% *nswer: rocessing conjunctive 0uery in order of si!e is not guarantee to be optimal because the distributive law proof that addition in any direction has always produce exactly same result and the representation of linear growth tell that any addition function of si!e is always exactly product total of length of its element% 'or example& (a  b#  c is e0uivalent with a  (b  c#%

+ea!in 1.1, 8rite out a postings merge algorithm& in the style o f 'igure 1%A (page 11#& for an x  y 0uery% *nswer: I N V E R T E D _ I N D E X* p x=x I N V E R T E D _ I N D E X* p y=y I N V E R T E D _ I N D E X* z=a l l o c a t e ( ) w h i l e( * p x ) . n e x t! =n u l l| |( * p y ) . n e x t! =n u l l : i f( * p x ) . d o c I d! =n u l l& &( * p y ) . d o c I d! =n u l l : i f( * p x ) . d o c I d= =( * p y ) . d o c I d : ( * z ) . d o c I d=( * p x ) . d o c I d p x=( * p x ) . n e x t p y=( * p y ) . n e x t e l s e : i fp x . d o c I d

z=( * z ) . n e x t & z=n u l l

Exercise 1.11 ow should the boolean 0uery x *D + y be handled6 8hy is naive evaluation of this 0uery normally very expensive6 8rite out a postings merge algorithm that evaluates this 0uery efficiently% *nswer: 'ind any terms of x that not matched with terms of y% +he algorithm may be lie this: I N V E R T E D _ I N D E X* p x=x I N V E R T E D _ I N D E X* p y=y I N V E R T E D _ I N D E X* z=a l l o c a t e ( ) w h i l ep x! =n u l l : i f( * p x ) . d o c I d<( * p y ) . d o c I d : ( * z ) . d o c I d=( * p x ) . d o c I d p x=( * p x ) . n e x t e l s e : p y=( * p y ) . n e x t ( * z ) . n e x t=a l l o c a t e ( ) z=( * z ) . n e x t & z=n u l l

Exercise 1.12 8rite a 0uery using 8estlaw syntax which would find any of the words professor& teacher& or lecturer in the same sentence as form of the verb explain% *nswer: (professorE +eacherE
Exercise 1.13 +ry using boolean search features on a couple of major web search engines% 'or instance& choose a word& such as burglar& and submit the 0ueries (.# burglar& (ii# burglar *D burglar& and (iii# burglar  burglar%
Exercise 2.1 *re the following statetemnts true or false6 a# .n a 5oolean retrieval system& stemming never lowers precision% ('alse# b# .n a 5oolean retrieval system& stemming never lowers recall% (+rue#

c# =temming increases the si!e of the vocabulary% ('alse# d# =temming should be invoed at indexing time but not while processing a 0uery% ('alse#

Exercise 2.2 =uggest what normali!ed form should be used for these words (including the word itself as a possibility#: a# FCos& cos& because b# =hiite& shiite c# cont d& count down& counting down d# awaii& hawai e# oure& o roure& roure

Exercise 2.3 +he following pairs of words are stemmed to the same form by orter stemmer% 8hich pairs would you argue shouldnt be conflated% Give your reasoning: a# *bandon , abandoment% +his associative is good because term abandoment is clearly adjective form of verb abandon% b# *bsorbency , absorbent% +his associative loos fine% c# Hareting , marets% +his associative loos not relevant because marets refer to place& while mareting refer to job function% d# Iniversity , universe% +his associative clearly not relevant& because term universe has general meaning while term university refers to place% e# Jolume , volumes% +his associative loos really good%

Exercise 2.4 'or the orter stemmer rule group shown in (2%1#: a# 8hat is the purpose of including an identity rule such as == K ==6 +o identify such term belong to adjective% b# *pplying just this rule group& what will the following words be stemmed to6 circus canaries boss circus canari boss c# 8hat rule should be added to correctly stem pony6 ./== K 7 d# +he stemming for ponies and pony might seem strange% Does it have a deleterious effect on retrieval6 8hy or why not6 o& it does not have high deleterious effect on retrieval result because poni as result of stemming of ponies and pony have been stored at index at first will always matched with poni from post processed term of 0uery%

Introduction To Information Rertrieval Answer

Recommend Documents