TREX – SAP Net W eav er’s Sear c h and TREX Cl a s s i f i c a t i o n En g i n e
SAP NetWeaver Product Management July 2008
Agenda
Introduction to TREX in i n SAP NetWeaver TREX Functions and Features TREX Architecture and Details TREX Plattform, Sizing Guidelines, … Summary
Agenda
Introduction to TREX in i n SAP NetWeaver TREX Functions and Features TREX Architecture and Details TREX Plattform, Sizing Guidelines, … Summary
S e a rc h En E n g in e f o r SA SA P
TREX is the one search technology in SAP solutions
TREX is deployed in over a dozen SAP produts
TRES searches and analyses as well unstructured documents as structured business data
TREX in knowledge management provides search access to an extensible number of document repositories repositories
TREX will provide the backend technology for Enterprise Search
C u r r e n t u s e o f T RE R E X i n SA S A P So So l u t i o n s / Components In SAP NetWeaver
SAP NetWeaver Enterprise Search SAP Enterprise Portal
Knowledge Management (KM) platform
SAP Business Intelligence (attachmenst, data aggregation in future) SAP KW (Training + Documentation solution) SAP Records Management
Further SAP solutions / components
mySAP HR Expert Finder, e-Recruiting, Learning Solution mySAP PLM DMS (Document Management System) IS Automotive Vehicle Finder mySAP CRM
Internet Sales (Catalog Engine) IC Web Client Segment Builder
...more
Agenda
Introduction to TREX in SAP NetWeaver TREX Functions and Features TREX Architecture and Details TREX Plattform, Sizing Guidelines, … Summary
T REX i s a s e r v i c e p r o v i d e r f o r S A P s o l u t i o n s
Solutionspecific UI
UI
SAP Enterprise Portal UI (iViews)
n - x TREX Services used
Intergrated Content in SAP EP
Apps
IC Web Client
Engine
mySAP PLM DMS
Shared TREX usage of Solutions
TREX TREX Engine Engine
Index
KM with SAP NetWeaver
n TREX Services offered
TREX TREX Engine Engine Solution specific TREX
Index
Index
Index
Index
T R EX – S e ar c h Se r v i c e s o f f e re d Search in
Unstructured data (documents) Structured data (business objects) Full text Attributes
Different search modes
Exact Linguistic: stemming, etc. Fuzzy: Search error tolerant Wildcards and truncations ( * or ? ) Phrase search for complex expressions Boolean operators (AND, OR, NOT…) Highlighting / HTML conversion Content Snippets (Abstracts) Federated search …
T REX T e x t M i n i n g En g i n e – Se r v i c e s o f f e r e d
Document Feature extraction
Find characteristic keywords (noun phrases)
Find similar documents
Based on document features
Document classification
Assign a document to predefined categories
Term search
Find better search terms; discover interesting relationships
Document clustering
Discover sets of related documents
Term clustering
Discover sets of related terms in the current corpus
T REX A t t r i b u t e En g i n e
Search for all types of attribute
String, integer, floating point, date, and so on
Sort query results by any attribute
For example, sort documents by date or by author
Support range search
For example, find sales orders from the last two months
New:
Multihost enabled
Attribute search enabled not only for case-insensitive ASCII but also for case-insensitive Unicode
A d d i t i o n a l Se r v i c e s – Ex a m p l e : I n t e r a c t i v e Search 1
User enters search keyword Berlin
Berlin Name Street
(5)
(5)
Age 0-18 ...
2
(26)
(42 hits) (...) (35 hits) (...)
A-D ... M-P ...
User clicks on group M-P in Street
M-P < Street < Berlin (...)
...
3
Street … (…) Pariserstr (3 hits) … (…)
Name
(314 hits) (...)
Search results: Hits grouped by attributes (overlapping) and listed by attribute value ranges (disjunct)
(26) (35 hits) (…) (…)
A … Z
4
5 User clicks street and sees hits
Search results: Groups ordered by relevance for this search All hits are in M-P streets in Berlin
Agenda
Introduction to TREX in SAP NetWeaver TREX Functions and Features TREX Architecture and Details TREX Plattform, Sizing Guidelines, … Summary
SAP Ne t Weaver T REX
SAP NetWeaver TREX ABAP Client U s e r s
Java Client
RFC Server
Name Server
Preprocessor
Index Server
Web Server
Natural Language Interface Engine
Browser UI
Server Pages
Text Search Engine
Text Mining Engine
Indexes
BO
Queue Server
Join Engine Attribute Engine
C r a w l e r
DB
www
…
TREX A PIs …
…may only be used SAP-internally
Cannot directly be used by customers or partners
…can indirectly be accessed via other APIs
e.g. the KM IMS API or the ABAP search engine service
T R EX – Su p po r t e d L a n gu a g es
Arabic
Italian
Chinese trad.
Japanese
Chinese simpl.
Korean
Czech
Polish
Danish
Portuguese
Dutch
Norwegian bok.
English
Norwegian nyn.
Finnish
Romanian
French
Russian
German
Spanish
Greek
Swedish
Hebrew
Thai
Hungarian
Turkish
…more
Index able MIME Types ANSI Text (7 & 8 bit)
All versio ns
A SC II Te xt ( 7 & 8 b it ve rs io ns a va il ab le ) Corel WordPerfect forWindows
A l v er si on s
Versions through 9.0
DECWPS Plu s (DX)
Versions through 4.0
DEC WPS Plus (WPL)
Versions through 4.1
DisplayWrite 2 & 3 (TXT)
All versions
Displa yWrite 4 & 5
Versions through 2.0
En ab le
Ve rsi ons 3. 0, 4. 0an d 4. 5
F ir st C ho ic e
V er si on st hr ou gh 3 .0
Fra mework
Version 3.0
H TM L
V er si on s t hr ou gh 3 .0 ( so me l im it at io ns )
IBM FFT
All versions
IBM Revisable Form Text
All versions
IBMWrit ing Assistant
Version 1.01
J ust Wr it e
V er si on st hr ou gh 3 .0
Leg acy
Versions throu gh 1.1
Lotus AMI/AMI Professional
Versions through 3.1
Lotu s Manuscript
Versions through 2.0
Lotus WordPro (Win16 and Win32 / Intel platforms)
SmartSuite 96, 97 and Millennium
e.g.
MS Word
HTML
XLS
QuattroPro
PDF
Lotus Manuscript
MS Rich Text
Note: The Lotus WordPro filter is for Win32 on the Intel platforms only, and is provided to SAP "as is", without any representations or warranties. Lotus WordPro (Non-Windows platforms - text only) MacW ri e t II
Version 1.1
MASS1 1
Ve rsi ons th rou gh 8 .0
Microsoft Rich Text Format (RTF) All versions MicrosoftWord for DOS
Versions through 6.0
Microsoft Word for Macintosh
Versions 4.0 through 98
Microsoft Word for Windows
Versions through 2000
MicrosoftWordPad
All versio ns
MicrosoftWorks for DOS
Versions through 2.0
Microsoft Works for Macintosh
Versions through 2.0
Microsoft Works for Windows
Versions through 4.0
M ic ro so ft W ri te
V er si on s t hr ou gh 3 .0
Mu lt iM ate
V er si on st hr ou gh 4 .0
Navy DIF
All versions
Nota Bene
Version 3.0
Novell Perfect Works
Version 2.0
Novell WordPerfect for DOS
Versions through 6.1
Novell WordPerfect for Mac
Versions 1.02 through 3.0
Novell WordPerfect for Windows
Versions through 7.0
Of fi ce W ri te r
V er si on 4 .0 t o 6 .0
P C- Fi le L et te r
SmartSuite 97 and Millennium
V er si on s t hr ou gh 5 .0
P C- Fi le + L et te r
V er si on s t hr ou gh 3 .0
PFS: Wr ite
Ve rsi ons A, B, an dC
Professional Write for DOS
Versions through 2.1
Q&A fo r DOS
Ve rsi on 2. 0
ProfessionalWrite Plu s
Version 1.0
Q&AWritefor Windows
Version 3.0
S am na W or d
V er si on s t hr ou gh S am na W or d I V+
Smar tW ar eI I
Ve rsi on 1. 02
Sprint
Version 1.0
Tota lWord
Version 1.2
Un ic ode Te xt
Al l ver si ons
...approximately 200 textcontaining file types
Agenda
Introduction to TREX in SAP NetWeaver TREX Functions and Features TREX Architecture and Details Summary TREX Plattform, Sizing Guidelines, … Summary
T R EX 7 .1 i n n e x t m a j o r r e l e a s e o f SA P NetWeaver
Focus platform support on:
Linux for x86_64
Windows for x86_64
In Detail: http://service.sap.com/pam
Re a s o n s f o r p l a t f o r m r e d u c t i o n
Optimize scalability and performance for fewer platforms more efficiently
Ensure a highly performance-optimized TREX for our customers
Ensure completeness of 64-bit coding for next release
No negative or limiting impact on other SAP solutions expected, because of TREX’ internal client/server acrhitecture and planned appliance delivery
Reduce cost of development and support and focus on available expertise at TREX development
Pr e vi o u s T R EX r e l ea s e s u p t o 7 .0 Platform support of previous releases is of course valid and remains so until their end of maintenance
TREX 7.0 for SAP NetWeaver 2004s thus comes to 2014 in 5+2+1 model.
TREX releases in current use
TREX 5.0
Out of maintenance
TREX 6.0
End 2006
TREX 6.1
2013
TREX 7.0
2014
Current intention is to move 6.0 and 6.1 installations mostly to 7.0
Agenda
Introduction to TREX in SAP NetWeaver TREX Functions and Features TREX Architecture and Details TREX Plattform, Sizing Guidelines, … Summary
Summary
Future TREX platform focus on Windows and Linux is a decision that has been made in relation to available development resources and expertises in TREX development.
It will enable more focussed development to optimize TREX performance and supportability.
It does not express any general platform preference trend at SAP.
Configuration and Administration TREX Search and Indexing Landscape Configuration Excursion: TREX Sizing RFC Connection Administration and Monitoring
T R EX is H i g hl y S ca l a b l e
TREX can be distributed on multiple hosts TREX hosts can have dedicated roles (Indexing, searching, backup....) TREX processes can run multiple times within the same TREX instance on one host TREX hosts can be added any time
T REX H o s t s – M a s t e r a n d Sl a v e Master Host RFC
WS
M IS
M NS M QS PP
Q Q
Q MI
Slave Hosts RFC
S IS
Q SN
Responsible for indexing Can also be used for searching but not in default configuration manages original version of index
SI
WS S NS PP
SI
Responsible for searching
Ensure perfomance during indexing times
Manages copy of master index
Index is created and updated using replication procedure
T REX H o s t s – M a s t e r , Sl a v e a n d B a c k u p
Backup Host
Master Host
RFC
WS
RFC
WS
RFC
WS
M NS B QS B IS
PP
M NS M QS M IS
PP
S NS S IS
PP
File Server
Slave Host
Replace Master Index Server and Queue Server if they become unavailable Inactive if Master Server and Queue Server are available Data has to be stored centrally Use one backup server for whole system or one backup server per master server
T Q Q QQ
Q Q MI MI
QSI Q SI SI SN SNQ
T R EX H o s t s – Sc a l a b i l i t y
Load Distribution for Searching and Indexing
High availability for Searching
Indexing larger data sets
TREX is:
Scalable
Provides load balancing
Provides HA Solution for Search
Master Host
RFC M NS M QS M IS
Slave Host
RFC
WS
RFC
WS
WSM NS M QS PPM IS
PP
S NS S IS
PP
RFC
WS
S NS S IS
PP
RFC
WS
S NS S IS
PP
T Q Q QQ
Q Q MI MI
QSI Q SI SI SN SNQ
L a n d s c a p e Ex a m p l e
Configuration and Administration TREX Search and Indexing Landscape Configuration Excursion: TREX Sizing RFC Connection Administration and Monitoring
A n A p p r o a c h t o S i zi n g – De t a i l e d A g e n d a
KPIs for TREX sizing
Quick information on BIA sizing
Sizing Methods and Tools
Structured Data
Unstructured Data
Example for document based TREX landscape
Different landscapes for different stages
K e y Pe r f o r m a n c e I n d i c a t o r s f o r Si z i n g T REX
PU
Processing time: load during indexing and search
Expressed in SAPS
Disk
Storage of indexes and queues
Expressed in MB
Memory
Memory consumption during indexing and search
Expressed in MB
N e t w o r k L oa d
Transferred amount of data
KB per server request
Pa r a m e t e r s I n f l u e n c i n g T R EX Si zi n g
Amount of indexed data
Search load
Type of indexed data
Number of languages
Amount and frequency of delta indexing
High availability needs
T REX Pr o c e s s i n g St r u c t u r e d a n d U n s t r u c t u r e d Data Object based Applications
Mainly Attributes
Document based Applications
BI
Mainly text
TREX Engine BI Accelerator
Indexes
Solution specific TREX
TREX Engine
Indexes
Po s s i b l e S i zi n g M e t h o d s a n d T o o l s R u le o f t h u m
“A typical CPU can process 4000 scenarios”
T-Shirt Sizing
Simple algorithms with many assumptions
Formulas
Simple or more complex
Offline Questionnaires
For structured questions
Q u i c k S i ze r
Based on users and throughput
Si z in g T REX f or S t r u c t u r ed D a t a – Ru l e o f Thumb AttrValue 1.0
Records
TREX Engine
AttrValue 2
AttrValue 3.0
AttrValue 1.1
AttrValue 3.1
AttrValue 1.2
AttrValue 3.2
AttrValue 1.4
#attributes (mixed set of integer, string, text) x #values x #objects < 100 million attributes per index server Rule of Thumb 100 million attributes
about 1000 SAPS (2 GB RAM)
200 million attributes
about 2000 SAPS (4 GB RAM)
Varies largely depending on amount of String and text attributes Multivalue attribute
Si z i n g T REX f o r St r u c t u r e d D a t a – A p p r o a c h
2.
Use given formula to get a rough idea how many indexservers you need
3.
Do hands on sizing by either
TREX Engine
IS
Use questionnaire to get an overview of your szenario
Generate test data and send it to TREX independently of application
IS
1.
IS
IS
Using testdata from the application Generating testdata on TREX machine, if you know datasets
4.
Test indexing and search perfomance by monitoring CPU load and RAM consumption
5.
Come to conclusion if your datasets allow larger amounts of attibutesets per indexserver or smaller ones
6.
Split index or design landscape with different indexes
Si zi n g T R EX f o r U n s t r u c t u r e d Da t a 1 Assumptions
80% mixture of predominantly office documents
20% PDF HTML and ASCII
Data volume of indexed content: 100GB
Leads to
Compression ratio of 1:40 from size of source data (documents) to index size in main memory
Searching: Up to 18 000 per hour
Indexing: 24 hrs time consumption
2000 SAPS / 6 GB RAM 4000 SAPS / 20 GB RAM
Si zi n g T R EX f o r U n s t r u c t u r e d Da t a 2
Re q u i r e d Di s k Sp a c e – r u l e o f t h u m b HTML/text Documents
Mixed set of Documents
Index size + queue (permanent)
Document set size x 2
Document set size x 0.5
Index snapshot size (permanent)
(Document set size x 2) x 0.7
(Document set size x 0.5) x 0.7
Document set size x 1.5
Document set size x 0.5
in distributed scenarios without central storage
Temporary disk space
(Document set size x 3.5)
(Document set size x 1)
(Document set size x 4.9) (Document set size x 1.35)
Si z i n g T REX f o r U n s t r u c t u r e d Da t a 3 Required space – rule of thumb for a Mixed Set of Documents Example: 50 GB of office and html/text documents
Index size + queue (permanent) .* .* .*
Q
Disk Space
Main Memory
Document set size x 0.5
Compression ratio 1:40
25 GB (50 GB x 0.5)
.* .*
Index snapshot size (permanent)
Document set size x 0.5 x 0.7
Temporary disk space
Document set size x 0.5
.*
17.5 GB (50 GB x 0.5 x 0.7)
25 GB (50 GB x 0.5) 67.5 GB
1.25 GB
A n Ex a m p l e 1 : L a r g e D o c u m e n t B a s e d T REX Landscape 2 Se t s o f D oc u m e n t s Discussion threads
Notes Basic quantity: 1.6 million / 800
Basic quantity: 4.2 million
000 documents per language
Languages: 40
Languages: 3 (E/G/J)
Growth: 2000 per day new or
Growth: 20 000 per day new or changed
changed
Search requests
150 000 per day
Search requests
20 000 per day
A n Ex a m p l e 2 : Co m p o n e n t I n f o r m a t i o n Sy s t e m
notes
1.6 million about 30 GB
discussion threads
4.2 million about 100 GB
Application System
TREX Engine
Notes Index
Discussions Index
A n Ex a m p l e 3 : M a st e r H o s t s Master Host
Master Host
mytrexmaster01
mytrexmaster02 RFCRFC RFCRFC
RFCRFC RFCRFC
NS
NS
PP
PP
PP
index mode
index mode
IS
PP
IS
IS
QS
QS
discussions Index
IS
Logical index
A n Ex a m p l e 4 : Sl a v e H o s t s
Slave Host
Slave Host
mytrexslave01
mytrexslave02
RFCRFC RFCRFC
RFCRFC RFCRFC
NS
NS
PP
PP
PP
Search mode
Search mode
IS
IS
PP
IS
2 slave hosts supporting master host system
All in all 5 slave host systems (10 servers)
IS
Logical index
A n Ex a m p l e 5 : Di s c u s s i o n T h r e a d s S e r v e r s
master hosts01+02
slave hosts01+02
slave hosts03+04
slave hosts07+08
Physical index sources
slave hosts05+06
slave hosts09+10
SAPS 24 000
RAM
48 GB
Disk
60 GB
Index is updated two times a day with about 10 000 new or changed documents per index run
Running on 12 blades
A n Ex a m p l e 6 : N o t e s masterhost01
slavehost01 slavehost02 RFC RFC RFC slavehost03 RFC RFC RFC slavehost04 RFC RFC RFC slavehost05 NS PP
RFCRFC RFCRFC
IS PP
IS
RFC RFC RFC PP RFC RFC RFC NS PP NS PP IS NS PP IS IS IS IS IS IS IS IS
NS
NS
Index mode IS
QS
J
German E
Using a delta index
Merges every hour
Indexsize:
German
7GB
English
8GB
Japanese
1GB
A n Ex a m p l e 7 : L a n d s c a p e So l u t i o n
About 35 000 users in different time zones
Application server
discussions index
Application server
Application server
Application server
notes index
A n Ex a m p l e 8 : Su m m a r y
5.8 million objects
More than 20 000 new or changed documents per day
35 000 users
170 000 search requests per day
req
s ent m e uir
25 languages to be processed
CPU:
36 000 SAPS
RAM:
48 GB
Disk:
200 GB – 80 GB index size – 120 GB temporary space for index update
Index updates twice a day for discussion threads and every hour for notes
u s ol
tion
T REX i n D i f f e r e n t St a g e s – T w o E x a m p l e s
Stage II
Stage I Initial Indexing
High load during initial indexing stage
Multiple Master Indexservers and preprocessors to speed up initial indexing
Less indexing and preprocessing required Use Master host (indexing) Slave host (searching) Concept Remove one host from landscape
Adding more applications or content
Start with small installation One host for indexing and searching due to little update frequency and search requests
Add Master and/or Slave hosts More search load than expected and/or backup server necessary