Routine Maintenance and Troubleshooting for DWDM
Unitrans Unitrans D&T D&T Group Group
Course Object Object q
Master Basic Operations for Routing Maintenance.
q
Know How to Handle the Common Faults.
2
Contents q
Basic Maintenance Knowledges
q
Basic Operations for Routine Equipment Maintenance
q
Basic Operations for Routine Network Management Maintenance
q
Maintenance and Handling for Common Faults
q
Fault Cases 3
Equipment Equipment Operating Operating Environment Environment è
Power - 48V (-38.4 ~ -57.6V )
è
Grounding
è
Temperature and Humidity Temperature (
)
Relative Humidity (%)
Long-Time Operating conditions
0
40
10%
90%
5%
95%
Short-Time Operating conditions
-5
45
Note: Short-time operating condition means that continuous working time is not more than 72 hours, and the total working time in one year is not more than 15 days .
It is proposed to keep room temperature around 20 room humidity around 60%
and 4
Routine Routine Maintenance Maintenance and and Precautions Precautions
u Laser u Electrics u Board Maintenance u Fiber Maintenance at Site
5
Laser Laser Ü Laser safety precautions l
Personnel Safety n
n
l
Laser on optical interface board sends infrared light and may cause permanent damage. Do not look straight into the optical interface to protect your eyes from being damaged by laser. Optical power output of DWDM is usually very high, it is necessary to protect your skin from being burnt by laser.
Equipment Damage n
n
n
Unused optical interface and fiber pigtail connector shall be covered with dustproof caps. If a fiber pigtail is used in the hardware loopback, an optical attenuator must be added before the optical receiving port in order to control the input power. The maximum optical receiving power must be in allowed range.
6
Electrics Electrics Ü Anti-static Precautions
Ü
l
Before touching equipment and boards, put on the antistatic wrist strap to connect the human body and plug the other end of the anti-static wrist in anti-static jack of equipment sub-frame.
l
Wear anti-static gloves before touching other electric elements on boards such as IC.
Power Maintenance and Precautions l
Never install or remove powered equipment.
l
Never install or remove power cable of powered equipment.
l
Risk prevention and insulating treatment for short-c ircuit during engineering.
l
Confirm whether cables and cable labels match with actual installation before connection.
7
Board Board Maintenance Maintenance Ü Anti-Static l
Store unused boards in anti-static bags.
l
Put on the anti-static wrist strap to connect the human body with the equipment protection ground before touching the board.
Ü Damp Proof l
Ü
Normally, a bag of desiccant is put in the static-shielding bag, which is used to absorb the moisture in the bag and keep the bag inside dry.
Board Mechanical Safety l Avoid board vibration during transportation as it can easily lead to
board damage.
8
Fiber Fiber Maintenance Maintenance at at Site Site Ü After connecting the fiber to corresponding port, reserve some residual fibers and coil them on fiber coiling box. l
Do not coil fibers overtight.
l
Never press or squeeze fibers.
l
Fiber bending diameter is not less than 10cm.
Ü The fiber pulled out must be cleaned up.
9
Fiber Fiber Cleaning Cleaning Ü Before clean the fiber, ensure that the fiber is disconnected from active elements and is lightless completely.
Ü Hold fiber connector and clean end surface of ceramic pin with dust-free paper. Drag fiber cleaning paper slowly and straightly along end surface toward one direction. Repeat above operations two or three times, and then spray compressed gas to pin faces after they are dry.
Ü Check end surface of connector.
10
Routine Routine Maintenance Maintenance Operations Operations
u Software Loopback u Hardware Loopback u Board Reset u Optical Power Test
11
Software Software Loopback Loopback User Side Near End Loopback O A
OTU
C
Wavelength Division Side
User Side
Circuit Far End Loopback O
OTU
C H
User Side
Wavelength Division Side
Ø Send signals to the corresponding output interface directly so as to form loop (OCH side, OAC side).
Ø Software loopback can be used to check running conditions of fiber circuit and fiber connector.
12
Hardware Hardware Loopback Loopback Hardware loopback is to loop the signal of the interface by connecting the optical receiving interface and the optical transmitting interface with a fiber pigtail.
IN1 OUT2 User Side
Attention: Add Optical Attenuator
OUT1 IN2 WDM Side
IN1 OUT2 User Side
OUT1 IN2 WDM Side
13
Board Board Reset Reset Ü
Perform soft/hard resetting to the board on EMS.
Ü
Unplug and plug the board so as to perform hard resetting to the board.
Press reset button on SNP board so as to perform hard resetting to SNP board.
14
Optical Optical Power Power Test Test OUT
MON
Testing Port
Testing Port
IN
Optical Power Meter
Attention!
Optical Board
Optical Power Meter
Optical Board
Permitted Input Optical Power of Tester
15
Contents q
Basic Maintenance Knowledges
q
Basic Operations for Routine Equipment Maintenance
q
Basic Operations for Routine Network Management Maintenance
q
Maintenance and Handling for Common Faults
q
Fault Cases 16
Period Period and and Items Items of of Routine Routine Equipment Equipment Maintenance Maintenance Period
Maintenance Items Check power supply in equipment room Check the temperature and humidity in equipment room
A day
Check cleanness in equipment room Check cabinet indicator Check board indicator Check audio alarm of equipment Check fan box and clean dust filter
A month
Check service telephones Test error code
A quarter A year
Test SNR at MPI-R point Check grounding cable and power cable
17
Indicators Indicators of of Equipment Equipment Cabinet Cabinet Ü Depending on the cabinet indicator, judge whether the alarm occurs on the equipment and the severity of the alarm.
Ü Meanings of Indicators of Cabinet and Column-Head Cabinet Status Indicators
Meanings ON
OFF
Red indicator
Critical alarm indicator
A critical alarm occurs on the equipment, usually with an audio alarm.
No critical alarm occurs on the equipment.
Yellow indicator
Minor alarm indicator
A minor alarm occurs on the equipment.
No minor alarm occurs on the equipment.
Green indicator
Power indicator
Equipment power supply is normal.
Equipment power supply is cut off.
18
Board Board Indicator Indicator Ü Depending on the board indicator, judge whether the alarm occurs on the board.
Ü Meanings of Board Indicator Indicator Status Working Status NOM (Green)
ALM (Red)
Waiting for configuration
Red and green indicators flash alternatively.
Running normally
Flash slowly and regularly
OFF
Board alarm
Flash slowly and regularly
ON
Board POST
Red and green indicators flash quickly for three times.
Board is getting into download status
Red and green indicator flash quickly at the same time.
Downloading status
Red and green indicator flash slowly at the same time.
19
Indicator Indicator of of Main Main Control Control Board Board Ü
Meanings of the Indicators of Main Control Board Indicator Status ALM (Red)
Running Status
NOM (Green)
Flash
OFF
No basic database
Flash
ON
The SNP board has been equipped in the equipment, but not configured in EMS.
OFF
Flash
The system runs normally.
OFF
1During system initialization, it means that the system is ready to recreate database (original data of SNP is empty); 2During the running process, it means that the system is failure.
ON
Red and green indicators flash alternatively.
Reset SNP.
Red and green indicators flash alternatively.
Download the Agent program. (The indicator of SNP i s normal during downloading the board program.)
20
Checking Checking Audio Audio Alarm Alarm of of Equipment Equipment Ü Check whether the audio alarm function is set correctly and works normally.
Ü
Operation l
Check the ring trip switch on the left of the cabinet.
l
Normally, it should be set to NORMAL position.
l
When the switch is set to the OFF" position, the alarm sound is stopped.
!
21
Checking Checking Fan Fan Box Box and and Clean Clean Dust Dust Filter Filter Ü Check whether the fan works normally and clean the dust filter timely so as to guarantee good heat dissipation for the equipment.
Ü Check Fan Box Status Steps
Ü
Operations
1
Check whether Fan Failure alarm occurs on EMS. If not, the fan box works normally.
2
Check whether each small fan runs normally. If yes, the green indicator flashes slowly and the red indicator turns off.
!
Clean Dust Filter Steps
Operations
1
Take out the air filter from the bottom of the subrack, clean it with water and air-dry it before putting it back.
2
Push the air filter back along the slide guide of the lower subrack after adjusting the position of the air filter.
22
Testing Testing SNR SNR at at MPI-R MPI-R Point Point Ü Check whether SNR at receiving side is degraded when the equipment runs.
Ü
Operations Steps
Operations
1
Connect the board at MPI-R point to OSA according to the below illustration.
2
Adjust the OSA to WDM test mode.
3
Test SNR of each channel and record the result.
4
Compare the testing results with the history records (or standard), and analyze the exceptional data. Optical interface of board
OSA
MON
23
Contents q
Basic Maintenance Knowledges
q
Basic Operations for Routine Equipment Maintenance
q
Basic Operations for Routine Network Management Maintenance
q
Maintenance and Handling for Common Faults
q
Fault Cases 24
Routine Routine Maintenance Maintenance Items Items of of Network Management Management Ü Check the status of NE and board. Ü Check the current alarm. Ü Check the 15-minute performance and 24-hour performance (such as optical received power, optical launched power, current, temperature, error code, etc.), analyze and record the results.
Ü Query the log information.
25
Checking Checking Status Status of of NE NE and and Board Board Ü In T3, judge the current running status through the board colors. l
Red: critical alarm.
l
Orange: major alarm.
l
Yellow: minor alarm.
26
Checking Checking Current Current // History History Alarm Alarm Ø Query the current alarm of the whole network Click the alarm indicator on main m ain window. window.
Ø Query the current alarm Right-click an NE firstly, and then choose Current Alarm. Alarm.
Ø Query history alarm Right-click an NE firstly, and then choose History Alarm. Alarm.
27
Checking Checking Current Current // History History Alarm Alarm
28
Right-click an NE firstly, and then choose Current Performance or History Performance.
Checking Checking Current Current // History History Performance Performance Event Event
29
Checking Checking Log Log Records Records Check the operations of all users and print out the logs if necessary.
30
Routine Routine Maintenance Maintenance Items Items of of Network Network Management Management System System Ü Monthly maintenance Items l Query and synchronize the time of NEs. l Query board configuration information. l Back up T3 database.
31
Checking Checking Spectrum Spectrum Analysis Analysis Unit Unit Select the OPM board in Device Manager, then select WDM Maintenance test the signal spectrum by using the OPM board.
OPM Spectra,
32
Synchronize Synchronize NE NE Time Time Ø Keep the NE time and EMS time synchronization one another so as to guarantee the
time of the reported information from NE is correctly. Ø In the view of Topology Management, click an NE and choose Operation-
>Topology Operation->NE Time Management.
33
Backup Backup and and Restore Restore T3 T3 Database Database Ü In the view of System Management, expand Device Tree->Database, then perform the Backup & Restoration->Restore after logging into the database.
Ü The default backup directory is
\server\bin\backup.
34
Contents q
Basic Maintenance Knowledges
q
Basic Operations for Routine Equipment Maintenance
q
Basic Operations for Routine Network Management Maintenance
q
Maintenance and Handling for Common Faults
q
Fault Cases 35
Preparation Preparation of of Fault Fault Location Location Ü Technical preparation l
Well know DWDM theory
l
Understand signal flow of WDM system
l
Be familiar with WDM equipment and basic operations of EMS
l
Be familiar with the basic operations of common meters
36
Preparation Preparation for for Fault Fault Location Location Ü Project networking l
Network topology
l
Wavelength assignment
l
Equipment running status
l
Project documentation
37
Preparation Preparation for for Fault Fault Location Location Ü Before handling the fault, the maintenance personnel should inform the network management center to collect, save, and back up onsite data.
Ü Record the details of each operation step during the fault handling process.
Ü All the data and records are very helpful for the handling of similar faults in the future as the reference.
38
Basic Basic Principles Principles of of Fault Fault Location Location
Accurate Fault location
39
Basic Basic Principles Principles of of Fault Fault Location Location Ü Fault Reason l
Check the external reasons firstly, after that, consider the reason in the transmission equipment itself.
Ü Alarm level l Analyze the higher-level alarms first, and then lower-level alarms
during the analysis of alarms.
40
Alarm Alarm Levels Levels Ü Critical Alarm l
The alarm results in the service interruption alarms, which must be handled immediately.
Ü Major Alarm l
The alarm influences the service, which should be handled timely.
Ü Minor Alarm l
The alarm potentially influences the service, which should be handled.
Ü Warning l
It is a prompt alarm caused by misoperation on EMS, which doesn t influence the service. "
41
Principles Principles of of Troubleshooting Troubleshooting Key points: 1. Fault directions: unidirectional fault or bidirectional fault 2. Affected object: single-wave or multi-wave. 1. Optical Power Fault
Common Troubleshooting Methods: 1. Alarm and performance analysis method 2. Instrument testing method 3. Substitution method
2. ECC Fault 3. Service Transient Interruption Fault
4. Loop-back method 5. Experience method
4. Error Code Fault
5. Equipment Interconnection Fault 42
Analysis Analysis and and Handling Handling -----Optical Power Power Overthreshold Overthreshold Alarm Alarm ------ Optical Ü Weak light inputting, no light inputting or strong light inputting Handling: Find and exclude faults level by level.
Ü Weak light outputting or no light outputting Handling: The board apparatus is damaged, the optical laser is shut off, or the connector is failure.
43
Typical Typical Alarm Alarm Handling Handling Ü OA weak light inputting alarm or OA no light inputting alarm 1.
System reason: Fiber fails (usual reason).
2.
Board reason: 1510/1550 demultiplexer at inputting end is damaged. Optical interface of inputting end or flange is damaged.
Ü OA weak light outputting alarm or OA no light outputting alarm 1.
System reason: the upstream of OA is failure if the OA inputting changes at the same time.
2.
Board reason: Inner fiber of the board maybe damaged if MON interface optical power is normal.
44
Typical Typical Alarm Alarm Handling Handling Ü
The inputting of OA is normal, but the outputting of OA is no light. l l
Board reasons: OA temperature is over-threshold, cable is disconnected, or inner fiber of EDFA fails. Handling 1.
Check MON optical power of OA board (OBA/OLA:OUT power=MON power+23dB,OPA:OUT power !MON power+16dB), if the optical power is normal, the alarm may be caused by the inner fiber of EDFA.
2.
Check if the OA driver current or backlight current is normal via EMS. If not, check the cable connection of EDFA. If the connection is normal, the board maybe fail, it is considered to replace the board.
3.
APSD status: OA downstream fiber or OA reverse upstream fiber breaks down. APSD function makes OA stop pump automatically, and periodically sends recovery pulses to test fiber. Thus, the board works normally.
45
Analysis Analysis and and Handling Handling --------Error --------Error Code Code Overthreshold Overthreshold Alarm Alarm Ü B1 error code over threshold alarm, B1 errored second (ES) count over threshold alarm, B1 serious errored second (ES) count over threshold alarm, B2 error code count over threshold alarm, B1 error code count over threshold alarm.
Ü Handling l
Locate the fault by dividing the trunk segments into several segments.
l
Collect the optical power of the boards between two TU/SRM/GEM boards to locate the fault.
46
Typical Typical Alarm Alarm Handling Handling Ü Single-channel Error Code/ Loss of Lock/ Loss of signal l
Board fault location n
l
Single-channel error code can be located at the board and transmission section according to board performance such as B1 error code.
Handling n
Firstly, test fiber parameters such as reflection. If these parameters are correct, which means that the board performance maybe degraded. Replace the board if necessary.
Ü Multi-channel Error Code/ Loss of Lock/ Loss of signal l
Generally, the main optical channel is faulty if all the channels have error codes. The SNR or power point maybe failure.
47
Analysis Analysis and and Handling Handling -----LOF -----LOF Alarm Alarm Ü Loss Of Frame (LOF) Ü Handling l
Collect the optical power upstream to locate the fault.
48
Analysis Analysis and and Handling Handling ----Alarm ----Alarm of of SOSC SOSC Board Board Ü Signal interruption alarm, error code alarm, optical power over threshold alarm
Ü Handling l
Firstly, verify whether the main optical channel goes wrong.
l
Next, verify whether supervisory channel itself goes wrong.
l
Finally, locate the faulty site or board.
49
Typical Typical Alarm Alarm Handling Handling Ü OOF Alarm of SOSC Board 1.
Outputting optical power (0~-7dBm) or extinction ratio of 2M module of previous SOSC doesn t reach the standard. "
Replace the faulty optical module or the board. 2.
Optical power doesn t reach the standard because of fiber jumping connector is dirty or broken. "
Clean the optical fiber or replace the faulty fiber pigtail. 3.
If receiver sensitivity doesn t reach the standard or the inputting optical power is too low. "
Replace board or readjust the receiver sensitivity. For the fault caused by too lower inputting optical power, increase the upstream outputting optical power or replace the board with lower sensitivity.
50
Analysis Analysis and and Handling Handling -----J0 -----J0 TIM TIM Alarm Alarm Ü J0 TIM alarm Ü Handling l
Modify J0 value sent by SDH equipment, and make it the same as that of DWDM equipment.
l
Modify J0 value sent by DWDM equipment, and make it the same as that of SDH equipment.
51
Analysis Analysis and and Handling Handling --------- Over-high Over-high Current Current Alarm Alarm Ü Over-high refrigeration current alarm, over-high laser bias current alarm
Ü Handling l
Change the threshold
l
Board faults
l
External power faults
52
Analysis Analysis and and Handling Handling ----Relative ----Relative Board Board Alarm Alarm Ü Board not-in-position alarm Ü Handling l
The slot isn t configured with the board: "
Plug in the board whose configuration is same as the setting on EMS. l
The board program is faulty, which means the board software doesn t run Replace the board or board program. "
l
Backplane pin or board socket are faulty, which leads the NCP can not detect the board Carefully check whether the pin is bent or broken. If yes, replace the backplane board or board.
53
Analysis Analysis and and Handling Handling -----Relative -----Relative Board Board Alarm Alarm Ü Board in-position alarm l A slot is not configured with a board in the EMS, but the EMS
finds that a board is installed in the slot of the actual equipment.
Ü Handling l
Plug a board of the same type as the actual one into the slot in the EMS.
54
Analysis Analysis and and Handling Handling ----Relative ----Relative Board Board Alarm Alarm Ü Board mismatch alarm This alarm indicates that a slot has been configured with a board in the EMS, but the information of the board detected in the EMS is different from that of the actual one installed in the slot.
Ü Handling l
Modify the configuration.
l
Change the board whose configuration is same as the setting on EMS.
55
Performance Performance Analysis Analysis and and Handling Handling Ü Abnormal performance value results in the corresponding alarm.
Ü Handling method of the abnormal performance is same with that of the corresponding alarm.
Ü Commonly, use the alarm menu and performance menu together in order to locate and handle the fault faster.
56
Troubleshooting Troubleshooting Flow Flow Ü Troubleshooting flow ---alarms appear at equipment side 1.
Rack alarm: check the board.
2.
Board alarm: obtain the board fault information.
3.
On EMS, query alarm type and current performance of the faulty board: refer to the history performance.
4.
Judge alarm type of the relevant board (upstream or d ownstream board ): If an OTU board has no inputting light, then it is necessary to check the inputting light status of other OTUs, OPA and ODU at this site, as well as the outputting light status of the OTU of the previous site.
57
Troubleshooting Troubleshooting Flow Flow 5.
Fault analysis: refer to the above power budget configuration model. If all OTUs have no inputting light, the main optical channel on multiplexing layer must be fail. Check the status of OPA & ODU. If other OTUs are normal, a certain channel must be fail, please check the outputting status of the previous OTU.
6.
Fault handling: after confirm the fault and its location, replace the board or repair the circuit, or clean the optical interface.
7.
Fault elimination
8.
Compare the current performance with the history performance of the relative boards on EMS: Query the current performance on EMS, check if other fault exists, if the fault is recovered totally, if the adjustment is needed.
9.
Record and archive
58
Troubleshooting Troubleshooting Flow Flow
59
Contents q
Basic Maintenance Knowledges
q
Basic Operations for Routine Equipment Maintenance
q
Basic Operations for Routine Network Management Maintenance
q
Maintenance and Handling for Common Faults
q
Fault Cases 60
Case Case 11 Ü During the process of inspecting a site, we find that the third channel is abnormal.
61
Case Case 11 Ü Handling: l
LOF and LOS alarms happen on the third channel of OTU10G by the query on EMS.
l
Firstly, it doubts that the loopback is incorrectly. But after the fiber at IN port of OTU10G is pulled out, the inputting optical power on EMS is still -9.32 dBm.
l
Query the history performance, we find that the inputting optical power has kept -9.32 dBm since Feb. this year even no actual inputting.
l At site, disable the FEC function of the board and test the board
performance by 10 G meter. Then the board can t work normally, and the signal channel is abnormal, which means the board is damaged. So apply for a new board to replace the bad one. "
62
Case Case 11 Ü Reflections : l
The receiving optical module of the board is obviously damaged. The receiving optical module of the third channel of OTU10G at this site is APD, and the overloading point is -9 dBm.
Pay attention that the receiving optical power should not be too high to burn the board.
63
Case Case 22 Ü During the process of inspecting a site, find that the tenth channel can not pass through the channel error code test because the error code occurs every 1-2 minutes. On EMS, all the boards of the channel don t report the error code. "
64
Case Case 22 Ü Handling l
Test the error code by using OTUR of the other wavelength. If no error codes occurs again, it is considered that OTUR 10 G is failure. Disable its FEC function and test it by meter, the error codes are still existed but not displayed on EMS, which means the module of the board is failure. Replace the module, after that the error codes are disappeared.
65
Case Case 22 Ü Reflections l
If the error codes occur during the process of channel testing. Firstly, check the board where error codes occur by EMS. If no board reports error codes, check if the optical module of OTUR is wrong. Finally, test the board and replace the board after confirming the fault.
66
Case Case 33 Ü On March 29th, the maintenance personnel find that 24-hour history performance is lost.
67
Case Case 33 Ü Handling l All of the 15-minute history performances are lost, which maybe
caused by the resetting of SNP. l
The default interval of reporting performance of SNP is 2 hours, so 15-minute performances during 2 hours are lost after the resetting of SNP. The 24-hour performance is also lost because it is the 24-hour performance reporting time when SNP is reset.
68
Case Case 33 Ü Reflections l
On EMS, the time interval of SNP reporting performance data can be set as 30 minutes, which increases the ECC communication burden. It is not beneficial to the communication of NE.
l Automatic reset is caused by many reasons, but the primary
reason is that the SNP receiving & transmitting packet is abnormal during a certain period. Detect circuit to reset SNP, which can recover the communication but the performance in a certain period will be lost then.
69
Case Case 44 Ü On March 29th, the maintenance personnel find that MCU overtime information is reported when the performances of the OA and SRM board are queried.
70
Case Case 44 Ü Handling l
The fault still exists after an OA board is reset. The selected performance parameter can be queried by EMS. The abnormal communication between SNP and S port results in the fault. Resetting the SNP board via EMS can solve the problem. Thus, the performance events can be queried normally.
71
Case Case 44 Ü Reflections l
There are not so many performance items of OA board, and the query is barely overtime, so the problem is mainly caused by SNP.
l
Reset the SNP board to solve this problem, which will not influence the service of OTU and SRM board.
72