NetEng 101: 2015

Last night I was asked by a customer to help diagnose their l2vpn xconnect. I've made several of these in the past and usually they just work -- with the exception of an MTU mismatch that is easily fixed. Well this one was troublesome. I swear I tried checking more commands than need be, even the "event-trace" debugging commands:

sh mpls l2transport vc 11000 event-trace
sh l2vpn internal event-trace error
sh l2vpn internal event-trace event
sh l2vpn internal event-trace major

Everything looked like our other xconnects -- a switchport mode trunk with no vlans allowed across, and a service instance that actually has the xconnect information. It may not be your typical routed port configuration, but it works for our purposes (and helps if your customer is sending tagged frames). After much debugging, Googling, and banging my head against the desk through numerous permutations of the configuration, I ran across this:

Router#sh mpls l2transport vc 11000 event-trace
Local interface: Gi0/3 up, line protocol up, Ethernet:1 down
Destination address: 10.0.110.1, VC ID: 11000, VC status: down
Last error: Local peer access circuit is down
Output interface: Te0/2, imposed label stack {16268 158}
Preferred path: not configured
Default path: active
Next hop: 10.0.1.185
Create time: 00:00:02, last status change time: 00:00:02
Last label FSM state change time: 00:00:02
Signaling protocol: LDP, peer 10.0.110.1:0 up
Targeted Hello: 10.0.10.16(LDP Id) -> 10.0.110.1, LDP is UP
Graceful restart: configured and enabled
Non stop routing: not configured and not enabled
Status TLV support (local/remote) : enabled/supported
LDP route watch : enabled
Label/status state machine : established, LrdRru
Last local dataplane status rcvd: No fault
Last BFD dataplane status rcvd: Not sent
Last BFD peer monitor status rcvd: No fault
Last local AC circuit status rcvd: DOWN AC(rx/tx faults)
Last local AC circuit status sent: No fault
Last local PW i/f circ status rcvd: No fault
Last local LDP TLV status sent: DOWN AC(rx/tx faults)
Last remote LDP TLV status rcvd: No fault
Last remote LDP ADJ status rcvd: No fault
MPLS VC labels: local 129, remote 158
Group ID: local 0, remote 0
MTU: local 1500, remote 1500
Remote interface description: to CUSTOMER-Z-END
Sequencing: receive disabled, send disabled
Control Word: On
Dataplane:
SSM segment/switch IDs: 6595/6594 (used), PWID: 2
VC statistics:
transit packet totals: receive 0, send 0
transit byte totals: receive 0, send 0
transit packet drops: receive 0, seq error 0, send 0
AToM VC event trace:
2015 Jul 23 18:37:38.176000: 1861168: AToM[10.0.110.1, 11000]: .... S:Act send notify(DOWN), remote up timer
2015 Jul 23 18:37:38.176000: 1861169: AToM[10.0.110.1, 11000]: ..... Send notify(DOWN)
2015 Jul 23 18:37:38.176000: 1861170: AToM[10.0.110.1, 11000]: ..... Local AC : DOWN AC(rx/tx faults)
2015 Jul 23 18:37:38.176000: 1861171: AToM[10.0.110.1, 11000]: ..... Overall : DOWN AC(rx/tx faults)
2015 Jul 23 18:37:38.176000: 1861172: AToM[10.0.110.1, 11000]: ..... Send LDP for status change from UP
2015 Jul 23 18:37:38.176000: 1861173: AToM[10.0.110.1, 11000]: ..... Start remote up timer
2015 Jul 23 18:37:38.176000: 1861174: AToM[10.0.110.1, 11000]: ..... NMS: VC oper state: DOWN
2015 Jul 23 18:37:38.176000: 1861175: AToM[10.0.110.1, 11000]: ..... SYSLOG: VC is DOWN, Loc AC Err
2015 Jul 23 18:37:38.176000: 1861176: AToM[10.0.110.1, 11000]: ..... PW MIB: VC state is: DOWN
2015 Jul 23 18:37:38.176000: 1861177: AToM[10.0.110.1, 11000]: ... Local ready
2015 Jul 23 18:37:38.176000: 1861178: AToM[10.0.110.1, 11000]: .... Local service is ready; send a label
2015 Jul 23 18:37:38.176000: 1861179: AToM[10.0.110.1, 11000]: .... Alloc local binding
2015 Jul 23 18:37:38.176000: 1861180: AToM[10.0.110.1, 11000]: ..... No need to update the local binding
2015 Jul 23 18:37:38.176000: 1861181: AToM[10.0.110.1, 11000]: .... Generate local event
2015 Jul 23 18:37:38.176000: 1861182: AToM[10.0.110.1, 11000]: .... Ready, label 129
2015 Jul 23 18:37:38.176000: 1861183: AToM[10.0.110.1, 11000]: .... Evt local ready, in established
2015 Jul 23 18:37:38.176000: 1861184: AToM[10.0.110.1, 11000]: ..... Local ready and established
2015 Jul 23 18:37:38.176000: 1861185: AToM[10.0.110.1, 11000]: .. Check if can activate dataplane
2015 Jul 23 18:37:38.176000: 1861186: AToM[10.0.110.1, 11000]: ... Keep the dataplane UP for AC DN
2015 Jul 23 18:37:38.176000: 1861187: AToM[10.0.110.1, 11000]: ... Dataplane already active

So what exactly does Last error: Local peer access circuit is down mean? I couldn't for the life of me find this on the web anywhere. Other people have had similar problems but no one has reported a solution. Some people said to try to clear the mpls label for the remote host, which didn't work. Oh well, you can only trust the internet so much!

As it got later in the evening I started comparing this switch's config to one that worked. Everything was the same, minus the VC and remote host addresses. So I started looking deeper -- maybe it was an IOS version? For those of you who have dealt with Cisco products, you know that occasionally bugs are introduced or re-introduced. It's just the nature of the beast. I hopped on Cisco's website and began poking around through their bug search tool for the software version I was using, and found over 48 some bugs. Some closed, some open, but definitely there, and related to my problem.

In my case, this was a bug due to a particular software version:

me360x-universalk9-mz.153-2.S2

The software that was working on our other device was:

me360x-universalk9-mz.152-4.S4.bin

So as a last ditch effort to fix this problem, I coordinated with the customer to downgrade the switch. This isn't something I would normally do, but this particular situation was driving me mad. After getting the approval, I went ahead and issued a

boot system flash:me360x-universalk9-mz.152-4.S4.bin

saved the config, and reloaded. Pacing back and forth around the desk a few times, the switch finally started pinging again. I quickly logged in and found that my xconnect problem had been solved. The customer responded within minutes saying everything was working.

So, moral of the story? If your l2vpn xconnect isn't working, do a bug search on Cisco's site. If everything should be working, and you've compared it to a known working copy, look for the only thing that may be different -- software version.

NetEng 101

Friday, July 24, 2015

L2VPN xconnect troubleshooting on an ME3600X

The Cisco Zone