Friday, July 24, 2015

L2VPN xconnect troubleshooting on an ME3600X

     Last night I was asked by a customer to help diagnose their l2vpn xconnect.  I've made several of these in the past and usually they just work -- with the exception of an MTU mismatch that is easily fixed.  Well this one was troublesome.  I swear I tried checking more commands than need be, even the "event-trace" debugging commands:

sh mpls l2transport vc 11000 event-trace
sh l2vpn internal event-trace error
sh l2vpn internal event-trace event
sh l2vpn internal event-trace major

     Everything looked like our other xconnects -- a switchport mode trunk with no vlans allowed across, and a service instance that actually has the xconnect information.  It may not be your typical routed port configuration, but it works for our purposes (and helps if your customer is sending tagged frames).  After much debugging, Googling, and banging my head against the desk through numerous permutations of the configuration, I ran across this:

Router#sh mpls l2transport vc 11000 event-trace
Local interface: Gi0/3 up, line protocol up, Ethernet:1 down
  Destination address: 10.0.110.1, VC ID: 11000, VC status: down
    Last error: Local peer access circuit is down
    Output interface: Te0/2, imposed label stack {16268 158}
    Preferred path: not configured
    Default path: active
    Next hop: 10.0.1.185
  Create time: 00:00:02, last status change time: 00:00:02
    Last label FSM state change time: 00:00:02
  Signaling protocol: LDP, peer 10.0.110.1:0 up
    Targeted Hello: 10.0.10.16(LDP Id) -> 10.0.110.1, LDP is UP
    Graceful restart: configured and enabled
    Non stop routing: not configured and not enabled
    Status TLV support (local/remote)   : enabled/supported
      LDP route watch                   : enabled
      Label/status state machine        : established, LrdRru
      Last local dataplane   status rcvd: No fault
      Last BFD dataplane     status rcvd: Not sent
      Last BFD peer monitor  status rcvd: No fault
      Last local AC  circuit status rcvd: DOWN AC(rx/tx faults)
      Last local AC  circuit status sent: No fault
      Last local PW i/f circ status rcvd: No fault
      Last local LDP TLV     status sent: DOWN AC(rx/tx faults)
      Last remote LDP TLV    status rcvd: No fault
      Last remote LDP ADJ    status rcvd: No fault
    MPLS VC labels: local 129, remote 158
    Group ID: local 0, remote 0
    MTU: local 1500, remote 1500
    Remote interface description: to CUSTOMER-Z-END
  Sequencing: receive disabled, send disabled
  Control Word: On
  Dataplane:
    SSM segment/switch IDs: 6595/6594 (used), PWID: 2
  VC statistics:
    transit packet totals: receive 0, send 0
    transit byte totals:   receive 0, send 0
    transit packet drops:  receive 0, seq error 0, send 0
AToM VC event trace:
2015 Jul 23 18:37:38.176000: 1861168: AToM[10.0.110.1, 11000]: .... S:Act send notify(DOWN), remote up timer
2015 Jul 23 18:37:38.176000: 1861169: AToM[10.0.110.1, 11000]: ..... Send notify(DOWN)
2015 Jul 23 18:37:38.176000: 1861170: AToM[10.0.110.1, 11000]: .....  Local AC  : DOWN AC(rx/tx faults)
2015 Jul 23 18:37:38.176000: 1861171: AToM[10.0.110.1, 11000]: .....  Overall   : DOWN AC(rx/tx faults)
2015 Jul 23 18:37:38.176000: 1861172: AToM[10.0.110.1, 11000]: ..... Send LDP for status change from UP
2015 Jul 23 18:37:38.176000: 1861173: AToM[10.0.110.1, 11000]: ..... Start remote up timer
2015 Jul 23 18:37:38.176000: 1861174: AToM[10.0.110.1, 11000]: ..... NMS: VC oper state:  DOWN
2015 Jul 23 18:37:38.176000: 1861175: AToM[10.0.110.1, 11000]: ..... SYSLOG: VC is DOWN, Loc AC Err
2015 Jul 23 18:37:38.176000: 1861176: AToM[10.0.110.1, 11000]: ..... PW MIB: VC state is: DOWN
2015 Jul 23 18:37:38.176000: 1861177: AToM[10.0.110.1, 11000]: ... Local ready
2015 Jul 23 18:37:38.176000: 1861178: AToM[10.0.110.1, 11000]: .... Local service is ready; send a label
2015 Jul 23 18:37:38.176000: 1861179: AToM[10.0.110.1, 11000]: .... Alloc local binding
2015 Jul 23 18:37:38.176000: 1861180: AToM[10.0.110.1, 11000]: ..... No need to update the local binding
2015 Jul 23 18:37:38.176000: 1861181: AToM[10.0.110.1, 11000]: .... Generate local event
2015 Jul 23 18:37:38.176000: 1861182: AToM[10.0.110.1, 11000]: .... Ready, label 129
2015 Jul 23 18:37:38.176000: 1861183: AToM[10.0.110.1, 11000]: .... Evt local ready, in established
2015 Jul 23 18:37:38.176000: 1861184: AToM[10.0.110.1, 11000]: ..... Local ready and established
2015 Jul 23 18:37:38.176000: 1861185: AToM[10.0.110.1, 11000]: .. Check if can activate dataplane
2015 Jul 23 18:37:38.176000: 1861186: AToM[10.0.110.1, 11000]: ...  Keep the dataplane UP for AC DN
2015 Jul 23 18:37:38.176000: 1861187: AToM[10.0.110.1, 11000]: ... Dataplane already active

     So what exactly does  Last error: Local peer access circuit is down mean? I couldn't for the life of me find this on the web anywhere.  Other people have had similar problems but no one has reported a solution.  Some people said to try to clear the mpls label for the remote host, which didn't work.  Oh well, you can only trust the internet so much!

     As it got later in the evening I started comparing this switch's config to one that worked.  Everything was the same, minus the VC and remote host addresses.  So I started looking deeper -- maybe it was an IOS version? For those of you who have dealt with Cisco products, you know that occasionally bugs are introduced or re-introduced.  It's just the nature of the beast.  I hopped on Cisco's website and began poking around through their bug search tool for the software version I was using, and found over 48 some bugs.  Some closed, some open, but definitely there, and related to my problem.

In my case, this was a bug due to a particular software version:

me360x-universalk9-mz.153-2.S2

The software that was working on our other device was:

me360x-universalk9-mz.152-4.S4.bin

     So as a last ditch effort to fix this problem, I coordinated with the customer to downgrade the switch.  This isn't something I would normally do, but this particular situation was driving me mad.  After getting the approval, I went ahead and issued a

boot system flash:me360x-universalk9-mz.152-4.S4.bin

saved the config, and reloaded.  Pacing back and forth around the desk a few times, the switch finally started pinging again.  I quickly logged in and found that my xconnect problem had been solved.  The customer responded within minutes saying everything was working.

     So, moral of the story? If your l2vpn xconnect isn't working, do a bug search on Cisco's site.  If everything should be working, and you've compared it to a known working copy, look for the only thing that may be different -- software version.