Monday, January 2, 2017

Network Engineer pro trips

I found this list pretty decent, and definitely things we should all be aware of as our networking careers grow.

1) ALWAYS TRIPLE CHECK EVERYTHING. Make SURE things work by verifying.
2) Never assume anything
3) Never be confident in yourself in anything. Question everything you do even if you know it
4) Listen to other people and ask for them to sanity check your work
5) Whatever you do, keep it stupid simple. You're not the only one using the network
6) Don't be afraid to re-evaluate everything you think you know every single day
7) You know less than you think you know
8) Don't trust documentation unless it's simple and straightforward
9) Never stop learning more
10) Learn operating systems, and linux usage/administration specifically
11) Be nice to people, even if you want to beat the ever living daylights out of them
12) Learn to use Excel, Notepad, and scripting. It will save your work life. Pick a language and stick to it
13) Bleeding edge technologies might sound interesting and cool, but they rarely pay the bills. Learn consistent, simple, straightforward engineering
14) Use the right tool for the right job. Not everything is a hammer, and not everything is a nail
15) All vendors are shit. Some are less shit than others. No vendor does everything well
16) Be flexible in the technologies you know and how to apply them irrespective of vendor
17) Learn how TCP, UDP, IP really work. I mean, REALLY work
18) Learn how to use Wireshark. Take a class if you need to
19) The network you work on is not yours
20) Your work is NOT necessarily representative of your skills. A shitty network doesn't mean you're a bad network engineer. A good network doesn't mean you're a good network engineer
21) Be demanding, but be fair
22) Admit if you messed up
23) Slow down

As taken from a comment in:
https://www.reddit.com/r/networking/comments/5ljq7j/what_are_your_networking_pro_tips/

Wednesday, November 2, 2016

QoS troubleshooting commands for 4500E/3650 networks

4500E tips from StackExchange

show platform hardware interface ten1/1/31 statistics
that should show you input bytes by CoS.
show platform hardware qos interface foo X/Y
shows queue lengths and flow counts
show interface foo X/Y counter detail
shows interface egress packets by queue, queue drops and DBL drops.
show policy-map interface foo X/Y
mostly useful if you are doing custom policy maps
show interface foo X/Y capabilities
if you want to know what the number and type of queues are available on a particular interface
show qos maps
to show the DSCP-CoS mappings

3650 tips

TBD

Monday, October 17, 2016

Handy Palo Alto Networks SNMP traps

For those of you running Palo Alto Networks devices in your environment, here are a few snmp traps I've found useful.

panROUTINGRoutedBGPPeerLeftEstablishedTrap - sent when a BGP peer drops off and leaves an Established state

panROUTINGRoutedBGPPeerEnterEstablishedTrap - sent when a BGP peer re-enters an Established state 

Wednesday, October 12, 2016

Handy Regex search strings

This post will be updated occasionally to include strings I've found to be useful in day to day work.  A great way to test these is to simply go to regexr.com, paste in the body of text/strings you want to match against, and verify with your own regex strings whether they match or not.


  • Negative lookahead for syslog string alerting.  Used primarily on Solarwinds syslog viewer, this should be placed in the message type field to discriminate against anything that includes the string "PLATFORM" in a syslog message.  All other messages will apply as normal.

    ^(?!.*PLATFORM).*$

  • IPv4 address matching
           [0-9]+(?:\.[0-9]+){3}

Monday, July 11, 2016

EEM redirecting logs to an ftp server

Today I was asked to help with a request from our voice engineer.  She on occasion has to run debugs for specific call functions on our ISR 4K routers, which show up in the buffered log.  We tried sending those log entries to our syslog server, but since the transport protocol was udp/514 (default), some of the messages were coming in out of order.  We tried a second syslog server and got the same result.

Another option was to use tcp instead, so we attempted to get that working.  Unfortunately the syslog server was not listening on that port, so we had to improvise.  We came up with the idea of exporting the full log file to an ftp server every so often, and using EEM to accomplish this.

Our initial script runs as a privilege 15 user once per day, and uses a while loop to run every 60 minutes of 3600 seconds.  It simply dumps the entire log buffer into a text file with a variable which increments once per hour.

event manager session cli username "neteng" privilege 15
event manager applet syslog_redirect
 event timer cron cron-entry "0 0 * * *" maxrun 86400
 action 0.00 cli command "term len 0"
 action 0.01 set i "0"
 action 0.02 while $i le 23
 action 0.03  cli command "show log | redirect ftp://10.10.10.10/logs/$i.txt"
 action 0.04  increment i 1
 action 0.05  wait 3600
 action 0.06 end

This worked pretty well, and we saw the script running once per hour as expected.  Unfortunately the log buffer is set to 70000000 (due to the large volume of debug entries written for each call), so writing the files takes about 4-5 minutes.  The script actually generates the log files at 12:04am, 1:08am, and so on.  At first I was apprehensive about this and wanted the file to correspond to the variable integer, but found that we were looking at the timestamp of the file rather than the filename itself.

Overall, a very basic script that does what we wanted.  However, it sometimes gets stuck we have to ask the question, why the hell are we letting a process run for 24 hours? That eats up cpu/memory cycles.  Here is a better version:

event manager applet syslog_redirect
 event timer cron cron-entry "0 * * * *"
 action 0.0 cli command "show clock"
 action 0.1 string range "$_cli_result" 2 4
 action 0.2 set hour "$_string_result"
 action 0.3 string range "$_cli_result" 6 7
 action 0.4 set mins "$_string_result"
 action 0.5 string range "$_cli_result" 20 22
 action 0.6 set day "$_string_result"
 action 0.7 cli command "show log | redirect ftp://10.13.5.36/vgw/test/$day$hour.$mins.txt"
 action 0.8 cli command "clear log" pattern "confirm"
 action 0.9 cli command "yes"

It will even clear the log after running, making your next log dumps smaller in size.  Depending on your "show clock" output, you may need to adjust the numerical variables on the right, as they are literal positions within the output itself.

Friday, July 24, 2015

L2VPN xconnect troubleshooting on an ME3600X

     Last night I was asked by a customer to help diagnose their l2vpn xconnect.  I've made several of these in the past and usually they just work -- with the exception of an MTU mismatch that is easily fixed.  Well this one was troublesome.  I swear I tried checking more commands than need be, even the "event-trace" debugging commands:

sh mpls l2transport vc 11000 event-trace
sh l2vpn internal event-trace error
sh l2vpn internal event-trace event
sh l2vpn internal event-trace major

     Everything looked like our other xconnects -- a switchport mode trunk with no vlans allowed across, and a service instance that actually has the xconnect information.  It may not be your typical routed port configuration, but it works for our purposes (and helps if your customer is sending tagged frames).  After much debugging, Googling, and banging my head against the desk through numerous permutations of the configuration, I ran across this:

Router#sh mpls l2transport vc 11000 event-trace
Local interface: Gi0/3 up, line protocol up, Ethernet:1 down
  Destination address: 10.0.110.1, VC ID: 11000, VC status: down
    Last error: Local peer access circuit is down
    Output interface: Te0/2, imposed label stack {16268 158}
    Preferred path: not configured
    Default path: active
    Next hop: 10.0.1.185
  Create time: 00:00:02, last status change time: 00:00:02
    Last label FSM state change time: 00:00:02
  Signaling protocol: LDP, peer 10.0.110.1:0 up
    Targeted Hello: 10.0.10.16(LDP Id) -> 10.0.110.1, LDP is UP
    Graceful restart: configured and enabled
    Non stop routing: not configured and not enabled
    Status TLV support (local/remote)   : enabled/supported
      LDP route watch                   : enabled
      Label/status state machine        : established, LrdRru
      Last local dataplane   status rcvd: No fault
      Last BFD dataplane     status rcvd: Not sent
      Last BFD peer monitor  status rcvd: No fault
      Last local AC  circuit status rcvd: DOWN AC(rx/tx faults)
      Last local AC  circuit status sent: No fault
      Last local PW i/f circ status rcvd: No fault
      Last local LDP TLV     status sent: DOWN AC(rx/tx faults)
      Last remote LDP TLV    status rcvd: No fault
      Last remote LDP ADJ    status rcvd: No fault
    MPLS VC labels: local 129, remote 158
    Group ID: local 0, remote 0
    MTU: local 1500, remote 1500
    Remote interface description: to CUSTOMER-Z-END
  Sequencing: receive disabled, send disabled
  Control Word: On
  Dataplane:
    SSM segment/switch IDs: 6595/6594 (used), PWID: 2
  VC statistics:
    transit packet totals: receive 0, send 0
    transit byte totals:   receive 0, send 0
    transit packet drops:  receive 0, seq error 0, send 0
AToM VC event trace:
2015 Jul 23 18:37:38.176000: 1861168: AToM[10.0.110.1, 11000]: .... S:Act send notify(DOWN), remote up timer
2015 Jul 23 18:37:38.176000: 1861169: AToM[10.0.110.1, 11000]: ..... Send notify(DOWN)
2015 Jul 23 18:37:38.176000: 1861170: AToM[10.0.110.1, 11000]: .....  Local AC  : DOWN AC(rx/tx faults)
2015 Jul 23 18:37:38.176000: 1861171: AToM[10.0.110.1, 11000]: .....  Overall   : DOWN AC(rx/tx faults)
2015 Jul 23 18:37:38.176000: 1861172: AToM[10.0.110.1, 11000]: ..... Send LDP for status change from UP
2015 Jul 23 18:37:38.176000: 1861173: AToM[10.0.110.1, 11000]: ..... Start remote up timer
2015 Jul 23 18:37:38.176000: 1861174: AToM[10.0.110.1, 11000]: ..... NMS: VC oper state:  DOWN
2015 Jul 23 18:37:38.176000: 1861175: AToM[10.0.110.1, 11000]: ..... SYSLOG: VC is DOWN, Loc AC Err
2015 Jul 23 18:37:38.176000: 1861176: AToM[10.0.110.1, 11000]: ..... PW MIB: VC state is: DOWN
2015 Jul 23 18:37:38.176000: 1861177: AToM[10.0.110.1, 11000]: ... Local ready
2015 Jul 23 18:37:38.176000: 1861178: AToM[10.0.110.1, 11000]: .... Local service is ready; send a label
2015 Jul 23 18:37:38.176000: 1861179: AToM[10.0.110.1, 11000]: .... Alloc local binding
2015 Jul 23 18:37:38.176000: 1861180: AToM[10.0.110.1, 11000]: ..... No need to update the local binding
2015 Jul 23 18:37:38.176000: 1861181: AToM[10.0.110.1, 11000]: .... Generate local event
2015 Jul 23 18:37:38.176000: 1861182: AToM[10.0.110.1, 11000]: .... Ready, label 129
2015 Jul 23 18:37:38.176000: 1861183: AToM[10.0.110.1, 11000]: .... Evt local ready, in established
2015 Jul 23 18:37:38.176000: 1861184: AToM[10.0.110.1, 11000]: ..... Local ready and established
2015 Jul 23 18:37:38.176000: 1861185: AToM[10.0.110.1, 11000]: .. Check if can activate dataplane
2015 Jul 23 18:37:38.176000: 1861186: AToM[10.0.110.1, 11000]: ...  Keep the dataplane UP for AC DN
2015 Jul 23 18:37:38.176000: 1861187: AToM[10.0.110.1, 11000]: ... Dataplane already active

     So what exactly does  Last error: Local peer access circuit is down mean? I couldn't for the life of me find this on the web anywhere.  Other people have had similar problems but no one has reported a solution.  Some people said to try to clear the mpls label for the remote host, which didn't work.  Oh well, you can only trust the internet so much!

     As it got later in the evening I started comparing this switch's config to one that worked.  Everything was the same, minus the VC and remote host addresses.  So I started looking deeper -- maybe it was an IOS version? For those of you who have dealt with Cisco products, you know that occasionally bugs are introduced or re-introduced.  It's just the nature of the beast.  I hopped on Cisco's website and began poking around through their bug search tool for the software version I was using, and found over 48 some bugs.  Some closed, some open, but definitely there, and related to my problem.

In my case, this was a bug due to a particular software version:

me360x-universalk9-mz.153-2.S2

The software that was working on our other device was:

me360x-universalk9-mz.152-4.S4.bin

     So as a last ditch effort to fix this problem, I coordinated with the customer to downgrade the switch.  This isn't something I would normally do, but this particular situation was driving me mad.  After getting the approval, I went ahead and issued a

boot system flash:me360x-universalk9-mz.152-4.S4.bin

saved the config, and reloaded.  Pacing back and forth around the desk a few times, the switch finally started pinging again.  I quickly logged in and found that my xconnect problem had been solved.  The customer responded within minutes saying everything was working.

     So, moral of the story? If your l2vpn xconnect isn't working, do a bug search on Cisco's site.  If everything should be working, and you've compared it to a known working copy, look for the only thing that may be different -- software version.