NetEng 101

Monday, January 2, 2017

Network Engineer pro trips

I found this list pretty decent, and definitely things we should all be aware of as our networking careers grow.

1) ALWAYS TRIPLE CHECK EVERYTHING. Make SURE things work by verifying.

2) Never assume anything

3) Never be confident in yourself in anything. Question everything you do even if you know it

4) Listen to other people and ask for them to sanity check your work

5) Whatever you do, keep it stupid simple. You're not the only one using the network

6) Don't be afraid to re-evaluate everything you think you know every single day

7) You know less than you think you know

8) Don't trust documentation unless it's simple and straightforward

9) Never stop learning more

10) Learn operating systems, and linux usage/administration specifically

11) Be nice to people, even if you want to beat the ever living daylights out of them

12) Learn to use Excel, Notepad, and scripting. It will save your work life. Pick a language and stick to it

13) Bleeding edge technologies might sound interesting and cool, but they rarely pay the bills. Learn consistent, simple, straightforward engineering

14) Use the right tool for the right job. Not everything is a hammer, and not everything is a nail

15) All vendors are shit. Some are less shit than others. No vendor does everything well

16) Be flexible in the technologies you know and how to apply them irrespective of vendor

17) Learn how TCP, UDP, IP really work. I mean, REALLY work

18) Learn how to use Wireshark. Take a class if you need to

19) The network you work on is not yours

20) Your work is NOT necessarily representative of your skills. A shitty network doesn't mean you're a bad network engineer. A good network doesn't mean you're a good network engineer

21) Be demanding, but be fair

22) Admit if you messed up

23) Slow down

As taken from a comment in:

https://www.reddit.com/r/networking/comments/5ljq7j/what_are_your_networking_pro_tips/

Wednesday, November 2, 2016

QoS troubleshooting commands for 4500E/3650 networks

4500E tips from StackExchange

show platform hardware interface ten1/1/31 statistics

that should show you input bytes by CoS.

show platform hardware qos interface foo X/Y

shows queue lengths and flow counts

show interface foo X/Y counter detail

shows interface egress packets by queue, queue drops and DBL drops.

show policy-map interface foo X/Y

mostly useful if you are doing custom policy maps

show interface foo X/Y capabilities

if you want to know what the number and type of queues are available on a particular interface

show qos maps

to show the DSCP-CoS mappings

3650 tips

TBD

Monday, October 17, 2016

Handy Palo Alto Networks SNMP traps

For those of you running Palo Alto Networks devices in your environment, here are a few snmp traps I've found useful.

panROUTINGRoutedBGPPeerLeftEstablishedTrap - sent when a BGP peer drops off and leaves an Established state

panROUTINGRoutedBGPPeerEnterEstablishedTrap - sent when a BGP peer re-enters an Established state

Wednesday, October 12, 2016

Handy Regex search strings

This post will be updated occasionally to include strings I've found to be useful in day to day work. A great way to test these is to simply go to regexr.com, paste in the body of text/strings you want to match against, and verify with your own regex strings whether they match or not.

Negative lookahead for syslog string alerting. Used primarily on Solarwinds syslog viewer, this should be placed in the message type field to discriminate against anything that includes the string "PLATFORM" in a syslog message. All other messages will apply as normal.

^(?!.*PLATFORM).*$

IPv4 address matching

[0-9]+(?:\.[0-9]+){3}

Monday, July 11, 2016

EEM redirecting logs to an ftp server

Today I was asked to help with a request from our voice engineer. She on occasion has to run debugs for specific call functions on our ISR 4K routers, which show up in the buffered log. We tried sending those log entries to our syslog server, but since the transport protocol was udp/514 (default), some of the messages were coming in out of order. We tried a second syslog server and got the same result.

Another option was to use tcp instead, so we attempted to get that working. Unfortunately the syslog server was not listening on that port, so we had to improvise. We came up with the idea of exporting the full log file to an ftp server every so often, and using EEM to accomplish this.

Our initial script runs as a privilege 15 user once per day, and uses a while loop to run every 60 minutes of 3600 seconds. It simply dumps the entire log buffer into a text file with a variable which increments once per hour.

event manager session cli username "neteng" privilege 15
event manager applet syslog_redirect
event timer cron cron-entry "0 0 * * *" maxrun 86400
action 0.00 cli command "term len 0"
action 0.01 set i "0"
action 0.02 while $i le 23
action 0.03 cli command "show log | redirect ftp://10.10.10.10/logs/$i.txt"
action 0.04 increment i 1
action 0.05 wait 3600
action 0.06 end

This worked pretty well, and we saw the script running once per hour as expected. Unfortunately the log buffer is set to 70000000 (due to the large volume of debug entries written for each call), so writing the files takes about 4-5 minutes. The script actually generates the log files at 12:04am, 1:08am, and so on. At first I was apprehensive about this and wanted the file to correspond to the variable integer, but found that we were looking at the timestamp of the file rather than the filename itself.

Overall, a very basic script that does what we wanted. However, it sometimes gets stuck we have to ask the question, why the hell are we letting a process run for 24 hours? That eats up cpu/memory cycles. Here is a better version:

event manager applet syslog_redirect
event timer cron cron-entry "0 * * * *"
action 0.0 cli command "show clock"
action 0.1 string range "$_cli_result" 2 4
action 0.2 set hour "$_string_result"
action 0.3 string range "$_cli_result" 6 7
action 0.4 set mins "$_string_result"
action 0.5 string range "$_cli_result" 20 22
action 0.6 set day "$_string_result"
action 0.7 cli command "show log | redirect ftp://10.13.5.36/vgw/test/$day$hour.$mins.txt"
action 0.8 cli command "clear log" pattern "confirm"
action 0.9 cli command "yes"

It will even clear the log after running, making your next log dumps smaller in size. Depending on your "show clock" output, you may need to adjust the numerical variables on the right, as they are literal positions within the output itself.

Friday, July 24, 2015

L2VPN xconnect troubleshooting on an ME3600X

Last night I was asked by a customer to help diagnose their l2vpn xconnect. I've made several of these in the past and usually they just work -- with the exception of an MTU mismatch that is easily fixed. Well this one was troublesome. I swear I tried checking more commands than need be, even the "event-trace" debugging commands:

sh mpls l2transport vc 11000 event-trace
sh l2vpn internal event-trace error
sh l2vpn internal event-trace event
sh l2vpn internal event-trace major

Everything looked like our other xconnects -- a switchport mode trunk with no vlans allowed across, and a service instance that actually has the xconnect information. It may not be your typical routed port configuration, but it works for our purposes (and helps if your customer is sending tagged frames). After much debugging, Googling, and banging my head against the desk through numerous permutations of the configuration, I ran across this:

Router#sh mpls l2transport vc 11000 event-trace
Local interface: Gi0/3 up, line protocol up, Ethernet:1 down
Destination address: 10.0.110.1, VC ID: 11000, VC status: down
Last error: Local peer access circuit is down
Output interface: Te0/2, imposed label stack {16268 158}
Preferred path: not configured
Default path: active
Next hop: 10.0.1.185
Create time: 00:00:02, last status change time: 00:00:02
Last label FSM state change time: 00:00:02
Signaling protocol: LDP, peer 10.0.110.1:0 up
Targeted Hello: 10.0.10.16(LDP Id) -> 10.0.110.1, LDP is UP
Graceful restart: configured and enabled
Non stop routing: not configured and not enabled
Status TLV support (local/remote) : enabled/supported
LDP route watch : enabled
Label/status state machine : established, LrdRru
Last local dataplane status rcvd: No fault
Last BFD dataplane status rcvd: Not sent
Last BFD peer monitor status rcvd: No fault
Last local AC circuit status rcvd: DOWN AC(rx/tx faults)
Last local AC circuit status sent: No fault
Last local PW i/f circ status rcvd: No fault
Last local LDP TLV status sent: DOWN AC(rx/tx faults)
Last remote LDP TLV status rcvd: No fault
Last remote LDP ADJ status rcvd: No fault
MPLS VC labels: local 129, remote 158
Group ID: local 0, remote 0
MTU: local 1500, remote 1500
Remote interface description: to CUSTOMER-Z-END
Sequencing: receive disabled, send disabled
Control Word: On
Dataplane:
SSM segment/switch IDs: 6595/6594 (used), PWID: 2
VC statistics:
transit packet totals: receive 0, send 0
transit byte totals: receive 0, send 0
transit packet drops: receive 0, seq error 0, send 0
AToM VC event trace:
2015 Jul 23 18:37:38.176000: 1861168: AToM[10.0.110.1, 11000]: .... S:Act send notify(DOWN), remote up timer
2015 Jul 23 18:37:38.176000: 1861169: AToM[10.0.110.1, 11000]: ..... Send notify(DOWN)
2015 Jul 23 18:37:38.176000: 1861170: AToM[10.0.110.1, 11000]: ..... Local AC : DOWN AC(rx/tx faults)
2015 Jul 23 18:37:38.176000: 1861171: AToM[10.0.110.1, 11000]: ..... Overall : DOWN AC(rx/tx faults)
2015 Jul 23 18:37:38.176000: 1861172: AToM[10.0.110.1, 11000]: ..... Send LDP for status change from UP
2015 Jul 23 18:37:38.176000: 1861173: AToM[10.0.110.1, 11000]: ..... Start remote up timer
2015 Jul 23 18:37:38.176000: 1861174: AToM[10.0.110.1, 11000]: ..... NMS: VC oper state: DOWN
2015 Jul 23 18:37:38.176000: 1861175: AToM[10.0.110.1, 11000]: ..... SYSLOG: VC is DOWN, Loc AC Err
2015 Jul 23 18:37:38.176000: 1861176: AToM[10.0.110.1, 11000]: ..... PW MIB: VC state is: DOWN
2015 Jul 23 18:37:38.176000: 1861177: AToM[10.0.110.1, 11000]: ... Local ready
2015 Jul 23 18:37:38.176000: 1861178: AToM[10.0.110.1, 11000]: .... Local service is ready; send a label
2015 Jul 23 18:37:38.176000: 1861179: AToM[10.0.110.1, 11000]: .... Alloc local binding
2015 Jul 23 18:37:38.176000: 1861180: AToM[10.0.110.1, 11000]: ..... No need to update the local binding
2015 Jul 23 18:37:38.176000: 1861181: AToM[10.0.110.1, 11000]: .... Generate local event
2015 Jul 23 18:37:38.176000: 1861182: AToM[10.0.110.1, 11000]: .... Ready, label 129
2015 Jul 23 18:37:38.176000: 1861183: AToM[10.0.110.1, 11000]: .... Evt local ready, in established
2015 Jul 23 18:37:38.176000: 1861184: AToM[10.0.110.1, 11000]: ..... Local ready and established
2015 Jul 23 18:37:38.176000: 1861185: AToM[10.0.110.1, 11000]: .. Check if can activate dataplane
2015 Jul 23 18:37:38.176000: 1861186: AToM[10.0.110.1, 11000]: ... Keep the dataplane UP for AC DN
2015 Jul 23 18:37:38.176000: 1861187: AToM[10.0.110.1, 11000]: ... Dataplane already active

So what exactly does Last error: Local peer access circuit is down mean? I couldn't for the life of me find this on the web anywhere. Other people have had similar problems but no one has reported a solution. Some people said to try to clear the mpls label for the remote host, which didn't work. Oh well, you can only trust the internet so much!

As it got later in the evening I started comparing this switch's config to one that worked. Everything was the same, minus the VC and remote host addresses. So I started looking deeper -- maybe it was an IOS version? For those of you who have dealt with Cisco products, you know that occasionally bugs are introduced or re-introduced. It's just the nature of the beast. I hopped on Cisco's website and began poking around through their bug search tool for the software version I was using, and found over 48 some bugs. Some closed, some open, but definitely there, and related to my problem.

In my case, this was a bug due to a particular software version:

me360x-universalk9-mz.153-2.S2

The software that was working on our other device was:

me360x-universalk9-mz.152-4.S4.bin

So as a last ditch effort to fix this problem, I coordinated with the customer to downgrade the switch. This isn't something I would normally do, but this particular situation was driving me mad. After getting the approval, I went ahead and issued a

boot system flash:me360x-universalk9-mz.152-4.S4.bin

saved the config, and reloaded. Pacing back and forth around the desk a few times, the switch finally started pinging again. I quickly logged in and found that my xconnect problem had been solved. The customer responded within minutes saying everything was working.

So, moral of the story? If your l2vpn xconnect isn't working, do a bug search on Cisco's site. If everything should be working, and you've compared it to a known working copy, look for the only thing that may be different -- software version.