cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
3931
Views
16
Helpful
5
Replies

Automating IOS Upgrade: Python script throws error ("Socket exception:

Eseharrison88
Level 1
Level 1

Hi,

 

I am new to python and have written a script based on my experience so far with python to automate the process of an IOS upgrade. I have over 500 devices within my organization to upgrade their images to the latest ios. I have written my code in two stages: a pre-upgrade stage and an upgrade + post-upgrade stage.

 

The pre-upgrade stage is scripted to log in to the devices one after the other to collect data such as - hostname, current version, model, flash space availaibility etc.

 

Code worked well within a lab environment but testing it out on the live devices throughs the error below:

 

"Socket exception: An existing connection was forcibly closed by the remote host
(10054)"

 

Logs on the cisco router had this : %SSH-4-SSH2_UNEXPECTED_MSG: Unexpected message type has arrived. Terminating the connection from x.x.x.x.

 

Code: (Truncated to improve readability)

import os
import subprocess, re, netmiko
from netmiko import ConnectHandler
from netmiko.ssh_exception import NetmikoTimeoutException
from netmiko.ssh_exception import SSHException
from netmiko.ssh_exception import AuthenticationException
from netmiko import SCPConn
from datetime import datetime, time
import time
import csv


#Cisco ISR Data
new_ios_ISR_size = "702197190"

#Cisco C800 Data
new_ios_c800_size = "97199776"

#Cisco C1900 Data
new_ios_c1900_size = "41735808"

###########################################################################

#Creating the CSV files for pre upgrade

#clearing the old data from the pre-upgrade CSV file and writing the headers
f = open("data/pre_upgrade.csv", "w+")
f.write("IP Address, Hostname, Uptime, Current_Version, Current_Image, Serial_Number, Device_Model, Device_Memory, Available_space, space")
f.write("\n")
f.close()


#clearing the old data from the logs file and writing the headers
f = open("data/logs.txt", "w+")
f.close()
now = datetime.now()
logs_time = now.strftime("%H:%M:%S")

#############################################################################################################################

def preupgrade():

    username = 'xxxxxxxxxx'
    password = 'xxxxxxxxx'
    
    with open('data/ip_address.csv', 'r') as file:
        reader = csv.reader(file)
        num = 0
        for row in reader:
            ip = row[0]
            num += 1
            print(ip, "Device " + str(num))
            
            device = {
                'device_type': 'cisco_ios', 
                'host': ip, 
                'username': username, 
                'password': password
                }

            now = datetime.now()
            logs_time = now.strftime("%H:%M:%S")
            print ("" + logs_time + ": " + ip + " Logging in to device")
            f = open("data/logs.txt", "a")
            f.write("" + logs_time + ": " + ip  + " Logging in to device " + "\n" )
            f.close()

            #handling exceptions errors
            try:
                net_connect = ConnectHandler(**device)

            except (NetmikoTimeoutException, AuthenticationException, SSHException, ValueError, TimeoutError, ConnectionError, OSError):
                now = datetime.now()
                logs_time = now.strftime("%H:%M:%S")
                print ("" + logs_time + ": " + ip  + " device login issue ")
                f = open("data/logs.txt", "a")
                f.write("" + logs_time + ": " + ip  + " device login issue " + "\n" + "\n" )
                f.close()
                continue
                
            #list where informations will be stored
            pre_upgrade_devices = []

            now = datetime.now()
            logs_time = now.strftime("%H:%M:%S")
            print ("" + logs_time + ": " + ip + " Log in Successfuul, Collecting device pre-upgrade report")
            f = open("data/logs.txt", "a")
            f.write("" + logs_time + ": " + ip  + " Log in Successfuul, Collecting device pre-upgrade report " + "\n" )
            f.close()

            # execute show version on router and save output to output object  
            sh_ver_output = net_connect.send_command('show version')

            #now = datetime.now()
            #logs_time = now.strftime("%H:%M:%S")
            #print ("" + logs_time + ": " + ip + " Checking the version ")

            #finding hostname in output using regular expressions
            regex_hostname = re.compile(r'(\S+)\suptime')
            hostname = regex_hostname.findall(sh_ver_output

 

Notes:

  • It works sometimes and then throws the above error and I have to wait till almost the next day and then it works unaided.
  • The devices are a mix of 1SR4331/K9 and C891F-K9 routers

@Seb Rupik @Alex Stevenson @dekwan 

1 Accepted Solution

Accepted Solutions

Claudia de Luna
Spotlight
Spotlight

Hi @Eseharrison88 

Not sure how helpful this will be or if it is what you want to hear but in the spirit of setting expectations and some lessons learned, I wanted to share.

 

I've undertaken projects just like yours for some of my clients including a global company with thousands of switches in each region with both Python/Netmiko and Ansible.

 

I've never seen a 100% success rate (in a production environment of any significant size)).

 

I have seen ~90% success for companies who are very rigorous about standard configurations and limiting the number of models in their environment and being extremely fastidious about code versions.   

Even with all of that we still ran into issues:

  • aaa not set up to standard
  • source interface configuration issues
  • file transfer issues - my recommendation is that you use http for the file transfer
  • timeouts <-- this was a big one and we were constantly having to adjust timeout values typically by region but not always, luckily netmiko has alot of options there..and timeouts for logging in as well as for getting responses and ultimately completing the file transfer.  

Those are just off the top of my head.


Set up your scripts to log sessions and be ready to log alot!  Logging the sessions I had issues with usually led to the culprit.  Log a successful session so you have a "known working" log.

 

https://github.com/ktbyers/netmiko/blob/develop/COMMON_ISSUES.md#enable-netmiko-logging-of-all-reads-and-writes-of-the-communications-channel

 

Eventually you will get tired of how long this takes and look at multi-threading and doing more than one at a time.

We kept it to 5-10 at a time because we were using and HTTP server on my laptop.  If you can get a "real" https server you should be able to scale better.

 

I know I've mentioned lots of issues but don't be discouraged.  Even at 80% success rate thats 400 devices you didn't have to upgrade manually.    I'd call that a big WIN and certainly worth your effort!   Eventually on the remaining 100 you will find root cause and be able to either fix it or address it in some way...and they are often things that needed fixing so your network (an your next upgrade) is better off.

 

Expect that you will be spending more time on the code to check things than the code that actually does the file transfer and reload.
We both started in exactly the same way.  Do i have enough space on the flash to transfer the new file?
If it were easy it would be boring, right?

Good luck & Happy Coding!

View solution in original post

5 Replies 5

@Eseharrison88 if this work and then does not work - chances are it is not your code. See this thread here --> https://networkengineering.stackexchange.com/questions/45168/cisco-ssh-disconnect

 

Hope this helps.

Please mark this as helpful or solution accepted to help others
Connect with me https://bigevilbeard.github.io

Thanks @bigevilbeard  for your reply.

 

I've checked the thread and it would seem it's more of a bug on the device which could be fixed with upgrading the ios. However, doing this manually defeats the purpose of the scripts itself.

Seb Rupik
VIP Alumni
VIP Alumni

Hi there,

I would agree with @bigevilbeard , if the contents of your script do not change from day to day, but the result of running against a device does, then point the finger of blame at the device. Do the production devices and lab devices run the same software versions?

 

Also, couldn't help notice at the top of your script the three variables relating to image file size. If you want to validate an image after it is copied to a device it would be better to check it's MD5 or SHA hash.

 

cheers,

Seb.

Hi Seb,

 

The variable with the image size is pre-upgrade, just to verify the devices have available space on flash. I'll be sure to include the MD5 checksum during post-upgrade script run. Thanks.

 

Both production and test device are running same image and both return same error at intervals.

Claudia de Luna
Spotlight
Spotlight

Hi @Eseharrison88 

Not sure how helpful this will be or if it is what you want to hear but in the spirit of setting expectations and some lessons learned, I wanted to share.

 

I've undertaken projects just like yours for some of my clients including a global company with thousands of switches in each region with both Python/Netmiko and Ansible.

 

I've never seen a 100% success rate (in a production environment of any significant size)).

 

I have seen ~90% success for companies who are very rigorous about standard configurations and limiting the number of models in their environment and being extremely fastidious about code versions.   

Even with all of that we still ran into issues:

  • aaa not set up to standard
  • source interface configuration issues
  • file transfer issues - my recommendation is that you use http for the file transfer
  • timeouts <-- this was a big one and we were constantly having to adjust timeout values typically by region but not always, luckily netmiko has alot of options there..and timeouts for logging in as well as for getting responses and ultimately completing the file transfer.  

Those are just off the top of my head.


Set up your scripts to log sessions and be ready to log alot!  Logging the sessions I had issues with usually led to the culprit.  Log a successful session so you have a "known working" log.

 

https://github.com/ktbyers/netmiko/blob/develop/COMMON_ISSUES.md#enable-netmiko-logging-of-all-reads-and-writes-of-the-communications-channel

 

Eventually you will get tired of how long this takes and look at multi-threading and doing more than one at a time.

We kept it to 5-10 at a time because we were using and HTTP server on my laptop.  If you can get a "real" https server you should be able to scale better.

 

I know I've mentioned lots of issues but don't be discouraged.  Even at 80% success rate thats 400 devices you didn't have to upgrade manually.    I'd call that a big WIN and certainly worth your effort!   Eventually on the remaining 100 you will find root cause and be able to either fix it or address it in some way...and they are often things that needed fixing so your network (an your next upgrade) is better off.

 

Expect that you will be spending more time on the code to check things than the code that actually does the file transfer and reload.
We both started in exactly the same way.  Do i have enough space on the flash to transfer the new file?
If it were easy it would be boring, right?

Good luck & Happy Coding!