ACM.69 While you go to mattress and your code works and get up to seek out it doesn’t anymore ~ what it’s prefer to troubleshoot CloudFormation
This can be a continuation of my sequence on Automating Cybersecurity Metrics.
I’m going to put in writing about issues you are able to do to assist cope with errors within the context of an error that’s at the moment driving me nuts at first of penning this put up.
If I’ve to spend this a lot time on it, I determine it’s price weblog put up. Possibly it would make it easier to troubleshoot CloudFormation sooner. Or possibly I’m simply documenting time spent on a single error message I’m whereas attempting to finish code. However that is what to look out for and the way I went about troubleshooting this specific error.
Because it seems, a few of it was a waste of time as a result of the error message is totally 100% deceptive. I wrote in regards to the significance of error messages right here. It’s essential for safety and in addition to be good to your clients, software program customers, and future builders of your code (together with your self when you possibly can’t bear in mind what the error means later).
The very first thing is to grasp the place to seek out CloudFormation errors. While you deploy a template utilizing CloudFormation and it fails you possibly can print out a listing of occasions on the command line utilizing the command AWS gives with the failure (which is very nice, by the best way!)
That’s nice however generally I discover it simpler to view the occasions within the AWS Console. Navigate to your listing of CloudFormation stacks (in the identical area as your CLI is configured to deploy them after all!). Click on in your stack after which click on on “Occasions” on the prime of the listing.
The knowledge you want just isn’t essentially the very first thing in purple. It’s important to scroll down within the listing to seek out the issue that triggered the deployment failure.
One factor I all the time do is print out every CLI command my scripts are executing — with parameters that had been enter into the command. That means when one thing fails I’ve the precise command that the CLI was attempting to run on the time of failure. That means I don’t should re-run my whole stack. I can re-run the command that did not attempt to troubleshoot it.
That’s what I’m doing on this framework I’m writing for batch jobs however the code really has a framework for a way a corporation may begin utilizing AWS and structuring deployments. I don’t repeat the print command again and again, I created widespread capabilities for error dealing with to attempt to scale back the quantity of code I’ve to keep up and troubleshoot.
Right here’s what I get on the finish of my script above the CloudFormation failure message utilizing my widespread deployment capabilities:
I can copy and paste that code and execute it on the command line individually from all the remainder of my sophisticated stack code to see what error I get.
Nicely to start with I get this error:
An error occurred (ValidationError) when calling the CreateChangeSet operation: Stack:arn:aws:cloudformation:us-east-1:xxxxx:stack/Community-SGRules-RemoteAccessPublicVPC-Default-Guidelines/xxxxx is in ROLLBACK_COMPLETE state and cannot be up to date.
That primarily tells me that there’s a stack in a nasty state. I have to delete it earlier than I can proceed. Typically this causes an actual problem as a result of you possibly can’t delete a stack that has an entire bunch of different dependencies nevertheless it’s caught in a nasty state. This can be a good cause to completely check your deployments earlier than you try and roll them out in manufacturing. It’s additionally a superb preferrred to construct stacks in small items like I’ve completed in my code above to it’s simpler to delete and re-deploy particular person elements. Nonetheless, you may get into this case and get caught. At that time you could strive numerous choices to pressure deployments.
Sadly you may get combined outcomes with these choices as effectively. Apparently disable-rollback solely exists for create-stack and never deploy strategies:
The issue with create-stack is that it’s important to check to see if a stack exists earlier than you deploy it in any other case it’s important to replace it. Needed to write quite a lot of logic for prior code. Deploy makes it a lot simpler to deploy stacks as a result of it handles that logic for you. Sadly the rollback choice doesn’t exist and I’d moderately cope with failures manually that check each stack to see if I have to create or replace.
So I delete the stack and take a look at once more.
This time it tells me the next:
That’s odd as a result of my parameter does have a price. I don’t get that error when my full stack executes. In case you are conversant in CloudFormation parameter overrides you may discover one thing odd in the best way I’m passing within the parameters. Usually you move them on this means:
So I check the code above and it really works. Why was I together with the brackets earlier than? As a result of some error message that got here out of CloudFormation sooner or later produced an error message that instructed me to formulate my parameters within the above method. I obtained that error after I was attempting to determine learn how to move parameters with areas into CloudFormation.
The primary methodology was the one means I may get parameters that had values with areas to not crash my scripts, however the parameters nonetheless didn’t work appropriately and present up within the console with the complete worth together with areas.
OK so what’s occurring right here. Why is it working in my script however not on the command line. Different assets are deploying appropriately. Possibly that’s not really what my command seems to be like. Again to my widespread stack deployment operate that prints out this command.
I’m printing out precisely what the script executes:
Does my different code nonetheless work that labored earlier than that makes use of that parameter construction? Let’s delete the final stack earlier than this one which obtained deployed by our script and see if working that command from the command line works.
Right here’s the stack previous to the one I’m getting errors on at the moment.
A check with out deleting it really works:
Let’s delete it and see what occurs.
Right here’s the issue. Route tables have dependencies. They’re gradual to create, replace, and delete. In the event you attempt to delete them prematurely they may get right into a Delete Failed state.
The opposite downside is as soon as your route desk will get right into a DeleteFailed state, you possibly can’t replace it:
There’s really nothing incorrect with my route desk so I may simply go away it like that. It’s simply sort of ugly. The issue is — what if I have to replace that route desk? You may have the ability to add a brand new one and delete the prevailing one however that’s simply messy.
You might attempt to pressure deletion of your route desk utilizing this command:
aws cloudformation delete-stack --stack-name my-stack --retain-resources myresource1 myresource2
In that case, you will want to know all of the assets dependent in your route desk.
You’ll be able to obtain the identical consequence by making an attempt to delete a stack twice within the AWS console, which is able to pop up the next dialog field.
In my case, I’m doing one thing to work round a CloudFormation problem with route tables that I’ll be publishing in a later put up. That’s inflicting my problem right here.
I can go forward and delete my stack and select to depart the underlying useful resource in place, after which attempt to go manually delete it. Simply bear in mind it’s important to attempt to delete twice to get this feature. You must clear up any assets you don’t want.
If I run the script once more on the command line I get the identical error. Clearly CloudFormation scripts executing from the CLI usually are not parsing the values the identical means as after I manually run the command, which may be very unusual.
I’m working the very same command in a bash script that I’m working manually from the console besides that I’m utilizing the $( [command] ) syntax to execute the command inside in my script, after formulating it as a string. Mainly I can emulate the performance in my script to execute the command like this (from the identical listing as my deploy script to get the proper relative path to my templates):
And, that works. Don’t ask me why. I don’t know. I don’t care. I do know that I wish to get this code out of bash as quickly as potential.
Now again to our failing stack. First I’ve to redeploy the stacks I deleted.
./deploy.sh
Efficiently again to our failure. Ugh.
Now I can check as I did above and identical consequence.
The error message is a bit deceptive as a result of it says the worth of GroupId must be a string. This template works simply effective for different stacks, so the GroupId just isn’t the issue. The issue is the parameter I’m passing in. It’s an output from one other stack that’s used with an ImportValue operate to get a safety group ID output from one other stack. I shouldn’t have any errors in my template as a result of this already labored earlier than — or was I delusional after I appeared on the prior outcomes of my templates that every one ran efficiently final night time?
The template takes an SGExportParam parameter. It makes use of that Parameter to get the output. I don’t see any typos right here.
OK subsequent, let’s see what I’m getting out of my operate name primarily based on the command I printed out.
That appears appropriate. I believe that’s the appropriate output identify. Let’s be certain that it obtained handed into our CloudFormation stack appropriately by wanting on the Parameters tab for the failed stack.
I see a price that appears appropriate. I believe. Proper?
Return to the VPC stack the place I’m outputting that worth and test the outputs.
Am I blind or does that match? I imply, I notice I ought to in all probability get glasses however I’m not seeing a distinction.
Debug Output from CloudFormation (and different AWS instruments)
What I’m doing to do subsequent is add a debug command to my stack.
That is the place I finished to put in writing the prior put up with a warning about credentials in debug output and in addition to within the AWS console.
[link here]
Now let’s go forward and run the command and see if we will discover any helpful data in our debug output. After weeding by way of the logs it seems to be like they simply include a bunch of retries to test the standing of CloudFormation till the stack lastly studies the failure, so the CLI or Boto3 logs don’t actually assist us.
Hmm. The issue has to do with not getting a Safety Group Id from that output. I defined earlier how I created a operate to get an output from a stack. Let’s check getting that output independently.
I’m going to navigate to my Capabilities listing and create a check script.
I navigate to my stack with the output values and replica and paste them into my script. I all the time copy and paste at any time when I’m testing or coding at any time when potential to keep away from typos.
And right here is the place I see an issue even earlier than I try and run my code. I examined this a few occasions. Once I copy and paste the output for the Default Safety Group ID it seems to have an area in it. What’s inflicting this? Is that this a purple herring or no?
First let’s test the VPC template to see if I’ve an additional area inside some quotes someplace. I can simply take a look at the template for the stack within the AWS Console:
Possibly I’m simply not seeing it however I don’t see how the above causes these areas and I’m doing the identical factor right here that I’m doing in different templates. Let’s proceed our check of the output parameters with the standalone operate.
Let’s take into consideration this or a minute:
What occurs with my parameters when I attempt to move them right into a command line operate like this? How are my bash instructions going to interpret these parameters?
If there are further areas it’s not going to acknowledge all of the values I’m passing into my operate proper?
As a result of it’s makes every worth between areas a separate parameter…so let’s see what occurs if we do that:
unary operator anticipated
What’s on line 7?
The areas are inflicting us grief.
I attempted one other of different variations I gained’t bore you with. It’s associated to this identical put up linked above:
Let’s confirm that there are actually areas in that output. I don’t see the areas right here after I question the outputs of this stack and that might break tons of individuals’s code on AWS so I discover it onerous to imagine that might ever be the issue within the first place:
So maybe our question will work in spite of everything. Simply out of curiosity I went again to the console and tried my copy and paste methodology on different outputs and I didn’t get the additional areas. Then I attempted on the unique parameter once more. I’m not getting the identical consequence. OK bizarre. Anyway let’s see if we will get our parameter as an output.
Take a look at it. This works:
Alright again to our stack. Out of curiosity I’m going to run the stack another time with no modifications as a result of that is so odd. Similar error.
Now once more, for sanity, I’m going to repeat and paste and check the worth from the failed stack parameters:
The subsequent factor I did was alter my stack to manually onerous code within the identify of the output with no areas. I click on on the template designer.
I hardcoded the identify I do know works in my operate above into the a part of the template that makes use of the ImportValue assertion. I attempted with and and not using a area. I uploaded and deployed it:
By the best way, if you happen to don’t need individuals manually enhancing templates within the AWS console it is advisable to restrict that entry in your IAM Insurance policies.
Similar consequence:
Now once more for sanity I deployed my working stacks that use those self same templates to test once more that the issue just isn’t with the template itself, however moderately the export parameter not being appropriately retrieved from CloudFormation utilizing fn::ImportValue.
At this level I notice my script that I’m positive was working final night time is not working. I do know I ran all of the stacks and obtained no errors previous to including the performance within the subsequent put up which didn’t alter this template. It merely makes use of it. What on the earth is occurring?
Alright, what’s the distinction between this and a script that’s working? I’m passing in a variable and resolving the output to a safety group. Right here’s one of many alternate, working scripts.
Evaluate that to my failing script:
Do you see a distinction? I appeared this code about 100 occasions earlier than I noticed it. Possibly you noticed it sooner.
IpProtocol is misaligned.
The error message is 100% deceptive and a giant time-waster as a result of it’s speaking in regards to the export worth when that isn’t in any respect the issue so I wasted hours on a silly area.
That is the place I want the error messages could be a bit smarter at giving us the proper downside, as an alternative of attempting to pressure us to make use of an IDE within the cloud, if that might even assist.
Repair the issue:
That is why you pay me to put in writing this weblog — haha (it’s free). To seek out the dumb little errors like this and provide you with working code so that you don’t really feel the ache.
It’s additionally why I publish my bugs to this web site to assist anybody who suffers this destiny in different areas of programming if it aligns to no matter I’m engaged on in the meanwhile.
The oddest factor is that I do know I ran the script and checked all of the CloudFormation stacks final night time and there have been no errors. I don’t know. Possibly I used to be delusional. At any price 99% positive that is going to repair the issue and finishing this put up because it runs.
Sure.
All stacks purposeful previous to my final addition that makes use of this template (with out modification).
Let’s add within the final stack to make use of the template and check. Sure. It really works.
By the best way, I’ve been having this unusual problem with CloudFormation parameters that I 100% know just isn’t my fault. It occurs after I don’t even change the code and goes away after I don’t change the code. I don’t know the supply. It could be one thing on my native machine or AWS. However simply so you might be conscious, generally it’s important to pinpoint if the issue is in your code or some place else:
If we may simply eliminate these bizarre errors and get actually particular error messages, all of the programmers within the universe would have the ability to work far more effectively! We’d in all probability stop some safety bugs too.
Teri Radichel
In the event you preferred this story please clap and observe:
Medium: Teri Radichel or Electronic mail Record: Teri Radichel
Twitter: @teriradichel or @2ndSightLab
Requests companies through LinkedIn: Teri Radichel or IANS Analysis
© 2nd Sight Lab 2022
All of the posts on this sequence:
____________________________________________
Writer:
Cybersecurity for Executives within the Age of Cloud on Amazon
Want Cloud Safety Coaching? 2nd Sight Lab Cloud Safety Coaching
Is your cloud safe? Rent 2nd Sight Lab for a penetration check or safety evaluation.
Have a Cybersecurity or Cloud Safety Query? Ask Teri Radichel by scheduling a name with IANS Analysis.
Cybersecurity & Cloud Safety Assets by Teri Radichel: Cybersecurity and Cloud safety lessons, articles, white papers, shows, and podcasts