It’s all too common to find that the most important or mission-critical business processes are plagued with persistent issues. Whether it’s your Customer Relationship Management (CRM) system, your payroll management option, or your brand’s platform, these systems have processes that hold your business together.
When it comes to your website, these processes can turn in to problems that get worse as your brand scales. They can be plagued by intermittent and repetitive conflicts that can easily spin out of control. The lines between infrastructure and code get blurry and fixing it can feel like playing whack-a-mole.
In this blog, I’ll examine one common issue that every developer encounters – unreliable imports. I’ll explain how to stabilize these imports and navigate some strategies that you or your team can use to troubleshoot imports.
The Scenario – Nightly Catalog Import Problems
In this case, there is a large catalog import that runs nightly. It updates all of the products, pricing, and inventory for a website, and it must complete or the products on the site will be out-of-date. The job takes an XML file drop directly from the ERP software, which can easily weigh in at GBs of data. It succeeds frequently enough to keep the site hobbling along, but it has an unacceptably high failure rate. So, what can you do when the import stops working?
Let’s Investigate Why Imports Aren’t Working
The first step to stabilizing this import is to investigate the problem(s) with the information you have. Review past failures, patterns of failure, common issues, and collect as much information as you can about the issue. This will help you quickly eliminate possible issues and let you identify larger patterns or problematic areas.
In this example, this job had a number of common issues, such as file upload failures and an invalid XML, as well as some unexplained issues that required deeper investigation.
Perform a Code Audit on Your Imports
Now that you’ve identified the failures you uncovered while investigating the problem(s), you’ll now need to dig deeper with a Code Audit of the job. Is the import failing when downloading the file? While processing it? Or did it fail while updating the site’s catalog?
Next, use what you know about these common issues to identify points of failure and look for opportunities to pick some of them off. Look for low-hanging fruit and refactor to make some improvements. Is there a more efficient way to download or read the file? Is it streaming or filling up the memory? Don’t forget to look at the git history to see what has already been tried before.
Refactor to Make Improvements in Your Imports
At this point, we’ve identified a few big issues. The file was regularly failing to upload or download over SFTP, and we also discovered that it was streaming the file to memory, which we suspected was causing additional problems.
We were able to eliminate the SFTP issues by swapping it out for S3 on both ends, which means the file upload and download would use the S3 protocol, instead. We also made some changes to stream the file to disk first and read from there to reduce our memory footprint.
Using Enhanced Logging to Examine the Import Problem(s)
If the import problem is not clear cut, you may need to add enhanced logging to the process to tell where it’s breaking down. Even if you’re not getting an error message, you can still get a general idea of what happened if you know the last milestone the system completed successfully.
In this case, there were still some unknowns about some of the job failures, so we added logging around the major steps of the process. For example, when the file has finished downloading, after it has processed every X% of the file, and several other key stops along the import’s journey.
Deploy and Keep Monitoring the Import
You’ll probably need to take more than one bite of the apple, so deploy the first version of your updates to production and try out the improvements you made when refactoring. If you did it right, then the next time it fails, you’ll be able to zero in on the part that’s failing and will know even more about the problem.
Once we took stock of the job after the deployment, we found that the changes eliminated a major source of failures. This was great to see but we weren’t done yet. There were still some unexplained issues, and now we knew that they were happening when parsing the very large XML file.
You May Need to Refactor or Rearchitect as the Job Changes
Armed with your new knowledge, you may need to make some bigger changes and revise the job’s architecture. Patterns that worked fine at a smaller scale can become a bottleneck when attached to larger projects.
This import was downloading the XML feed multiple times and processing the file each time. By changing the job to only download the file once, and stage the data from the file in a central table, we were able to eliminate the remaining issues and also chop off hours of run time.
Be sure to continue to revise and update the job to get it working at an acceptable level, but don’t forget to monitor the job and watch for other problems that may arise.
Take It to the Next Level
Think about ways to make troubleshooting the job more self-service. Are there common pitfalls that can’t be engineered around, like an invalid XML? If so, then communicate processes for dealing with them, and find ways to provide early feedback on known issues. Send a report of validation issues to make sure that they’re getting resolved right away.
There’s a fuzzy line between infrastructure and code, and this type of problem often walks that tight rope. There may be an easy fix to the problem by just giving it more memory or a variety of other solutions. However, keep in mind that a fix like that will often only delay the inevitable and you’ll need to tighten up the code sometime down the line.
Need a professional to take a look at these imports and infrastructure? We’re happy to help! CQL has talented and certified developers on staff to address your website’s backend and frontend needs. Contact us below!