September 6, 2024

Post-Incident Review on Connect SL’s Recent Performance Regressions

Ryan Reading

Ryan Reading

Executive Summary

On May 29, 2024, we released the Skydio Spring Release 2024 (34.1.106) X10 software update. While this update added many valuable capabilities, it also inadvertently introduced performance regressions that degraded wireless range and video quality and caused disconnects at short ranges.

On Aug 14, we released a point release (34.1.140) to address these regressions and bring X10 back to expected performance levels. Going forward, we are implementing corrective measures to prevent these kinds of regressions in future updates, including more comprehensive and automated test suites and increased analytics to better measure and monitor wireless performance experienced by customers.

What happened?

The Spring Release (34.1.106) contained many important product features and improvements - NightSense Beta, Spotlight attachment support, Free Look, improved ground detection on landing, and more. It also included changes to improve wireless upload speeds for Media Sync while the drone was on the ground. Unfortunately, these wireless changes also introduced bugs that affected wireless performance when flying over Skydio Connect SL. These bugs were not detected by test suites and many customer operations were impacted after the upgrade as a result.

During our investigation, we identified the following issues:

  • Degraded video quality and range on certain channels
    In the May release, we made a change to enable 80Mhz bandwidth when not flying in order to support faster upload speeds for Media Sync and flight log uploads. An unintended side effect of the change resulted in poor radio performance on channels 120 and 153. Thus, whenever the Auto Channel algorithm or customers manually selected to fly Connect SL flights on channel 120 or 153, they experienced dramatically reduced performance, both in terms of video quality and flight range.

    By running detailed side-by-side tests in the wireless test chamber, we were able to clearly identify video quality decreasing faster as signal strength dropped on the May software.
Spring Release Regression Announcement
Automated signal strength testing on ch153 shows faster drop off in video quality in the May release vs Aug release. Simply, video quality degraded faster than it should have as signal decreased.

Real-world analytics of performance on ch153 also confirm the regression and impact of the fix. In the August release, analytics confirm that customers experience 3x higher video quality and longer ranges than on the May release.

  • Degraded video and disconnects in RF congested environments when flying at short range
    Skydio uses Forward Error Correction (FEC) in order to be more resilient to dropped packets when flying in congested environments - the net effect is better video performance even when experiencing RF congestion.

    The May release contained a bug in video bitrate estimation when FEC was enabled that caused more data to be sent through the wireless link than it could handle. This led to artificial congestion of the wireless link which ultimately caused the user to experience degraded video and disconnects - even at short ranges.

    On the August release (34.1.140), customers are now experiencing dramatically improved performance - the frequency of degraded video at short range (<100m) has decreased by 82% when compared to the May release (34.1.1.106).

How did we respond?

As customers began encountering issues, they reported them through our Customer Support ticketing system. When it became clear that there were systematic performance problems, we set up an engineering strike team and began meeting every day to understand, diagnose, and solve the problem.

  • Understanding the issues
    On July 14, Skydio Customer Support escalated unresolved concerns to engineering about wireless performance based on growing customer reports. Skydio engineering responded by setting up a “strike team” – a cross-functional team that meets daily to understand the issues and deliver updates to executives to assign resources and drive action.

    The strike team engaged with the goal to rapidly and clearly understand which configurations and usage patterns were affected. We analyzed analytics and flight logs from affected customers and organized into similar symptoms to look for commonalities. We kicked off an initiative to thoroughly document recommended configurations to achieve best wireless performance in different situations.
  • Working with customers to use recommended wireless configurations
    On July 31, we started rolling out Skydio X10 Wireless Recommendations guide to customers to provide better guidance for achieving best results and more systematically diagnose connectivity issues experienced by customers.
  • Diagnosing the root causes
    On Aug 1, we discovered evidence of the regression that was ultimately associated with ch153 being degraded.

    Through analyzing flight logs, we identified flights with elevated low-level error rates. Our engineering team then started working on systematic tests in the wireless chamber to reproduce these errors and discovered degraded performance on ch153. By conducting thorough testing across all channels in the chamber, we were able to positively identify and root cause several issues.
  • Notifying customers of issues
    On Aug 7, we issued an NTO about channel 153 degradation advising customers of the issues that engineering had confirmed the previous week and mitigating configurations to avoid the issues while we worked on a fix.
  • Solving the problem
    On Aug 14, we released a point release (34.1.140) to address the above documented issues and an updated NTO recommending upgrading to the latest software.

What are we doing to prevent situations like this in the future?

After releasing the point release, we conducted a postmortem to systematically review the issues and identify process and technical improvements that will both prevent and improve response to issues like this in the future.

Starting immediately, we are implementing more comprehensive testing and performance monitoring that will help us both prevent and more rapidly detect issues such as these. My team is also continuing investigations on wireless performance to proactively identify any additional performance issues that could be negatively impacting customers as well as working aggressively to deliver new functionality recently laid out in the Connect Fusion product announcements.

  • More comprehensive testing
    We are implementing a suite of comprehensive controlled automated tests in our wireless test chambers. We will run performance tests across all available channels and various conditions in an automated way. These tests will systematically measure wireless performance across every combination of channel and signal strengths across the product to ensure there haven’t been any regressions in the software update.
Skydio Wireless Testing Facility, San Mateo CA_2024
X10 mounted on a robotic arm for automated testing in Skydio’s wireless test chamber.
  • Performance monitoring
    The team has added a comprehensive set of analytics for monitoring real-world wireless performance outside of the chamber. We will use these metrics to monitor the performance of the product during pre-release flight testing, dogfooding, and beta testing to ensure the product is meeting expected Key Performance Indicators (KPIs) in terms of video quality and frequency of dropped video and disconnects. We will also leverage these analytics to proactively monitor product performance in the field to more quickly identify and react to issues affecting customers - without requiring time-consuming flight log uploads.
  • Added Connectivity to Safety Counsel
    Skydio Customer Support, Product and Engineering maintain a regular Safety Counsel to ensure our products meet the rigorous environmental and operational requirements of our customers. We track and investigate any significant issue that could affect safety or reliability of our drones and customer operations. This month, we will add metrics and tracking to include wireless performance indicators that result in mission failure for our customers. We understand that when our customers are responding to critical scenarios, reliable performance of the entire system is paramount.
  • Continued investigations and new functionality
    We have already identified more improvements to make and are working hard to roll those into software updates for customers. Our next software release will address known issues improving the quality of thermal video feeds and media sync upload speeds. It also adds the ability to toggle between Connect SL and Connect 5G during flight. In software releases after that, we will roll out a more sophisticated dynamic channel-switching system that will choose the best channel automatically and change as necessary in flight, as well as Connect Fusion which will seamlessly leverage both point-to-point radio and cellular for optimal video quality and connectivity across all available communication channels.

Closing Remarks

We sincerely apologize for the impact of these issues and are committed to both transparency and the continuous improvement necessary to deliver product performance and quality that our customers need. In the latest software (34.1.140), the issues documented in this post have all been resolved. We urge all customers to upgrade to this software for the best possible experience.

We failed to identify these issues in our regression tests and took too long to recognize and resolve the issues once customers started experiencing them. Issues with wireless performance can too easily be attributed to environmental issues (i.e., local interference) and dismissed. This was true early on in this case and made us slow to identify the regression. We also lacked the data to validate performance regressions and relied too much on customer reports. We outlined the process improvements above we are making both to testing and performance monitoring to prevent this going forward.

For those of you who reported issues and were patient with us in uploading flight logs - Thank You. We know it’s extra work to get these logs to us, but the data gathered was critical in helping us diagnose the problems. We are making improvements to streamline the log upload experience as well as minimize the need to upload flight logs to get support so it’s less work on you and we can support you better and faster.

Our success is tied to the success of our customers. We value your partnership and we are constantly looking for ways to provide better engineering support that can directly improve the success of your organization’s mission. Speaking on behalf of our entire team, we welcome and encourage your feedback. Please don’t hesitate to reach out.

Our most important commitment as a company is that you can rely on us to deliver when the stakes are high, and we are determined to earn your trust.

Let us help you launch and grow a successful drone program.

What team would you like to contact today?

Support