Over the last months there has been a lot of discussion in the community regarding the viability and scalability of Neutron. We set out to prove whether scale is actually a concern, especially when Neutron is supported by an SDN system such as the Nuage VSP. Some previous attempts to prove scale with Neutron [1] were either abandoned or reduced in scope.
Neutron is composed of two components: the Neutron server that implements the north bound API, and the reference implementation based on ML2 and OVS. When the Nuage VSP platform is deployed in conjunction with Neutron, the reference plugin is replaced by the Nuage plugin and the control of the network data path is delegated to the Nuage VSD platform. In this implementation, the Neutron server operates as the API interface between Nova and the VSP platform. The Nuage VSP is responsible for all network implementations including DHCP, NAT, routing, and there is no need for l3 agents, dhcp agents or for that matter any centralized entity to manage the network.
It was clear from the experiences reported in [1] that in most cases the performance issues identified with Neutron were all in the reference implementation. Achieving scale though, is not always just an issue of hardening code, but very often fundamental in the architecture of the system. In these series of blogs we will discuss the set of design and architectural decisions that allow the Nuage VSP system to scale and provide a robust SDN control plane for cloud deployments. Before we go into the technology though, we will present some of our performance results and how we tuned Openstack to scale performance.
We started with a basic Openstack Icehouse install with the upstream Nuage plugin and tried to see how we can scale the infrastructure and what is the control plane performance that we could achieve. For these tests, our focus was to identify the maximum instance activation rate we could achieve, assuming a single Nova controller and a single Neutron server. Our setup included 40 compute nodes, and instances were activated using libvirt-lxc so that we can maximize the instance activation rate and not depend on VM boot times. Remember that our goal was to push Neutron to its limits. For the Nuage platform we used a single VSD and a single VSC in the control plane.
We started issuing instance activation commands (batch activation with –num-instance=50), and we quickly discovered that the system was bottlenecked even though CPUs were underutilized, and we could not achieve any instance activation rate better than 1-2 instances/second. We started optimizing the default nova/neutron configurations. Increased the number of nova-conductor workers, increased the nova-api workers and increased the neutron API workers. Performance started improving, but then quickly realized that processes were timing-out on keystone APIs. Looking for an easy solution, we updated keystone to Juno that supports eventlet based workers and immediately the bottleneck seemed to be removed. At this point we had several keystone workers, but things were still not going as fast as we wanted. The Neutron server was idle, and the Nuage VSD was idle as well. A little more investigation showed that the default mysql settings needed some tuning as well. Because of the multiple workers, we had to increase the mysql connections and mysql had to be configured to accept these connections.
After all these modifications, we were finally able to run our test. We configured 5000 networks in Neutron and we started activating VMs. We managed to get a rate of about 10 instances/second (600 instances/minute) that is more than double than what was reported in [1]. Note that the tests in [1] abandoned Neutron and relied only on Nova network and hit several problems when there were more than 170 instances/server. We were actually running at this point with close to 500 instances per server. Thus, with the Nuage plugin we were able to improve on the previous instance activation numbers while maintaining all the functionality of Neutron in place. But, even at this activation rate, both the Neutron server and the Nuage VSP were still pretty much idle. We were not happy with idle servers.
What we noticed was that since we had a single nova controller, there was just one nova-scheduler process and this was the bottleneck in these tests. Since our goal was not to optimize nova, but rather prove whether Nuetron/Nuage can scale we decided to try an alternative test. We modified our test scripts and every instance was started with 5 interfaces (i.e. for every instance booted we would have to configure 5 neutron ports and implement of all associated functions). This means that essentially nova-scheduler had to do only 1/5th of the work that the Neutron server had to do. With this new test in hand, we started activating instances in batches of 50, where all instances in a batch had an interface in 5 different works. The result: We activated 4K instances in 40 servers with 20K neutron ports across 400 networks in about 9 minutes. At this time, there were 100 instances per server with 500 Neutron ports per server.
In summary, we had a neutron implementation that was activating 35 ports/second end-to-end (including DHCP, security rules, etc) and without errors. For all practical purposes, nova was still the bottleneck and the neutron server was no more than 60% utilized.
Based on the above results we can say with confidence that the network functions and a Neutron/Nuage implementation is by far not the bottleneck in an Openstack deployment. All the bottlenecks in the above tests were removed by properly tuning Nova and Keystone, and even after all tuning, nova-scheduler remained as the main bottleneck. This is the opposite of what has been reported previously, where the Neutron reference implementation was the main bottleneck for scale.
A detailed presentation of the above results together with details on the parameter tuning
will be presented during our talk in the Paris Openstack Summit.
[1] How we scaled Openstack to launch 168,000 cloud instances.