SOA performance measurement is important for two main reasons — requirements compliance and system planning. The first reason, requirements compliance is obvious, where you must prove that your system meets the overall performance requirements of whatever process you are running. The second reason, system planning, is necessary to get to a fully deployed SOA in the first place since you must know how each component performs individually and collectively so that you can plan on the amount of hardware and the number of systems you need to realize your SOA. I will examine the ways in which individual components (or services) can be measured for performance by drawing a loose analogy to the way in which electronics systems are characterized, as well as discuss the importance for each type of measurement and how to adjust your system design or implementation based on the results of each measurement.

For individual components, performance can be measured in several ways, which can be thought of as analogs to the way electronic systems are characterized: the impulse response, the step response, and the frequency response. The impulse response characterizes how the system reacts to an excitation on an input pole; in an SOA component this is analogous to the latency from a single input on a single operation. The step response of an electronics system is how it reacts from going from some zero state to an input of one unit in some short period of time; for an SOA system this is like going from no inputs to a heavy load of inputs, but I will break the analogy and consider a step-wise type function, where the input is varied from zero to some maximum value, then back again to get a decaying response as well under load. Finally, the frequency response of an electronic system defines how the system reacts when a signal of varying frequency is applied at its inputs; for an SOA component this can be thought of exciting all operations on a component simultaneously.

In an SOA component, the impulse response is analogous to the latency from a single input to the response to that input. Remember that a single service can have multiple operations, per input operation, so for this measurement you assume a complete decoupling of the operations; that is, there is no operation that changes the state of the system in such a way as to change the semantics of any other operation. The latency of each operation is an important measure of how the component will act within an orchestration of other components, and forms the basic metric for performance. However, another important metric to get from the impulse response is the amount of resources consumed for that latency. The percentage of resources used coupled with the latency number can give a rough idea of the upper bounds on the throughput of a component on dedicated hardware. As an example, if operation FOO() takes 400 msec and uses 20% of the CPU and no I/O, then a rough upper bound for throughput would be 5 (100%/20%) FOO() operations in a 400 msec interval, or 400 msec/5 or 80msec aggregate FOO() throughput. Given these numbers, you may need to go back and refactor your operation to either make it more efficient, to reduce the latency and increase the throughput, or even to refactor the operation and parallelize it into multiple operations. After deciding on refactoring or optimizing, it is tempting to stop at this point of analyzing performance, but there are several factors missing. First, the assumption was made that the process will simply scale linearly. This implies that there is zero shared state or locking within the process, which for a well designed service should be true. However, the realities of modern hardware demand that multiple processes will contend for access to low level devices (memory pathways in particular),especially if you have parallelized your operation in any way Amdahl’s Law takes effect, so it is necessary to consider another type of profiling of your component. (for some examples of how the managed environment itself can be causing problems like this see my last post on performance)

The second performance measurement step is analogous to the step response of an electronic system, where in you cause your input to jump from zero to one unit in a very short time. In the case of an SOA component, this means running the individual operations with a high enough load to push the system and to step that load up then back down. The step portion of the measurement is important for two reasons. The first stepping allows you to measure how initial system startup time can affect latency, just as subsequent steps allow you to measure how the system responds to load under a steady state. Secondly, stepping the inputs down allows you to measure worst case latency as well as aggregate throughput; since worst case latency is most likely going to be a limiting factor on overall performance, it is critical that it be measured under load, and not inferred from the impulse response. Finally, your latency can be affected by a backlog of inputs differently when the rate of inputs changes due to queuing theory, which is beyond the scope of this discussion. This measurement therefore takes into better account any hidden dependencies that you are unaware of to give you an idea of your worst case latency for messages and the corresponding throughput under varying input loads and system resource demands. At this point you have a decent idea of what your worst case latency is, and assuming that the latency is tolerable, you have an idea of the type of throughput you can expect for each operation. The performance measurements now allow to again optimize or refactor your component, but given that the numbers are acceptable you can now start to plan how many nodes will need to be executing to meet you overall throughput needs of your system, but there is one more important aspect of performance to consider – the aggregate performance of the component when all operations are active.

The case when you are exercising all inputs of a component is analogous to the frequency response of an electronic system. By activating all of the operations simultaneously, you will again uncover any hidden dependencies that affect latency and throughput under load. This third type of performance measurement is ironically the most critical. This measurement can reveal if there are any aspects of your component design that have shared state or dependencies. Conversely, seeing bad results from this measurement allows you to examine splitting the component in to multiple components and choreographing them differently or running them on separate nodes. Fundamentally this step is simply running the step measurement simultaneously across all operations, but you must take care to vary the loads across the operations at different frequencies to measure how the system responds when one operation is under heavy loads but certain others are quiescent and vice versa.

Armed with the results of all three types of input testing you should now be able to more reasonably plan out your resource requirements, and also optimize the design of your components and services. These measurements will allow you to parallelize a portion of an operation, or even normalize it into two or more operations to achieve your necessary latency and throughput requirements. The measurements will also be useful when you have to document and prove the performance of your system to an outside party. This discussion has left out a few key topics, most notably what specific mechanism should be used to drive your system, how do you do planning if you haven’t actually built the systems yet, but only have assumed numbers from simulations, and how these performance numbers relate to the overall performance of an SOA choreography. Briefly, the driving mechanism for an SOA component should be fairly simply. Any test harness (Junit/Nunit/httpperf) should be able to be configured to exercise a service interface by simply specifying one or more of the allowable domain elements for an operation. The only trick is to make sure that it is run in parallel or is internally threaded for running step load operations, and that it is running on sufficient hardware that the test harness itself is not the bottleneck (although that in itself is useful data).

The question of how component performance affects overall system performance will be addressed in the next installment.

Post a Comment

*
*