Thursday 19 September 2019

Achieving Reliability in MuleSoft 4


I always found difficult to talk about what reliability is and how to achieve that, sometimes I have to admit because I was not aware of certain problems or I wasn't sure how to address them.
That is why I thought it was a good idea to collect the solutions offered by MuleSoft to the various reliability problems in one place, so to give a unique view and different options.

In this article, I do not intend to give low-level technical details of each solution, better to link the related Mulesoft documentation for that, however, I am happy to expand on any of these if required.

Overview

Reliability is defined in Wikipedia as "the quality of being trustworthy or of performing consistently well". 

Reliability is one of the most important requirements (non-functional) in IT, however, often left out of any requirement discussion as either thought:
  • not important for the customer (before everything goes to hell)
  • not need to be discussed, and why things should go wrong?
  • too much technical, and so difficult for business people to understand
  • sometimes the Integration Architect might not emphasize the importance of understanding it
The problem is that if the requirements are not agreed when issues come, is easy to say that this is someone (this someone is usually the Architect) else's fault. 

Reliability Approaches

Requirements 

Reliability means no message data loss during the processing, in case of errors or after a stop or a crash of the server(s) processing the requests.

Various reliability patterns can be implemented to achieve reliability goals for synchronous and asynchronous flows.

Way to achieve a Reliable API

Reliability in Mule applications can be achieved using:
  • Reconnection strategies, see here
    • When a system such as a DB of SFTP where a connection pool is used MuleSoft use to open these connections at the start of the server and use them while needed.
    • If for any reason (remote systems down or connectivity) this pool is not properly populated or connection goes down at any time, by default MuleSoft keeps running and eventually, all the flows using these connections will keep failing.
    • By using this feature it is possible to instruct Mule instead in trying to reconnect and repopulate the pool at defined intervals of time.
    • I always configure Reconnection strategy for instance with (S)FTP or DB connector
  • Until Successful scope, see here
    • It can execute a sequence of Mule processor, a defined number of times until all everything succeeds.
    • I find it very useful for HTTP Request where the connection is unstable.
  • Redelivery policy, see here
    • Redelivery policy in Mule4 si similar to the until successful scope but it gets applied always to the Source of the Flow.
    • For the developer it is just a configuration, however, It works by saving in the background by Mule, the received message in Mule cache and incrementing the number of times it gets resubmitted after an error occurs.
    • When using this policy is a best practice to implement an error handler for the exception REDELIVERY_EXHAUSTED
    • Mule4 Redelivery policy can be applied to any flow source, but best practices would be to use external systems redelivery when supported (suck us for JMSConsume operations).   
  • RETRY_EXHAUSTED see here
    • This exception handler as per best practice should always be implemented
    • It can be thrown in Mule4 common by any connectors
  • Transactions see here
    • If a series of steps in a Mule flow must succeed or fail as one unit, in Mule a good practice is to use transaction to demarcate that unit. Transaction can start at the Source of the flow of can demarcate any Try-Scope.
    • A single system, supporting transaction it will allow using a single Mule local transaction, weather, if more than one system that supports transactions has to be involved, then is recommended XA transaction that will support 2 phase commit, making sure all systems involved commit or all will rollback.
    • Default approach in Mule4 is that in the happy path the (XA or Single) transaction gets COMMITTED at the end, while in case of errors, within:
      • On-Error-Propagate handler the transaction gets ROLLED-BACK
      • On-Error-Continue handler the transaction gets COMMITTED
    • Message persistence for application downtime or crash might be required when the state has to be recovered
      • Persistence can be implemented in Mule via VM, JMS, DB, Cache/Object Store, File
    • When using JMS/ActiveMQ other that Transaction also the ACK message approach can be used
  • Reliability Pattern in async scenario see here 
    • I find it useful when having a push mechanism scenario that is triggered by changes to a source system (often used for systems synchronisation). Sometimes ago I would have referred to this as ChangeDataCapture (CDC), while today these use cases are usually called webhooks.
    • Main options for implementing the communication between the reliable acquisition and processing flows are Persistent VM or JMS/Active MQ, however, differences have to be noted:
    • Persistent VM (based on Amazon SQS standard Queue, see here and here): 
      • "at least one delivery guaranteed", there is a chance the same message could be processed more than 1 time, therefore the flow has to call idempotent operation or have an idempotent filter
      • "message ordering" is not supported"
      • Can be used for this when only between MuleSoft API 
      • It is recommended to configure a Redelivery policy (see here https://docs.mulesoft.com/connectors/vm/vm-reference#listener) and implement the Exception Handler for REDELIVERY_EXHAUSTED not to lose track of the messages LOST
    • JMS/Active MQ
      • "exactly one delivery" and "message ordering" is supported
      • It is recommended to configure a FIFO Queue when message ordering is needed, with a Dead Letter Queue (see here. https://docs.mulesoft.com/mq/mq-queues#dead-letter-queues) and implement some monitoring on the Dead Letter Queue not to lose track of the messages LOST

Conclusions

I found quite frequently customers not willing to talk about these topics and if eventually, they trust the experts they are happy to go for the recommended solution.

In fact, I use to have a default approach which is based on my experience in the industry, and I tend to recommend it (and sometimes I explain it too) to the customer, making sure he fully understands (he sign it off) what does that mean to him and most of the times he is happy with that.

In some cases they want to discuss various options in more details, however, each approach has tradeoffs between reliability goals vs. performance vs. cost.