Monday, 7 July 2014

Planning – setting the cell retrieval threshold and interesting hack

If you have been working with planning then no doubt you have experienced the web application JVM crashing at some point which can be caused by a number of different factors, the usual likely candidates are large forms and maybe the most deadly of all is ad-hoc analysis.

When a large form or ad-hoc retrieval is run then there is a possibility that an out of memory error may occur in the JVM, the common errors to be generated in the logs are:

<Critical> <WorkManager> <BEA-002911> <WorkManager weblogic.kernel.Default failed to schedule a request due to java.lang.OutOfMemoryError: getNewTla

<Critical> <WorkManager> <BEA-002911> <WorkManager weblogic.kernel.System failed to schedule a request due to java.lang.OutOfMemoryError


You only need to monitor the web application once a single large ad-hoc query has been run to see the hit on resources.


It it possible to alleviate this issue with JVM tuning, optimal form design, improved access permissions, limit ad-hoc provisioning and educating users that planning ad-hoc should not be used as a download engine as it has not been designed to be used like that, if faster and larger queries are required then look towards the Essbase provider.

Even after implementing some of the above suggestions there is no guarantee that planning won’t be pushed too far and if it is a recurring issue then finding the root cause is definitely a number one priority.

A useful way of delving into the reason is to analyse the heap dump taken when the crash occurs, what is a heap dump you may well ask….

“A heap dump is a snapshot of the memory of a Java process at a certain point of time. There are different formats for persisting this data, and depending on the format it may contain different pieces of information, but in general the snapshot contains information about the java objects and classes in the heap at the moment the snapshot was triggered. Usually a full garbage collection is triggered before the heap dump is written so it contains information about the remaining objects.”

By default a heap dump is not created by default when an out of memory occurs so a few additional parameters can be added to the java options for the web application in question, they can be added in the registry editor for windows or setCustomParamsManagedServerName.sh for *nix systems.

The two parameters I usually use are:
-XX:+HeapDumpOnOutOfMemoryError and -XX:HeapDumpPath


Once the parameters have been applied and the web application restarted then the next time an out of memory error occurs a heap dump should be generated.


Now you have the output then you require software to analyse the heap and a very useful piece of open source software is eclipse memory analyser.

The installation is quick and simple and once the heap dump has been opened the Memory Analyzer can inspect the heap dump for leak suspects, e.g. objects or set of objects which are suspiciously big.


Once the leak suspects report is selected a pie chart is generated displaying the size of the heap and any problem suspects.


Selecting the details will allow you to drill down into the suspect classes which were active when the out of memory occurred.


Without needing to understand the classes in great detail you are able to spot that it was potentially an ad-hoc grid that caused the out of memory issue, selecting the attributes tab provides usual information such as the application and cube name plus the number of rows/column in the ad-hoc grid.

Selecting the xmlAdhocGrid2 class provides further information such as it was run from Smart View, the user id which can be easily converted to a username in the planning tables and some of the options selected at retrieval time.


The attributes also show that all the suppression options were turned off at the time of retrieval which is also good indication why such as large hit on the JVM took place.


If the out of memory occurred due to a large form and not an ad-hoc analysis the form in question can also be found in the heap dump.


With this simple investigation it is possible to ascertain what was happening in planning and the user that instigated the crash.

This is all well and good but wouldn’t it be nice just to be able to limit the amount of cells that can be retrieved thus protecting a possible outage on the web application.

Prior to 11.1.2.2 the options available are pretty limited and the following message just doesn’t cut it as a safeguard.


Recently I noticed a patch which was listed for release 11.1.2.3 but reading the description it clearly states it is for 11.1.2.2, I have no idea what it is doing with the 11.1.2.3 patches and I suppose it was mistaken put there and the reason why it has not had many downloads.


There is one defect addressed with the patch:

18259065 - When retrieving data through Smart View Ad-Hoc Analysis, any number of rows or columns is retrieved, despite the preferences which were set.

After applying this patch, the below mentioned property need to be set in Administrator -> Application -> Properties screen ERROR_THRESHOLD_NUM_OF_CELLS -> Value for this property will be the maximum number of cells for which the error messages is shown to the user during Ad-hoc Grid operation.

Note 1: This fix applies to Planning data simple forms, Smart View Ad-hoc and Financial Reports against Planning application.

Note 2: Unless overridden, default value of this property -ERROR_THRESHOLD_NUM_OF_CELLS is 250000 cells.


Interesting yet another planning property that can be applied to an application which should stop a retrieval being run where the number of cells is higher than the default value of 250,000 or higher than a custom value set using the property.

Also if the threshold is exceeded the retrieval is not executed so no memory is allocated which means no additional memory being consumed in the planning web application.

The following calculation can be used to determine the total number of cells:

Total number of cells = row * cols

Number of cols = Max (static columns (dimensions on page/pov), data columns) + Number of dimensions on the rows (each dimension takes one column)

Number of rows = (static rows (one for page/pov + one for col headers) + (data rows)


Going back to the patch it indicates that it needs to be applied on top of 11.1.2.306 so before patching I tested on 306 and there were definitely no messages being generated with large retrievals.

After patching I ran a large Planning Smart View ad-hoc retrieval.


Nice, the default 250,000 threshold had been exceeded so the retrieval was stopped and an error message displayed.

So how about overriding the default value, I added a new application property ERROR_THRESHOLD_NUM_OF_CELLS with a low value of 10,000


A restart of the planning web application is required to apply the changes; I wish it was possible to apply properties settings without having to restart as it can be long winded task if you need to play around with different values.

Another Smart View retrieval and the new value was working.


I also tested from within planning and the same functionality is applied and a warning pop up is displayed when the threshold is exceeded in forms or in ad-hoc mode.


A nice feature if you are on 11.1.2.2 but what about the latest and greatest 11.1.2.3.500 surely that must be there as standard, unfortunately it does not seem to be available yet as no error messages were being generated with testing.

I could have left and it there but I wanted to delve a little deeper to see if the code was actually deployed in 11.1.2.3.500

After looking around I could see the code was there but it was not firing and I finally found the reason why.

As you may well be aware that some 11.1.2.3 functionality is only available in Oracle’s planning and budgeting cloud service (PBCS) and in this case it looks like at the moment the threshold option is only available if in the cloud world.

Now the following steps I definitely don’t recommend but I wanted to see if the threshold governor could be enabled.

Planning checks the Shared Services registry to see if cloud mode is active, running a registry report clearly shows that there is an option available which by default is set to false.


After a quick update to the registry and restart of services I wondered if anything would have happened.


Oh look my EPM instance has now sort of turned into the planning cloud version :)

I gave a quick test of opening a large planning form and running a Smart View planning ad-hoc retrieve.




So the functionality is definitely iavailable but for some reason not for on-premise deployments yet, maybe it is coming soon if it available now in 11.1.2.2

If you have never seen PBCS then there a few diagnostics options available which are not available in the on-premise version.


A health check can be run on a planning application which will run through various artifacts such as forms, rules, dimensions and give them a traffic light type status.


Any artifacts that have been given a red status such as passing an error threshold will provide suggestions on how to resolve the problem.


The health check also provides basic suggestions such as outline order optimisation and stats like plan type block size and block density.


The threshold properties in the cloud service are set by Oracle and cannot be changed but the properties are all still available.

In the following example I set the block size to a low value of 1000 Bytes using the ERROR_THRESHOLD_BLOCK_SIZE property.

Running the health check again now highlighted a red status for the block size.


Anyway I must apologise to Oracle for my hack and don’t worry I have set the cloud mode back to the original value.

I do have one final message though :)


Update: Threshold functionality is available in 11.1.2.3.501+

3 comments:

Anonymous said...

We got an error message for HFM retrieval in SmartView:

Form X exceeds the limit of 25000 cells to be displayed. Please change options to suppress some cells.


Do you know how to edit the limit for HFM?

Nandan Vankadari said...

Hi John,

Thank you for the very useful post as usual. Need one clarification. We are running on Hyperion 11.1.2.3.500.17 on Exalytics server. What is the process of moving from the current environment to Cloud to use new features of 11.1.2.3.500 mentioned in your blog?

Thanks,
Kumar

John Goodwin said...

The threshold functionality is available in 11.1.2.3.501