Skip to main content

opalis: multiple policy instances running in the operator console

here is an interesting and rather complex problem i experienced with opalis.  honestly, this thing has probably been going on since i built the environment.  however, i have only begun working with nested workflows which really helped flush it out.

 

summary

essentially, when a master workflow launches a sub workflow, the policy spins in the background and never does anything.  another indication of the problem is that the summary of policies indicate a high number of running instances.  lastly, the problem appears to be environmental since only one opalis instance has the problem. 

here is a view of the operator console with this problem.

image

 

troubleshooting

the first thing i prefer to do in scenarios like these is to try to recreate the problem.  that wasn't difficult on the server having the problem.  the operator console shows a high number of instances, sub-workflows hang indefinitely, etc.  when i tried it in a different environment, replicating the same master/sub nested test, it worked fine.  i decided it must be my policy and exported the test version into my broken opalis instance, ran it, and got the same result -- it hung.

i decided to look down another path into the running instances count to see if that was part of the problem.  i began by restarting the opalis service, hoping instances that were trapped somewhere would free themselves and disappear.  not a chance!  next, i started looking through the process list and found policymodule.exe which appears to be the running process that a policy resides in.  i figured with that many policy instances running, surely, i was way over the concurrent policy instance maximum.  unfortunately, i only found two of these processes which matched up to the two running policies as seen in the screenshot above. 

since that looked fine, i went back to sifting the logs (yeah, i mistakenly dismissed it the first time around).i found this statement in the logs (full snip below)[3] which had me scratching my head:

2010-12-07 15:12:38 [1640] 1 Opalis Event: Frequent DB errors

that information wasn't very telling but eventually a little further down in the logs, i found this error:

<Param>The EXECUTE permission was denied on the object 'sp_UnpublishPolicyRequest', database 'Opalis', schema 'dbo'.</Param>

apparently, i had the wrong set of permissions.  according to the administrator guide, the action server requires "part of database users group on the datastore computer".  going by this guidance, the action server service account was granted db_datareader and db_datawriter. 

this made perfect sense.  all of my other environments were ones i set up simply to test.  with this broken one, it was set up as a dev environment utilizing best practices such as using databases from people who know how to run them (not me).  adding to that, using the minimal level of rights required is followed.  by the log output, you can see i missed one of those such rights, however.  i did not grant the "execute" permissions that the account required.

this seems to speak to the root of the problem but in my research found that it did not correct the running instances count that appears in the operator console.  looking through the database views, i found dbo.policies_view (which details a lot more than what you see in the operator console, by the way).  i opened this view and found "runningpolicyinstances" column which contained the exact counts i saw in the operator console:

image

the design of the dbo.policies_view indicated very clearly how this column was being constructed.

(SELECT COUNT(*) AS Expr1 FROM dbo.POLICYINSTANCES WHERE (dbo.POLICIES.UniqueID = PolicyID) AND (TimeEnded IS NULL))

so basically, if the timeended value in the dbo.policyinstances table is null, it counts them up and displays it here.  sure enough, the table contained a high number of rows with null values.  i searched through all of the stored procedures to try to locate an entry that indicated inserting values into policyinstances but in almost every case, data was drawn from the table and rarely written to.  to add to this problem, log purging doesn't delete entries where the timeended value is null.  i suspect this is because it believes the policies haven't finished executing yet.  thus, these counts would have never gone away on their end.

 

resolution

:: database ::

the problem with the database permissions can be fixed in one of two ways:

  • grant the account db_owner rights
  • grant the account execute rights on all stored procedures [1]

once either one of these is done, the error in the opalisactionservice logs stop generating.  i would be cautious with this.  first, granting db_owner is probably granting rights beyond what the account actually needs.  second, if you choose to just grant rights to the stored procedures, it may not be the entire set of rights required.  my testing has been limited so far, and thus i may find in doing so that i run into other issues [2].

 

:: operator console ::

correcting the summary count will require some changes to the policyinstances table.  obviously going in and messing around with tables is not going to be supported by microsoft so proceed at your own risk.  my environment is a testing environment so it makes sense for me.  you may want to call microsoft support.

to begin, i shutdown all running policies so that any timeended date stamps would write in as necessary (allowing me to avoid unnecessarily changing data that didn't need modification).  afterwards, i ran the following sql query to set the timeended value to the current timestamp.

update dbo.policyinstances 
set [timeended] = getdate()
where timeended is null

this simple query adds the current datetime to the timeended field where the value is null.  as shown in the screenshot below, the instance summary now displays the correct count.

image

 

addendum

[1] if you're interested in the sql query to provide execute permissions for the action server service account, here it is.  many thanks to patrick for whipping this up!

declare @str_objname sysname,
@str_execsql nvarchar(256)
 
declare cur_spname cursor for
select name from sys.procedures
 
--select schema_name(schema_id), name from sys.procedures
 
open cur_spname
 
fetch cur_spname into @str_objname
 
while @@fetch_status = 0
begin
set @str_execsql = 'grant execute on '+@str_objname+' to [myServiceAccount];'
print @str_execsql
exec sp_executesql @statement = @str_execsql
fetch cur_spname into @str_objname
 
end
 
deallocate cur_spname

here is what the query ended up executing:

grant execute on sp_insertevent to [myServiceAccount];
grant execute on sp_GetLogEntriesForDelete_FilterByEntries to [myServiceAccount];
grant execute on sp_GetLogEntriesForDelete_FilterByDays to [myServiceAccount];
grant execute on sp_GetLogEntriesForDelete_FilterByEntriesAndDays to [myServiceAccount];
grant execute on sp_CustomLogCleanup to [myServiceAccount];
grant execute on sp_PublishPolicy to [myServiceAccount];
grant execute on sp_UnpublishPolicyRequest to [myServiceAccount];
grant execute on sp_UnpublishPolicy to [myServiceAccount];
grant execute on sp_DeleteTreeData to [myServiceAccount];
grant execute on sp_FindTreeInsertionPoint to [myServiceAccount];
grant execute on sp_InsertTreeData to [myServiceAccount];
grant execute on sp_MoveTreeBranch to [myServiceAccount];
grant execute on sp_StopAllRequestsForPolicy to [myServiceAccount];
grant execute on sp_StopAllRequests to [myServiceAccount];

[2] if you want, you can follow this forum post. looking for the complete list of database rights required by the action server service account: http://social.technet.microsoft.com/Forums/en-US/opalisv5v6/thread/faf9c8a0-a1b2-4442-8391-1ff738826f28

[3] here's the full log snippet:

2010-12-07 15:12:38 [1640] 1 Opalis Event: Frequent DB errors

2010-12-07 15:12:38 [1640] 1 Exception caught in void __cdecl OpalisEventDeliveryStrategyComposite::sendAndTraceExceptions(const class OpalisEventDeliveryStrategy &,const class OpalisEvent &,const class std::basic_string<unsigned short,struct std::char_traits<unsigned short>,class std::allocator<unsigned short> > &)
C:\AutomatedBuild\IS5.Platform\Branches\6.2_Sanitized\Platform\OpalisEventDelivery\OpalisEventDeliveryStrategyComposite.cpp(83):
<Exception>
<Type>Opalis::Exception</Type>
<Location>
void __cdecl OpalisEventDeliveryStrategyComposite::sendAndTraceExceptions(const class OpalisEventDeliveryStrategy &,const class OpalisEvent &,const class std::basic_string<unsigned short,struct std::char_traits<unsigned short>,class std::allocator<unsigned short> > &)
C:\AutomatedBuild\IS5.Platform\Branches\6.2_Sanitized\Platform\OpalisEventDelivery\OpalisEventDeliveryStrategyComposite.cpp(80)
</Location>
<MsgCode>Cannot deliver Opalis Event</MsgCode>
<Params>
<Param>EW</Param>
</Params>
<Prev><Exception>
<Type>Opalis::Exception</Type>
<Location>
void __cdecl StorageCallExecutor::throwChained(const class Opalis::Exception &)
C:\AutomatedBuild\IS5.Platform\Branches\6.2_Sanitized\Platform\OpalisActionService2\StorageCallExecutor.cpp(47)
</Location>
<MsgCode>SCE: ActionServerStorage call failed</MsgCode>
<Prev><Exception>
<Type>Opalis::Exception</Type>
<Location>
long __stdcall CODBDataStore::ReportEventW(struct tagVARIANT)
C:\AutomatedBuild\IS5.Platform\Branches\6.2_Sanitized\Platform\DBDataStore\ODBDataStore.cpp(6002)
</Location>
<MsgCode>_com_error</MsgCode>
<Params>
<Param>Unknown error 0x800A0CC1</Param>
<Param>Item cannot be found in the collection corresponding to the requested name or ordinal.</Param>
<Param>-2146825023</Param>
</Params>
</Exception></Prev>
</Exception></Prev>
</Exception>

2010-12-07 15:12:38 [1640] 1 Exception caught in void __thiscall StorageCallExecutor::onStorageException(class Opalis::Exception &,bool)
C:\AutomatedBuild\IS5.Platform\Branches\6.2_Sanitized\Platform\OpalisActionService2\StorageCallExecutor.cpp(120):
<Exception>
<Type>Opalis::Exception</Type>
<Location>
void __thiscall PublishedPoliciesManager::removePolicyRequest(const class _bstr_t &,const __int64 &)
C:\AutomatedBuild\IS5.Platform\Branches\6.2_Sanitized\Platform\DBDataStore\PublishedPoliciesManager.cpp(850)
</Location>
<MsgCode>Cannot unpublish policy</MsgCode>
<Prev><Exception>
<Type>Opalis::Exception</Type>
<Location>
void __thiscall PublishedPoliciesManager::removePolicyRequest(const class _bstr_t &,const __int64 &)
C:\AutomatedBuild\IS5.Platform\Branches\6.2_Sanitized\Platform\DBDataStore\PublishedPoliciesManager.cpp(838)
</Location>
<MsgCode>No more attempts</MsgCode>
<Prev><Exception>
<Type>Opalis::Exception</Type>
<Location>
void __thiscall DBAccessor::execute(const class std::basic_string<unsigned short,struct std::char_traits<unsigned short>,class std::allocator<unsigned short> > &) const
C:\AutomatedBuild\IS5.Platform\Branches\6.2_Sanitized\Platform\DBDataStore\DBAccessor.cpp(37)
</Location>
<MsgCode>_com_error</MsgCode>
<Params>
<Param>IDispatch error #3081</Param>
<Param>The EXECUTE permission was denied on the object 'sp_UnpublishPolicyRequest', database 'Opalis', schema 'dbo'.</Param>
<Param>-2147217911</Param>
</Params>
</Exception></Prev>
</Exception></Prev>
</Exception>

Comments

Popular posts from this blog

using preloadpkgonsite.exe to stage compressed copies to child site distribution points

UPDATE: john marcum sent me a kind email to let me know about a problem he ran into with preloadpkgonsite.exe in the new SCCM Toolkit V2 where under certain conditions, packages will not uncompress.  if you are using the v2 toolkit, PLEASE read this blog post before proceeding.   here’s a scenario that came up on the mssms@lists.myitforum.com mailing list. when confronted with a situation of large packages and wan links, it’s generally best to get the data to the other location without going over the wire. in this case, 75gb. :/ the “how” you get the files there is really not the most important thing to worry about. once they’re there and moved to the appropriate location, preloadpkgonsite.exe is required to install the compressed source files. once done, a status message goes back to the parent server which should stop the upstream server from copying the package source files over the wan to the child site. anyway, if it’s a relatively small amount of packages, you can

How to Identify Applications Using Your Domain Controller

Problem Everyone has been through it. We've all had to retire or replace a domain controller at some point in our checkered collective experiences. While AD provides very intelligent high availability, some applications are just plain dumb. They do not observe site awareness or participate in locating a domain controller. All they want is the name or IP of one domain controller which gets hardcoded in a configuration file somewhere, deeply embedded in some file folder or setting that you are never going to find. How do you look at a DC and decide which applications might be doing it? Packet trace? Logs? Shut it down and wait for screaming? It seems very tedious and nearly impossible. Potential Solution Obviously I wouldn't even bother posting this if I hadn't run across something interesting. :) I ran across something in draftcalled Domain Controller Isolation. Since it's in draft, I don't know that it's published yet. HOWEVER, the concept is based off

sccm: content hash fails to match

back in 2008, I wrote up a little thing about how distribution manager fails to send a package to a distribution point . even though a lot of what I wrote that for was the failure of packages to get delivered to child sites, the result was pretty much the same. when the client tries to run the advertisement with an old package, the result was a failure because of content mismatch. I went through an ordeal recently capturing these exact kinds of failures and corrected quite a number of problems with these packages. the resulting blog post is my effort to capture how these problems were resolved. if nothing else, it's a basic checklist of things you can use.   DETECTION status messages take a look at your status messages. this has to be the easiest way to determine where these problems exist. unfortunately, it requires that a client is already experiencing problems. there are client logs you can examine as well such as cas, but I wasn't even sure I was going to have enough m