JobPak Support

    Very Important, Do Not Discard

    JobRescue & POSIX Directories Under MPE/iX

    June 2011



    Dear JobPak Customer,

    This email contains very important information identifying a problem that could affect your production environment. It is not urgent, but rather information that you need so as to be informed about a potential problem.


    If you are not the technical contact for JobPak or JobRescue on your HP3000, please forward this email to the appropriate person and ensure that they receive it.

    First, please understand that JobRescue has been used on thousands of systems, for many years, in large and small environments without triggering the problem described below.

    During the last 12 months it has come to our attention that there may be a problematic interaction between the MPE/iX host operating system's POSIX directory file system and the distribution of JobRescue stored spoolfiles in those directories. At this point in time, two customers have experienced one or more directory corruption problems that cannot be traced back to a operating system problem. This problem is directly related to the number of files stored under a single directory name. The amount of disk space used does not appear to be a factor.

    This email is intended to make you aware of the problem, and provide a solution should you decide to implement it. To understand the problem and its solution we'll provide some background into how JobRescue saves spoolfiles.

    Background

    When the 6.1a version of JobRescue was introduced in the 90's, it took advantage of HP's newly introduced POSIX file system and function libraries. Previously JobRescue stored saved spoolfiles (both $STDLISTs and reports) in a series of MPE groups in the NSD account. This changed with the POSIX file system in that JobRescue now stores spoolfiles in ten POSIX named directories under the /NSD account directory.

    The decision was made to use these new features because there was a limitation of about 5000 files per MPE group that was enforced by the operating system. The multiple group implemenation in previous versions of JobRescue was cumbersome for the software to navigate, and required unreasonable limitations on how many spoolfiles could be saved during a given span of time.

    We chose to implement storing files in the POSIX file system for a couple of reasons, first HP told us that similar to UNIX file systems there was no practical limitation on the number of files in any single directory, and second, since filenames could be much longer each spoolfile saved could be named uniquely so as to eliminate conflicts for restoration purposes.

    Where JobRescue Stores Spoolfiles

    When JobRescue saves a spoolfile it puts a copy of it (in original spoolfile format) in one of ten (/NSD/files0 thru /NSD/files9) directories. Each file is uniquely named, being made up of the the letter "A" or "R", the original spool ID, and a timestamp. The timestamp makes the filename unique. The directory where the file resides uses the last digit of the original spool ID. So a $STDLIST with spool ID #O1234, stored on the system volume set, would be named:

    /NSD/files4/A.1234.1274807843

    If stored on a user volume set:

    /NSD/USERVOL/files4/A.1234.1274807843

    The $STDLIST being saved also requires that a header file (used for navigation information by the STATUS program) also be saved, like:

    /NSD/files4/A.1234.1274807843.hdr

    If spoolfile compression is turned on, then only one file is saved, with the spoolfile and the header file being combined into a single file, and that file having ".cmp" added to the end of it.

    The Actual Problem

    Over the years, there have been multiple issues with corruption in the POSIX file system unrelated to JobRescue. HP had developed quite a few patches that fixed and prevented corruption issues in MPE 6.x and 7.x. Until recently (the last 12 months or so), we were unware of any continuing directory corruption problems. Now, two customers have reported directory corruption problems that may cause a system failure. And, one of those customers has provided us with an HP support document (dated Feb 2010) stating a 10,000 file count estimated limitiation per POSIX directory. Both the document and the limitation we were unaware of.

    The problem with the number of files surfaces during JobRescue's "merge" or "log" processing, where JobRescue determines which files should stay and which files should be aged off and deleted. Due to the nature of HP's implementation of the POSIX file system under MPE, corruption may result when files are deleted. The operating system's Transaction Manager logic requires that all file names alphabetically preceeding the file to be deleted have their file information read into a memory list. This may become a large memory requirement, and when it exceeds its limits the Transaction Manager causes a system failure and directory corruption may result. The logic the Transaction Manager uses for POSIX files may also result in very slow file deletion performance, and depending on the software being used, backup performance issues.

    Most every customer with a large number of files in their POSIX directories has never experienced this problem. We have some customers with over 500,000 files stored in ten directories (over 50000 files per directory) and they have had no problems for years.

    What's Your Current File Count?

    You can determine how many files are stored by JobRescue by looking at the EOF of the JOBDATA.PUB.NSD file. Just do a

    LISTF JOBDATA.PUB.NSD,2

    If you do not have file compression turned on then multiply that number by two. Then divide the number by 10 for a rough estimate of the number of files per each POSIX directory.

    Read This

    If you don't read anything else in this email be sure to read this paragraph. If you do encounter this problem and you have a system abort, you may or may not have associated POSIX directory corruption. The only way to identify if corruption occurs is to have your OS support provider read the dump that you took, or use FSCHECK to identify the problem. However, once directory corruption occurs, any access to the file that caused the corruption will result in another system failure or hang. It does not matter if the /NSD/files# directories are on the system volume set or a user volume set.

    If directory corruption occurs,

    • Do not restart JobRescue
    • Use FSCHECK to identify the file and its directory (only if properly instructed in the use of FSCHECK, as improper use can make your system unbootable)
    • Do not access the offending directory in an attempt to delete the file, as that will cause another system failure
    • After the offending directory has been identified (from reading a dump or using FSCHECK) it should be renamed (using the POSIX mv command) and not deleted

    The individual /NSD/files# directory may then be rebuilt and the saved spoolfiles from that directory restored from a backup -- not copied from the old directory.

    Customer Reported System Aborts

    There isn't just one system failure ID that identifies the directory problem. The following failures have been reported:

    • SA773 - multiple semaphore lock preceeded by messages indicating too many files to backup
    • SA2216 - Transaction Manager memory list exceeds maximum size
    • SA1851 - invalid address from MPE (running out of entries in a critical table)

    Prevention

    Prevention of this problem requires the installation of a one-program patch and modification of the JobPak start-up job control file. The patch has been developed for both JobRescue 6.1D and 6.1F. This patch creates 90 more directories so that the file count in any one directory would become one tenth of what it was previously. So instead of storing spoolfiles in 10 directories, JobRescue would then use 100. The directories are named /NSD/files0 thru /NSD/files99. The patch is available from the Nobix website, takes about 10 minutes to install, requires the use of the Reflection terminal emulator, and that JobPak be stopped momentarily.

    100 Directory Patch

    The patch for JobRescue may be downloaded from:

    http://www.nobix.com/download/JobPak/MoveTo100.zip

    This patch file is a Zip format file for your PC. Once un-zipped to a folder on your PC, you should read the enclosed README.TXT file for the patch installation instructions. Read through all of the instructions before attempting the installation. Before installation, you must determine the version of JobRescue that is currently installed on your machine -- just follow the instructions. If you have more than one machine with JobRescue, do not assume that all of your machines have the same version of JobRescue installed -- check each individually.

    The Reflection terminal emulator is required to be able to transfer the patch files to the HP3000 using "labels" transfer format. If you do not have access to Reflection and you wish to install this patch, please contact us so that we may supply you with an alternative method.

    Once the patch is installed, JobRescue will begin to save spoolfiles across 100 directories instead of 10. Your existing saved spoolfiles stay where they were originally put. Only new spoolfiles use the new directories. Eventually, as your old spoolfiles are aged off and deleted, all of the directories will normalize with about 1/100 of the total saved spoolfiles in each directory. This may take a period of time to accomplish depending on the storage and retention attributes you have set for your jobs and spoolfiles.

    We want to help you

    If you have any questions regarding this email, the patch, or whether or not you should implement it, please do not hesitate to contact us at support@nobix.com. or 1-925-659-3500.

    Thanks for using JobRescue, ElectroPage, and JobQue. And thank you Randy and Terri for your help in the preparation of this email.


    Other Links



    Nobix, Inc.
    www.nobix.com
    1.925.659.3500

    Truly affordable Job Scheduling & Management, Notification & Alerting, and Environment Monitoring Solutions, for UNIX, Linux, Windows & MPE/iX


You are subscribed to this list as [SUBSCRIBEEMAILADDR]. To change the email address please click to email a change request to support@nobix.com. If you no longer wish to receive emails from us about best uses of your JobPak software, click here to be removed immediately.