Saturday, August 22, 2015

RAC 11GR2 Interview Questions


NODE EVICTION OVERVIEW

The Oracle Clusterware is designed to perform a node eviction by removing one or more nodes from the cluster if some critical problem is detected.  A critical problem could be a node not responding via a network heartbeat, a node not responding via a disk heartbeat, a hung or severely degraded machine, or a hung ocssd.bin process.  The purpose of this node eviction is to maintain the overall health of the cluster by removing bad members.  
Starting in 11.2.0.2 RAC (or if you are on Exadata), a node eviction may not actually reboot the machine.  This is called a rebootless restart.  In this case we restart most of the clusterware stack to see if that fixes the unhealthy node. 


1.0 - PROCESS ROLES FOR REBOOTS

OCSSD (aka CSS daemon) - This process is spawned by the cssdagent process. It runs in both vendor clusterware and non-vendor clusterware environments.  OCSSD's primary job is internode health monitoring and RDBMS instance endpoint discovery. The health monitoring includes a network heartbeat and a disk heartbeat (to the voting files).  OCSSD can also evict a node after escalation of a member kill from a client (such as a database LMON process). This is a multi-threaded process that runs at an elevated priority and runs as the Oracle user.
Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent --> ocssd --> ocssd.bin

CSSDAGENT - This process is spawned by OHASD and is responsible for spawning the OCSSD process, monitoring for node hangs (via oprocd functionality), and monitoring to the OCSSD process for hangs (via oclsomon functionality), and monitoring vendor clusterware (via vmon functionality).  This is a multi-threaded process that runs at an elevated priority and runs as the root user.
Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent

CSSDMONITOR - This proccess also monitors for node hangs (via oprocd functionality), monitors the OCSSD process for hangs (via oclsomon functionality), and monitors vendor clusterware (via vmon functionality). This is a multi-threaded process that runs at an elevated priority and runs as the root user.
Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdmonitor


2.0 - DETERMINING WHICH PROCESS IS RESPONSIBLE FOR A REBOOT


Important files to review:
  • Clusterware alert log in <GRID_HOME>/log/<nodename>
  • The cssdagent log(s) in <GRID_HOME>/log/<nodename>/agent/ohasd/oracssdagent_root
  • The cssdmonitor log(s) in <GRID_HOME>/log/<nodename>/agent/ohasd/oracssdmonitor_root
  • The ocssd log(s) in <GRID_HOME>/log/<nodename>/cssd
  • The lastgasp log(s) in /etc/oracle/lastgasp or /var/opt/oracle/lastgasp
  • IPD/OS or OS Watcher data
  • 'opatch lsinventory -detail' output for the GRID home
  • *Messages files:
* Messages file locations:
  • Linux: /var/log/messages
  • Sun: /var/adm/messages
  • HP-UX: /var/adm/syslog/syslog.log
  • IBM: /bin/errpt -a > messages.out
Please refer to the following document which provides information on collecting together most of the above files:
Document 1513912.1 - TFA Collector - Tool for Enhanced Diagnostic Gathering

11.2 Clusterware evictions should, in most cases, have some kind of meaningful error in the clusterware alert log.  This can be used to determine which process is responsible for the reboot.  Example message from a clusterware alert log:

[ohasd(11243)]CRS-8011:reboot advisory message from host: sta00129, component: cssagent, with timestamp: L-2009-05-05-10:03:25.340
[ohasd(11243)]CRS-8013:reboot advisory message text: Rebooting after limit 28500 exceeded; disk timeout 27630, network timeout 28500, last heartbeat from CSSD at epoch seconds 1241543005.340, 4294967295 milliseconds ago based on invariant clock value of 93235653

This particular eviction happened when we had hit the network timeout.  CSSD exited and the cssdagent took action to evict. The cssdagent knows the information in the error message from local heartbeats made from CSSD.

If no message is in the evicted node's clusterware alert log, check the lastgasp logs on the local node and/or the clusterware alert logs of other nodes. 


3.0 - TROUBLESHOOTING OCSSD EVICTIONS


If you have encountered an OCSSD eviction review common causes in section 3.1 below. 


3.1 - COMMON CAUSES OF OCSSD EVICTIONS


  • Network failure or latency between nodes. It would take 30 consecutive missed checkins (by default - determined by the CSS misscount) to cause a node eviction. 
  • Problems writing to or reading from the CSS voting disk.  If the node cannot perform a disk heartbeat to the majority of its voting files, then the node will be evicted.
  • A member kill escalation.  For example, database LMON process may request CSS to remove an instance from the cluster via the instance eviction mechanism.  If this times out it could escalate to a node kill. 
  • An unexpected failure or hang of the OCSSD process, this can be caused by any of the above issues or something else.
  • An Oracle bug. 

3.2 - FILES TO REVIEW AND GATHER FOR OCSSD EVICTIONS


All files from section 2.0 from all cluster nodes.  More data may be required.
Example of an eviction due to loss of voting disk:

CSS log:
2012-03-27 22:05:48.693: [ CSSD][1100548416](:CSSNM00018:)clssnmvDiskCheck: Aborting, 0 of 3 configured voting disks available, need 2
2012-03-27 22:05:48.693: [ CSSD][1100548416]###################################
2012-03-27 22:05:48.693: [ CSSD][1100548416]clssscExit: CSSD aborting from thread clssnmvDiskPingMonitorThread


OS messages:
Mar 27 22:03:58 choldbr132p kernel: Error:Mpx:All paths to Symm 000190104720 vol 0c71 are dead.
Mar 27 22:03:58 choldbr132p kernel: Error:Mpx:Symm 000190104720 vol 0c71 is dead.
Mar 27 22:03:58 choldbr132p kernel: Buffer I/O error on device sdbig, logical block 0
...

4.0 - TROUBLESHOOTING CSSDAGENT OR CSSDMONITOR EVICTIONS


If you have encountered a CSSDAGENT or CSSDMONITOR eviction review common causes in section 4.1 below.


4.1 - COMMON CAUSES OF CSSDAGENT OR CSSDMONITOR EVICTIONS


  • An OS scheduler problem.  For example, if the OS is getting locked up in a driver or hardware or there is excessive amounts of load on the machine (at or near 100% cpu utilization), thus preventing the scheduler from behaving reasonably.
  • A thread(s) within the CSS daemon hung.
  • An Oracle bug.


4.2 - FILES TO REVIEW AND GATHER FOR CSSDAGENT OR CSSDMONITOR EVICTIONS

All files from section 2.0 from all cluster nodes. More data may be required.

Importance of master node in a cluster:
- Master node has the least Node-id in the cluster. Node-ids are  assigned to the nodes in the same order as the nodes join the cluster. Hence, normally the node which joins the cluster first is the master node.
- CRSd process on the Master node is responsible to initiate the OCR backup as per the backup policy
- Master node  is also responsible to sync OCR cache across the nodes
- CRSd process oth the master node reads from and writes to OCR on disk
- In case of node eviction, The cluster is divided into two sub-clusters. The sub-cluster containing fewer no. of nodes is evicetd. But, in case both the sub-clusters have same no. of nodes, the sub-cluster having the master node survives whereas the other sub-cluster is evicted.
- When OCR master (crsd.bin process) stops or restarts for whatever reason, the crsd.bin on surviving node with lowest node number will become new OCR master.

The following method can be used to find OCR master:
1. By searching crsd.l* on all nodes:
grep "OCR MASTER" $ORA_CRS_HOME/log/$HOST/crsd/crsd.l*

Query V$GES_RESOURCE to identified master node.

how to monitor block transfer interconnects nodes in rac ?

The v$cache_transfer   and v$file_cache_transfer  views are used to examine RAC statistics. 

The types of blocks that use the cluster interconnects in a RAC environment are monitored with the v$ cache transfer series of views:

v$cache_transfer: This view shows the types and classes of blocks that Oracle transfers over the cluster interconnect on a per-object basis.

 The forced_reads and forced_writes columns can be used to determine the types of objects the RAC instances are sharing. 
Values in the forced_writes column show how often a certain block type is transferred out of a local buffer cache due to the current version being requested by another instance. 

Querying GV$CLUSTER_INTERCONNECTS view lists the interconnect used by all the participating instances of the RAC database. 

SQL> select INST_ID, IP_ADDRESS from GV$CLUSTER_INTERCONNECTS;
INST_ID IP_ADDRESS
---------- ----------------
1 192.168.261.1
2 192.168.261.2


New Features for Release 2 (11.2)

Oracle Automatic Storage Management and Oracle Clusterware Installation
With Oracle Grid Infrastructure 11g release 2 (11.2), Oracle Automatic Storage Management (Oracle ASM) and Oracle Clusterware are installed into a single home directory, which is referred to as the Grid Infrastructure home. Configuration assistants start after the installer interview process that configures Oracle ASM and Oracle Cluster ware. The installation of the combined products is called Oracle Grid Infrastructure. However, Oracle Clusterware and Oracle Automatic Storage Management remain
separate products.

Oracle Automatic Storage Management and Oracle Clusterware Files
With this release, Oracle Cluster Registry (OCR) and voting disks can be placed on Oracle Automatic Storage Management (Oracle ASM).
This feature enables Oracle ASM to provide a unified storage solution, storing all the data for the clusterware and the database, without the need for third-party volume managers or cluster filesystems. For new installations, OCR and voting disk files can be placed either on Oracle ASM or on a cluster file system or NFS system. Installing Oracle Clusterware files on raw or block devices is no longer supported, unless an existing system is being upgraded. 

Fix up Scripts and Grid Infrastructure Checks
With Oracle Clusterware 11g release 2 (11.2), Oracle Universal Installer (OUI) detects when minimum requirements for installation are not completed, and creates shell script programs, called fix up scripts, to resolve many incomplete system configuration requirements. If OUI detects an incomplete task that is marked "fixable", then you can easily fix the issue by generating the fix up script by clicking the Fix & Check Again button.

The fixup script is generated during installation. You are prompted to run the script as root in a separate terminal session. When you run the script, it raises kernel values to required minimums, if necessary, and completes other operating system configuration tasks. You also can have Cluster Verification Utility (CVU) generate fixup scripts before installation.

Grid Plug and Play
In the past, adding or removing servers in a cluster required extensive manual preparation. With this release, you can continue to configure server nodes manually or use Grid Plug and Play to configure them dynamically as nodes are added or removed from the cluster.

Grid Plug and Play reduces the costs of installing, configuring, and managing server nodes by starting a grid naming service within the cluster to allow each node to perform the following tasks dynamically:
■ Negotiating appropriate network identities for itself
■ Acquiring additional information it needs to operate from a configuration profile
■ Configuring or reconfiguring itself using profile data, making host names and addresses resolvable on the network

Because servers perform these tasks dynamically, the number of steps required to add or delete nodes is minimized.

Improved Input/Output Fencing Processes
Oracle Clusterware 11g release 2 (11.2) replaces the oprocd and Hangcheck processes with the cluster synchronization service daemon Agent and Monitor to provide more Accurate recognition of hangs and to avoid false termination

Intelligent Platform Management Interface (IPMI) Integration
Intelligent Platform Management Interface (IPMI) is an industry standard management protocol that is included with many servers today. IPMI operates
independently of the operating system, and can operate even if the system is not powered on. Servers with IPMI contain a baseboard management controller (BMC) which is used to communicate to the server.
If IPMI is configured, then Oracle Clusterware uses IPMI when node fencing is required and the server is not responding.

SCAN for Simplified Client Access
With this release, the Single Client Access Name (SCAN) is the host name to provide for all clients connecting to the cluster. The SCAN is a domain name registered to at
least one and up to three IP addresses, either in the domain name service (DNS) or the Grid Naming Service (GNS).

The primary benefit of a Single Client Access Name (SCAN) is not having to update client connection information (such as TNSNAMES.ora) every time you add or remove nodes from an existing RAC cluster.

Clients use a simple EZconnect string and JDBC connections can use a JDBC thin URL to access the database, which is done independently of the physical hosts that the database instances are running on. Additionally, SCAN automatically provides both failover and load balancing of connects, where the new connection will be directed to the least busy instance in the cluster by default.

It should be noted here that because EZconnect is used with SCAN, the SQLNET.ora file should include EZconnect as one of the naming methods, for example:
NAMES.DIRECTORY_PATH=(tnsnames,ezconnect,ldap)
An EZconnect string would look like
sqlplus user/pass@mydb-scan:1521/myservice
A JDBC thin string would look like
jdbc:oracle:thin@mydb-scan:1521/myservice

It's highly recommended that the clients are Oracle 11g R2 clients, to allow them to fully take advantage of the failover with the SCAN settings.

The TNSNAMES.ora file would now reference the SCAN rather than the VIPs as has been done in previous versions. This is what a TNSNAMES entry would be:
MYDB 
(DESCRIPTION=
(ADDRESS=(PROTOCOL=TCP)(HOST=mydb-scan.ORACLE.COM)(PORT=1521))
(CONNECT_DATA=(SERVICE_NAME=myservice.ORACLE.COM)))

There are two methods available for defining the SCAN. These are to use your corporate DNS to define the SCAN; the second option is to use Grid Naming Service.

Define Scan Using DNS

To use the DNS method for defining your SCAN, the network administrator must create a single name that resolves to three separate IP addresses using round-robin algorithms. Regardless of how many systems are part of your cluster, Oracle recommends that 3 IP addresses are configured to allow for failover and load-balancing.
It is important that the IP addresses are on the same subnet as the public network for the server. The other two requirements are that the name (not including the domain suffix) are 15 characters or less in length and that the name can be resolved without using the domain suffix. Also, the IP addresses should not be specifically assigned to any of the nodes in the cluster.

You can test the DNS setup by running an nslookup on the scan name two or more times. Each time, the IP addresses should be returned in a different order:
Syntax: nslookup mydatabase-scan

Define Scan Using Grid Naming Solutions (GNS)

Using GNS assumes that a DHCP server is running on the public network with enough available addresses to assign the required IP addresses and the SCAN VIP. Only one static IP address is required to be configured and it should be in the DNS domain.

Database Parameters for SCAN

The database will register each instance to the scan listener using the REMOTE_LISTENER parameter in the spfile. Oracle 11g R2 RAC databases will only register with the SCAN listeners. Upgraded databases, however, will continue to register with the local listener as well as the SCAN listener via the REMOTE_LISTENER parameter. The LOCAL_LISTENER parameter would be set to the node VIP for upgraded systems.
The REMOTE_LISTENER parameter, rather than being set to an alias that would be in a server side TNSNAMES file (as it has been in previous versions), would be set simply to the SCAN entry: The alter command would be
ALTER SYSTEM SET REMOTE_LISTENER=mydb-scan.oracle.com:1521

POINTS TO BE NOTED FOR SCAN LISTENER


An Oracle Database 11g release 2 (11.2) database service automatically registers with the listeners specified in the LOCAL_LISTENER and REMOTE_LISTENER parameters. During registration, PMON sends information such as the service name, instance names, and workload information to the listeners. This feature is called service registration

Services coordinate their sessions by registering their workload, or the amount of work they are currently handling, with the local listener and the SCAN listeners. Clients are redirected by the SCAN listener to a local listener on the least-loaded node that is running the instance for a particular service. This feature is called load balancing. The local listener either directs the client to a dispatcher process (if the database was configured for shared server), or directs the client to a dedicated server process.

When a listener starts after the Oracle instance starts, and the listener is available for service registration, registration does not occur until the next time the Oracle Database process monitor (PMON) starts its discovery routine. By default, the PMON discovery routine is started every 60 seconds. To override the 60-second delay, use the SQL statement ALTER SYSTEM REGISTER. This statement forces PMON to register the service immediately.

Local Listeners

-       Starting with Oracle Database 11g release 2 (11.2), the local listener, or default listener, is located in the Grid home when you have Oracle Grid Infrastructure installed.
Grid_home\network\admin directory. 
-       Oracle Clusterware 11g release 2 and later, the listener association no longer requires tnsnames.ora file entries. The listener associations are configured as follows:
·        - DBCA no longer sets the LOCAL_LISTENER parameter. The Oracle Clusterware agent that starts the database sets the LOCAL_LISTENER parameter dynamically, and it sets it to the actual value, not an alias. So listener_alias entries are no longer needed in the tnsnames.ora file.
·         - The REMOTE_LISTENER parameters are configured by DBCA to reference the SCAN and SCAN port, without any need for a tnsnames.ora entry. Oracle Clusterware uses the Easy Connect naming method with scanname:scanport, so no listener associations for the REMOTE_LISTENER parameter are needed in the tnsnames.ora file.

Three SCAN addresses are configured for the cluster, and allocated to servers. When a client issues a connection request using SCAN, the three SCAN addresses are returned to the client. If the first address fails, then the connection request to the SCAN name fails over to the next address. Using multiple addresses allows a client to connect to an instance of the database even if the initial instance has failed.

The net service name does not need to know the physical address of the server on which the database, database instance, or listener runs. SCAN is resolved by DNS, which returns three IP addresses to the client. The client then tries each address in succession until a connection is made.

Understanding SCAN

SCAN is a fully qualified name (host name.domain name) that is configured to resolve to all the addresses allocated for the SCAN listeners. 

The default value for SCAN is cluster_name.GNS_sub_domain, or, cluster_name-scan.domain_name if GNS is not used. For example, in a cluster that does not use GNS, if your cluster name issalesRAC, and your domain is example.com, then the default SCAN address is salesRAC-scan.example.com:1521.

SCAN is configured in DNS to resolve to three IP addresses, and DNS should return the addresses using a round-robin algorithm. This means that when SCAN is resolved by DNS, the IP addresses are returned to the client in a different order each time.

- Based on the environment, the following actions occur when you use SCAN to connect to an Oracle RAC database using a service name.

1.  The PMON process of each instance registers the database services with the default listener on the local node and with each SCAN listener, which is specified by the REMOTE_LISTENER database parameter.
     2.   The listeners are dynamically updated on the amount of work being handled by the instances and dispatchers.
            The client issues a database connection request using a connect descriptor of the form:
 orausr/@scan_name:1521/sales.example.com
Note:
If you use the Easy Connect naming method, then ensure the sqlnet.ora file on the client contains EZCONNECT in the list of naming methods specified by theNAMES.DIRECTORY_PATH parameter.

   3. The client uses DNS to resolve scan_name. After DNS returns the three addresses assigned to SCAN, the client sends a connect request to the first IP address. If the connect request fails, then the client attempts to connect using the next IP address.
  4. When the connect request is successful, the client connects to a SCAN listener for the cluster which hosts the sales database. The SCAN listener compares the workload of the instances sales1 and sales2 and the workload of the nodes on which they are running. Because node2 is less loaded than node1, the SCAN listener selects node2 and sends the address for the listener on that node back to the client.
     The client connects to the local listener on node2. The local listener starts a dedicated server process for the connection to the database.  The client connects directly to the dedicated server process on node2 and accesses the sales2 database instance.


SRVCTL Command Enhancements for Patching
With this release, you can use the server control utility SRVCTL to shut down all Oracle software running within an Oracle home, in preparation for patching. Oracle Grid Infrastructure patching is automated across all nodes, and patches can be applied in a multi-node, multi-patch fashion.

Typical Installation Option
To streamline cluster installations, especially for those customers who are new to clustering, Oracle introduces the Typical Installation path. Typical installation defaults as many options as possible to those recommended as best practices.

Voting Disk Backup Procedure Change
In prior releases, backing up the voting disks using a dd command was a required post installation task. With Oracle Cluster ware release 11.2 and later, backing up and restoring a voting disk using the dd command is not supported. Backing up voting disks manually is no longer required, because voting disks are backed up automatically in the OCR as part of any configuration change. Voting disk data is automatically restored to any added voting disks.

Oracle RAC Background Processes
The GCS and GES processes, and the GRD collaborate to enable Cache Fusion. The Oracle RAC processes and their identifiers are as follows:
 ACMS: Atomic Controlfile to Memory Service (ACMS)
In an Oracle RAC environment, the ACMS per-instance process is an agent that contributes to ensuring a distributed SGA memory update is either globally
committed on success or globally aborted if a failure occurs.
 GTX0-j: Global Transaction Process
The GTX0-j process provides transparent support for XA global transactions in an Oracle RAC environment. The database autotunes the number of these processes
based on the workload of XA global transactions.
 LMD: Global Enqueue Service Daemon (GES)
The LMD process manages incoming remote resource requests within each instance.
 LMON: Global Enqueue Service Monitor (GES)
The LMON process manages the GES, it maintains consistency of GCS memory structure in case of process death. It is also responsible for cluster reconfiguration and locks reconfiguration (node joining or leaving), it checks for instance deaths and listens for local messaging. A detailed log file is created that tracks any reconfigurations that have happened.
 LMSn: Lock Manager Server process (GCS)
This is the cache fusion part and the most active process; it handles the consistent copies of blocks that are transferred between instances.
It receives requests from LMD to perform lock requests. It rolls back any uncommitted transactions.
There can be up to ten LMS processes running and can be started dynamically if demand requires it.

They manage lock manager service requests for GCS resources and send them to a service queue to be handled by the LMSn process.
It also handles global deadlock detection and monitors for lock conversion timeouts.
As a performance gain you can increase this process priority to make sure CPU starvation does not occur
You can see the statistics of this daemon by looking at the view X$KJMSDP
 LCK0: Instance Enqueue Process
The LCK0 process manages non-Cache Fusion resource requests such as library and row cache requests.
 RMSn: Oracle RAC Management Processes (RMSn)
The RMSn processes perform manageability tasks for Oracle RAC. Tasks accomplished by an RMSn process include creation of resources related to Oracle
RAC when new instances are added to the clusters.
RSMN: Remote Slave Monitor manages background slave process creation and communication on remote instances. These background slave processes perform
Tasks on behalf of a coordinating process running in another instance.

DIAG: Diagnostic Daemon           
This is a lightweight process; it uses the DIAG framework to monitor the health of the cluster. It captures information for later diagnosis in the event of failures. It will perform any necessary recovery
If an operational hang is detected.

Top 5 issues for Instance Eviction (Doc ID 1374110.1)

srvctl status database -d HRPRD
Instance HRPRD1 is running on node hrprddb1
Instance HRPRD2 is running on node hrprddb2
============================================
$crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   8c377cc40f344f0dff8da3d9b0d7d610 (/psvoting01/hrprd_vote) []
 2. ONLINE   a371af8108beffcfbfb43e12ff17d2c4 (/psvoting02/hrprd_vote) []
 3. ONLINE   8b2ee6a3aabcefc6bf61114b6802cc48 (/psvoting03/hrprd_vote) []
Located 3 voting disk(s).

Cluster Verification Utility (CVU) command to verify OCR integrity of all of the nodes in your cluster database:
cluvfy comp ocr -n all -verbose

List the nodes in your cluster by running the following command on one node:
olsnodes

Listing Backup Files
$ocrconfig -showbackup

eldevdb2     2014/06/24 15:09:02     /oracrs/app/11.2.0.3/grid/cdata/hhcpsoft/backup00.ocr
eldevdb2     2014/06/24 11:09:00     /oracrs/app/11.2.0.3/grid/cdata/hhcpsoft/backup01.ocr
eldevdb2     2014/06/24 07:08:58     /oracrs/app/11.2.0.3/grid/cdata/hhcpsoft/backup02.ocr
eldevdb2     2014/06/23 03:08:44     /oracrs/app/11.2.0.3/grid/cdata/hhcpsoft/day.ocr
eldevdb2     2014/06/11 23:06:00     /oracrs/app/11.2.0.3/grid/cdata/hhcpsoft/week.ocr
hrdevdb2     2011/08/18 12:53:04     /oracrs/app/11.2.0/grid/cdata/hhcpsoft/backup_20110818_125304.ocr

ocrconfig - Configuration tool for Oracle Cluster/Local Registry.

Synopsis:
        ocrconfig [option]
        option:
                [-local] -export <filename>
                                                    - Export OCR/OLR contents to a file
                [-local] -import <filename>         - Import OCR/OLR contents from a file
                [-local] -upgrade [<user> [<group>]]                   - Upgrade OCR from previous version
                -downgrade [-version <version string>]             - Downgrade OCR to the specified version
                [-local] -backuploc <dirname>       - Configure OCR/OLR backup location
                [-local] -showbackup [auto|manual]  - Show OCR/OLR backup information
                [-local] -manualbackup              - Perform OCR/OLR backup
                [-local] -restore <filename>        - Restore OCR/OLR from physical backup
                -replace <current filename> -replacement <new filename>   - Replace a OCR device/file <filename1> with <filename2>
                -add <filename>                     - Add a new OCR device/file
                -delete <filename>                  - Remove a OCR device/file
                -overwrite                          - Overwrite OCR configuration on disk
                -repair -add <filename> | -delete <filename> | -replace <current filename> -replacement <new filename>
                                                    - Repair OCR configuration on the local node
                -help                               - Print out this help information

Run the following command to inspect the contents and verify the integrity of the backup file:
ocrdump -backupfile backup_file_name

Verify the integrity of OCR:
ocrcheck
Run ocrcheck and if the command returns a failure message, then both the primary OCR and the OCR mirror have failed.

The OCRCHECK utility displays the version of the OCR's block format, total space available and used space, OCRID, and the OCR locations that you have configured. OCRCHECK performs a block-by-block checksum operation for all of the blocks in all of the OCRs that you have configured. It also returns an individual status for each file as well as a result for the overall OCR integrity check.

OCRCHECK creates a log file in the directory CRS_home/log/hostname/client. To change amount of logging, edit the file CRS_home/srvm/admin/ocrlog.ini.

$./ocrcheck -local
Status of Oracle Local Registry is as follows :
         Version                  :          3
         Total space (kbytes)     :     262120
         Used space (kbytes)      :       2584
         Available space (kbytes) :     259536
         ID                       :  814444380
         Device/File Name         : /oracrs/app/11.2.0/grid/cdata/hrprddb2.olr
                                    Device/File integrity check succeeded

         Local registry integrity check succeeded
         Logical corruption check bypassed due to non-privileged user

Run the following command to inspect the contents and verify the integrity of the backup file:
ocrdump -backupfile /oracrs/app/11.2.0.3/grid/cdata/hhcpsoft/backup01.ocr

The number of voting files you can store in a particular Oracle ASM disk group depends upon the redundancy of the disk group.
·         External redundancy: A disk group with external redundancy can store only one voting disk
·         Normal redundancy: A disk group with normal redundancy stores three voting disks
·         High redundancy: A disk group with high redundancy stores five voting disks

To migrate voting disks to Oracle ASM, specify the Oracle ASM disk group name in the following command:
$ crsctl replace votedisk +asm_disk_group

Backing up Voting Disks

In Oracle Clusterware 11g release 2 (11.2), you no longer have to back up the voting disk.
The voting disk data is automatically backed up in OCR as part of any configuration change and is automatically
restored to any voting disk added. If all voting disks are corrupted, however,

Restoring Voting Disks

If all of the voting disks are corrupted, then you can restore them, as follows:

I. Restore OCR: This step is necessary only if OCR is also corrupted or otherwise unavailable,
such as if OCR is on Oracle ASM and the disk group is no longer available.

Restoring Oracle Cluster Registry

If a resource fails, then before attempting to restore OCR, restart the resource.
As a definitive verification that OCR failed, run ocrcheck and if the command returns a failure message,
then both the primary OCR and the OCR mirror have failed. Attempt to correct the problem using the
OCR restoration procedure for your platform.

Use the following procedure to restore OCR on Linux or UNIX systems:

1. List the nodes in your cluster by running the following command on one node:
$ olsnodes

2. Stop Oracle Clusterware by running the following command as root on all of the nodes:
# crsctl stop crs
If the preceding command returns any error due to OCR corruption, stop Oracle Clusterware by running the following command as root on all of the nodes:

# crsctl stop crs -f
3. If you are restoring OCR to a cluster file system or network file system, then run the following command as root to restore OCR with an OCR backup that you can identify in "Listing Backup Files":

# ocrconfig -restore file_name
After you complete this step, proceed to step 10.

4. Start the Oracle Clusterware stack on one node in exclusive mode by running the following command as root:

# crsctl start crs -excl -nocrs
The -nocrs option ensures that the crsd process and OCR do not start with the rest of the Oracle Clusterware stack.

Ignore any errors that display.

Check whether crsd is running. If it is, then stop it by running the following command as root:

# crsctl stop resource ora.crsd -init
Caution:
Do not use the -init flag with any other command.
5. If you want to restore OCR to an Oracle ASM disk group, then you must first create a disk group using SQL*Plus that has the same name as the disk group you want to restore and mount it on the local node.

If you cannot mount the disk group locally, then run the following SQL*Plus command:

SQL> drop diskgroup disk_group_name force including contents;
Optionally, if you want to restore OCR to a raw device, then you must run the ocrconfig -repair -replace command as root, assuming that you have all the necessary permissions on all nodes to do so and that OCR was not previously on Oracle ASM.

6. Restore OCR with an OCR backup that you can identify in "Listing Backup Files" by running the following command as root:

# ocrconfig -restore file_name
Notes:
Ensure that the OCR devices that you specify in the OCR configuration exist and that these OCR devices are valid.

If you configured OCR in an Oracle ASM disk group, then ensure that the Oracle ASM disk group exists and is mounted.

See Also:
Oracle Grid Infrastructure Installation Guide for information about creating OCRs

Oracle Automatic Storage Management Administrator's Guide for more information about Oracle ASM disk group management

7. Verify the integrity of OCR:
# ocrcheck

8. Stop Oracle Clusterware on the node where it is running in exclusive mode:
# crsctl stop crs -f

9. Run the ocrconfig -repair -replace command as root on all the nodes in the cluster where you did not the ocrconfig -restore command. For example, if you ran the ocrconfig -restore command on node 1 of a four-node cluster, then you must run the ocrconfig -repair -replace command on nodes 2, 3, and 4.

10. Begin to start Oracle Clusterware by running the following command as root on all of the nodes:
# crsctl start crs

11. Verify OCR integrity of all of the cluster nodes that are configured as part of your cluster by running the following CVU command:
$ cluvfy comp ocr -n all –verbose


II. Run the following command as root from only one node to start the Oracle Clusterware stack in exclusive mode,
which does not require voting files to be present or usable:
# crsctl start crs -excl

III. Run the crsctl query css votedisk command to retrieve the list of voting files currently defined, similar to the following:
$ crsctl query css votedisk
--  -----    -----------------                --------- ---------
##  STATE    File Universal Id                File Name Disk group
 1. ONLINE   7c54856e98474f61bf349401e7c9fb95 (/dev/sdb1) [DATA]

This list may be empty if all voting disks were corrupted, or may have entries that are marked as status 3 or OFF.
IV. Depending on where you store your voting files, do one of the following:

If the voting disks are stored in Oracle ASM, then run the following command to migrate the voting disks to the Oracle ASM disk group you specify:
crsctl replace votedisk +asm_disk_group

The Oracle ASM disk group to which you migrate the voting files must exist in Oracle ASM. You can use this
command whether the voting disks were stored in Oracle ASM or some other storage device.
If you did not store voting disks in Oracle ASM, then run the following command using the File Universal Identifier (FUID) obtained in the previous step:

$ crsctl delete css votedisk FUID
Add a voting disk, as follows:

$ crsctl add css votedisk path_to_voting_disk
V. Stop the Oracle Clusterware stack as root:

# crsctl stop crs
Note:
If the Oracle Clusterware stack is running in exclusive mode, then use the -f option to force the shutdown of the stack.
VI. Restart the Oracle Clusterware stack in normal mode as root:

# crsctl start crs

Voting Files stored in ASM - How many disks per disk group do I need?
If Voting Files are stored in ASM, the ASM disk group that hosts the Voting Files will place the appropriate number of Voting Files in accordance to the redundancy level. Once Voting Files are managed in ASM, a manual addition, deletion, or replacement of Voting Files will fail, since users are not allowed to manually manage Voting Files in ASM.

If the redundancy level of the disk group is set to "external", 1 Voting File is used.
If the redundancy level of the disk group is set to "normal", 3 Voting Files are used.
If the redundancy level of the disk group is set to "high", 5 Voting Files are used.

Note that Oracle Clusterware will store the disk within a disk group that holds the Voting Files. Oracle Clusterware does not rely on ASM to access the Voting Files.

In addition, note that there can be only one Voting File per failure group. In the above list of rules, it is assumed that each disk that is supposed to hold a Voting File resides in its own, dedicated failure group.

In other words, a disk group that is supposed to hold the above mentioned number of Voting Files needs to have the respective number of failure groups with at least one disk. (1 / 3 / 5 failure groups with at least one disk)

Consequently, a normal redundancy ASM disk group, which is supposed to hold Voting Files, requires 3 disks in separate failure groups, while a normal redundancy ASM disk group that is not used to store Voting Files requires only 2 disks in separate failure groups.

What happens if I lose my voting disk(s)?
If you lose 1/2 or more of all of your voting disks, then nodes get evicted from the cluster, or nodes kick themselves out of the cluster. It doesn't threaten database corruption. Alternatively you can use external redundancy which means you are providing redundancy at the storage level using RAID.
For this reason when using Oracle for the redundancy of your voting disks, Oracle recommends that customers use 3 or more voting disks in Oracle RAC 10g Release 2. Note: For best availability, the 3 voting files should be physically separate disks. It is recommended to use an odd number as 4 disks will not be any more highly available than 3 disks, 1/2 of 3 is 1.5...rounded to 2, 1/2 of 4 is 2, once we lose 2 disks, our cluster will fail with both 4 voting disks or 3 voting disks.
Restoring corrupted voting disks is easy since there isn't any significant persistent data stored in the voting disk. See the Oracle Clusterware Admin and Deployment Guide for information on backup and restore of voting disks.

An odd number of voting disks is required for proper clusterware configuration. A node must be able to strictly access more than half of the voting disks at any time. So, in order to tolerate a failure of n voting disks, there must be at least 2n+1 configured. (n=1 means 3 voting disks).

Why should we have an odd number of voting disks?
The odd number of voting disks should be configured to provide a method to determine who in the cluster should survive.
A node must be able to access more than half of the voting disks at any time. For example, let’s have a two node cluster with an even number of let’s say 2 voting disks. Let Node1 is able to access voting disk1 and Node2 is able to access voting disk2. This means that there is no common file where clusterware can check the heartbeat of both the nodes.  If we have 3 voting disks and both the nodes are able to access more than half i.e. 2 voting disks, there will be at least on disk which will be accessible by both the nodes. The clusterware can use that disk to check the heartbeat of both the nodes. Hence, each node should be able to access more than half the number of voting disks. A node not able to do so will have to be evicted from the cluster by another node that has more than half the voting disks, to maintain the integrity of the cluster. After the cause of the failure has been corrected and access to the voting disks has been restored, you can instruct Oracle Clusterware to recover the failed node and restore it to the cluster.
   Loss of more than half your voting disks will cause the entire cluster to fail!!

HOW TO IDENTIFY THE MASTER NODE IN RAC?
 Master node has the least Node-id in the cluster.
 Node-ids are assigned to the nodes in the same order as the nodes join the cluster.
 Hence, normally the node which joins the cluster first is the master node.
- CRSd process on the Master node is responsible to initiate the OCR backup as per the backup policy
- Master node is also responsible to sync OCR cache across the nodes
- CRSd process oth the master node reads from and writes to OCR on disk
- In case of node eviction, the cluster is divided into two sub-clusters. The sub-cluster containing fewer no. of nodes is evicetd. But, in case both the sub-clusters have same no. of nodes, the sub-cluster having the master node survives whereas the other sub-cluster is evicted.

Oracle ClusterWare master’s information can be found

- by scanning ocssd logs from various nodes
- by scanning  crsd logs from various nodes.
- by identifying the node which  takes the backup of the OCR.
If master node gets evicted/rebooted, another node becomes the master.
HOW TO FIND THE RESOURCE MASTER?
In RAC, every data block is mastered by an instance. Mastering a block simply means that master instance
keeps track of the state of the block until the next reconfiguration event.
– Remastering can be triggered as result of
    – Manually
    – Resource affinity
    – Instance crash
- Method – I gets info about master node from v$gcspfmaster_info   using data_object_id
- Method – II gets info about master node from v$dlm_ress and v$ges_enqueue   using resource name in hexadecimal format
- Method – III gets info about master node from x$kjbl with x$le using resource name in hexadecimal format
– CURRENT SCENARIO -
- 3 node setup
- name of the database – orcl
— SETUP –
SYS@NODE1>create table scott.emp1 as select * from scott.emp;
– Get data_object_id for scott.emp1
SQL> col owner for a10
col data_object_id for 9999999
col object_name for a15
select owner, data_object_id, object_name from dba_objects where owner = ‘SCOTT’ and object_name = ‘EMP1';
OWNER      DATA_OBJECT_ID OBJECT_NAME
———- ————– —————
SCOTT               74652 EMP1
For Method-II and Method-III, we need to find out file_id and block_id and hence GCS  resource name in hexadecimal format
 – Get File_id and range of block_ids of emp1 table
– It can be seen that emp1 lies in block 523 of file 4.
SQL>select dbms_rowid.rowid_relative_fno(rowid) FILE_NO,
min(dbms_rowid.rowid_block_number(rowid)) MIN_BLOCK_ID,
max(dbms_rowid.rowid_block_number(rowid))  MAX_BLOCK_ID
from scott.emp1
group by dbms_rowid.rowid_relative_fno(rowid);
   FILE_NO MIN_BLOCK_ID MAX_BLOCK_ID
———- ———— ————
         4          523          523
– Find the GCS resource name to be used in  the query using blodk_id and data_object_id retrieved above.
   x$kjbl.kjblname = resource name in hexadecimal format([id1],[id2],[type]
   x$kjbl.kjblname2 = resource name in decimal format
   Hexname will be used to query resource master using v$dlm_ress , v$ges_enqueue, $kjbl
   and x$le
SQL> col hexname for a25
col resource_name for a15
select b.kjblname hexname, b.kjblname2 resource_name
from x$le a, x$kjbl b
where a.le_kjbl=b.kjbllockp
and a.le_addr = ( select le_addr
from x$bh
where dbablk = 523
and obj    = 74652
and class  = 1
and state   <> 3);
HEXNAME                   RESOURCE_NAME
————————- —————
[0x20b][0x4],[BL]         523,4,BL
– Manually master the EMP table to node1 –
SYS@NODE1>oradebug lkdebug -m pkey <objectid>
SYS@NODE1>oradebug lkdebug -m pkey 74652
—- GET RESOURCE MASTER NAME ———-
Method – I gets info about master node from v$gcspfmaster_info using data_object_id
– ——-
– Note that current master is node1 (Node numbering starts from 0)
SYS@node1>col object_name for A10
        select o.object_name, m.CURRENT_MASTER
from   dba_objects o, v$gcspfmaster_info m
where o.data_object_id=74652
and m.data_object_id = 74652 ;
OBJECT_NAM CURRENT_MASTER
———- ————–
EMP1                    0
—- Method II gets info about master node from v$dlm_ress and v$ges_enqueue
     using resource name in hexadecimal format
– check that master node is node1 (node numbering starts with 0)
SYS@NODE1>col resource_name for a22 select a.resource_name,  a.master_node
from   v$dlm_ress a, v$ges_enqueue b
where upper(a.resource_name) = upper(b.resource_name1)
and a.resource_name like ‘%[0x20b][0x4],[BL]%’;
RESOURCE_NAME          MASTER_NODE
———————- ———–
[0x20b][0x4],[BL]                0
Method – III gets info about master node from x$kjbl with x$le
             using resource name in hexadecimal format
–  This SQL joins   x$kjbl with x$le to retrieve resource master for a block
– Note that current master is node1(MASTER=0)
SYS@NODE1> select kj.kjblname, kj.kjblname2, kj.kjblmaster master
from (select kjblname, kjblname2, kjblowner, kjblmaster, kjbllockp
from x$kjbl
where kjblname = ‘[0x20b][0x4],[BL]‘
) kj, x$le le
where le.le_kjbl = kj.kjbllockp
order by le.le_addr;
KJBLNAME                       KJBLNAME2                          MASTER
—————————— —————————— ———-
[0x20b][0x4],[BL]              523,4,BL                                     0

RAC Troubleshooting, RAC LOG FILES
Cluster Ready Services Daemon (crsd) Log Files: CRS home/log/hostname/crsd

Oracle Cluster Registry (OCR) records
For the OCR tools (OCRDUMP, OCRCHECK, OCRCONFIG) record log information in the following location:
CRS_Home/log/hostname/client
To change the amount of logging, edit the path in the CRS_home/srvm/admin/ocrlog.ini file.

The OCR server records log information in the following location:
CRS_home/log/hostname/crsd
To change the amount of logging, edit the path in the CRS_home/log/hostname/crsd/crsd.ini file.

Oracle Process Monitor Daemon (OPROCD)
The following path is specific to Linux: /etc/oracle/hostname.oprocd.log
This path is dependent upon the installed Linux or UNIX platform.


Cluster Synchronization Services (CSS): CRS_home/log/hostname/cssd
Event Manager (EVM) information generated by evmd: CRS_home/log/hostname/evmd
Oracle RAC RACG
The Oracle RAC high availability trace files are located in the following two locations:
CRS_home/log/hostname/racg

$ORACLE_HOME/log/hostname/racg
Core files are in subdirectories of the log directory.
Each RACG executable has a subdirectory assigned exclusively for that executable.
The name of the RACG executable subdirectory is the same as the name of the executable.

Enable Threads for remaining 3 nodes  

Alter database enable public thread 2;
Alter database enable public thread 3;
Alter database enable public thread 4;

shutdown database on primary node vmohswort018.

Login to below dbnodes as root user and perform mentioned steps
vmohswort022
vmohswort026
vmohswort029

cd /oracrs/oracle/product/112/bin
./crsctl start crs
./crsctl enable crs

Login to vmohswort022 server as orpwor1i osuser
SQL>startup mount;
alter system set undo_tablespace='APPS_UNDOTS2' scope=spfile sid='PWOR1I2';
alter system set undo_tablespace='APPS_UNDOTS3' scope=spfile sid='PWOR1I3';
alter system set undo_tablespace='APPS_UNDOTS4' scope=spfile sid='PWOR1I4';
alter database open;
SQL> alter database disable thread 1;
SQL> alter database enable public thread 1;
SQL> shutdown immediate;
Start DB & DB listnener on all RAC nodes

RCONFIG TO CONVERT RAC

1. As the oracle user, navigate to the directory $ORACLE_HOME/assistants/rconfig/sampleXMLs, and open the file ConvertToRAC.xml using a text editor, such as vi.
2. Review the ConvertToRAC.xml file, and modify the parameters as required for your system. The XML sample file contains comment lines that provide instructions for how to configure the file.

When you have completed making changes, save the file with the syntax filename.xml. Make a note of the name you select.
3. Navigate to the directory $ORACLE_HOME/bin, and use the following syntax to run the command rconfig: rconfig input.xml
Where input.xml is the name of the XML input file you configured in step 2.
For example, if you create an input XML file called convert.xml, then enter the following command
$./rconfig convert.xml

Note:
The Convert verify option in the ConvertToRAC.xml file has three options:
Convert verify="YES": rconfig performs checks to ensure that the prerequisites for single-instance to Oracle RAC conversion have been met before it starts conversion
Convert verify="NO": rconfig does not perform prerequisite checks, and starts conversion
Convert verify="ONLY" rconfig only performs prerequisite checks; it does not start conversion after completing prerequisite checks

If performing the conversion fails, then use the following procedure to recover and reattempt the conversion.:
·         Attempt to delete the database using the DBCA delete database option.
·         Restore the source database.
·         Review the conversion log, and fix any problems it reports that may have caused the conversion failure. The rconfig log files are under the rconfig directory in $ORACLE_BASE/cfgtoollogs.
·         Reattempt the conversion.



RESTORE VOTE DISK


root@hrdevdb2 /oracrs/app/11.2.0.3/grid/bin =>crsctl query css votedisk

Start CRS in exclusive mode
./crsctl start crs -excl

Query for voting disk
./crsctl query css votedisk

If you did not store voting disks in Oracle ASM, then run the following command using the File Universal Identifier (FUID) obtained in the previous step:
$ crsctl delete css votedisk FUID
./crsctl delete css votedisk a4a849393fb14f4fbf92cbef0d2d215a
./crsctl delete css votedisk a8906deabdd3ff29ff0df2f382271ce0

Add a voting disk, as follows:
./crsctl add css votedisk /appvotingocrtie/vote/hhcvote1
./crsctl add css votedisk /apptmpctrltrcenv/vote/hhcvote2
./crsctl query css votedisk

root@hrdevdb2 /oracrs/app/11.2.0.3/grid/network/admin =>/oracrs/app/11.2.0.3/grid/bin/crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   70165ff16d50df98ff135ecded8bd82e (/appvotingocr/vote/hhcvote0) []
 2. ONLINE   69e63cb962f84f1cbf3055292d79fe86 (/appvotingocrtie/vote/hhcvote1) []
 3. ONLINE   8a844b35ed364f85bfb3ee3ab8262215 (/apptmpctrltrcenv/vote/hhcvote2) []

root@eldevdb2 /apptmpctrltrcenv/vote =>/oracrs/app/11.2.0.3/grid/bin/crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   70165ff16d50df98ff135ecded8bd82e (/appvotingocr/vote/hhcvote0) []
 2. ONLINE   69e63cb962f84f1cbf3055292d79fe86 (/appvotingocrtie/vote/hhcvote1) []
 3. ONLINE   8a844b35ed364f85bfb3ee3ab8262215 (/apptmpctrltrcenv/vote/hhcvote2) []
Located 3 voting disk(s).

root@eldevdb2 /apptmpctrltrcenv/vote =>/oracrs/app/11.2.0.3/grid/bin/ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          3
         Total space (kbytes)     :     262120
         Used space (kbytes)      :       8916
         Available space (kbytes) :     253204
         ID                       :   38393080
         Device/File Name         : /appvotingocr/ocr/hhcocr
                                    Device/File integrity check succeeded
         Device/File Name         : /appvotingocrtie/ocr/hhcmirrocr
                                    Device/File integrity check succeeded

                                    Device/File not configured

                                    Device/File not configured

                                    Device/File not configured

         Cluster registry integrity check succeeded
         Logical corruption check succeeded

root@hrdevdb2 /oracrs/app/11.2.0.3/grid/network/admin =>/oracrs/app/11.2.0.3/grid/bin/ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          3
         Total space (kbytes)     :     262120
         Used space (kbytes)      :       8916
         Available space (kbytes) :     253204
         ID                       :   38393080
         Device/File Name         : /appvotingocr/ocr/hhcocr
                                    Device/File integrity check succeeded
         Device/File Name         : /appvotingocrtie/ocr/hhcmirrocr
                                    Device/File integrity check succeeded

                                    Device/File not configured

                                    Device/File not configured

                                    Device/File not configured

         Cluster registry integrity check succeeded

         Logical corruption check succeeded


current read is one where a session reads the current value of the data block from another instance’s Data Buffer Cache. This current value contains the most up-to-date committed data. The current read would happen when a second instance needs a data block that has not been changed. This is often thought of as a read/read situation.   The current read will be seen as any wait event that starts with gc current.

Consistent Read  
A consistent read is needed when a particular block is being accessed/modified by transaction T1 and at the same time another transaction T2 tries to access/read the block. If T1 has not been committed, T2 needs a consistent read (consistent to the non-modified state of the database) copy of the block to move ahead. A CR copy is created using the UNDO data for that block.   A sample series of steps for a CR in a normal setup would be:  
1.     Process tries to read a data block
2.     Finds an active transaction in the block
3.     Then checks the UNDO segment to see if the transaction has been committed or not
4.     If the transaction has been committed, it creates the REDO records and reads the block
5.     If the transaction has not been committed, it creates a CR block for itself using the UNDO/ROLLBACK information.
6.     Creating a CR image in RAC is a bit different and can come with some I/O overheads. This is because the UNDO could be spread across instances and hence to build a CR copy of the block, the instance might has to visit UNDO segments on other instances and hence perform certain extra I/O

As you said Voting & OCR Disk resides in ASM Diskgroups, but as per startup sequence OCSSD starts first before than ASM, how is it possible?

How does OCSSD starts if voting disk & OCR resides in ASM Diskgroups?
You might wonder how CSSD, which is required to start the clustered ASM instance, can be started if voting disks are stored in ASM? This sounds like a chicken-and-egg problem: without access to the voting disks there is no CSS, hence the node cannot join the cluster. But without being part of the cluster, CSSD cannot start the ASM instance. To solve this problem the ASM disk headers have new metadata in 11.2: you can use kfed to read the header of an ASM disk containing a voting disk. The kfdhdb.vfstart and kfdhdb.vfend fields tell CSS where to find the voting file. This does not require the ASM instance to be up. Once the voting disks are located, CSS can access them and joins the cluster.


2 comments: