Previous Table of Contents Next


Parallel Server Architecture Review

The objectives of OPS are to provide high availability by supporting clustering technologies, and to provide scalability by increasing the amount of CPU and RAM allocated to a database. As you’ll see later in the chapter, because of the operating system level locking introduced by clustering technologies, scalability is only achieved with OPS if you observe a design restriction called application partitioning.

Clustering technologies allow two or more machines, often called nodes, to be tightly connected so that they can share access to the same disks and, on some platforms, common memory. Although each node in the cluster shares resources with another machine(s), it retains the ability to operate separately. That is, each node has an IP address and runs processes independently.

Clustering technologies use statistics to our advantage in the same way that many types of RAID do. Both clustering and RAID technologies provide hardware redundancy to reduce the probability of an outage. For example, consider a data file that is mirrored (RAID-1). In this example, two copies of a data file exist at all times. The probability that both copies will be lost at the same time due to a disk crash is less than the probability that one copy will be lost.

Although disk crashes are the most common form of hardware failure, other components of a server can fail as well. For example, I’ve seen many controllers and network cards fail. Memory can fail. The RISC chip can fail. All of these failures can cause a server crash, or at least make it unavailable to applications. By providing redundancy for all parts of a server (not just the disk, as RAID does), you lower the probability that an outage will occur. It’s less likely that both nodes in a two-node cluster will crash simultaneously than that a standalone server will crash.

Clustering technologies also provide automatic failover at the operating system level. That is, if one node in the cluster crashes, another node will automatically assume its network transmissions and even assume its IP address in addition to its own. All network transactions in progress when the node dies will encounter errors, but all new work for the failed node will be assumed by one of the other nodes in the cluster. All batch processes running on the failed node will die with the node. We will discuss this later in the chapter.

As we’ve alluded, clusters can contain more than two nodes. I’ve seen applications running OPS hosted on three or four nodes. Due to cost considerations, clusters containing three nodes or more are done primarily on business-critical applications. For example, one of my clients used a three-node cluster to service an OLTP application that approves credit card transactions. My client determined that at some points in the year (such as Christmas Eve), they can easily lose in excess of one million dollars an hour through loss of service. In cases like this, three or four node clusters are used to further decrease the probability of an outage.

Most clustering technologies provide operating system-level locking for disk space. In a cluster, it is possible that multiple nodes are writing to the same data file at the same time. In fact, using OPS, all instances can update the same tables and indexes, which in turn means that at an operating system level they are writing to the same data files. In a cluster, the operating system will ensure that no two nodes are attempting to write the same blocks on the same disk at any given time.

On most platforms, the service that manages operating system-level locking in a cluster is called the Distributed Lock Manager (DLM). Most DLMs operate on an operating system block level, which is not to be confused with an Oracle block. We will see later in the chapter that DLM behavior dictates much about how we design OPS databases and applications. DLM behavior can effectively dictate whether or not an OPS application scales.

Because DLMs between platforms differ, Oracle8 Server introduces the Integrated Distributed Lock Manager (IDLM). The IDLM provides a locking mechanism between platforms so that OPS locking will be much more consistent between platforms.

Using OPS, a separate instance is started on each node in the cluster. Each instance has a separate name and SGA. Furthermore, each instance has the same capabilities as non-OPS instances.

Although instances in OPS share data and control files, they do not share redo log files. Each instance in OPS is assigned a redo log thread that has redo log groups and members associated with it. The CREATE/ALTER DATABASE commands are used to create and manage these multiple threads. There should be one thread for each instance of a database.

Although each instance has its own initialization parameter file, several parameters must be set identically between instances. The following initialization parameters must be identical between instances of the same database:


     CACHE_SIZE_THRESHOLD                 CONTROL_FILES

     CPU_COUNT                            DB_BLOCK_SIZE

     DB_FILES                             DB_NAME

     DML_LOCKS                            GC_FILES_TO_LOCKS

     GC_LCK_PROCS                         GC_ROLLBACK_LOCKS

     LOG_FILES                            MAX_COMMIT_PROPAGATION_DELAY

     PARALLEL_DEFAULT_MAX_SCANS           PARALLEL_DEFAULT_MAX_INSTANCES

     ROLLBACK_SEGMENTS                    ROW_LOCKING


Note:  
DML_LOCKS must be identical only if set to zero. Oracle recommends (in its documentation for 8.0.4) that LM_LOCKS, LM_PROCS, and LM_RESS should be set to the same value for all instances in the cluster.


Previous Table of Contents Next
Используются технологии uCoz