Thursday, March 14, 2013

I/O Scheduler algorithms in Linux

 I just remembered once someone asked me about available I/O scheduler algorithms in Linux during a job interview. I still have not come across a situation to change any system's I/O scheduler algorithm during the 14 years of Linux experience I have. and personally I don't believe this is needed to be changed on any normal server environment. But may be this is a good reviews by RedHat.
 Really, What that guy has been thinking of !
 http://www.redhat.com/magazine/008jun05/features/schedulers/

Choosing an I/O Scheduler for Red Hat® Enterprise Linux® 4 and the 2.6 Kernel

The Linux kernel, the core of the operating system, is responsible for controlling disk access by using kernel I/O scheduling. Red Hat Enterprise Linux 3 with a 2.4 kernel base uses a single, robust, general purpose I/O elevator. The 2.4 I/O scheduler has a reasonable number of tuning options by controlling the amount of time a request remains in an I/O queue before being serviced using the elvtune command. While Red Hat Enterprise Linux 3 offers most workloads excellent performance, it does not always provide the best I/O characteristics for the wide range of applications in use by Linux users these days. The I/O schedulers provided in Red Hat Enterprise Linux 4, embedded in the 2.6 kernel, have advanced the I/O capabilities of Linux significantly. With Red Hat Enterprise Linux 4, applications can now optimize the kernel I/O at boot time, by selecting one of four different I/O schedulers to accommodate different I/O usage patterns:
  • Completely Fair Queuing—elevator=cfq (default)
  • Deadline—elevator=deadline
  • NOOP—elevator=noop
  • Anticipatory—elevator=as
Add the elevator options from Table 1 to your kernel command in the GRUB boot loader configuration file (/boot/grub/grub.conf) or the eLILO command line. Red Hat Enterprise Linux 4 has all four elevators built-in; no need to rebuild your kernel.
The 2.6 kernel incorporates the best I/O algorithms that developers and researchers have shared with the open-source community as of mid-2004. These schedulers have been available in Fedora Core 3 and will continue to be used in Fedora Core 4. There have been several good characterization papers on using evaluating Linux 2.6 I/O schedulers. A few are referenced at the end of this article. This article details our own study based on running Oracle 10G in both OLTP and DSS workloads with EXT3 file systems.

Red Hat Enterprise Linux 4 I/O schedulers

Included in Red Hat Enterprise Linux 4 are four custom configured schedulers from which to choose. They each offer a different combination of optimizations.
The Completely Fair Queuing (CFQ) scheduler is the default algorthim in Red Hat Enterprise Linux 4. As the name implies, CFQ maintains a scalable per-process I/O queue and attempts to distribute the available I/O bandwidth equally among all I/O requests. CFQ is well suited for mid-to-large multi-processor systems and for systems which require balanced I/O performance over multiple LUNs and I/O controllers.
The Deadline elevator uses a deadline algorithm to minimize I/O latency for a given I/O request. The scheduler provides near real-time behavior and uses a round robin policy to attempt to be fair among multiple I/O requests and to avoid process starvation. Using five I/O queues, this scheduler will aggressively re-order requests to improve I/O performance.
The NOOP scheduler is a simple FIFO queue and uses the minimal amount of CPU/instructions per I/O to accomplish the basic merging and sorting functionality to complete the I/O. It assumes performance of the I/O has been or will be optimized at the block device (memory-disk) or with an intelligent HBA or externally attached controller.
The Anticipatory elevator introduces a controlled delay before dispatching the I/O to attempt to aggregate and/or re-order requests improving locality and reducing disk seek operations. This algorithm is intended to optimize systems with small or slow disk subsystems. One artifact of using the AS scheduler can be higher I/O latency.

Choosing an I/O elevator

The definitions above may give enough information to make a choice for your I/O scheduler. The other extreme is to actually test and tune your workload on each I/O scheduler by simply rebooting your system and measuring your exact environment. We have done just that for Red Hat Enterprise Linux 3 and all four Red Hat Enterprise Linux 4 I/O schedulers using an Oracle 10G I/O workloads.
Figure 1 shows the results of running an Oracle 10G OLTP workload running on a 2-CPU/2-HT Xeon with 4 GB of memory across 8 LUNs on an LSIlogic megraraid controller. The OLTP load ran mostly 4k random I/O with a 50% read/write ratio. The DSS workload consists of 100% sequential read queries using large 32k-256k byte transfer sizes.

Figure 1. Red Hat Enterprise Linux 4 IO schedulers vs. Red Hat Enterprise Linux 3 for database Oracle 10G oltp/dss (relative performance)

The CFQ scheduler was chosen as the default since it offers the highest performance for the widest range of applications and I/O system designs. We have seen CFQ excel in both throughput and latency on multi-processor systems with up to 16-CPUs and for systems with 2 to 64 LUNs for both UltraSCSI and Fiber Channel disk farms. In addition, CFQ is easy to tune by adjusting the nr_requests parameter in /proc/sys/scsi subsystem to match the capabilities of any given I/O subsystem.
The Deadline scheduler excelled at attempting to reduce the latency of any given single I/O for real-time like environments. A problem which depends on an even balance of transactions across multiple HBA, drives or multiple file systems may not always do best with the Deadline scheduler. The Oracle 10G OLTP load using 10 simultaneous users spread over eight LUNs showed improvement using Deadline relative to Red Hat Enterprise Linux 3's I/O elevator, but was still 12.5% lower than CFQ.
The NOOP scheduler indeed freed up CPU cycles but performed 23% fewer transactions per minute when using the same number of clients driving the Oracle 10G database. The reduction in CPU cycles was proportional to the drop in performance, so perhaps this scheduler may work well for systems which drive their databases into CPU saturation. But CFQ or Deadline yield better throughput for the same client load than the NOOP scheduler.
The AS scheduler excels on small systems which have limited I/O configurations and have only one or two LUNs. By design, the AS scheduler is a nice choice for client and workstation machines where interactive response time is a higher priority than I/O latency.

Summary: Have it your way!

The short summary of our study indicates that there is no SINGLE answer to which I/O scheduler is best. The good news is that with Red Hat Enterprise Linux 4 an end-user can customize their scheduler with a simple boot option. Our data suggests the default Red Hat Enterprise Linux 4 I/O scheduler, CFQ, provides the most scalable algorithm for the widest range of systems, configurations, and commercial database users. However, we have also measured other workloads whereby the Deadline scheduler out-performed CFQ for large sequential read-mostly DSS queries. Other studies referenced in the section "References" explored using the AS scheduler to help interactive response times. In addition, noop has proven to free up CPU cycles and provide adequate I/O performance for systems with intelligent I/O controller which provide their own I/O ordering capabilities.
In conclusion, we recommend baselining an application with the default CFQ. Use this article and its references to match your application to one of the studies. Then adjust the I/O scheduler via the simple command line re-boot option if seeking additional performance. Make only one change at a time, and use performance tools to validate the results.

 

/proc/cpuinfo, what cpu flags mean?

I alwas keep forgetting what CPU flags mean in Linux and what CPU architecture I have.

http://www.gentoo-wiki.info/Gentoo:/proc/cpuinfo

Intel flags (This table is currently identical with /usr/include/asm/cpufeature.h. Hopefully some hardware god will share his wisdom and expand this table. )
FlagDescriptionCommon in processor types
fpuOnboard (x87) Floating Point Unit
vmeVirtual Mode Extension
deDebugging Extensions
psePage Size Extensions
tscTime Stamp Counter: support for RDTSC and WRTSC instructions
msrModel-Specific Registers
paePhysical Address Extensions: ability to access 64GB of memory; only 4GB can be accessed at a time though
mceMachine Check Architecture
cx8CMPXCHG8 instruction
apicOnboard Advanced Programmable Interrupt Controller
sepSysenter/Sysexit Instructions; SYSENTER is used for jumps to kernel memory during system calls, and SYSEXIT is used for jumps back to the user code
mtrrMemory Type Range Registers
pgePage Global Enable
mcaMachine Check Architecture
cmovCMOV instruction
patPage Attribute Table
pse3636-bit Page Size Extensions: allows to map 4 MB pages into the first 64GB RAM, used with PSE.
pnProcessor Serial-Number; only available on Pentium 3
clflushCLFLUSH instruction
dtesDebug Trace Store
acpiACPI via MSR
mmxMultiMedia Extension
fxsrFXSAVE and FXSTOR instructions
sseStreaming SIMD Extensions. Single instruction multiple data. Lets you do a bunch of the same operation on different pieces of input in a single clock tick.
sse2Streaming SIMD Extensions-2. More of the same.
selfsnoopCPU self snoop
accAutomatic Clock Control
IA64IA-64 processor Itanium.
htHyperThreading. Introduces an imaginary second processor that doesn't do much but lets you run threads in the same process a bit quicker.
nxNo Execute bit. Prevents arbitrary code running via buffer overflows.
pniPrescott New Instructions aka. SSE3
vmxIntel Vanderpool hardware virtualization technology
svmAMD "Pacifica" hardware virtualization technology
lm"Long Mode," which means the chip supports the AMD64 instruction set
tm"Thermal Monitor" Thermal throttling with IDLE instructions. Usually hardware controlled in response to CPU temperature.
tm2"Thermal Monitor 2" Decrease speed by reducing multipler and vcore.
est"Enhanced SpeedStep"