Tuesday, January 05, 2016

Wednesday, November 18, 2015

Process and filter log files by date using AWK

Unix AWK doesn't have strptime function, which makes life a little messy.
An example to calculate the Epoch time is as fallows but it is a little ugly too.
works for me as a quick way to get things done.

cat /var/log/yum.log | awk 'BEGIN { mnts="JANFEBMARAPRJUNJULAUGSEPOCTNOVDEC"; ST=mktime("2015 11 18 11 32 00"); EN=mktime("2015 12 18 11 33 00");}{ MON=$1; DAY=$2 ; split($3,TM,":"); M=((index(mnts,toupper(MON)))+2)/3+1; EPOCH=mktime("2015 "M" "DAY" "TM[1]" "TM[2]" "TM[3]); if ((ST < EPOCH)&&(EN > EPOCH)) { print $0; }}'


Saturday, August 01, 2015

Regular expression

How to grep addresses :

echo 255.255.255.250 | egrep '(1[0-9][0-9]|2[0-4][0-9]|25[0-5]|[1-9][0-9]|[1-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])'

Can anyone do better ? I would love to see how  you do it.


Notes about Filesystems

Well, Filesystems may sound very simple concepts specially in Unix environments but there are some interesting limitations and tricks which a Unix admin must keep in mind working with.
I'm taking some notes about filesystems and very simple definitions specifically for common file systems in Linux. I'll also include my references in here. Some are interesting.



A bit of reading about Inodes and Unix filesystem features

Unix Inodes
 
Many Unix filesystems (Berkeley Fast Filesystem, Linux ext2fs, Sun ufs, ...) take an approach that combines some of the ideas above.







  • each file is indexed by an inode
  • inodes are special disk blocks set aside just for this purpose (see df -i to see how many of these exist on your favorite Unix filesystem)
  • they are created when the filesystem is created
  • the number of inodes limits the total number of files/directories that can be stored in the filesystem
  • the inode itself consists of
  • administrative information (permissions, timestamps, etc.)
  • a number of direct blocks (typically 12) that contain pointers to the first 12 blocks of the file
  • a single indirect pointer that points to a disk block which in turn is used as an index block, if the file is too big to be indexed entirely by the direct blocks
  • a double indirect pointer that points to a disk block which is a collection of pointers to disk blocks which are index blocks, used if the file is too big to be indexed by the direct and single indirect blocks
  • a triple indirect pointer that points to an index block of index blocks of index blocks...
  • interesting reading on your favorite FreeBSD system: /sys/ufs/ufs/dinode.h
  • small files need only the direct blocks, so there is little waste in space or extra disk reads in those cases
  • medium sized files may use indirect blocks
  • only large files make use of (and incur the overhead of) the double or triple indirect blocks, and that is reasonable since those files are large anyway
  • since the disk is now broken into two different types of blocks - inodes and data blocks, there must be some way to determine where the inodes are, and to keep track of free inodes and disk blocks. This is done by a superblock, located at a fixed position in the filesystem. The superblock is usually replicated on the disk to avoid catastrophic failure in case of corruption of the main superblock 

  • Disk Allocation Considerations
  • limitations on file size, total partition size
  • internal, external fragmentation
  • overhead to store and access index blocks
  • layout of files, inodes, directories, etc, as they affect performance - disk head movement, rotational latency - many unix filesystems keep clusters of inodes at a variety of locations throughout the file system, to allow inodes and the disk blocks they reference to be close together
  • may want to reorganize files occasionally to improve layout (see hw7 question) 
  • Free Space Management
    With any of these methods of allocation, we need some way to keep track of free disk blocks.
    Two main options:



  • bit vector - keep a vector, one bit per disk block
  • 0 means the corresponding block is free, 1 means it is in use
  • search for a free block requires search for the first 0 bit, can be efficient given hardware support
  • vector is too big to keep in main memory, so it must be on disk, which makes traversal slow
  • with block size 212 or 4KB, disk size 233 or 8 GB, we need 221 bits (128 KB) for bit vector
  • easy to allocate contiguous space for files
  • free list - keep a linked list of free blocks
  • with linked allocation, can just use existing links to form a free list
  • with FAT, use FAT entries for unallocated blocks to store free list
  • no wasted space
  • can be difficult to allocate contiguous blocks
  • allocate from head of list, deallocated blocks added to tail, both O(1) operations
  • Performance Optimization
    Caching is an important optimization for disk accesses.
    A disk cache may be located:



  • main memory
  • disk controller
  • internal to disk drive



  • Safety and Recovery

    When a disk cache is used, there could be data in memory that has been "written" by programs, which which has not yet been physically written to the disk. This can cause problems in the event of a system crash or power failure.
    If the system detects this situation, typically on bootup after such a failure, a consistency checker is run. In Unix, this is usually the fsck program, and in Windows, scandisk or some variant. This checks for and repairs, if possible, inconsistencies in the filesystem.

    Journaling Filesystems
    One way to avoid data loss when a filesystem is left in an inconsistent state is to move to a log-structured or journaling filesystem.



  • record updates to the filesystem as transactions
  • transactions are written immediately to a log, though the actual filesystem may not yet be updated
  • transactions in the log are asynchronously applied to the actual filesystem, at which time the transaction is removed from the log
  • if the system crashes, any pending transactions can be applied to the filesystem - main benefits are less chance of significant inconsistencies, and that those inconsistencies can be corrected from the unfinished transactions, avoiding the long consistency check
  • Examples:
  • ReiserFS, a linux journaling filesystem - I recommend reading this page
  • ext3fs, also for linux
  • jfs, IBM journaling filesystem, available for AIX, Linux
  • Related idea in FreeBSD's filesystem: Soft Updates
  • Journaling extensions to Macintosh HFS disks, called Elvis, supposedly coming in OS X 10.2.2
  • NTFS does some journaling, but some claim it is not "fully journaled"
  • the term "journaling" may also refer to systems that maintain the transaction log for a longer time, giving the ability to "undo" changes and retrieve a previous state of a filesystem


  • From: Ext4 filesystem layout

    Overview


        An ext4 file system is split into a series of block groups. To reduce performance difficulties due to fragmentation, the block allocator tries very hard to keep each file's blocks within the same group, thereby reducing seek times.  The size of a block group can be calculated as 8 * block_size_in_bytes. With the default block size of 4KiB, each group will contain 32,768 blocks, for a length of 128MiB. ( It's a good to group things. )

    Blocks

        ext4 allocates storage space in units of "blocks". A block is a group of sectors between 1KiB and 64KiB, and the number of sectors must be an integral power of 2. Blocks are in turn grouped into larger units called block groups. Block size is specified at mkfs time and typically is 4KiB. You may experience mounting problems if block size is greater than page size (i.e. 64KiB blocks on a i386 which only has 4KiB memory pages). By default a filesystem can contain 2^32 blocks; if the '64bit' feature is enabled, then a filesystem can have 2^64 blocks. 


    32-bit mode 64-bit mode
    Item 1KiB 2KiB 4KiB 64KiB 1KiB 2KiB 4KiB 64KiB
    Blocks 2^32 2^32 2^32 2^32 2^64 2^64 2^64 2^64
    Inodes 2^32 2^32 2^32 2^32 2^32 2^32 2^32 2^32
    File System Size 4TiB 8TiB 16TiB 256PiB 16ZiB 32ZiB 64ZiB 1YiB

     ( Nice, so It's actually the System's architecture dictating the maximum filesystem's size.  2 to the power of 32 or 64 and you may have some playroom with your block size but you must stick to blocks as big as your memory page size. But, Can't I mount the file systems created on bigger machines on smaller ones ? Seems we may have difficulties. So don't be sure unless you've tried it. )


    Layout

    The layout of a standard block group is approximately as follows (each of these fields is discussed in a separate section below):
    Group 0 Padding ext4 Super Block Group Descriptors Reserved GDT Blocks Data Block Bitmap inode Bitmap inode Table Data Blocks
    1024 bytes 1 block many blocks many blocks 1 block 1 block many blocks many more blocks
    For the special case of block group 0, the first 1024 bytes are unused, to allow for the installation of x86 boot sectors and other oddities. The superblock will start at offset 1024 bytes, whichever block that happens to be (usually 0). However, if for some reason the block size = 1024, then block 0 is marked in use and the superblock goes in block 1. For all other block groups, there is no padding.




    Still to continue.....






    Monday, April 06, 2015

    Finding Hardlinks

    Well, this may be something very primitive but may  be useful to someone.

    How do you now how many  hardlinks you have to a file ? There is a system call: fstat for open files and stat for close files. the structure returned by this system call is as follows:

               struct stat {
                   dev_t     st_dev;     /* ID of device containing file */
                   ino_t     st_ino;     /* inode number */
                   mode_t    st_mode;    /* protection */
                   nlink_t   st_nlink;   /* number of hard links */
                   uid_t     st_uid;     /* user ID of owner */
                   gid_t     st_gid;     /* group ID of owner */
                   dev_t     st_rdev;    /* device ID (if special file) */
                   off_t     st_size;    /* total size, in bytes */
                   blksize_t st_blksize; /* blocksize for file system I/O */
                   blkcnt_t  st_blocks;  /* number of 512B blocks allocated */
                   time_t    st_atime;   /* time of last access */
                   time_t    st_mtime;   /* time of last modification */
                   time_t    st_ctime;   /* time of last status change */
               };

    And it is very interesting that one of the parameters of the fstat is the number of hard links. Just refer to "man 2 fstat" for more details.

    The utility in shell to read these values is simply the stat command:

    user@localhost ~]$ stat ./scr.sh
      File: ‘./scr.sh’
      Size: 72            Blocks: 8          IO Block: 4096   regular file
    Device: fd02h/64770d    Inode: 137155      Links: 2
    Access: (0755/-rwxr-xr-x)  Uid: ( 1000/    user)   Gid: ( 1000/    user)
    Context: unconfined_u:object_r:user_home_t:s0
    Access: 2015-04-06 19:32:44.410014623 +1000
    Modify: 2015-04-06 19:32:42.703024143 +1000
    Change: 2015-04-06 20:12:21.379767352 +1000
     Birth: -
    [user@localhost ~]$

    and ls -ial  shows you the inode number. Two files with the same inode number means they are the same file and referring to the same inode on the filesystem.

    To find all the hard links on the mount point, you should simply find the files with the same inode number. Find has made it simeple. Since hardlinks are only possible on the same mount point,  to make life easier, we can use -xdev option with the find comand:

    find ./ -xdev -samefile ./scr.sh


    [user@localhost ~]$ find ./ -xdev -samefile ./scr.sh
    ./scr-1.sh
    ./scr.sh

    or use the inode number:
    [user@localhost ~]$ find ./ -xdev -inum 137155
    ./scr-1.sh
    ./scr.sh



    That's all falks :)

    Wednesday, January 15, 2014

    My favorate tool tcpdump

     The best thing I like about tcpdump is the power of expressions you can benefit from.
    what port, host, interface or network and protocol to capture. tcpdump is so powerful and handy that even IBM could not resist to deliver it in  AIX systems and through oem_setup_env in VIO. Cisco have not been able to make a better alternative and is happily offering tcpdump in Nexus NX-OS Linux based environments and promoting this as a feature of their devices.

    All and all because of a powerful library named: 

       [root@localhost local]# ldd /usr/sbin/tcpdump
            linux-gate.so.1 =>  (0x00317000)
            libcrypto.so.10 => /usr/lib/libcrypto.so.10 (0x00b27000)
            libpcap.so.1 => /usr/lib/libpcap.so.1 (0x00dde000)
            libc.so.6 => /lib/libc.so.6 (0x004ba000)
            libdl.so.2 => /lib/libdl.so.2 (0x00414000)
            libz.so.1 => /lib/libz.so.1 (0x008c9000)
            /lib/ld-linux.so.2 (0x0074d000)
    [root@localhost local]#

    The libpcap project
    libpcap is a system-independent interface for user-level packet capture. libpcap provides a portable framework for low-level network monitoring. Applications include network statistics collection, security monitoring, network debugging, etc.

    Oh, Just have a look at the projects benefiting from libpcap and you'll be surprised:

    http://en.wikipedia.org/wiki/Pcap

     Programs that use libpcap/WinPcap

    • tcpdump, a tool for capturing and dumping packets for further analysis, and WinDump, the Windows port of tcpdump.
    • ngrep, aka "network grep", isolate strings in packets, show packet data in human-friendly output.
    • Wireshark (formerly Ethereal), a graphical packet-capture and protocol-analysis tool.
    • Snort, a network-intrusion-detection system.
    • Nmap, a port-scanning and fingerprinting network utility
    • the Bro IDS and network-monitoring platform.
    • URL Snooper, locate the URLs of audio and video files in order to allow recording them.
    • Kismet, for 802.11 wireless LANs
    • L0phtCrack, a password auditing and recovery application.
    • iftop, a tool for displaying bandwidth usage (like top for network traffic)
    • EtherApe, a graphical tool for monitoring network traffic and bandwidth usage in real time.
    • Bit-Twist, a libpcap-based Ethernet packet generator and editor for BSD, Linux, and Windows.
    • Pirni, a network security tool for jailbroken iOS devices.
    • McAfee ePolicy Orchestrator, Rogue System Detection feature
    • XLink Kai Software that allows various LAN console games to be played online
    • Firesheep, an extension for the Firefox web browser, that intercepts unencrypted cookies from certain websites (such as Facebook and Twitter) as the cookies are transmitted over networks, exploiting session hijacking vulnerabilities.
    • Suricata, a network intrusion prevention and analysis platform.
    • WhatPulse, a statistical (input, network, uptime) measuring application.
    • Xplico, a network forensics analysis tool (NFAT).
     The limits are your imagination!


    What are you doing in my blog ?! ... uhhh .. looking for a nice cheat-sheet ( In Unix manual terminology it's called examples ) on tcpdump ? How to use freely available libpcap in your projects and sell it to people and you work for Oracle ? Just found this blog by coincidence and you haven't seen any computers in you life ?

    Easy ..

    First, you need to be the root user to Listen on an interface or in other words, set the interface to promiscuous mode.

    Jan 14 23:03:04 localhost kernel: device eth0 entered promiscuous mode
    Jan 14 23:30:04 localhost kernel: device eth0 left promiscuous mode

    I personally always use tcpdump with -n to stop it from reversing IP's to names, you can even do -nn to see real port numbers and I have the habit of mentioning exactly what interface to listen on as I used to work on servers with at least 5-6 Ethernet ports and maybe tens of ppp interfaces with -i <interface name>. If your interface is really a busy one, always mention -c <number> to limit the number of the packets you want to capture. So up to here your command may looks like this:

    # tcpdump -n -c 100 -i eth0

    Well, if you are running the tcpdump command on a remote system and you've logged in to that system via the interface you are listening on, you probably don't want to be overwhelmed buy your own SSH packets and overflow your screen by useless encrypted ssh packets.

    # tcpdump -n -c 100 -i eth0 port not ssh

    Woww, going so fast? A little description. We all know what the yellow part is. The second part says that I'm not interested in packets sourcing from or destined to port 22 and the not is the negating phrase.
    So If I just want to see the web traffic, I would write:

    # tcpdump -n -c 100 -i eth0 port not ssh and port http

     hmm, can we have more logical expressions ?  Sure. a little reminder from basic logic in school:

    A and ( B or C ) = ( A and B ) or ( A and C )
    A or ( B and C ) =  ( A or B ) and ( A or C )
    not ( A or C ) = not A and not C

    The magic:

    # tcpdump -n -i eth0 -c 100 port not ssh and not \( port nrpe or https \) and tcp

    Means that I want to see all tcp traffics that are not about ssh, nrpe and https.
    Remember that you can see the known port number in /etc/services and you are always able to add whatever you want to that file although you can use port numbers instead of mentioning its name like ssh, http or etc.

    Well, another interesting option is to see what's inside your packets. if you are sniffing Web traffic or passwords out of telnet sessions, you can simply use following command:

    # tcpdump -n -i eth0 -c 100 -A -s 0 port 80

    -A option is to print the packets in ASCII so you can see the content of the packets and -s 0 means that you want to see as much as possible, no matter how big the packet is ( 0 means 65535 bytes ).

    if you want to see the bytes in HEX just use -X instead of -A

    I guess it's enough for now. You are always welcome to check the manual of tcpdump command.

    Using libpcap in your applications

     In every major language and some not popular programing languages you can find the library interface. the following is from the wikipedia page of pcap project:

    Wrapper libraries for libpcap/WinPcap


    Go and read your package's documents !

    Monday, November 25, 2013

    Maze algorithm in Python.

    I was so bored during the weekend and I thought, may be I should play around python a little. So I wrote the following code to make a Maze. It uses a backtracking algorithm to make a Maze of any given size. I personally enjoyed writing it.


    And the output:

    ##############################
    #       ##                   #
    # ## ## ##### ############## #
    # ## ## ##### ############## #
    # ## ##       ##             #
    # ## ########### #############
    # ## ########### #############
    # ##    ##    ## ##          #
    ####### ## ##### ##### ##### #
    ####### ## ##### ##### ##### #
    #       ##    ##    ##    ## #
    # ######## ## ##### ## ## ####
    # ######## ## ##### ## ## ####
    #    ## ## ##    ## ## ##    #
    #### ## ## ## ##### ##### ## #
    #### ## ## ## ##### ##### ## #
    #    ## ## ##    ##    ## ## #
    # ##### ## ##### ##### ##### #
    # ##### ## ##### ##### ##### #
    #    ##    ##       ##       #
    # ## ##### ## ############## #
    # ## ##### ## ############## #
    # ##    ## ##          ##    #
    # ##### ## ######## ##### ####
    # ##### ## ######## ##### ####
    #    ## ## ##    ##       ## #
    ####### ## ##### ########### #
    ####### ## ##### ########### #
    #       ##                   #
    ##############################
    

    Friday, November 08, 2013

    Auto-compile Jasper subreports in OpenNMS

    I was trying to make a database report in OpenNMS to collect CPU utilization information from JRobin files.  To use a subreport and make it compile the subreport every time you run the report, I ended up with the following way of presentation of jrxml file. May it be useful to someone. 



    The parent report file:
    The subreport file:
    /opt/opennms/etc/report-templates/subreports/PeakCPU_subreport.jrxml


    Friday, June 28, 2013

    RHEL4 or an old Linux but you want to roll back a LVM snapshot

    LVM has supported snapshot since long time ago but it is so recent ( 2-3 years) since merge support is introduced in RHEL6 which means you can roll back to the snapshot points. Before this, I think snapshots were only a solution when it was needed to freeze a filesystem to be backed up.

    Well ... what about an old kernel which only knows how to snapshot but not to merge !?

    Easy, boot the system with a kernel which knows how to do this. like, booting from a RHEL6 DVD in rescue mode. LVM is a LVM anyway.

    Or, you can dd the snapshot device file to an equally sized disk to create an equivalent filesystem to the snapshot.

    OK ... to show you how, follow the following guidelines:

    1. Create a snapshot:
    There should be enough free space in the VG for the snapshot. It was nice if you could use other VGs to host the snapshot LV, but this feature is not available. Add a need disk to the VG to make some room if you need to.


    #lvcreate -L<size> -s -n LogVol00-backup

     2.Do changes
    Cool, now you have the snapshot. Do your changes

    3.Clean up
    If it went well, remove the snapshot LV and release the disk:

    #lvremove LogVol00-backup

    3-1. If you are on RHEL6 or a new version or a high end kernel:

    #lvconvert --merge LogVol00-backup

    If your KVs are mounted nothing will happen unless you reboot or unmount, deactivate and reactivate the LV.

    #umount /mountpoint
    #lvchange -an LogVol00-backup
    #lvchange -ay LogVol00-backup
    #mount /mountpoint

     Obviously  you can't do this if your LV id the root partition. Just reboot if you are stuck.

    3-2. What if your OS does not support  merge ? Boot from a RHEL6 DVD and go into rescue mode and feel free to do the step3.

    Or, alternatively, add a disk or create a LV equally sized to the original LV and do the following:

    dd if=/dev/MV/snapshot_lv of=/dev/DV/DestinationLV

    It will generate an LV with the contents of the snapshot LV. Don't be scared of dd. You know what you are doing, aren't you ?






     

    Thursday, May 16, 2013

    OpenVPN and XOR obfuscation

    UPDATED: 13/09/2016

    I patched the the current version 2.3.10 and pushed it in to my git hub: 
    https://github.com/shenavaa/openvpn


    UPDATED 15/07/2014:

    I managed to patch and compile the latest version of Openvpn-2.3.4 for Windows. Please download compiled windows Autoinstaller binaries from here and the sources from here.

    ----


    I went somewhere for a while and during my visit, I had a chance to play around OpenVPN. During a lazy afternoon I came up with a silly idea adding a layer of XOR obfuscation on top of whatever already exists in OpenVPN. I even managed to compile the windows client of openVPN and run it on windows.

    The good thing about XOR obfuscation is that, it has no overhead on top of packets and it is so fast and easy.

    The bigger an organization is, the harder it would be for LI/Security layers to detect the algorithm or the protocol of the packets on the network. I have seen AI engines learning protocols and used to block unwanted and recently undetected packets ! - Their solution is sillier than what I just did. Trust me. ;)

    I've generally done it by adding one simple function obviously and couple of hacks in other source files.

    ## in xor.h
    #ifndef _XOR_H
    #define _XOR_H

    void encbuffer(unsigned char* buf,int size ,unsigned char key);
    #endif /* _XOR_H */
    ## in xor.c
    #include "xor.h"
    void encbuffer(unsigned char* buf,int size ,unsigned char key) {
        int i;
        for (i = 0; i < size ; i++) {
            *(buf + i) = *(buf + i) ^ key;
        }
    }


    So my OpenVPN configutation file simply turns to something as follows:

    ## On the server
    local X.X.X.X
    dev tap
    verb 4
    #mute 10
    port 36
    tun-mtu-extra 32
    tun-mtu 1500
    up-delay
    ifconfig 172.16.4.1 255.255.255.0
    ping 10
    comp-lzo yes
    fragment 1100
    xorkey 52


    ## On the client
    remote X.X.X.X
    dev tap
    verb 4
    #mute 10
    port 36
    tun-mtu-extra 32
    tun-mtu 1500
    up-delay
    ifconfig 172.16.4.2 255.255.255.0
    ping 10
    comp-lzo yes
    fragment 1100
    xorkey 52


    My sources are here for whoever is interested to see. It's Openvpn 2.3.1. I've cleaned it up and all you need to compile the source, after unpacking, is "./configure; make; make install"


    This is the beauty of open source software. Feel free to distribute the love.

    Monday, April 08, 2013

    Running jobs in different Timezones and Day light saving or DST by crontab

    Oh, This day light saving thing. Why people don't set their whole environment clock on UTC ?
    Believe me, I all will have better life if that happens.

    Any way, The newer versions of crontab have introduced a good way to run job in different time zones. Just look at the following example of crontab file for instance.


    #######################IMPORTANT#######################
    # Anything beyond this line will run in Australia/Victoria time zone
    #######################################################
    CRON_TZ=Australia/Victoria
    #######################################################

    00 08 * * * /home/healthcheck/linux/send-email.sh

    #######################IMPORTANT#######################
    # Anything beyond this line will run in America/New_York time zone
    #######################################################
    CRON_TZ=America/New_York  
    #######################################################

    00 08 * * * /home/healthcheck/linux/syslog-check.sh

    #######################IMPORTANT#######################
    # Anything beyond this line will run in UTC time zone
    #######################################################
    CRON_TZ=UTC
    #######################################################
    01 16 * * * /home/healthcheck/linux/runDST.sh


    And to check when this year's day light saving happens ?

    [healthcheck@adm01 ~]$  zdump -v Australia/Victoria| grep `date +%Y`
    Australia/Victoria  Sat Apr  6 15:59:59 2013 UTC = Sun Apr  7 02:59:59 2013 EST isdst=1 gmtoff=39600
    Australia/Victoria  Sat Apr  6 16:00:00 2013 UTC = Sun Apr  7 02:00:00 2013 EST isdst=0 gmtoff=36000
    Australia/Victoria  Sat Oct  5 15:59:59 2013 UTC = Sun Oct  6 01:59:59 2013 EST isdst=0 gmtoff=36000
    Australia/Victoria  Sat Oct  5 16:00:00 2013 UTC = Sun Oct  6 03:00:00 2013 EST isdst=1 gmtoff=39600
    [healthcheck@adm01 ~]$


    Thursday, March 14, 2013

    I/O Scheduler algorithms in Linux

     I just remembered once someone asked me about available I/O scheduler algorithms in Linux during a job interview. I still have not come across a situation to change any system's I/O scheduler algorithm during the 14 years of Linux experience I have. and personally I don't believe this is needed to be changed on any normal server environment. But may be this is a good reviews by RedHat.
     Really, What that guy has been thinking of !
     http://www.redhat.com/magazine/008jun05/features/schedulers/

    Choosing an I/O Scheduler for Red Hat® Enterprise Linux® 4 and the 2.6 Kernel

    The Linux kernel, the core of the operating system, is responsible for controlling disk access by using kernel I/O scheduling. Red Hat Enterprise Linux 3 with a 2.4 kernel base uses a single, robust, general purpose I/O elevator. The 2.4 I/O scheduler has a reasonable number of tuning options by controlling the amount of time a request remains in an I/O queue before being serviced using the elvtune command. While Red Hat Enterprise Linux 3 offers most workloads excellent performance, it does not always provide the best I/O characteristics for the wide range of applications in use by Linux users these days. The I/O schedulers provided in Red Hat Enterprise Linux 4, embedded in the 2.6 kernel, have advanced the I/O capabilities of Linux significantly. With Red Hat Enterprise Linux 4, applications can now optimize the kernel I/O at boot time, by selecting one of four different I/O schedulers to accommodate different I/O usage patterns:
    • Completely Fair Queuing—elevator=cfq (default)
    • Deadline—elevator=deadline
    • NOOP—elevator=noop
    • Anticipatory—elevator=as
    Add the elevator options from Table 1 to your kernel command in the GRUB boot loader configuration file (/boot/grub/grub.conf) or the eLILO command line. Red Hat Enterprise Linux 4 has all four elevators built-in; no need to rebuild your kernel.
    The 2.6 kernel incorporates the best I/O algorithms that developers and researchers have shared with the open-source community as of mid-2004. These schedulers have been available in Fedora Core 3 and will continue to be used in Fedora Core 4. There have been several good characterization papers on using evaluating Linux 2.6 I/O schedulers. A few are referenced at the end of this article. This article details our own study based on running Oracle 10G in both OLTP and DSS workloads with EXT3 file systems.

    Red Hat Enterprise Linux 4 I/O schedulers

    Included in Red Hat Enterprise Linux 4 are four custom configured schedulers from which to choose. They each offer a different combination of optimizations.
    The Completely Fair Queuing (CFQ) scheduler is the default algorthim in Red Hat Enterprise Linux 4. As the name implies, CFQ maintains a scalable per-process I/O queue and attempts to distribute the available I/O bandwidth equally among all I/O requests. CFQ is well suited for mid-to-large multi-processor systems and for systems which require balanced I/O performance over multiple LUNs and I/O controllers.
    The Deadline elevator uses a deadline algorithm to minimize I/O latency for a given I/O request. The scheduler provides near real-time behavior and uses a round robin policy to attempt to be fair among multiple I/O requests and to avoid process starvation. Using five I/O queues, this scheduler will aggressively re-order requests to improve I/O performance.
    The NOOP scheduler is a simple FIFO queue and uses the minimal amount of CPU/instructions per I/O to accomplish the basic merging and sorting functionality to complete the I/O. It assumes performance of the I/O has been or will be optimized at the block device (memory-disk) or with an intelligent HBA or externally attached controller.
    The Anticipatory elevator introduces a controlled delay before dispatching the I/O to attempt to aggregate and/or re-order requests improving locality and reducing disk seek operations. This algorithm is intended to optimize systems with small or slow disk subsystems. One artifact of using the AS scheduler can be higher I/O latency.

    Choosing an I/O elevator

    The definitions above may give enough information to make a choice for your I/O scheduler. The other extreme is to actually test and tune your workload on each I/O scheduler by simply rebooting your system and measuring your exact environment. We have done just that for Red Hat Enterprise Linux 3 and all four Red Hat Enterprise Linux 4 I/O schedulers using an Oracle 10G I/O workloads.
    Figure 1 shows the results of running an Oracle 10G OLTP workload running on a 2-CPU/2-HT Xeon with 4 GB of memory across 8 LUNs on an LSIlogic megraraid controller. The OLTP load ran mostly 4k random I/O with a 50% read/write ratio. The DSS workload consists of 100% sequential read queries using large 32k-256k byte transfer sizes.

    Figure 1. Red Hat Enterprise Linux 4 IO schedulers vs. Red Hat Enterprise Linux 3 for database Oracle 10G oltp/dss (relative performance)

    The CFQ scheduler was chosen as the default since it offers the highest performance for the widest range of applications and I/O system designs. We have seen CFQ excel in both throughput and latency on multi-processor systems with up to 16-CPUs and for systems with 2 to 64 LUNs for both UltraSCSI and Fiber Channel disk farms. In addition, CFQ is easy to tune by adjusting the nr_requests parameter in /proc/sys/scsi subsystem to match the capabilities of any given I/O subsystem.
    The Deadline scheduler excelled at attempting to reduce the latency of any given single I/O for real-time like environments. A problem which depends on an even balance of transactions across multiple HBA, drives or multiple file systems may not always do best with the Deadline scheduler. The Oracle 10G OLTP load using 10 simultaneous users spread over eight LUNs showed improvement using Deadline relative to Red Hat Enterprise Linux 3's I/O elevator, but was still 12.5% lower than CFQ.
    The NOOP scheduler indeed freed up CPU cycles but performed 23% fewer transactions per minute when using the same number of clients driving the Oracle 10G database. The reduction in CPU cycles was proportional to the drop in performance, so perhaps this scheduler may work well for systems which drive their databases into CPU saturation. But CFQ or Deadline yield better throughput for the same client load than the NOOP scheduler.
    The AS scheduler excels on small systems which have limited I/O configurations and have only one or two LUNs. By design, the AS scheduler is a nice choice for client and workstation machines where interactive response time is a higher priority than I/O latency.

    Summary: Have it your way!

    The short summary of our study indicates that there is no SINGLE answer to which I/O scheduler is best. The good news is that with Red Hat Enterprise Linux 4 an end-user can customize their scheduler with a simple boot option. Our data suggests the default Red Hat Enterprise Linux 4 I/O scheduler, CFQ, provides the most scalable algorithm for the widest range of systems, configurations, and commercial database users. However, we have also measured other workloads whereby the Deadline scheduler out-performed CFQ for large sequential read-mostly DSS queries. Other studies referenced in the section "References" explored using the AS scheduler to help interactive response times. In addition, noop has proven to free up CPU cycles and provide adequate I/O performance for systems with intelligent I/O controller which provide their own I/O ordering capabilities.
    In conclusion, we recommend baselining an application with the default CFQ. Use this article and its references to match your application to one of the studies. Then adjust the I/O scheduler via the simple command line re-boot option if seeking additional performance. Make only one change at a time, and use performance tools to validate the results.

     

    /proc/cpuinfo, what cpu flags mean?

    I alwas keep forgetting what CPU flags mean in Linux and what CPU architecture I have.

    http://www.gentoo-wiki.info/Gentoo:/proc/cpuinfo

    Intel flags (This table is currently identical with /usr/include/asm/cpufeature.h. Hopefully some hardware god will share his wisdom and expand this table. )
    FlagDescriptionCommon in processor types
    fpuOnboard (x87) Floating Point Unit
    vmeVirtual Mode Extension
    deDebugging Extensions
    psePage Size Extensions
    tscTime Stamp Counter: support for RDTSC and WRTSC instructions
    msrModel-Specific Registers
    paePhysical Address Extensions: ability to access 64GB of memory; only 4GB can be accessed at a time though
    mceMachine Check Architecture
    cx8CMPXCHG8 instruction
    apicOnboard Advanced Programmable Interrupt Controller
    sepSysenter/Sysexit Instructions; SYSENTER is used for jumps to kernel memory during system calls, and SYSEXIT is used for jumps back to the user code
    mtrrMemory Type Range Registers
    pgePage Global Enable
    mcaMachine Check Architecture
    cmovCMOV instruction
    patPage Attribute Table
    pse3636-bit Page Size Extensions: allows to map 4 MB pages into the first 64GB RAM, used with PSE.
    pnProcessor Serial-Number; only available on Pentium 3
    clflushCLFLUSH instruction
    dtesDebug Trace Store
    acpiACPI via MSR
    mmxMultiMedia Extension
    fxsrFXSAVE and FXSTOR instructions
    sseStreaming SIMD Extensions. Single instruction multiple data. Lets you do a bunch of the same operation on different pieces of input in a single clock tick.
    sse2Streaming SIMD Extensions-2. More of the same.
    selfsnoopCPU self snoop
    accAutomatic Clock Control
    IA64IA-64 processor Itanium.
    htHyperThreading. Introduces an imaginary second processor that doesn't do much but lets you run threads in the same process a bit quicker.
    nxNo Execute bit. Prevents arbitrary code running via buffer overflows.
    pniPrescott New Instructions aka. SSE3
    vmxIntel Vanderpool hardware virtualization technology
    svmAMD "Pacifica" hardware virtualization technology
    lm"Long Mode," which means the chip supports the AMD64 instruction set
    tm"Thermal Monitor" Thermal throttling with IDLE instructions. Usually hardware controlled in response to CPU temperature.
    tm2"Thermal Monitor 2" Decrease speed by reducing multipler and vcore.
    est"Enhanced SpeedStep"