Multi Codec Jukebox

Succeeding with Cloud Adoption in the Enterprise

2016-08-02T04:57:00.001+02:00

This article discusses the "day-after" challenges of driving cloud adoption in the enterprise after the cloud has been put in place. Even though cloud technology is increasingly stable and commiditized, enterprises are struggling to transform the goodness of cloud technology into improved reliability, greater IT agility and decreased IT expenditure. Why does the cloud falter in many organisations when it has thrived in others?

Whether you choose to go with private cloud (e.g. Open Stack) in the company's data center or you choose to go with the public cloud (e.g. AWS), or some mix of the two, there are plenty of good tools and documentation resources to stand up the cloud for your organisation. There are very talented professional services that can cut the cloud's time-to-deployment and provide high service level guarantees from day one, so that your organisation's IT workers can learn from the best and have ample time to train up on all things cloud. Several IT components - enterprise servers, networking, storage - are increasingly tailored to the cloud use case with hardware vendors making sure there are drivers and documentation to interface with popular cloud offerings. It seems as if its only getting easier to deploy a private cloud. Public cloud offerings have gone a step further: not only do they provide turnkey infrastructure solutions, but also provide several platform services so you don't have to build these support platforms yourself (think managed databases, CDNs, load balancing, DNS, monitoring, archival storage etc.). In short, the debate about cloud maturity and support is over. Its more mature, better supported and better documented than most DiY-bare metal infrastructure that IT organisations have been supporting until now.

Setting up is the Easy Part

Here is how the cloud story unfolds in a large enterprise: the CIO's strategic initiative boot-straps the cloud of choice and many dollars later a shiny new cloud is born in the IT organisation. IT leaders proclaim the end of the dark ages of waiting months before bare-metal servers could be lit up in DCs. The business unit applications consuming the cloud wait in anticipation of converting their capex IT expenditure into opex. Legions of IT engineers and developers are trained up in the specific cloud technology chosen by the organisation. They all hear the spiel about devops, scalability and micro-services and how to engineer applications to fit into the cloud. The organisation has arrived at the forefront of the IT landscape and there is something for everyone (it seems).

6-12 months later things could have gone two ways. Either there is genuine transformation and notable enterprise applications start migrating to the cloud, or the cloud is deemed a failure with many application owners finding reasons (legitimate or not) to delay, postpone or revisit the decision to move to the cloud. To address these issues the IT organisation is forced to retrofit and re-architect the cloud and to make it more "enterprise friendly". The cloud becomes an albatross hanging around the IT organisation's neck - a few applications have moved to the cloud and therefore it cannot be shelved or re-engineered from scratch. Most applications continue in legacy non-cloud mode - straining IT organisations' resources to maintain both legacy and cloud infrastructure. Application owners feel the push-back from the IT organisation when it comes time to invest more in non-cloud (bare-metal) infrastructure as IT tries desperately to wring the arm of the application owners to "get-on" the cloud instead. Even then, little details - such as the organisation's entrenched database technology not being supported in the cloud - effectively rule out the cloud as an option for most of the enterprise's applications. You can almost hear the deafening noise when cloud-high expectations (pun intended) come crashing down.

There are two potential pitfalls that are the most problematic. I call them the "Big Happy Family Syndrome" and "Everyone is Invited Syndrome".

Big Happy Family Syndrome

The cloud is incredibly flexible. In their eagerness to earn enterprise business, many third-party developers have been offering ways to make the cloud "enterprise friendly". These solutions can irreparably mutate the cloud and make it lose its unique value preposition. They give enterprises excuses for not moving toward cloud-capable application architectures by effectively saying that "Its okay to do things the old way, you can still bask in the glory of declaring you are on the cloud without re-architecting your 15 year old dinosaur application that everyone is so comfortable with". A wolf in sheeps clothing.

Here is another example: Proprietary storage company "X" has a large footprint in an enterprise's data centre. The whole storage team is trained in this technology, and is loathe to adopt the new cloud storage technology (from example, in case of Openstack, this may be Ceph) that has been heavily tested and automated by the cloud developers. Proprietary company "X" writes a storage driver to integrate its array with Openstack, but the level of integration, testing and automation is not the same as Ceph. Automation is uncomfortable - one has to deal with things like authentication, quota and capacity management and most importantly, this means giving up control and making the storage team redundant (This is obtuse thinking, in my own experience the cloud creates many more opportunities for IT specialists willing to learn). Under the pretext of "enterprise grade storage to back the cloud", the whole storage piece is retro-fitted and still driven via manual tickets, meaning that any virtual machine that needs storage will need to give up on automated provisioning as it waits (for hours or even days) for the ticket to be actioned. The organisation in this example has just removed one of the pillars of the cloud's value preposition - spinning up infrastructure on demand instantly.

Everyone is Invited Syndome

It is every CIO's dream to move all applications to the cloud. There are valid reasons to want to achieve this but forcing all enterprise IT applications to get on the cloud from day one is a mistake. Many IT managers still believe that the cloud is virtualisation - virtual machines on demand for running applications, and so basically if a suitable VM can be provisioned (in terms of CPU/memory/storage capacity) then any application can be moved to the cloud. In reality moving to the cloud goes far beyond running an application in a VM but unfortunately, this is all that can be achieved when applications are shoe-horned into the cloud when under an unreal time-to-cloud constraint.

The reality is that many enterprise applications need to be re-architected and perhaps even rebuilt (with more cloud-friendly constructs) before they become cloud citizens. Yes it is possible to run virtually any application in the cloud but by forcing the IT organisation to support each unique application deployment on the cloud you are setting up for failure. IT teams are not nearly as scalable as cloud infrastructure - the technical debt of creating one-off VMs in the cloud will quickly overwhelm them and lead to missed configurations, broken VMs and catastrophic failures. Moreover, clouds are designed with the assumption that service availability is distributed across several hyper-visors. Every time a legacy "VIP" application is installed on a VM with no distributed capabilities it limits the infrastructure team from booting that hyper-visor when its OS needs to be patched, for example. In general, introducing exceptions and a class system in the application population takes out infrastructure agility from the cloud value preposition.

Which Way then?

One could argue that if you tried to avoid the two syndromes I mentioned above the cloud experiment is not worth it in the first place. There is a grain of truth to this. If the application mix is almost entirely heterogeneous servers with no scope for re-architecting services, then you are better off leaving the infrastructure untouched or looking at dialling down to a virtualisation platform instead of a cloud platform (with very few cloud benefits). A rather pessimistic recommendation.

The happy reality is that most application owners "get it" about why they need to be on the cloud. They are genuinely interested in modernising their architecture but may not have the resources or time to do so in sync with a cloud roll-out. Many CotS (commercial off-the-shelf) applications are also being (slowly) re-written and re-architected to become more cloud friendly. Moreover, enterprise SaaS application (For example, remotely-hosted Workday) are making inroads into enterprises and displacing monolithic legacy applications as time goes by. Hopefully some of the problematic application workloads will get outsourced to software services hosted off-premise.

Cloud adoption in the enterprise shoudl be a natural, un-hurried organic process that gathers steam as it progresses, eventually taking a life of its own. Don't mutate the cloud to fit the whims of each legacy enterprise application, that will destroy the whole value preposition of the cloud.

But all this will happen in the mid to long term. What of the need to prove the value of the shiny new cloud that was stood up in the last two quarters, idling away and depreciating right now?

Start small and grow organically. Don't be mislead by capacity plans that assume that everything will move to the cloud in the next 12 months. Let the cloud grow organically - its one of the key strengths of infrastructure as a service. Don't worry if initial adoption is slow or if the ramp up seems to be taking too long (the more heterogeneous the application mix, the more time this will take - the first initial cloud automation process is slow). There comes an inflection point when application owners see their peers enjoying the benefits of automation, self healing and elastic capacity and there is a mass conversion to the cloud. Peer-pressure is an incredible motivator for application owners to move to the cloud.

Focus on the low hanging fruit first. Applications like the stateless web-tier or the dev/test environment for a few development teams. Facilitate their move to the cloud and use them as the poster-child to make other application owners drool over what the cloud can do for them if they only re-architected their application. It is usually easier to work with new applications rather than migrating older ones, so in the beginning keep an eye out for new services being developed and deployed by the enterprise. For example, the organisation may be working on exposing a HTTP-based Restful API to its customers and given the need to scale with customer demand, this is a great citizen for the cloud.

Protect the cloud way. Make sure you do not dilute the cloud's value preposition by bending/mutating to every demand an application owner makes. Circle your wagons around the cloud architecture team and let them have the final say when it comes to change requests for key cloud functionality and components. Its perfectly fine to build an organisational shim around vanilla cloud installations (e.g. custom authentication or monitoring etc.) but do not transplant legacy technology into the cloud - keep the shim slim. Keeping the cloud close to the original also helps roll out upgrades and patches released by the cloud developer community. Enterprises get the additional benefit of forcing legacy dinosaurs to rethink their applications for the future rather than desperately clinging to the past. After all, applications have life cycles too and you don't want to be burdened by the past forever in the future. The greatest risk of all is getting stuck with ageing applications that few employees understand or those that are supported by a single vendor: the anti-thesis of future-proofing enterprise IT.

Outlook

When organisations sign up for the cloud, they are signing up for much more than spinning up VMs to run their applications on. Think of the cloud as a framework that forces applications to modernise, to embrace scale-out and often open-source technology instead of being totally dependent on vertical scalability and vendor lock-in. It enables healthy IT hygiene like devops practices to automate dev-test-deployment, driving up the productivity of the development and IT organisation (for example, the average number of servers managed by each IT specialist). In fact succeeding in the cloud is more about changing the culture and approach to how IT is consumed in the organisation and less about the particular technology used. Enterprises went digital decades ago. Now the age of many applications running in their data-centres is showing. The cloud will help them modernise, but only if there is genuine buy in by IT consumers to walk to the walk into the future.

Dynamo, Swift Objectstore, Cassandra - Part 2: Openstack Swift Objectstore

2013-11-19T05:15:00.000+01:00

This is a 3 part series where I discuss the Swift Objectstore and Cassandra in the context of the original idea that inspired them both: Amazon's Dynamo.

This post is part 2 of the series that looks at the Openstack Swift Objectstore. If you haven't read part 1 of the series - which describes the Amazon Dynamo data store - then you may want to first read that.

OVERVIEW

The Openstack Swift Objectstore is a distributed file storage system. Swift open-source software creates a content-addressable storage cluster on top of multiple (usually off-the-shelf x86) boxes full of storage disks. The content-addressable storage provides a uniform namespace for all objects that are stored ib Swift cluster. Each data object is represented via a URL of the form

http://swift-cluster-lb.address/account/container/object

By clustering multiple off-the-shelve boxes and replicating data across them Swift can achieve almost limitless scalability and data durability by spreading data replicas across independent failure domains (different disks/servers/racks or even data centres). The account, container and object abstractions of Openstack Swift are analogous to volume, directory and file in conventional file systems. Clients can issue Create, Read, Update and Delete (CRUD) requests on a data object by passing different verbs in HTTP requests made against the URL.

The Swift objectstore uses the Dynamo consistent hashing idea to set up partitions on the multiple boxes. Usually many hundreds or a few thousand partitions, chosen from random ring hash ranges, are mapped to each storage node. Each partition is replicated multiple (N=3 by default) times. Swift guarantees that partition replicas are kept in as distinct failure domains as the hardware allows (e.g. different disks if the whole swift cluster is on one server, or different servers if the whole swift cluster is a rack of servers, or across racks for multi-rack clusters). This strategy ensures high probability of data availability when hardware fails because hardware faults are usually localized. For example, if a top-of-rack switch fails, then a multi-rack Openstack Swift cluster would still be able to serve any stored object because the replicas exist on different racks.

Openstack Swift consists of multiple types of processes usually running (multiple copies) on different physical servers. Openstack Swift clients usually send CRUD requests to a HTTP(S) load balancer, which in turn distributes these requests among a pool of proxy processes. All proxy processes have complete information about how partitions are physically mapped to the boxes and OS partitions on disks within these boxes. So they can direct each incoming request (based on its URL and HTTP verb) to the appropriate storage processes for processing. In this model all data is passed through the proxy and load balancer(s) and no storage node is directly accessible to clients. Openstack Swift uses N=3, W=2 and R=1 by default (see the earlier post on Dynamo for the meaning of these variables). Therefore writes are acknowledged after the incoming object has been successfully written on two separate partitions. Reads are returned via the storage server that is fastest to respond to the proxy server with the requested data.

Like Dynamo, the Swift Objectstore provides eventual consistency. Swift adopts a proactive algorithm to check consistency between replicas of partitions. Storage nodes periodically compare their partitions (which are directories on the files system) and rsync any differences between the partition replicas. The period between these comparisons user-configurable and set to 30 seconds by default. One of the notable characteristics of Openstack Swift is that the replication is at the file level, not the block level (such as traditional RAID systems). So the amount of time to "rebuild" a broken drive is proportional to the size of the data stored on the drive and not the total capacity of the drive.

Swift also keeps account and container information separately in sqllite databases for any metadata operations (such as listing objects in a container). These sqllite databases are also checked for consistency via periodic comparisons with their replicas and synchronized based on timestamps if there differences between replicas.

MECHANICS

Its worthwhile understanding how the CRUD operations and background replication and scrub translate to low level disk operations on the storage nodes. Understanding this aspect of Swift opens the door to understand performance in terms of ideal and less than ideal workloads, strengths and limitations of Swift, and the impact of hardware choices on System performance. Fortunately Openstack Swift code is very well documented, well factored, and professionally maintained, making it an excellent source of understanding its inner workings.

Create

A create request requires Swift to store an object (a binary blob of data) along with some metadata associated with the object. The proxy node determines the (N) partitions where the object is to be stored and forwards the request to those storage nodes. Each storage node first does a lookup within the target partition to confirm that an object with identical account/container/object hash does not already exist. If not, then a directory is created within the partition to store the object. The object is stored as a binary file with a .data extension. Metadata associated with the object is also stored within the inode or within .meta files in the same directory. For more details, refer to the Object server and diskfile source code files.

Read

A read request is forwarded by the proxy server to all N storage servers containing the partition in which the object is stored. Each of these storage nodes checks within the appropriate partition if the directory containing the object exists. If it does, then the object's directory is checked for .ts files (a .ts or tombstone file would indicate that the object is deleted and the a 404 not found response should be returned to the client). The directory is also tested for .meta files in case additional metadata files associated with the object are available. Finally the .data file containing the object and the corresponding metadata read from the XFS inode metadata and any .meta files is composed into the HTTP response sent back to the proxy node and the client. Recall that Openstack Swift returns the first successfully read object to the client from among the storage servers. For more details, refer to the Object server and diskfile source code files.

Delete

Deletes are asynchronous in Openstack Swift. A .ts (tombstone) file is created in the object's folder to indicate that the object has been deleted. The container sqllite database is also updated. A subsequent asynchronous background process (called the auditor) deletes the object at a later time. For more details, refer to the Object server and diskfile source code files.

Object Replication

Replicating objects is necessary to guarantee the eventual consistency guarantee of Openstack Swift. Swift's object replicators on each storage server compare their partitions with the other (N-1) replica partitions via the background replicator process. For each partition, the replicator process sends requests to the other storage nodes storing that partition to send Merkle tree hashes of their objects stored in their partition directory. This data structure allows for quick identification of differing objects in the partition replicas. Subsequently rsync is used to synchronize replicas. For more details refer to the object replicator source code.

Scrub

Objects are periodically scrubbed to check for bit rot. Swift implements a periodic disk scrub in the background by computing checksums of stored file objects and comparing them with stored checksums. This process identifies any data corruption (due to disk bit rot for example) and suitably addresses the errors by creating more copies from the other replicas of the data object. The metadata stored for each object contains the (MD5) hash of the object's data. The auditor process(es) running on each storage node cycle through all partition directories containing objects, compute the MD5 hashes of each object and compare them to the stored checksums. Mismatches indicate corrupted object data, this is quarantined and a subsequent replication run restores the object's data from the other replicas. For more details refer to the object auditor source code.

Other Goodies

There are several other convenience features built into Openstack Swift, most of which are beyond the basic Dynamo design. For example there is the ability to specify object versions in create requests (which essentially results in different objects being created for different versions). Time bound objects, which are automatically deleted after a certain interval are also provided (as described in expirer.py). Extremely large objects (over 5 GB) are internally divided and stored as smaller objects. The Openstack Swift implementation is based on modular WSGI pipelines, which allows pipeline-style addition and removal of custom components while processing Swift requests and responses. For example, the Amazon S3 object interface can be enabled by installing an additional component into the processing pipeline in proxy nodes.

Swift also provides automatic disaster recovery features by giving operators the ability to asynchronously replicate objects across remote data centres. Read and write affinity features (as described in the proxy server source code file server.py) ensure that data is accessed from/written to the nearest data centers from clients if possible.

Corner Cases

There are a few corner cases where Openstack Swift may not yield great performance. These are interesting to discuss here.

Very Small files

One of the side-effects of using the hash-based rings to store data in Openstack Swift clusters on any of the 100s or 1000s of partitions on a storage node is that consecutive write operations on storage nodes are neither spatially nor temporally correlated. This means that one write operation will most likely be in a different partition (directory) than the previous write. This poses a challenge when Swift is used to store many small files because caching the XFS inodes and dentrys becomes ineffective. To appreciate the issue here, consider this example of the directory layout on a storage node

Example: An Openstack swift object storage directory with an object directory containing the object's .data file

/srv/1/node/sdb1/objects/717/89f/b359508efc4d39b0d22efde69fc7f89f/1382112651.23154.data

Breaking down the directory paths below:

/srv/1/node/sdb1/objects: This is the object directory where all objects stored on sdb1 device are stored.
/717: This is the partition
/89f: This is the hash_prefix of the object
/b359508efc4d39b0d22efde69fc7f89f: This is the directory with name = name_hash of the object
1082112651.23154.data:This is the actual data file containing the object

Each time an object is written to this storage node the inode and dentry cache needs to access a random entry down from the partition level. The only practical method to ensure fast inode metadata operations is to make sure that the memory can fit the whole inode and dentry cache. Also consider that storage nodes usually contain multiple large capacity disks, each containing a XFS filesystem and the associated inodes and dentrys. All these caches should ideally fit in memory for fast metadata operations! Given a storage node with 10s of TB storage capacity can store 100s of millions of small objects (say of 10-100kB size) , the memory requirement of each storage node becomes quite large if the inode and dentry caches need to be fully stored in memory. There is a good discussion of this issue here.

It is to be noted that this issue of small files is not unique to Openstack Swift. For example, Ceph, another distributed file system that can be used as an objectstore, stores each individual object as a file on the underlying filesystem (which is usually XFS on production Ceph systems). Many small files stored as objects in Ceph may cause similar issues.

Very Large files

Reading or writing speeds for extremely large files (e.g. several GB) are limited by single spindle speeds in Openstack Swift because objects are not striped across disk spindles. These "elephants" may also slow down read and write operations for other objects being stored on the same partitions, (since all the spindles across all the storage nodes that store the elephants will be busy at the same time serving the elephant request). However, the randomization in partition-to-object mapping makes such situations rare, especially if adequate number of spindles and partitions are provisioned in the Openstack Swift deployment.

OUTLOOK

Hopefully this article has supplemented your knowledge about Openstack Swift and encouraged you to look at the (very accessible) Swift source code to find exact answers of any questions you may have about it. In addition, Openstack Swift is remarkably well documented. Together, the source code and documentation unambiguously answer almost any question about how Openstack Swift works.

In the next (and last) part of this series we'll look at Cassandra, another very popular data store based on the ideas of Dynamo.

Dynamo, Swift Objectstore, Cassandra - Part 1: Dynamo review

2013-10-30T04:16:00.000+01:00

This is a 3 part series where I discuss the Swift Objectstore and Cassandra in the context of the original idea that inspired them both: Amazon's Dynamo

PART 1: Amazon Dynamo review

Dynamo was described in a 2007 paper from Amazon. Dynamo is a distributed data store that Amazon developed in order to service the database and storage use cases of Amazon at the time. Yes, both database and storage needs because Dynamo can be adapted to store data in ways that are more like databases or ways that are more like distributed file systems. It achieves this by taking a pre-existing “persistence engine”, such as the Berkeley DB or MySQL or a filesystem, and adding distributed system sauce to make these services scale horizontally across hundreds of cheap servers in a SLA-bound manner with respect to latencies of the 99.9th or higher quartile. This is absolutely remarkable because Dynamo transforms mature but non-distributed data stores (like Berkeley DB) to scale horizontally, almost limitlessly.

Dynamo provides redundancy by storing multiple data copies (N) in different physical servers, possibly located in independent fault domains. The fault domains requirement keeps the N copies in different Amazon data centers, effectively making the probabilities of loosing all N copies independent of each other and practically miniscule. Read and write load is distributed across multiple servers by randomizing which servers are responsible for which data. Then, given the reasonable assumption that the number of frequently accessed data objects far exceeds the number of physical servers, load is nicely balanced across the physical servers.

Dynamo achieves these properties by making two tradeoffs compared to more traditional data stores such as RDBMS databases. The first tradeoff is that reading data off dynamo may not yield the most recent write update to the data. This is the tradeoff of consistency in the famous (C)onsistency, (A)vailability and (P)artition tolerance triangle of the CAP theorem in distributed computing. Applications using a dynamo data store need to be intelligent enough to detect and deal with inconsistent copies of data that may yield a stale version of the data object.

The second tradeoff dynamo makes is limiting the richness of the data schema by providing a simple key-value store. Therefore, unlike traditional RDBMS databases the data store implements content-addressable storage via a simple key-value model. Applications can create a key-value pair corresponding to a data object insert or “put” operation and read the value corresponding to a key or a “get” operation. While putting an object Dynamo provides version support capabilities via passing implicit metadata during the put operation on data objects, which serves as a means of handling conflict resolution between different versions of the same data object. Deletes are handling by inserting a tombstone corresponding to the key that needs to be deleted.

Consistency is eventual, this means that if an object is left unmodified for a (finite) amount of time then all N copies of the object will be identical. Moreover it is guaranteed that the latest update of the object becomes available after a finite amount of time at all physical locations hosting the N copies so a subsequent read off any physical location will yield the latest version of the data. The exact time required for the system to enter this state depends on the number of failures, network conditions, and user-defined parameters that control how aggressively the background replication of objects happens.

Dynamo uses Merkle trees to reconcile data quickly and scalably in the background. The key benefit of Merkle trees is that the data transferred for reconciliation is proportional to the number of differences and not to the total number of data objects being checked for consistency via reconciliation.

The paper describes the interplay between number of copies (N), the number of copies read (R) before returning a client’s read request and the number of copies successfully written (W) before acknowledging the client’s write request. When N < R + W then strong consistency is guaranteed (since the intersection of servers where the W writes and R subsequent reads is performed cannot be an empty set). This assumes that the read happens after the write is acknowledged. If instead two independent clients were to write and read a data object respectively with the read request hitting the system before the write request of the other client is acknowledged (a classic asynchronous and independent data access pattern) then there is no guarantee. The authors suggest that N=3, R=2 and W=2 are used for several Amazon services.

The most interesting part of Dynamo is the partitioning that dictates how data is dispersed among backend resources (physical servers). Consistent hashing is used to divide up the key space (128 bit MD5 hash keyspace of the keys in the key-value insertions). The consistent hashing approach guarantees that when a physical server is added to or removed from a dynamo cluster of M nodes then the total data moved is a 1/M fraction of the data stored in the cluster. Given the limited number of physical servers each is further divided into 100s of virtual nodes which are randomly mapped to the consistent hash ring. This enables faster recovery from failure conditions (disks and node malfunction) and distributes replication load during recovery across all servers in the dynamo cluster.

There are several other handy features - for example hinted handoffs for maintaining the correct replica count even when the preferred physical server for which the data is destined is down transiently. Writing speedups using ideas similar to commit logs via the notion of a “in memory writer” with at least 1 persistent copy are also described. There are also some SLA-related performance graphs which show the remarkable availability and bounded latency properties of Dynamo under production settings. I highly recommend reading the paper on your own if you have got so far in reading this write-up.

In the next part of this series I will dive into Openstack Swift, a Dynamo-inspired file objectstore. I will analyze design decisions particular to Swift, what is unique and different in each, and where (perhaps) Swift could still improve by standing on the shoulders of the grand-daddy of modern distributed systems -Dynamo.

And after that, we’ll repeat the above for another Dynamo-inspired data store - Cassandra. Stay tuned.

Understanding Storage IOPs

2013-09-23T04:27:00.000+02:00

Input output operations per second, commonly termed IOPs, is an important performance measure of a data storage system. The unit of IOPs is operations per second. Input operations, i.e. when data is written to the storage, are different than output operations when data is read off the storage. Their corresponding IOP number is also different. Therefore it is important to know both the inputs per second and the outputs per second performance of a storage system. These numbers are not independent of each other - reading off the storage system while writing to it may affect the performance of both operations. Therefore IOPs are quoted with a weighted average of the inputs and outputs per second. For example, a storage system may be capable of performing 100,000 IOPs with 70% reads and 30% writes, with the operations being performed concurrently. The storage engineer needs to validate the concurrent input/ouput requirements of the application using the storage system to ascertain if they can be met by the storage.

But IOPs by themselves don’t paint the whole picture of storage performance. More information is needed to understand the whole picture. Here are questions leading to the complete picture:

What is the size of the read/written data block while measuring IOPs?
What is the end-to-end latency seen by the application in reading or writing the data block?
With regards to writes, are the reported IOP numbers for synchronous writes or asynchronous writes?
With regards to reads, what is the role of cache?

Large data blocks take longer to write to/read from storage. So IOP numbers of 4kB blocks will be very different than IOPs for 1MB blocks. The most relevant block size for an IOP measure should correspond to the sizes of blocks being written to the storage system by the application. Knowing the IO profile of the application is key to choosing an appropriate storage system for an application.

Latency is a key IOP qualifier. Storage system latency is the time from when the application issues an IO request to the storage system to when the request is completed - the read data is delivered to the application or the storage system acknowledges that the data block has been written. The important question with respect to writes is when the storage system acknowledges that the data block has been persisted on non-volatile storage. For some applications writes may be asynchronous - they are acknowledged before they have been persisted on non-volatile storage. Since asynchronous IOP and latency numbers look better (higher IOPs, lower latency), promotional storage systems’ material often quotes asynchronous write IOPs.

Some systems have battery-backed non-volatile RAM to allow the acknowledgement to be sent to the application as soon as the data block is written to RAM - usually orders of magnitude faster than data storage media like SSD or disk. The question then becomes, what is the size of this non-volatile RAM that can hold data blocks before the data need to be persisted on the slower storage media. While some applications have bursty write profiles which play nicely with this (limited) non-volatile RAM, applications that require sustained write performance may not benefit much from such methods.

Similarly, RAM can be used to cache data for reads - read latency decreases when a cache hits serve data blocks from RAM instead of being read off slower storage media. The size of the RAM cache as well as the application’s read patterns - is some data read more often than other data - are important considerations while working with caches.

The key to having an intelligent conversation about IOPs is to know your application and to seek definitive answers about latency, data block sizes, synchronous/asynchronous assumptions and caches.

Beautiful Mergesort

2012-07-07T23:10:00.000+02:00

Effective data structures and programming is about putting simple blocks of logic together until the whole is larger than the sum of the parts. A great example is Mergesort. Mergesort is a beautiful algorithm to implement sorting in NlogN time (N being the number of elements being sorted). This is the asymptotic lower bound on sorting.

Mergesort brings together the concepts of recursion, divide-and-conquer, arrays, and pointers (when implemented in C). So here is to Jon Von Neumann and his fantastic invention of Merge sort.

#include <stdio.h>
/*
Simple Mergesort implementation by Sachin Agarwal
sachinkagarwal@gmail.com


 */
void PrintList(int list[], int stIndex, int endIndex)
/*
  Helper function - Print the elements of array list, between stIndex and endIndex
*/
{
  for(; stIndex <= endIndex;stIndex++)
    printf("%d ",list[stIndex]);
}

void Merge (int dataArray[], int stIndexL, int endIndexL, int stIndexR, int endIndexR) 
/*
  Merge two lists into a tempArray in sorted order; this tempArray is copied back to dataArray. 
*/
{
  int tempArray[endIndexL-stIndexL+1+endIndexR-stIndexR+1]; //temp storage
  int stL = stIndexL;
  int resultCtr = 0;
  while (stIndexL <= endIndexL) {
    if (stIndexR>endIndexR) {
      tempArray[resultCtr++]=dataArray[stIndexL++];
      continue;
    }
    if (dataArray[stIndexL] <= dataArray[stIndexR]) {
      tempArray[resultCtr++] = dataArray[stIndexL++];
    }
    else {
      tempArray[resultCtr++] = dataArray[stIndexR++];
    }
  }
  while (stIndexR <= endIndexR) {
    tempArray[resultCtr++] = dataArray[stIndexR++];
  }
  int ii = stL;
  for (; ii<=endIndexR;ii++)
    dataArray[ii] = tempArray[ii-stL];
}

void Mergesort(int dataArray[], int stIndex,int endIndex) 
/*
  Mergesort routine, divide into two halves, and then merge.
*/
{
  int midIndex = stIndex + (endIndex-stIndex)/2;
  if(stIndex<endIndex) {
    Mergesort(dataArray,stIndex,midIndex);
    Mergesort(dataArray,midIndex+1,endIndex);
    Merge(dataArray,stIndex,midIndex,midIndex+1,endIndex);
  }    
}

int main() {
  int dataArray [] = {5,3,7,8,1,6,9,2,0};
  int stIndex = 0;
  int endIndex = sizeof(dataArray)/sizeof(int) - 1;
  printf("\nUNSORTED DATA ARRAY: ");
  PrintList(dataArray,stIndex,endIndex);

  Mergesort(dataArray,stIndex,endIndex);
  printf("\nFINAL RESULT: ");
  PrintList(dataArray,stIndex,endIndex);
}

Keringhan & Richie Nostalgia - circular right shifting in C

2012-05-23T00:42:00.001+02:00

I picked up my old Keringhan and Richie copy last weekend and came across the C puzzle to do circular right shifts on unsigned integer bits.

So for example if
x = unsigned int 1 = binary 00000000000000000000000000000001

The function

unsigned int rightrot (x, 29)

will right shift the bits 29 times, and making sure that the bits wrap around.

Here is the code I came up with. Can you think of clever ways to improve the code it?

#include <stdio.h>

unsigned int rightrot(unsigned int data, unsigned int rotCount)
{
  unsigned int ctr = 0;
  for(ctr=0;ctr<rotCount;ctr++) {
    unsigned int lsb = data & 1;
    data >>= 1;
    if (lsb)
    data |= (lsb << sizeof(unsigned int)*8-1);
  }
  return (data);
}

unsigned int main() {
  unsigned int d = 1;
  unsigned int shift = 31; 
  printf("%d right shifted and rotated %d times is %d\n",d,shift,rightrot(d,shift));
}

How I fixed my flat LCD TV

2012-05-15T22:38:00.002+02:00

I have a Samsung 40 inch LCD TV. A little over two years old and with all the works - full HD, USB, multiple HDMI inputs, etc.

Then 3 weeks ago my wife heard 2 pops. Both my satellite receiver (a separate box) and my Samsung LCD had been fried, thanks to a high voltage spike in the antenna cable. My power surge protector strip was powerless (no pun intended) against this, because that high voltage signal traveled into my satellite receiver through the antenna cable, and from there on through the HDMI cable into my TV.

No indicator power/standby light, non-responsive TV. Oh no, Check mate.

The satellite receiver was easy to fix - it was immediately replaced by the store where I had bought it (still under warranty). That is why, sometimes, paying a few bucks extra instead of buying things from the Internet pays off later.

Anyway, the elephant in the room was the broken TV. Since it was out of warranty, I didnt even bother calling Samsung - they'd take 100+ euros to just look at it, and their service center is too far anyway from where I live. The local TV repair shop put the bill at a minimum of 280 Euros - thats almost half the cost of the TV itself. Out of the question. What else could I do?

Down but not out, I started looking into fixing it on my own. Now that is risky. Why?

I have no training in fixing TVs (although I am a computer and electrical engineer)
TVs insides can be very dangerous (You have been warned. Don't mess with it unless you really know what you are doing)
Where and how to start?
How would I figure out what is broken?

On the other hand, what had I got to lose anyway? It was a paperweight if I couldn't do something about it.

So I got the Philips #2 screwdriver and took the plunge. Almost a dozen screws later, the plastic back was off and I peered into the insides of my beloved LCD TV. The first thought that came to me was - this looks a lot like the insides of a modern desktop computer. Not at all like those old CRT-display based TVs full of myriad electronic components and circuits. And I do know how to build and fix desktops. Why should this be so different?

Inside the LCD TV

Back side of the LCD TV, after removing the controller board. The top part (side of the TV) is the power board. The connectors from the powerboard to the controller board, from the controller board to the LCD screen, and from the controller board to the speakers/front panel are also visible.

There are 3 main components - the screen itself, the power board (its got lots of big inductors and capacitors on it - you cant miss it), and the controller board. In addition, there are small pilot circuits that run the remote control, front panel lights, speakers etc. In my case, I immediately noticed the problem - a burned out and blacked HDMI port on the controller board. So that is where the disastrous pop sound came from.

Close-up on the controller board. This is the CPU of the TV. Notice the black soot due to the burn out next to the HDMI port in the center of the picture. A careful visual inspection of the power board and the controller board can often yield the source of the problem

I quickly checked if DC voltages were being delivered from the cable coming out of the power board in order to be confident of my thesis that the controller board was the most likely culprit. I couldn't see any easy way of fixing the controller board. The parts are minuscule and this is a double-sided printed circuit board - parts on both sides. Too intricate for human hands to manipulate. Besides, how would I ever figure out which of those chips had been fried?

The only option was to replace the whole board. Thankfully, other people seem to breaking their LCD TVs in creative ways that leave their controller boards intact to be sold as parts - for example, Nintendo Wii controllers are notorious for breaking LCD screens. And sure enough, there are Internet vendors who have made a flourishing business out of selling parts pulled from other broken TVs. I found a vendor in the UK who stocked the controller board of my TV.

Now the trick is getting the same exact part as the one you are replacing. For this you need the part number. Note that the same controller/power boards are used across multiple Samsung TVs so searching for parts based exactly on the TV model number is sub-optimal. In case of my controller board I found the part number printed on the board itself (from my TV the controller board was BN41-01167C-MP1). Pasting this part number in Google yielded several sellers who offer it. I chose FlatTVParts.co.uk for the favorable customer reviews and testimonials. They ship worldwide. I was not disappointed by their service. The part is provided with the guarantee that it is in working condition, and I believe there is a small return period as well (I hope I don't have to avail of this!). Total cost, including international shipping, was 60 Euros.

So, I screwed in this controller board. Plugged in the connectors (these connectors are just like computer connectors.). I made sure that I put back the back plastic panel before testing (caution - high voltages).

Then I turned it on and waited. The standby light lazily turned a beautiful red. I pressed the power button. Tuned my satellite receiver. My 8-month daughter let out a squeal when the Kika channel came up. I was home.

Devstack - Treading Lightly into the Openstack World

2012-04-01T18:52:00.000+02:00

Devstack is a simple way to test drive Openstack before committing too much time to setting up a production-ready Openstack installation. It allows you to setup a toy Openstack installation in as little as one single virtual machine running on your PC. Then you can start and stop virtual machines (nova-compute), assign storage (Nova swift), play with Openstack networking or run your own images (Glance), and monitor your personal cloud via the nice graphical interface using Openstack Horizon in a web browser.

Devstack works by executing a bash script which sets up all the software dependencies and Openstack software to run Openstack. Its primary use case is for Openstack development - if you are interested in tweaking Openstack's software then Devstack flattens the steep learning curve of setting up the test environment. However, it is also an excellent "try before you dive in" option. Compared to the free Amazon AWS trial, you will gain first hand experience of the entire IaaS cloud ecosystem - from setting it up to provisioning cloud resources for your applications.

Here are the steps to run Devstack:

Install VMware or another hypervisor of your choice (e.g. Virtualbox) on your computer; you probably need a powerful computer with lots of RAM because what we are going to do is run a virtual machine on this computer and then run the Openstack software inside this VM; which in turn will spawn virtual machines inside the outer virtual machine! I recommend at least a dual core processor with 4GB of RAM.
Download the Ubuntu 11.10 ISO from here. You probably don't need the desktop version; the server version should be fine. Off course this is assuming that you will run the Openstack Horizon client in a web browser of the host operating system (otherwise you need to run the web browser in the VM, which means you need X). So make sure you have networking connectivity between the host and the guess Ubuntu 11.10 virtual machine.
Use the downloaded ISO to create the Ubuntu 11.10 server virtual machine using the hypervisor you selected. Make sure to assign it adequate resources. A good start would be 2.5 GB of RAM, 40 GB of disk space and two cores.
Next follow the straightforward instructions from the Devstack webpage.
The main Devstack script (stack.sh) spits out the url of the Openstack Horizon web server at the end of the run. If your virtual machine is (network) accessible from your host OS then you can now point your browser to the url, login (credentials: admin,password or demo,password) to play with your very own Openstack cloud!

Ok done! Now what? Well, spawn a VM or two using the Horizon GUI. If you are more enterprising go ahead and try the euca2ools command line tools (the same command line tools used to speak with AWS also work here because Openstack supports the EC2 API). Try to spawn a whole bunch of VMs until your cloud gives up (in this case, the VM running the cloud is giving up!). If you are interested in learning how Openstack is setup, then reading the stack.sh Devstack script is a great introduction to Openstack's internals. You probably know that Openstack is opensource, and its written in accessible Python code. So you can get right into Openstack development and contribute to the project if that is your kind of thing.

A word of caution - the cloud "starts clean" every time its host system (in our case the Ubuntu 11.10 VM) is rebooted. This means that any old VMs, images, or other configuration is cleared (that is the logical thing to do for a developer relaunching the cloud every time she changes the code base). If this is problem for you then you can simply suspend the virtual machine instead of shutting it down. But again, this is another reason to not use Devstack for a production system.

Stingy LTE Data-plans will ruin the new iPad's Party

2012-03-08T22:59:00.002+01:00

The new iPad has 2048 x 1536 pixels screen resolution. The best way to understand this is that it has a higher resolution than your normal 1080p full HD TV. Let us contemplate on how much data is needed to drive this high resolution screen. Lets talk video first. The Verizon FIOS HD channels typically come in at 15Mbps. That is almost 2MB/second. Now let us switch to using the new iPad as a navigation device in the car. The iPad 3 resolution is more than 5 times as much as an iPhone 4 (960 x 640 pixels). This means that Google maps will have to load about 5 times as much map tile data on the new iPad to fill up the screen. From what I read on the Internet, users report that the iPhone 4 burns through about 30MB of bandwidth per hour when Google maps is used for navigation. So we are talking about 30x5 = 150MB/hour of data usage on the new iPad when it is used as a navigation device. Another example: in order for jpeg images to completely fill up the new iPad's screen without any software interpolation, they are going to need to be over 3 mega pixels each. That is about 1MB of data transfer per JPEG image!

Ok, but why do you need full screen apps and content, you may ask. Can't you compromise on the size and save bandwidth? No, because the new iPad's screen size is only 9.7inches across and chances are that we use the new iPad while holding it in our hands. If this is the case, then interpolation (up-sampling to fill up the pixels with guessed values) will be really ugly at best (say for jpeg pictures), and unusable at worst (say, maps with minute features marked on them). So while its acceptable to see a standard definition channel on an HD TV from a distance, it will be unacceptable seeing it from close range on the new iPad in full screen mode.

Now you might say, thats alright because we have LTE to tame the iPad 3 bandwidth sucking machine. Upto 100Mbps if you 've heard all the marketing around LTE.

But do we have the data plans? To quote from Techcrunch, here are the LTE plans on offer in the US today.

"
AT&T is offering three plans: 250MB for $14.99, 3GB for $30, and 5GB for $50. On the 250MB plan, you’ll be charged an additional $15 for each 250MB allotment you go over. On the two bigger plans, it’s a $10 overage fee per each additional 1GB of data.

Verizon is offering four different plans: 1GB for $20, 2GB for $30, 5GB for $50, and 10GB for $80. Their overage fees are a little more straightforward — it’s an extra $10 for each 1GB over.
"

Lets plug some of these numbers into the applications we alluded to earlier in this post. Lets say, that you have opted to spend $30 per month on your new iPad's LTE connection. How far will your $30 take you?

	AT&T's $30 3GB LTE plan	Verizon's $30 2GB LTE plan
HD-quality video	@ 2MB/sec for HD video you get only 1500 seconds (25min) of HD video from your MONTHLY data plan.	@ 2MB/sec for HD video you get only 1000 seconds (17min) of HD video from your MONTHLY data plan.
Google maps	@150MB/hour you get only 20 hours of Google maps from your MONTHY data plan	@150MB/hour you get only 13 hours of Google maps from your MONTHY data plan

Simply put, these data-plans just won't work for the new iPad. What is really surprising is that LTE, all through its long standardization process in the last decade, always promised lower cost/bit and higher spectral efficiency (slated to be 2x-5x, meaning LTE can pack 5 times as many bits in the same frequency spectrum compared to 3G). Lots of new technology was put into the standard to support very high speed wireless mobile broadband services (read more about this on the LTE Wikipedia page).

Where are all those technology savings going? Granted that telcos are making big capital expenditures rolling out the LTE infrastructure, but consumers are paying LTE patent royalties (10s of dollars) too, every time they buy LTE-enabled devices such as the new iPad.

Why are the data plans so anemic?

Can we as consumers really hope to enjoy the benefits of high-resolution mobile technology and LTE when Telcos are cutting out all the joy even before the first new iPad is shipped?

UPDATE: Related WSJ article: Video Speed Trap Lurks in New iPad

UPDATE: Another angle from CNN, why images look so bad on the new iPad: Why do magazines look so bad on the new iPad?

Google+ Circles: Humanity's Social Router

2011-07-16T20:02:00.004+02:00

I have been trying to wrap my thoughts around the importance of Google+ circles. The circles idea is to let Google+ users organize their Google+ contacts into different circles. The idea itself is not new; Facebook has let its users organize contacts into different bins for a long time now. The FB avatar of the idea hasn't really been a killer feature; in fact, PC Magazine published a Google+ circles obituary based on the idea's failure in FB.

Most other features of Google+ are powerful and well planned - like video calling (hang-outs), seamless integration with other Google products (Gmail and You-tube), and a good cross-platform HTML mobile app. Still, Google is touting circles as the key Google+ feature. Why does Google think circles is so important?

Lets look at how Google+ circles affects the users social networking experience. By gently forcing the user to select which circle a new contact should belong, Google+ amortizes the job of categorizing contacts. On the other hand new contacts usually end up in one big "friends" bin in FB. The categorization (or binning) has to be performed later (and this is a tedious task - at least I haven't bothered to do it until now).

With Google+, I've ended up with my contacts being in one of these circles:

Fig. 1: Each circle is a post-box to send messages to a specific contact category.

So now I have a bunch of post-boxes, one corresponding to every circle, where I can post information (pictures/status updates/etc.) and they will get routed to that sub-set of contacts which comprise the circle. This gives me the ability to target information to relevant parts of my social network. I look at this as a social graph routing mechanism. Circles are routing rules that users put in place so that their social message streams are routed appropriately in the social graph.

Google+ is constructing humanity's social router via circles which will be programmed via routing rules defined through the elegant circles abstraction. Yes the same thing can be done with FB, but FB never really tried to make this the center-piece of its product. By gently forcing users to separate relationships via circles, Google+ might just manage to make users feel more confident about selectively routing their social lives with different groups of contacts, rather than blasting messages to everyone they (do or do not) know in their huge FB friend lists. The result should be a more information-rich Google+ social network. With more information comes better advertisement targeting possibilities.

Users are more concerned about privacy with respect to their contacts (my family should not see what happened in the office holiday party) rather than Google knowing every intimate detail of their lives. A functional social router will implement this wish without coming in the way of Google obtaining user information. No I don't think Google will be more discerning than FB when it comes to monetizing the private information of users, but hey, who cares about user privacy anyway?

App Engine's Price Shock and the New Web App Equilibrium

2011-05-22T11:57:00.006+02:00

App Engine's Price Shock

The Google App Engine blog announced that Google's App Engine is moving out of beta this year. According to the blog this means that App Engine will have to become financially attractive to Google, and therefore the corresponding upward revision in pricing. The details are intricate and affect customers (app developers) differently based on what services they use; the takeaway is that there is a reduction in the free-tier plan, a fixed monthly subscription fee, and a switch from billing based on CPU cycles to billing based on CPU-instance uptime. Now this has made several app developers unhappy because they feel slighted by Google changing the pricing structure, so much, so late. Google is trying to assuage developer anger, but I am guessing that Google App Engine stands to loose many apps and developers given these higher pricing structures.

Anyway, going back to the 3 highlights of the pricing change, lets see how it improves App Engine's business plan.

Reduction of the free-tier - Google claims that the free tier was eating into App Engine resources because there are many non-paying apps that run just fine using these free resources. Google is out to trim the free tier, leaving just enough to still win over developers who want to experiment with the App Engine, but no more. If an app is seriously deployed, it needs to pay for App Engine real estate. Many apps running on the App Engine are back-ends to iOS and Facebook apps. Why should Google be bank-rolling apps that benefit those platforms? App Engine's early adopter phase is over and freebies can end.
Fixed $9 monthly subscription fee - The App Engine has been a fantastic success with more than 100K deployed apps. Even if 10% of them can be converted into monthly subscriptions then the App Engine will have an evergreen revenue stream (many developers have complained that they would end up pay much more than $9pm given the new pricing model).
CPU-instance time billing - Google wants customers to pay for availability (keeping their web apps running 24 x 7) rather than the amount of computation used by the apps. Since web apps usually remain online throughout, it makes business sense for Google to charge for this 24x7 CPU-instance time. CPU-cycles usage can be little and far-between given the eccentric product life-cycle, usage variability, and popularity of apps. It makes perfect sense for Google to start charging for availability rather than CPU cycles.

Off course these facts are not lost on other cloud providers such as Amazon's AWS service who have always charged per CPU-instance and offer lean free tiers as well.

The Equilibrium Shift

I don't expect developers to shun the App Engine (or other cloud services for that matter) given this and future price increases. Its much more expensive, both in terms of capex and opex, to achieve a cloud provider's level of service availability and convenience for small and mid-sized app shops. But these price changes will affect the technology and architecture decisions of app developers, here is how:

Simplification of server-side logic, and richer client side logic - Clever web app design can off-load CPU cycles to web-browsers, more so with the advent of sophisticated client-side Javascript libraries and HTML5. I expect web app developers to aggressively move more processing into the user browser. This has the added benefit of more responsive web applications, for example HTML5's local data store can be used to store user data and shave-off network latency.
Multi-threaded server-side architecture - Charging for instances would push app developers toward utilizing instances more efficiently, for example, by adopting multi-threaded programming approaches. There is already talk about App Engine developers shunning Python (inherently weak in multi-threaded functionality) and moving to Java.
Space-CPU time trade-off - Storage space continues to be relatively cheap in the cloud. I expect developers to store a lot more application state rather than having to compute it again at a later time. For example, instead of using traditional RDBMS databases (e.g. MySQL), app developers may start looking at simple nosql alternatives like couchDB, which are instead optimized to store multiple indexes and views based on common read patterns.
QoS as a service - App developers will also move toward higher workload thresholds in their load-balancing algorithms, meaning that users may see web apps slowing down as new CPU instances are more sparingly fired up during times of greater demand. I expect app developers to start charging users for differentiated QoS in apps. After all, there is no free lunch, and lunch just got a whole lot more expensive.
Availability as a service - Does it make sense for an app to keep its back-end running (CPU instances online) in the dead of night because 0.1% of its users are insomniacs? I think that some web apps may start experimenting with the breaking the unwritten rule of 24x7 web uptime, or at least charging more for the privilege of using these services outside waking hours. Off course, different time-zones complicate this idea.

In the end, money matters. Its going to be interesting to see the market affecting cloud-deployed web app architecture decisions of the future.

H.264 vs. WebM

2011-01-25T06:39:00.001+01:00

And so it begins. The battle between the H.264 and WebM video codec.

Google's On2 acquisition and the subsequent open-sourcing of the VP8 video codec has created a formidable competitor for H.264. Formidable not because WebM is technically superior to H.264 but because now there is a free alternative to the proprietary and licensed H.264. WebM is free, underwritten by Google, and a proven web-video delivery veteren -after all, Adobe Flash has used On2's codecs for web video delivery over the years.

There are several things going for H.264. First, it is entrenched in several video delivery formats and standards. For example, Bluray uses H.264 to encode video. Millions of Bluray players will become obselete if WebM is used instead of H.264. My two cents are that this wont really happen, instead, newer players will incorporate the possibility of decoding WebM video also. Even as I write this I am aware of several hardware manufacturers who are incorporating the WebM video decoders into their ASIC hardware. But I am not assuming that things like the Bluray standard will be changed, on the contrary, there are other emerging media delivery and storage standards that have been frozen with H.264 being selected as the codec of choice. Standards take years to change or deploy and its very unlikely that they can suddenly adopt WebM instead of H.264.

In the mid-term WebM will defeat H.264 where there is a (easily replacable) software decoder and soft-media. By soft-media I mean video that is not burnt onto read-only media like Bluray disks but instead exists, say, in the form of a web-downloadable video on a server's hard-disk. The economic compulsion of having to pay the H.264 licensing body per-video download and per decoder shipped compared to the free (as in air) WebM alternative shall edge out the former. I suspect web-video delivery platforms like You-tube will lead the charge because (1) The number of videos being downloaded are huge and, (2) Their average revenue per video is miniscule, and each WebM download instead of H.264 download saves a few cents in licensing fee.

A black-knight for the time-frame question will be the innovation in H.264 vs. that in WebM. If open-sourcing WebM has the desired effect of creating a better and more innovative codec in the future then WebM could gain on H.264 faster. But I am sure that the H.264 camp won't be sitting on their palms all this while! Video codecs use advanced algorithms and developing such concepts needs big investments (R&D). Will backers of WebM bring that kind of investment to the table in the interest of improving WebM when there is no direct revenue coming back to them?

Another thing that is going for WebM is the push toward virtualization in consumer electronics (away from the conventional ASIC approach) in the coming years. This means that future hardware (such as future Bluray players) may be capable of running multiple upgradable decoders rather than being tied to a specific ASIC implementing a specific decoding algorithm for a specific codec. That may just break the hardware dominance of H.264 over WebM. As a consumer I would prefer to hedge my bets and buy a virtualization-capable decoder rather than being tied into one video codec via an ASIC decoder.

Parallelizing & Multiprocessing Commands Using Python

2010-11-09T11:51:00.000+01:00

My computer has multiple processor cores. That means I could speed up scripts by running some of their tasks in parallel. I have written up a simple Python script that uses the Multiprocessing library to take a list of jobs (each is a unix command string) and then executes them on a specified number of independent processes. These processes are created only once and act as a pool of "workers" which undertake a job, submit the result of the computation, and then undertake another job (if available in the job queue). The script ends when there are no more jobs in the job queue.
This approach is useful when (1) You have a multi-processor/multicore CPU. (2) Your tasks are CPU intensive. (3) You are reasonably sure that the jobs are not internally parallelized to take advantage of multiple CPUs. In my case, I had two directories full of numerically-named image (.ppm) files whose PSNR's had to be compared using the pnmpsnr utility. Computing PSNR is a computationally intensive task. Running the comparisons serially (single process) was significantly slower than adopting a multiprocess approach.
The code below should get you started on parallelizing your computationally intensive script. You can download the script from here.

#! /usr/bin/env python
# Sachin Agarwal, Google, Twitter: sachinkagarwal, Web: http://sites.google.com/site/sachinkagarwal/ 
# November 2010
# Using Python to execute a bunch of job strings on multiple processors and print out the results of the jobs in the order they were listed in the job list (e.g. serially).
# Partly adapted from http://jeetworks.org/node/81


#These are needed by the multiprocessing scheduler
from multiprocessing import Queue
import multiprocessing
import commands
import sys

#These are specific to my jobs requirement
import os
import re
 
def RunCommand (fullCmd):
    try:
        return commands.getoutput(fullCmd)
    except:
        return "Error executing command %s" %(fullCmd)

        
class Worker(multiprocessing.Process):
 
    def __init__(self,
            work_queue,
            result_queue,
          ):
        # base class initialization
        multiprocessing.Process.__init__(self)
        self.work_queue = work_queue
        self.result_queue = result_queue
        self.kill_received = False
 
    def run(self):
        while (not (self.kill_received)) and (self.work_queue.empty()==False):
            try:
                job = self.work_queue.get_nowait()
            except:
                break

            (jobid,runCmd) = job
            rtnVal = (jobid,RunCommand(runCmd))
            self.result_queue.put(rtnVal)

            
def execute(jobs, num_processes=2):
    # load up work queue
    work_queue = multiprocessing.Queue()
    for job in jobs:
        work_queue.put(job)
 
    # create a queue to pass to workers to store the results
    result_queue = multiprocessing.Queue()
 
    # spawn workers
    worker = []
    for i in range(num_processes):
        worker.append(Worker(work_queue, result_queue))
        worker[i].start()
    
    # collect the results from the queue
    results = []
    while len(results) < len(jobs): #Beware - if a job hangs, then the whole program will hang
        result = result_queue.get()
        results.append(result)
    results.sort() # The tuples in result are sorted according to the first element - the jobid
    return (results) 

 
#MAIN 
if __name__ == "__main__":
    
    import time #Code to measure time
    starttime = time.time() #Code to measure time
    
   
    jobs = [] #List of jobs strings to execute
    jobid = 0#Ordering of results in the results list returned

    #Code to generate my job strings. Generate your own, or load joblist into the jobs[] list from a text file
    lagFactor = 5
    ppmDir1 = sys.argv[1]
    ppmDir2 = sys.argv[2]
    decNumRe = re.compile(u"((\d+)\.(\d+))")
    ctr = 0
    for root, dirs, files in os.walk(ppmDir1):
        numFiles = len(files)
        files.sort()
        fNameLen = len(files[0].split('.')[0])
        for fName in files:
            for offset in range(0,lagFactor):
                fNumber = int(fName.split('.')[0])
                targetFile =  '%0*d' % (fNameLen, max(fNumber-offset,1))
                targetPath = ppmDir2+'/'+targetFile+'.ppm'
                origPath = ppmDir1+'/'+fName
                fullCmd = "pnmpsnr "+origPath+' '+targetPath  #Linux command to execute
                jobs.append((jobid,fullCmd)) # Append to joblist
                jobid = jobid+1

    # run
    numProcesses = 2
    if len(sys.argv) == 3:
        numProcesses = int(sys.argv[1])
    results = execute(jobs,numProcesses) #job list and number of worker processes
    
    #Code to print out results as needed by me. Change this to suit your own need
    # dump results
    ctr = 0
    for r in results:
        (jobid, cmdop) = r  
        if jobid % lagFactor == 0:
            print
            print jobid/lagFactor,
        print '\t',
         
        try:
            print cmdop.split()[10],
        except:
            print "Err",
        ctr = ctr+1
    print

    print "Time taken = %f" %(time.time()-starttime) #Code to measure time

My Grocery Store is a Mobile Operator

2010-10-31T10:40:00.002+01:00

My grocery store sells generic versions of bottled water, soap, breakfast cereal, butter, milk and mobile voice/Internet service. Now thats quite remarkable considering Rewe, the German grocery store chain I am alluding to, doesn't really have a history in the German telecommunications market. What they do have are 15445 stores across Europe that can stock up prepaid SIM cards branded "ja! Mobil" (the name comes from their generic in-store brand). Their physical presence and the mind space ja! occupies drives their business model. If shoppers can drink ja! branded generic cola then they could as well use ja!-branded mobile voice/Internet service.

The innovation here is the marketing possibility offered by Rewe grocery stores (instead of any technical innovation). Rewe has partnered with T-Mobile in Germany to implement its ja! branded "mobile operator". T-Mobile provides a white-label technical platform and Rewe simply brands it "ja! mobile". T-Mobile wins because it gets to sell its service at a discount to lower-paying market segments without putting off the premium T-Mobile customers, Rewe makes a neat profit by leveraging the ja! brand, and the customer wins by getting a discounted service from the best mobile operator of Germany, minus the T-Mobile brand.

I was looking at ja! mobile pricing. There are various flavors of pre-paid and flat-rate plans, although the focus seems to be on pre-paid plans that require no long-term contract and can be dispensed at Rewe's check-out counters. Depending on a customer's typical usage, s/he can can trade-off get a discounted subset of services from among the services offered - SMS, MMS, in-network calling, fixed-line calls, data etc. Interestingly, customer support is not free. Its a little like the contemporary airline business where everything from customer service to carry-on baggage can become a chargeable add-on rather than part of the product. Customers need to be mindful of what their money is buying them before assuming that things like customer service or technical support is part of the product.

Brick-and-mortar stores also sell iTunes gift cards and Facebook credit nowadays. Dell and Amazon partner with Best Buy to sell computers and Kindle e-books respectively. There are interesting business opportunities for anyone who can funnel real customers and subscribers (read: money) into the virtual/communications world. Very real profits await those brick-and-mortar outfits who can build bridges between technology companies and customers, even if they are just plain-Jane grocery stores!

Fancy Vertical Handover: A victim of REST?

2010-10-13T10:00:00.000+02:00

There has been a ton of research, standardization work, and development around Vertical Handover - the ability to change the underlying network access without disturbing the overlying communication protocol (TCP or application) sessions. The simplest example is when a user moves from a Wifi zone (e.g. office) to a 3G zone (outdoors). A seamless handover hides the underlying rewiring of the access and lets the user continue using the device as if nothing changed. Vertical handovers have quickly graduated from laboratory quirk to mainstream occurance, with Wifi-enabled smart-phones switching between access technologies multiple times daily.

But the vertical handover on my smart-phone doesn't really preserve the underlying TCP session and yet works pretty well. Why? Because most of the apps on my phone use REST-ful protocols like HTTP, XML-RPC, or SOAP. That means they are, in theory, stateless. In fact, a TCP connection is created and torn down for every message exchange between the service server and the client. Sometimes TCP connections linger on to improve efficiency (carrying multiple request-response mesages between the client and service server), but a discontinuity in the TCP connection is not catastrophic. I simply see my smart-phone negotiate a new connection with the new access (3G or Wifi) and then my app keeps working as if nothing has changed.

All that talk about preserving TCP connections across access technologies was much ado about nothing!

Mobile Video Calling: Can Tango Tango?

2010-10-06T18:53:00.001+02:00

Tango is a newly launched mobile-to-mobile video calling application for iPhone and Android devices. Tango enables smart-phone owners to see each other in addition to speaking with each other during a Voice Over IP (voip) conversation. Many smart-phones come with front facing cameras, ostensibly for video calling, and Tango enables people to use these cameras during a voip call. Think of mobile video calling when you want to see your expat pet doing silly tricks on video (or for beach and boardroom voyeurism).

But, as Walter Mossberg's Tango's review in WSJ reports, the quality of Tango's video call leaves a lot to be desired. I came across a video on Gizmodo's website showing Tango in action. The verdict is that Tango's performance is way below expectation. In fact, Tango's video frame-rate seemed to be approximately 1 frame per second in the Gizmodo video (and not the "high quality video mobile calling service" as the company's press release claims).

Make no mistake, achieving even 1 frame-per-second video+voice is no small feat. Tango's engineers have packed a real-time video+voice encoder/decoder into a smart-phone and have managed to trasmit/receive two parallel audio/video streams over Wifi (they also claim high quality video calls over 3G but lets not give Tango all the benefit of doubt :-) ). On top of this, achieving this for both the Android and iPhone platforms and for dozens of smart-phone models is admirable.

Frankly, I am not surprised by Tango's dismal video frame rate - resource bottlenecks such as smart-phone hardware, software/OS, network bandwidth and latency have to be overcome before an acceptable double digit frame-rate is achieved. But what surprised me was the poor voice quality: the Tango call sounded a lot like those cheap international calling cards I used to make international calls from the US many years ago. Terrible sound quality. I wonder why Tango engineers didn't trade more video quality (or even cut out video entirely when resources were scarce) and spend resources on improving voice? Voice over IP for mobile phones is a solved problem - Skype and the umpteen number of mobile SIP voip clients got audio to work well even on older smart-phones. Why couldn't Tango?

Tango is an over-the-top application, meaning that it runs over the best-effort (ordinary) Internet. I mention this here because the alternative, 3G telecom-operator-supported video calling, uses a dedicated network channel to ensure call quality assurance. But a Tango call will be carried over the same pipes as plain web traffic, making the video/voice call quality dependent on what else is being transmitted during the call. Telecom-supported 3G video calling is also much more energy (battery) efficient than Tango.Why? Because in order to remain signed-into Tango to receive calls, the smart-phone has to periodically send "I-am-alive" messages to the Tango server. This means that a TCP or UDP socket is always active (or repeatedly created and and torn-down), effectively disabling the smart-phone's built-on power-saving sleep function. Offcourse, telecom supported 3G video calling costs money, but it is technically superior to Tango or any other over-the-top mobile video calling system.

But this is not about Telecom vs. Internet applications. This is about the use-case. Video calling was touted as one of the big use-cases for 3G Telecom networks (and 4G too?). 3G standards support video calling and so there is hardware acceleration, network resource reservation, optimized audio/video codecs, and cross-phone/OS support for video calling on every modern smart-phone. But apart from the cost of making 3G video calls, is their something else that relegated video calling to its sad never-used status in phones? Yes there is. Video calling has simply not been accepted as a viable form of mass communication in our society, and remains to-date, a quirky add-on. When was the last time you placed a video call?

When Internet telephony (voip) arrived it quickly replaced circuit-switched calling. With mobile video calling, even if Tango can eventually fix its technical/engineering limitations, there is nothing to replace! Sadly, the mobile video calling use-case was still-born from the beginning.

Android Device Chatter with the Google Mother Ship

2010-09-01T09:16:00.205+02:00

Parts of this post were moved into a formal research study (click link below).

https://sites.google.com/site/sachinkagarwal/home/publications-talks/gis-2011-infocom-2011

Microsoft Kin: RIP == Social Networks:RIP (?)

2010-07-03T10:34:00.009+02:00

Microsoft Kin 04/2010-06/2010

Microsoft is phasing out its social-network/cloud storage-heavy Kin smart phone just 2 months after launch. This embarrassing report from CNN claims that the Microsoft+Verizon Kin sold less than 10000 units in the two months. RIP Kin.

I never got around to using the Kin, but apparently the market didn't see the justification for the expensive data plan (>=$29 p.m.) tagged on to the Kin by Verizon. The market was supposed to be teens looking to stay connected via social networks, but they did not bite into the insanely high data-plan tariff. Social networking, it seems, is not worth that much to them. How much is it worth anyway?

Lets not belittle the effort Microsoft put into this device - as a product the Kin was fully functional and seemed to do the things you would expect from this sort of device - Internet social networking, cloud storage and syncing of users' data, a built in Zune player, sleek design, etc. And at under $100 (with a data plan) it had a low entry barrier too. It seems like all the pieces were there but the Kin machine never got off the ground.

I don't know if the lack of a credible app store spelt the end for the Kin. What I do know is that social networking apps completely failed to drive sales. Next time someone uses social networking as the use-case for a device or service that is supposed to make money, say - Kin!

Untitled Poem

2010-04-30T18:23:00.004+02:00

Among many other things, my father taught me how to read and write English. Everything I've ever written starts with what he taught me. Now as he lies dying of cancer, I wrote this for him. Say a prayer for him.

All of my thoughts
Like river drops
Together making up me
Like a river that flows
Until it throws
Fresh into the salty sea

All rivers meet that end
No matter what they pretend
Or how many bends they make
And so it will be
With every drop inside me
No matter what path I take

So you may ask
The point of the task
To meander toward the salty end
But don't we all know
Drops become vapor and snow
From which new rivers descend

Plug and play internal HDDs, literally!

2010-01-08T10:05:00.001+01:00

I just saw this contraption on a colleagues desk. As you can see, a 3.5" HDD is literally plugged into the dock as if it were some super-sized memory card . Well, thats exactly what it is. The dock also has ports for USB keys, SD cards, and probably a few other formats.

Interesting to see the form factor difference between the SD slot and the 3.5" HDD slot. Flash memory capacity is quickly catching up with HDD capacity (the latter's lead has shrunk to only about ~10x). HDDs are endangered species!

India's Broadband Future

2009-12-14T10:04:00.015+01:00

Ajit Balakrishnan, CEO, Rediff gave a keynote in IIT Delhi earlier today. His talk suggested that Indian telecommunication operators and the government should not be concentrating at delivering niche multi-Mbps broadband services but should instead concentrate on delivering reasonably good service (100s of kbps) to a larger population. Ajit flashed a slide which showed that 86% of 3G users use their smartphones to access their email, a relatively low bandwidth application, but only 6% use 3G to download and watch videos. Ajit's point was to recognize the importance of broadband as an "always on" connection rather than a high-bandwidth connection in India.

There is a analogue in India's history to this choice that Indian telecommunication operators and the government has to make. The government of India created top notch higher education institutes - IITs, RECs, and IIMs - in the 1950s (after Indian independence). It spends tens of thousands of dollars per year on each student enrolled in these institutes, arguably at the expense of thousands of primary education schools in backward areas of the country. The thinking at the time of creation of these institutes was that this creme de la creme would catalyze the growth of industry and technology in the country. Similarly, it may be theorized that by providing high-speed Internet connectivity, early adoptors will drive applications and create demand in the general population to upgrade their connectivity.

Countries like China or South Korea concentrated on their primary education institutions rather than creating world-class higher education institutes. It is safe to say that both these countries are significantly ahead of India, measured via any human development index. But does this analogy suggest that India should concentrate on democratization of (relatively low speed) broadband rather than creating small pockets of high speed broadband?

I think that the market forces will decide the balance between broadband services in India. The ARPU on low-speed broadband may not exceed $5, but this will be compensated via large volumes. I also believe that low-speed broadband will be served via wireless in India. With mobile phones outpacing fixed line connections by a 12:1 ratio in the country, there is limited scope for technologies like DSL to be widely deployed. Fortunately, 3G, LTE and Wimax are nicely poised to fill in for the lack of fixed line infrastructure in India. As for the niche multi-Mbps broadband, I expect FTTX being deployed in highly urbanized areas where western ARPUs (10s of dollars) are possible.

Thermal imaging cameras at Bangalore airport!

2009-12-06T08:38:00.006+01:00

Photo: Thermal imaging for Swine Flu screening at Bangalore International Airport

Arriving on an international flight at the Bangalore International airport, I was surprised to see two thermal imaging cameras. Each camera was looking at arriving passengers and visually marking those who had an elevated body temperature, in order to discern people who may be suffering from Swine Flu. These cameras are sensitive to IR heat radiation in the body temperature range. The cameras work by mapping temperature readings into a colormap that visually depicts body temperature. The video images produced by the cameras looked eerily similar to the IR images that the alien saw in the Predator movie series!

As compared to conventional body temperature measurements via thermometers, this real-time technique makes it possible for a medical officer to screen many more people. I wonder why these systems are not installed in other world airports.

Multiprocessing vs. Network I/O

2009-11-24T04:38:00.006+01:00

I've been reading up on Python's (v2.6 and above) multiprocessing module. While multiprocessing has been around for a long time, simplified libraries like this multiprocessing module may spur even casual programmers to consider parallelism in their programs. My feeling is that if issues like inter-process communication, synchronization among processes, and deadlock avoidance are dealt with painlessly, then many non-professional programmers would feel confident enough to load up CPUs with programs with multiple processes to speed things up. Moreover, given that multiple CPU cores are becoming the norm rather than the exception on commodity hardware, there is a real incentive to eventually switch to multiprocessing.

What will this switch in program design mean for network data I/O? Will average users end up opening and using more network connections on average? Web browser tabs are a good example of multiple threads or processes. When modern browsers fire up they often connect to several websites saved from the previous session. I conjecture that multiple tabs fill up the network's queue faster than was possible with single core CPUs. Although Network I/O is much slower than CPU bandwidth (data rate at which CPUs process say, HTML), there is a point beyond which a single core CPU becomes the bottle-neck (e.g. firing a dozen browser tabs). But multiple cores remove this limitation and drive network I/O to its physical (or traffic-shaped) limits. I plan to measure this interplay between multiprocessing and network I/O. Watch this space!

Free airport Wifi as a marketing tool

2009-11-12T15:05:00.013+01:00

Google is offering free Wifi in 47 US airports during the holiday season The idea is to flash a few web pages marketing Google's software and services to users in return for free Wifi service. According to this CNN article, Google is not the only company to do so - apparently Lexus and Ebay have also implemented similar ideas, or intend do so in the near future.

Free service is probably going to bring a torrent of airport Wifi users online - probably many more than the current number of (paying) users. Given that Wifi Internet channel space is a shared resource, it will be interesting to see how airport Wifi scales with the up-tick in usage. I just hope that the service doesn't deteriorate so much that the sponsoring companys' well-meaning message is lost to disgruntled users. And I do hope that engineers running these Wifi access points have done the networking provisioning Math beforehand.

Now the economics. The sponsoring company (Google) is probably going to pay a lot less than the retail price of airport Wifi connectivity. Why? Because the sheer volume of users will be much higher than when users have to pay individually. I think that the payment will include a fixed component depending on the number of access points participating in the service, and a variable component depending on the number of users accessing the service.

Lets assume that an average airport has about 20 accessible Wifi access points. Each access point can support (with any reasonable quality of service) about 10 concurrent users. If the airport is busy for, say, 12 hours in a day, and further say that we assume an average utilization of 50% of the total capacity of the access points, then we have (per day)

10 * 20 * 12 * 0.5 = 1200 hours of usage per day per airport.

I would assume that the sponsoring company (Google) would pay about $5000 per day as a fixed cost and then about $1 per hour usage. This brings the daily total cost per airport for the sponsoring company to $5000+$1200 = $6200.

So for 47 airports and 50 holiday season days, we are looking at a bill of about

6200 * 47 * 50 = $14.57m

That's not a bad deal for a big company like Google, considering the number of eyeballs they will capture. Lets say a user uses the free Wifi for 30 minutes on average. So, we are looking at about 12*10*20/(1/2) = 4800 users per airport, per day. That works out to over 11m users in the 47 airports over the 50 day holiday period. Even if we assume that most people make round trips and therefore use the Wifi connection 2 times, Google can still reach about 5.5 million unique users! No too bad for the $15 million spent.

And I haven't even started counting the goodwill ROI bonus for playing Santa during holiday season! Nifty nifty marketing.

Call for action! Powering down PCs

2009-11-06T13:51:00.010+01:00

I've been playing with the idea of building a PC application that measures a computer's idle time. The idea is to gently convince users to suspend or power-down their PCs when they are not being utilised. I strongly believe that if PCs are optimally powered down, then many users could cut energy consumption (and hence also save on energy bills). Powering down battery-powered laptops will also increase the longevity of batteries and thereby decrease toxic battery waste in landfills.

As an example of where the possible savings may be, above is a pie chart showing my own PC usage over the past few working days. As you can see, there is ample scope to power down/suspend PCs when they are idle.

If you want to contribute time to this project (coding/web page/translation into other languages/spreading the word), feel free to contact me. If not, then do suspend your PC every time you are away for more than a few minutes :-).