Tuesday, November 19, 2013

Dynamo, Swift Objectstore, Cassandra - Part 2: Openstack Swift Objectstore


This is a 3 part series where I discuss the Swift Objectstore and Cassandra in the context of the original idea that inspired them both: Amazon's Dynamo.

This post is part 2 of the series that looks at the Openstack Swift Objectstore. If you haven't read part 1 of the series - which describes the Amazon Dynamo data store - then you may want to first read that.


OVERVIEW


The Openstack Swift Objectstore is a distributed file storage system. Swift open-source software creates a content-addressable storage cluster on top of multiple (usually off-the-shelf x86) boxes full of storage disks. The content-addressable storage provides a uniform namespace for all objects that are stored ib Swift cluster. Each data object is represented via a URL of the form

http://swift-cluster-lb.address/account/container/object

By clustering multiple off-the-shelve boxes and replicating data across them Swift can achieve almost limitless scalability and data durability by spreading data replicas across independent failure domains (different disks/servers/racks or even data centres). The account, container and object abstractions of Openstack Swift are analogous to volume, directory and file in conventional file systems. Clients can issue Create, Read, Update and Delete (CRUD) requests on a data object by passing different verbs in HTTP requests made against the URL.

The Swift objectstore uses the Dynamo consistent hashing idea to set up partitions on the multiple boxes. Usually many hundreds or a few thousand partitions, chosen from random ring hash ranges, are mapped to each storage node. Each partition is replicated multiple (N=3 by default) times. Swift guarantees that partition replicas are kept in as distinct failure domains as the hardware allows (e.g. different disks if the whole swift cluster is on one server, or different servers if the whole swift cluster is a rack of servers, or across racks for multi-rack clusters). This strategy ensures high probability of data availability when hardware fails because hardware faults are usually localized. For example, if a top-of-rack switch fails, then a multi-rack Openstack Swift  cluster would still be able to serve any stored object because the replicas exist on different racks.

Openstack Swift consists of multiple types of processes usually running (multiple copies) on different physical servers. Openstack Swift clients usually send CRUD requests to a HTTP(S) load balancer, which in turn distributes these requests among a pool of proxy processes. All proxy processes have complete information about how partitions are physically mapped to the boxes and OS partitions on disks within these boxes. So they can direct each incoming request (based on its URL and HTTP verb) to the appropriate storage processes for processing. In this model all data is passed through the proxy and load balancer(s) and no storage node is directly accessible to clients. Openstack Swift uses N=3, W=2 and R=1 by default (see the earlier post on Dynamo for the meaning of these variables). Therefore writes are acknowledged after the incoming object has been successfully written on two separate partitions. Reads are returned via the storage server that is fastest to respond to the proxy server with the requested data.

Like Dynamo, the Swift Objectstore provides eventual consistency. Swift adopts a proactive algorithm to check consistency between replicas of partitions. Storage nodes periodically compare their partitions (which are directories on the files system) and rsync any differences between the partition replicas. The period between these comparisons user-configurable and set to 30 seconds by default. One of the notable characteristics of Openstack Swift is that the replication is at the file level, not the block level (such as traditional RAID systems). So the amount of time to "rebuild" a broken drive is proportional to the size of the data stored on the drive and not the total capacity of the drive.

Swift also keeps account and container information separately in sqllite databases for any metadata operations (such as listing objects in a container). These sqllite databases are also checked for consistency via periodic comparisons with their replicas and synchronized based on timestamps if there differences between replicas.

MECHANICS


Its worthwhile understanding how the CRUD operations and background replication and scrub translate to low level disk operations on the storage nodes. Understanding this aspect of Swift opens the door to understand performance in terms of ideal and less than ideal workloads, strengths and limitations of Swift, and the impact of hardware choices on System performance. Fortunately Openstack Swift code is very well documented, well factored, and professionally maintained, making it an excellent source of understanding its inner workings.

Create


A create request requires Swift to store an object (a binary blob of data) along with some metadata associated with the object. The proxy node determines the (N) partitions where the object is to be stored and forwards the request to those storage nodes. Each storage node first does a lookup within the target partition to confirm that an object with identical account/container/object hash does not already exist. If not, then a directory is created within the partition to store the object. The object is stored as a binary file with a .data extension. Metadata associated with the object is also stored within the inode or within .meta files in the same directory. For more details, refer to the Object server and diskfile source code files.


Read


A read request is forwarded by the proxy server to all N storage servers containing  the partition in which the object is stored. Each of these storage nodes checks within the appropriate partition if the directory containing the object exists. If it does, then the object's directory is checked for .ts files (a .ts or tombstone file would indicate that the object is deleted and the a 404 not found response should be returned to the client). The directory is also tested for .meta files in case additional metadata files associated with the object are available. Finally the .data file containing the object and the corresponding metadata read from the XFS inode metadata and any .meta files is composed into the HTTP response sent back to the proxy node and the client. Recall that Openstack Swift returns the first successfully read object to the client from among the storage servers. For more details, refer to the Object server and diskfile source code files.


Delete


Deletes are asynchronous in Openstack Swift. A .ts (tombstone) file is created in the object's folder to indicate that the object has been deleted. The container sqllite database is also updated. A subsequent asynchronous background process (called the auditor) deletes the object at a later time. For more details, refer to the Object server and diskfile source code files.


Object Replication


Replicating objects is necessary to guarantee the eventual consistency guarantee of Openstack Swift. Swift's object replicators on each storage server compare their partitions with the other (N-1) replica partitions via the background replicator process. For each partition, the replicator process sends requests to the other storage nodes storing that partition to send Merkle tree hashes of their objects stored in their partition directory. This data structure allows for quick identification of differing objects in the partition replicas. Subsequently rsync is used to synchronize replicas. For more details refer to the object replicator source code.


Scrub


Objects are periodically scrubbed to check for bit rot. Swift implements a periodic disk scrub in the background by computing checksums of stored file objects and comparing them with stored checksums. This process identifies any data corruption (due to disk bit rot for example) and suitably addresses the errors by creating more copies from the other replicas of the data object. The metadata stored for each object contains the (MD5) hash of the object's data. The auditor process(es) running on each storage node cycle through all partition directories containing objects, compute the MD5 hashes of each object and compare them to the stored checksums. Mismatches indicate corrupted object data, this is quarantined and a subsequent replication run restores the object's data from the other replicas. For more details refer to the object auditor source code.


Other Goodies


There are several other convenience features built into Openstack Swift, most of which are beyond the basic Dynamo design. For example there is the ability to specify object versions in create requests (which essentially results in different objects being created for different versions). Time bound objects, which are automatically deleted after a certain interval are also provided (as described in expirer.py). Extremely large objects (over 5 GB) are internally divided and stored as smaller objects. The Openstack Swift implementation is based on modular WSGI pipelines, which allows pipeline-style addition and removal of custom components while processing Swift requests and responses. For example, the Amazon S3 object interface can be enabled by installing an additional component into the processing pipeline in proxy nodes.

Swift also provides automatic disaster recovery features by giving operators the ability to asynchronously replicate objects across remote data centres. Read and write affinity features (as described in the proxy server source code file server.py) ensure that data is accessed from/written to the nearest data centers from clients if possible.

Corner Cases


There are a few corner cases where Openstack Swift may not yield great performance. These are interesting to discuss here.


Very Small files


One of the side-effects of using the hash-based rings to store data in Openstack Swift clusters on any of the 100s or 1000s of partitions on a storage node is that consecutive write operations on storage nodes are neither spatially nor temporally correlated. This means that one write operation will most likely be in a different partition (directory) than the previous write. This poses a challenge when Swift is used to store many small files because caching the XFS inodes and dentrys becomes ineffective. To appreciate the issue here, consider this example of the directory layout on a storage node

Example: An Openstack swift object storage directory with an object directory containing the object's .data file

/srv/1/node/sdb1/objects/717/89f/b359508efc4d39b0d22efde69fc7f89f/1382112651.23154.data

Breaking down the directory paths below:


  • /srv/1/node/sdb1/objects: This is the object directory where all objects stored on sdb1 device are stored.
  • /717: This is the partition
  • /89f: This is the hash_prefix of the object
  • /b359508efc4d39b0d22efde69fc7f89f: This is the directory with name = name_hash of the object
  • 1082112651.23154.data:This is the actual data file containing the object

Each time an object is written to this storage node the inode and dentry cache needs to access a random entry down from the partition level. The only practical method to ensure fast inode metadata operations is to make sure that the memory can fit the whole inode and dentry cache. Also consider that storage nodes usually contain multiple large capacity disks, each containing a XFS filesystem and the associated inodes and dentrys. All these caches should ideally fit in memory for fast metadata operations! Given a storage node with 10s of TB storage capacity can store 100s of millions of small objects (say of 10-100kB size) , the memory requirement of each storage node becomes quite large if the inode and dentry caches need to be fully stored in memory. There is a good discussion of this issue here.

It is to be noted that this issue of small files is not unique to Openstack Swift. For example, Ceph, another distributed file system that can be used as an objectstore, stores each individual object as a file on the underlying filesystem (which is usually XFS on production Ceph systems). Many small files stored as objects in Ceph may cause similar issues.


Very Large files


Reading or writing speeds for extremely large files (e.g. several GB) are limited by single spindle speeds in Openstack Swift because objects are not striped across disk spindles. These "elephants" may also slow down read and write operations for other objects being stored on the same partitions, (since all the spindles across all the storage nodes that store the elephants will be busy at the same time serving the elephant request). However, the randomization in partition-to-object mapping makes such situations rare, especially if adequate number of spindles and partitions are provisioned in the Openstack Swift deployment.


OUTLOOK


Hopefully this article has supplemented your knowledge about Openstack Swift and encouraged you to look at the (very accessible) Swift source code to find exact answers of any questions you may have about it. In addition, Openstack Swift is remarkably well documented. Together, the source code and documentation unambiguously answer almost any question about how Openstack Swift works.

In the next (and last) part of this series we'll look at Cassandra, another very popular data store based on the ideas of Dynamo.

Wednesday, October 30, 2013

Dynamo, Swift Objectstore, Cassandra - Part 1: Dynamo review

This is a 3 part series where I discuss the Swift Objectstore and Cassandra in the context of the original idea that inspired them both: Amazon's Dynamo


PART 1: Amazon Dynamo review

Dynamo was described in a 2007 paper from Amazon. Dynamo is a distributed data store that Amazon developed in order to service the database and storage use cases of Amazon at the time. Yes, both database and storage needs because Dynamo can be adapted to store data in ways that are more like databases or ways that are more like distributed file systems. It achieves this by taking a pre-existing “persistence engine”, such as the Berkeley DB or MySQL or a filesystem, and adding distributed system sauce to make these services scale horizontally across hundreds of cheap servers in a SLA-bound manner with respect to latencies of the 99.9th or higher quartile. This is absolutely remarkable because Dynamo transforms mature but non-distributed data stores (like Berkeley DB) to scale horizontally, almost limitlessly.

Dynamo provides redundancy by storing multiple data copies (N) in different physical servers, possibly located in independent fault domains. The fault domains requirement keeps the N copies in different Amazon data centers, effectively making the probabilities of loosing all N copies independent of each other and practically miniscule. Read  and  write load is distributed across multiple servers by randomizing which servers are responsible for which data. Then, given the reasonable assumption that the number of frequently accessed data objects far exceeds the number of physical servers, load is nicely balanced across the physical servers.

Dynamo achieves these properties by making two tradeoffs compared to more traditional data stores such as RDBMS databases. The first tradeoff is that reading data off dynamo may not yield the most recent write update to the data. This is the tradeoff of consistency in the famous (C)onsistency, (A)vailability and (P)artition tolerance triangle of the CAP theorem in distributed computing. Applications using a dynamo data store need to be intelligent enough to detect and deal with inconsistent copies of data  that may yield a stale version of the data object.

The second tradeoff dynamo makes is limiting the richness of the data schema by providing a simple key-value store. Therefore, unlike traditional RDBMS databases the data store implements content-addressable storage via a simple key-value model. Applications can create a key-value pair corresponding to a data object insert or “put” operation and read the value corresponding to a key or a “get” operation. While putting an object Dynamo provides  version support capabilities via passing implicit metadata during the put operation on data objects, which serves as a means of handling conflict resolution between different versions of the same data object. Deletes are handling by inserting a tombstone corresponding to the key that needs to be deleted.

Consistency is eventual, this means that if an object is left unmodified for a (finite) amount of time then all N copies of the object will be identical. Moreover it is guaranteed that the latest update of the object becomes available after a finite amount of time at all physical locations hosting the N copies so a subsequent read off any physical location will yield the latest version of the data. The exact time required for the system to enter this state depends on the number of failures, network conditions, and user-defined parameters that control how aggressively the background replication of objects happens.

Dynamo uses Merkle trees to reconcile data quickly and scalably in the background. The key benefit of Merkle trees is that the data transferred for reconciliation is proportional to the number of differences and not to the total number of data objects being checked for consistency via reconciliation.

The paper describes the interplay between number of copies (N), the number of copies read (R) before returning a client’s read request and the number of copies successfully written (W) before acknowledging the client’s write request. When N < R + W then strong consistency is guaranteed (since the intersection of servers where the W writes and R subsequent reads is performed cannot be an empty set). This assumes that the read happens after the write is acknowledged. If instead two independent clients were to write and read a data object respectively with the read request hitting the system before the write request of the other client is acknowledged (a classic asynchronous and independent data access pattern) then there is no guarantee. The authors suggest that N=3, R=2 and W=2 are used for several Amazon services.

The most interesting part of Dynamo is the partitioning that dictates how data is dispersed among backend resources (physical servers). Consistent hashing is used to divide up the key space (128 bit MD5 hash keyspace of the keys in the key-value insertions). The consistent hashing approach guarantees that when a physical server is added to or removed from a dynamo cluster of M nodes then the total data moved is a 1/M fraction of the data stored in the cluster. Given the limited number of physical servers each is further divided into 100s of virtual nodes which are randomly mapped to the consistent hash ring. This enables faster recovery from failure conditions (disks and node malfunction) and distributes replication load during recovery across all servers in the dynamo cluster.

There are several other handy features - for example hinted handoffs for maintaining the correct replica count even when the preferred physical server for which the data is destined is down transiently. Writing speedups using ideas similar to commit logs via the notion of a “in memory writer” with at least 1 persistent copy are also described. There are also some SLA-related performance graphs which show the remarkable availability and bounded latency properties of Dynamo under production settings. I highly recommend reading the paper on your own if you have got so far in reading this write-up.

In the next part of this series I will dive into Openstack Swift, a Dynamo-inspired file objectstore. I will analyze design decisions particular to Swift, what is unique and different in each, and where (perhaps) Swift could still improve by standing on the shoulders of the grand-daddy of modern distributed systems -Dynamo.

And after that, we’ll repeat the above for another Dynamo-inspired data store - Cassandra. Stay tuned.

Monday, September 23, 2013

Understanding Storage IOPs

Input output operations per second, commonly termed IOPs, is an important performance measure of a data storage system. The unit of IOPs is operations per second. Input operations, i.e. when data is written to the storage, are different than output operations when data is read off the storage. Their corresponding IOP number is also different. Therefore it is important to know both the inputs per second and the outputs per second performance of a storage system. These numbers are not independent of each other - reading off the storage system while writing to it may affect the performance of both operations. Therefore IOPs are quoted with a weighted average of the inputs and outputs per second. For example, a storage system may be capable of performing 100,000 IOPs with 70% reads and 30% writes, with the operations being performed concurrently. The storage engineer needs to validate the concurrent input/ouput requirements of the application using the storage system to ascertain if they can be met by the storage.

But IOPs by themselves don’t paint the whole picture of storage performance. More information is needed to understand the whole picture. Here are questions leading to the complete picture:

  1. What is the size of the read/written data block while measuring IOPs?
  2. What is the end-to-end latency seen by the application in reading or writing the data block?
  3. With regards to writes, are the reported IOP numbers for synchronous writes or asynchronous writes?
  4. With regards to reads, what is the role of cache?

Large data blocks take longer to write to/read from storage. So IOP numbers of 4kB blocks will be very different than IOPs for 1MB blocks. The most relevant block size for an IOP measure should correspond to the sizes of blocks being written to the storage system by the application. Knowing the IO profile of the application is key to choosing an appropriate storage system for an application.

Latency is a key IOP qualifier. Storage system latency is the time from when the application issues an IO request to the storage system to when the request is completed - the read data is delivered to the application or the storage system acknowledges that the data block has been written. The important question with respect to writes is when the storage system acknowledges that the data block has been persisted on  non-volatile storage. For some applications writes may be asynchronous - they are acknowledged before they have been persisted on non-volatile storage. Since asynchronous IOP and latency numbers look better (higher IOPs, lower latency), promotional storage systems’ material often quotes asynchronous write IOPs.

Some systems have battery-backed non-volatile RAM to allow the acknowledgement to be sent to the application as soon as the data block is written to RAM - usually orders of magnitude faster than data storage media like SSD or disk. The question then becomes, what is the size of this non-volatile RAM that can hold data blocks before the data need to be persisted on the slower storage media. While some applications have bursty write profiles which play nicely with this (limited) non-volatile RAM, applications that require sustained write performance may not benefit much from such methods.

Similarly, RAM can be used to cache data for reads - read latency decreases when a cache hits serve data blocks from RAM instead of being read off slower storage media. The size of the RAM cache as well as the application’s read patterns - is some data read more often than other data - are important considerations while working with caches.

The key to having an intelligent conversation about IOPs is to know your application and to seek definitive answers about latency, data block sizes, synchronous/asynchronous assumptions and caches.

Saturday, July 7, 2012

Beautiful Mergesort

Effective data structures and programming is about putting simple blocks of logic together until the whole is larger than the sum of the parts. A great example is Mergesort. Mergesort is a beautiful algorithm to implement sorting in NlogN time (N being the number of elements being sorted). This is the asymptotic lower bound on sorting.


Mergesort brings together the concepts of recursion, divide-and-conquer, arrays, and pointers (when implemented in C). So here is to Jon Von Neumann and his fantastic invention of Merge sort.


#include <stdio.h>
/*
Simple Mergesort implementation by Sachin Agarwal
sachinkagarwal@gmail.com


 */
void PrintList(int list[], int stIndex, int endIndex)
/*
  Helper function - Print the elements of array list, between stIndex and endIndex
*/
{
  for(; stIndex <= endIndex;stIndex++)
    printf("%d ",list[stIndex]);
}

void Merge (int dataArray[], int stIndexL, int endIndexL, int stIndexR, int endIndexR) 
/*
  Merge two lists into a tempArray in sorted order; this tempArray is copied back to dataArray. 
*/
{
  int tempArray[endIndexL-stIndexL+1+endIndexR-stIndexR+1]; //temp storage
  int stL = stIndexL;
  int resultCtr = 0;
  while (stIndexL <= endIndexL) {
    if (stIndexR>endIndexR) {
      tempArray[resultCtr++]=dataArray[stIndexL++];
      continue;
    }
    if (dataArray[stIndexL] <= dataArray[stIndexR]) {
      tempArray[resultCtr++] = dataArray[stIndexL++];
    }
    else {
      tempArray[resultCtr++] = dataArray[stIndexR++];
    }
  }
  while (stIndexR <= endIndexR) {
    tempArray[resultCtr++] = dataArray[stIndexR++];
  }
  int ii = stL;
  for (; ii<=endIndexR;ii++)
    dataArray[ii] = tempArray[ii-stL];
}

void Mergesort(int dataArray[], int stIndex,int endIndex) 
/*
  Mergesort routine, divide into two halves, and then merge.
*/
{
  int midIndex = stIndex + (endIndex-stIndex)/2;
  if(stIndex<endIndex) {
    Mergesort(dataArray,stIndex,midIndex);
    Mergesort(dataArray,midIndex+1,endIndex);
    Merge(dataArray,stIndex,midIndex,midIndex+1,endIndex);
  }    
}

int main() {
  int dataArray [] = {5,3,7,8,1,6,9,2,0};
  int stIndex = 0;
  int endIndex = sizeof(dataArray)/sizeof(int) - 1;
  printf("\nUNSORTED DATA ARRAY: ");
  PrintList(dataArray,stIndex,endIndex);

  Mergesort(dataArray,stIndex,endIndex);
  printf("\nFINAL RESULT: ");
  PrintList(dataArray,stIndex,endIndex);
}

Wednesday, May 23, 2012

Keringhan & Richie Nostalgia - circular right shifting in C

I picked up my old Keringhan and Richie copy last weekend and came across the C puzzle to do circular right shifts on unsigned integer bits.

So for example if
x = unsigned int 1 = binary 00000000000000000000000000000001

The function

unsigned int rightrot (x, 29) 

will right shift the bits 29 times, and making sure that the bits wrap around.

Here is the code I came up with. Can you think of clever ways to improve the  code it?


#include <stdio.h>

unsigned int rightrot(unsigned int data, unsigned int rotCount)
{
  unsigned int ctr = 0;
  for(ctr=0;ctr<rotCount;ctr++) {
    unsigned int lsb = data & 1;
    data >>= 1;
    if (lsb)
    data |= (lsb << sizeof(unsigned int)*8-1);
  }
  return (data);
}

unsigned int main() {
  unsigned int d = 1;
  unsigned int shift = 31; 
  printf("%d right shifted and rotated %d times is %d\n",d,shift,rightrot(d,shift));
}

Tuesday, May 15, 2012

How I fixed my flat LCD TV

I have a Samsung 40 inch LCD TV. A little over two years old and with all the works - full HD, USB, multiple HDMI inputs, etc.

Then 3 weeks ago my wife heard 2 pops. Both my satellite receiver (a separate box) and my Samsung LCD had been fried, thanks to a high voltage spike in the antenna cable. My power surge protector strip was powerless (no pun intended) against this, because that high voltage signal traveled into my satellite receiver through the antenna cable, and from there on through the HDMI cable into my TV.

No indicator power/standby light, non-responsive TV. Oh no, Check mate.

The satellite receiver was easy to fix - it was immediately replaced by the store where I had bought it (still under warranty). That is why, sometimes, paying a few bucks extra instead of buying things from the Internet pays off later.

Anyway, the elephant in the room was the broken TV. Since it was out of warranty, I didnt even bother calling Samsung - they'd take 100+ euros to just look at it, and their service center is too far anyway from where I live. The local TV repair shop put the bill at a minimum of 280 Euros - thats almost half the cost of the TV itself. Out of the question. What else could I do?

Down but not out, I started looking into fixing it on my own. Now that is risky. Why?

  1. I have no training in fixing TVs (although I am a computer and electrical engineer)
  2. TVs insides can be very dangerous (You have been warned. Don't mess with it unless you really know what you are doing)
  3. Where and how to start?
  4. How would I figure out what is broken?
On the other hand, what had I got to lose anyway? It was a paperweight if I couldn't do something about it.

So I got the Philips #2 screwdriver and took the plunge. Almost a dozen screws later, the plastic back was off and I peered into the insides of my beloved LCD TV. The first thought that came to me was -  this looks a lot like the insides of a modern desktop computer. Not at all like those old CRT-display based TVs full of myriad electronic components and circuits. And I do know how to build and fix desktops. Why should this be so different?

Inside the LCD TV


Back side of the LCD TV, after removing the controller board. The top part (side of the TV) is the power board. The connectors from the powerboard to the controller board, from the controller board to the LCD screen, and from the controller board to the speakers/front panel are also visible. 

There are 3 main components  - the screen itself, the power board (its got lots of big inductors and capacitors on it - you cant miss it), and the controller board. In addition, there are small pilot circuits that run the remote control, front panel lights, speakers etc. In my case, I immediately noticed the problem - a burned out and blacked HDMI port on the controller board. So that is where the disastrous pop sound came from.


Close-up on the controller board. This is the CPU of the TV. Notice the black soot due to the burn out next to the HDMI port in the center of the picture. A careful visual inspection of the power board and the controller board can often yield the source of the problem

I quickly checked if DC voltages were being delivered from the cable coming out of the power board in order to be confident of my thesis that the controller board was the most likely culprit. I couldn't see any easy way of fixing the controller board. The parts are minuscule and this is a double-sided printed circuit board - parts on both sides. Too intricate for human hands to manipulate. Besides, how would I ever figure out which of those chips had been fried?

The only option was to replace the whole board. Thankfully, other people seem to breaking their LCD TVs in creative ways that leave their controller boards intact to be sold as parts - for example, Nintendo Wii controllers are notorious for breaking LCD screens. And sure enough, there are Internet vendors who have made a flourishing business out of selling parts pulled from other broken TVs. I found a vendor in the UK who stocked the controller board of my TV.

Now the trick is getting the same exact part as the one you are replacing. For this you need the part number. Note that the same controller/power boards are used across multiple Samsung TVs so searching for parts based exactly on the TV model number is sub-optimal. In case of my controller board I found the part number printed on the board itself (from my TV the controller board was BN41-01167C-MP1). Pasting this part number in Google yielded several sellers who offer it. I chose FlatTVParts.co.uk  for the favorable customer reviews and testimonials. They ship worldwide. I was not disappointed by their service. The part is provided with the guarantee that it is in working condition, and I believe there is a small return period as well (I hope I don't have to avail of this!). Total cost, including international shipping, was 60 Euros.

So, I screwed in this controller board. Plugged in the connectors (these connectors are just like computer connectors.). I made sure that I put back the back plastic panel before testing (caution - high voltages). 


Then I turned it on and waited. The standby light lazily turned a beautiful red. I pressed the power button. Tuned my satellite receiver. My 8-month daughter let out a squeal when the Kika channel came up. I was home.


Sunday, April 1, 2012

Devstack - Treading Lightly into the Openstack World

Devstack is a simple way to test drive Openstack before committing too much time to setting up a production-ready Openstack installation. It allows you to  setup a toy Openstack installation in as little as one single virtual machine running on your PC. Then you can start and stop virtual machines (nova-compute), assign storage (Nova swift), play with Openstack networking or run your own images (Glance), and monitor your personal cloud  via the nice graphical interface using Openstack Horizon in a web browser.

Devstack works by executing a bash script which sets up all the software dependencies and Openstack software to run Openstack. Its primary use case is for Openstack development - if you are interested in tweaking Openstack's software then Devstack flattens the steep learning curve of setting up the test environment. However, it is also an excellent "try before you dive in" option. Compared to the free Amazon AWS trial, you will gain first hand experience of the entire IaaS cloud ecosystem - from setting it up to provisioning cloud resources for your applications. 

Here are the steps to run Devstack:

  • Install VMware or another hypervisor of your choice (e.g. Virtualbox) on your computer; you probably need a powerful computer with lots of RAM because what we are going to do is run a virtual machine on this computer and then run the Openstack software inside this VM; which in turn will spawn virtual machines inside the outer virtual machine! I recommend at least a dual core processor with 4GB of RAM. 
  • Download the Ubuntu 11.10 ISO from here. You probably don't need the desktop version; the server version should be fine. Off course this is assuming that you will run the Openstack Horizon client in a web browser of the host operating system (otherwise you need to run the web browser in the VM, which means you need X). So make sure you have networking connectivity between the host and the guess Ubuntu 11.10 virtual machine.
  • Use the downloaded ISO to create the Ubuntu 11.10 server virtual machine using the hypervisor you selected. Make sure to assign it adequate resources. A good start would be 2.5 GB of RAM, 40 GB of disk space and two cores.
  • Next follow the straightforward instructions from the Devstack webpage
  • The main Devstack script (stack.sh) spits out the url of the Openstack Horizon web server at the end of the run. If your virtual machine is (network) accessible from your host OS then you can now point your browser to the url, login (credentials: admin,password or demo,password) to play with your very own Openstack cloud!

Ok done! Now what? Well, spawn a VM or two using the Horizon GUI. If you are more enterprising go ahead and try the euca2ools command line tools (the same command line tools used to speak with AWS also work here because Openstack supports the EC2 API). Try to spawn a whole bunch of VMs until your cloud gives up (in this case, the VM running the cloud is giving up!). If you are interested in learning how Openstack is setup, then reading the stack.sh Devstack script is a great introduction to Openstack's internals. You probably know that Openstack is opensource, and its written in accessible Python code. So you can get right into Openstack development  and contribute to the project if that is your kind of thing.

A word of caution - the cloud "starts clean"  every time its host system (in our case the Ubuntu 11.10 VM) is rebooted. This means that any old VMs, images, or other configuration is cleared (that is the logical thing to do for a developer relaunching the cloud every time she changes the code base). If this is problem for you then you can simply suspend the virtual machine instead of shutting it down. But again, this is another reason to not use Devstack for a production system.

Thursday, March 8, 2012

Stingy LTE Data-plans will ruin the new iPad's Party

The new iPad  has 2048 x 1536 pixels screen resolution. The best way to understand this is that it has a higher resolution than your normal 1080p full HD TV. Let us contemplate on how much data is needed to drive this high resolution screen. Lets talk video first. The Verizon FIOS HD channels typically come in at 15Mbps. That is almost 2MB/second. Now let us switch to using the new iPad  as a navigation device in the car. The iPad 3 resolution is more than 5 times as much as an iPhone 4 (960 x 640 pixels). This means that Google maps will have to load about 5 times as much map tile data on the new iPad to fill up the screen. From what I read on the Internet, users report that the iPhone 4 burns through about 30MB of bandwidth per hour when Google maps is used for navigation. So we are talking about 30x5 = 150MB/hour of data usage on the new iPad when it is used as a navigation device. Another example: in order for jpeg images to completely fill up the new iPad's screen without any software interpolation, they are going to need to be over 3 mega pixels each. That is about 1MB of data transfer per JPEG image!

Ok, but why do you need full screen apps and content, you may ask. Can't you compromise on the size and save bandwidth? No, because the new iPad's screen size is only 9.7inches across and chances are that we use the new iPad while holding it in our hands. If this is the case, then interpolation (up-sampling to fill up the pixels with guessed values) will be really ugly at best (say for jpeg pictures), and unusable at worst (say, maps with minute features marked on them). So while its acceptable to see a standard definition channel on an HD TV from a distance, it will be unacceptable seeing it from close range on the new iPad in full screen mode.

Now you might say, thats alright because we have LTE to tame the iPad 3 bandwidth sucking machine. Upto 100Mbps if you 've heard all the marketing around LTE.

But do we have the data plans? To quote from Techcrunch, here are the LTE plans on offer in the US today.

"
AT&T is offering three plans: 250MB for $14.99, 3GB for $30, and 5GB for $50. On the 250MB plan, you’ll be charged an additional $15 for each 250MB allotment you go over. On the two bigger plans, it’s a $10 overage fee per each additional 1GB of data.


Verizon is offering four different plans: 1GB for $20, 2GB for $30, 5GB for $50, and 10GB for $80. Their overage fees are a little more straightforward — it’s an extra $10 for each 1GB over.
"

Lets plug some of these numbers into the applications we alluded to earlier in this post. Lets say, that you have opted to spend $30 per month on your new iPad's LTE connection. How far will your $30 take you?




AT&T's $30 3GB LTE plan

Verizon's  $30 2GB LTE plan

HD-quality  video 
@ 2MB/sec for HD video you get only 1500 seconds (25min) of HD video from your MONTHLY data plan.

@ 2MB/sec for HD video you get only 1000 seconds (17min) of HD video from your MONTHLY data plan.


Google maps

@150MB/hour you get only 20 hours of Google maps from your MONTHY data plan

@150MB/hour you get only 13 hours of Google maps from your MONTHY data plan


Simply put, these data-plans just won't work for the new iPad. What is really surprising is that LTE, all through its long standardization process in the last decade, always promised lower cost/bit and higher spectral efficiency (slated to be 2x-5x, meaning LTE can pack 5 times as many bits in the same frequency spectrum compared to 3G). Lots of new technology was put into the standard  to support very high speed wireless mobile broadband services (read more about this on the LTE Wikipedia page).

Where are all those technology savings  going? Granted that telcos are making big capital expenditures rolling out the LTE infrastructure, but consumers are paying LTE  patent royalties (10s of dollars) too, every time they buy LTE-enabled devices such as the new iPad.

Why are the data plans so anemic? 

Can we as consumers really hope to enjoy the benefits of high-resolution mobile technology and LTE when Telcos are cutting out all the joy even before the first new iPad is shipped?


UPDATE: Related WSJ article: Video Speed Trap Lurks in New iPad 

UPDATE: Another angle from CNN, why images look so bad on the new iPad: Why do magazines look so bad on the new iPad?

Saturday, July 16, 2011

Google+ Circles: Humanity's Social Router

I have been trying to wrap my thoughts around  the importance of Google+ circles. The circles idea is to let Google+ users organize their Google+ contacts into different circles. The idea itself is not new; Facebook has let its users organize contacts into different bins for a long time now. The FB avatar of the idea hasn't really been a killer feature; in fact, PC Magazine  published a Google+ circles obituary based on the idea's failure in FB.

Most other features of Google+ are powerful and well planned - like video calling (hang-outs), seamless integration with other Google products (Gmail and You-tube), and a good cross-platform HTML mobile app. Still, Google is touting circles as the key Google+ feature. Why does Google think circles is so important?

Lets look at how Google+ circles affects the users social networking experience. By gently forcing the user to select which circle a new contact should belong, Google+ amortizes the job of categorizing contacts. On the other hand new contacts usually end up in one big "friends" bin in FB. The categorization (or binning) has to be performed later (and this is a tedious task - at least I haven't bothered to do it until now).

With Google+, I've ended up with my contacts being in one of these circles:

Fig. 1: Each circle is a post-box to send messages to a specific contact category.

So now I have a bunch of post-boxes, one corresponding to every circle, where I can post information (pictures/status updates/etc.) and they will get routed to that sub-set of contacts which comprise the circle. This gives me the ability to target information to relevant parts of my social network. I look at this as a social graph routing mechanism. Circles are routing rules that users put in place so that their social message streams are routed appropriately in the social graph.

Google+ is constructing humanity's social router via circles which will be programmed via routing rules defined through the elegant circles abstraction. Yes the same thing can be done with FB, but FB never really tried to make this the center-piece of its product. By gently forcing users to separate relationships via circles, Google+ might just manage to make users feel more confident about selectively routing their social lives with different groups of contacts, rather than blasting messages to everyone they (do or do not) know in their huge FB friend lists. The result should be a more information-rich Google+ social network. With more information comes better advertisement targeting possibilities.

 Users are more concerned about privacy with respect to their contacts (my family should not see what happened in the office holiday party) rather than Google knowing every intimate detail of their lives. A functional social router will implement this wish without coming in the way of Google obtaining user information. No I don't think Google will be more discerning than FB when it comes to monetizing the private information of users, but hey, who cares about user privacy anyway?

Sunday, May 22, 2011

App Engine's Price Shock and the New Web App Equilibrium

App Engine's Price Shock

The Google App Engine blog announced that Google's App Engine is moving out of beta this year. According to the blog this means that App Engine will have to become financially attractive to Google, and therefore the corresponding upward revision in pricing. The details are intricate and affect customers (app developers) differently based on what services they use; the takeaway is that there is a reduction in the free-tier plan, a fixed monthly subscription fee, and a switch from billing based on CPU cycles to billing based on CPU-instance uptime. Now this has made several app developers unhappy because they feel slighted by Google changing the pricing structure, so much, so late. Google is trying to assuage developer anger, but I am guessing that Google App Engine stands to loose many apps and developers given these higher pricing structures.

Anyway, going back to the 3 highlights of the pricing change, lets see how it improves App Engine's business plan.
  1. Reduction of the free-tier - Google claims that the free tier was eating into App Engine resources because there are many non-paying apps that run just fine using these free resources. Google is out to trim the free tier, leaving just enough to still win over developers who want to experiment with the App Engine, but no more. If an app is seriously deployed, it needs to pay for App Engine real estate. Many apps running on the App Engine are back-ends to iOS and Facebook apps. Why should Google be bank-rolling apps that benefit those platforms? App Engine's early adopter phase is over and freebies can end. 
  2. Fixed $9 monthly subscription fee - The App Engine has been a fantastic success with more than 100K deployed apps. Even if 10% of them can be converted into  monthly subscriptions then the App Engine will have an evergreen revenue stream (many developers have complained that they would end up pay much more than $9pm given the new pricing model). 
  3. CPU-instance time billing - Google wants customers to pay for availability (keeping their web apps running 24 x 7) rather than the amount of computation used by the apps. Since web apps usually remain online throughout, it makes business sense for Google to charge for this 24x7 CPU-instance time. CPU-cycles usage can be little and far-between given the eccentric product life-cycle, usage variability, and popularity of apps. It makes perfect sense for Google to start charging for availability rather than CPU cycles. 
Off course these facts are not lost on other cloud providers such as Amazon's AWS service who have always charged per CPU-instance and offer lean free tiers as well.

The Equilibrium Shift

I don't expect developers to shun the App Engine (or other cloud services for that matter) given this and future price increases. Its much more expensive, both in terms of capex and opex, to achieve a cloud provider's level of service availability and convenience for small and mid-sized app shops. But these price changes will affect the technology and architecture decisions of app developers, here is how:
  1. Simplification of server-side logic, and richer client side logic - Clever web app design can off-load CPU cycles to web-browsers, more so with the advent of sophisticated client-side Javascript libraries and HTML5. I expect web app developers to aggressively move more processing into the user browser. This has the added benefit of more responsive web applications, for example HTML5's local data store can be used to store user data and shave-off network latency.
  2. Multi-threaded server-side architecture - Charging for instances would push app developers toward utilizing instances more efficiently, for example, by adopting multi-threaded programming approaches. There is already talk about App Engine developers shunning Python (inherently weak in multi-threaded functionality) and moving to Java.
  3. Space-CPU time trade-off - Storage space continues to be relatively cheap in the cloud. I expect developers to store a lot more application state rather than having to compute it again at a later time. For example, instead of using traditional RDBMS databases (e.g. MySQL), app developers may start looking at simple nosql alternatives like couchDB, which are instead optimized to store multiple indexes and views based on common read patterns. 
  4. QoS as a service - App developers will also move toward higher workload thresholds in their load-balancing algorithms, meaning that users may see web apps slowing down as new CPU instances are more sparingly fired up during times of greater demand. I expect app developers to start charging users for differentiated QoS in apps. After all, there is no free lunch, and lunch just got a whole lot more expensive.
  5. Availability as a service - Does it make sense for an app to keep its back-end running (CPU instances online) in the dead of night because 0.1% of its users are insomniacs? I think that some web apps  may start experimenting with the breaking the unwritten rule of 24x7 web uptime, or at least charging more for the privilege of using these services outside waking hours. Off course, different time-zones complicate this idea.
In the end, money matters. Its going to be interesting to see the market affecting cloud-deployed web app architecture decisions of the future.

Tuesday, January 25, 2011

H.264 vs. WebM

And so it begins. The battle between the H.264 and WebM video codec.

Google's On2 acquisition and the subsequent open-sourcing of the VP8 video codec has created a formidable competitor for H.264. Formidable not because WebM is technically superior to H.264 but because now there is a free alternative to the proprietary and licensed H.264. WebM is free, underwritten by Google, and a proven web-video delivery veteren -after all, Adobe Flash has used On2's codecs for web video delivery over the years.

There are several things going for H.264. First, it is entrenched in several video delivery formats and standards. For example, Bluray uses H.264 to encode video. Millions of Bluray players will become obselete if WebM is used instead of H.264. My two cents are that this wont really happen, instead, newer players will incorporate the possibility of decoding WebM video also. Even as I write this I am aware of several hardware manufacturers who are incorporating the WebM video decoders into their ASIC hardware.  But I am not assuming that things like the Bluray standard will be changed, on the contrary, there are other emerging media delivery and storage standards that have been frozen with H.264 being selected as the codec of choice. Standards take years to change or deploy and its very unlikely that they can suddenly adopt WebM instead of H.264.

In the mid-term WebM will defeat H.264 where there is a (easily replacable) software decoder and soft-media. By soft-media I mean video that is not burnt onto read-only media like Bluray disks but instead exists, say, in the form of a web-downloadable video on a server's hard-disk. The economic compulsion of having to pay the H.264 licensing body per-video download and per decoder shipped compared to the free (as in air) WebM alternative shall edge out the former. I suspect web-video delivery platforms like You-tube will lead the charge because (1) The number of videos being downloaded are huge and, (2) Their average revenue per video is miniscule, and each WebM download instead of H.264 download saves a few cents in licensing fee.

 A black-knight for the time-frame question will be the innovation in H.264 vs. that in WebM. If open-sourcing WebM has the desired effect of creating a better and more innovative codec in the future then WebM could gain on H.264 faster. But I am sure that the H.264 camp won't be sitting on their palms all this while! Video codecs use advanced algorithms and developing such concepts needs big investments (R&D). Will backers of WebM bring that kind of investment to the table in the interest of improving WebM when there is no direct revenue coming back to them?

Another thing that is going for WebM is the push toward virtualization in consumer electronics (away from the conventional ASIC approach) in the coming years. This means that future hardware (such as future Bluray players) may be capable of running multiple upgradable decoders rather than being tied to a specific ASIC implementing a specific decoding algorithm for a specific codec. That may just break the hardware dominance of H.264 over WebM. As a consumer I would prefer to hedge my bets and buy a virtualization-capable decoder rather than being tied into one video codec via an ASIC decoder.

Tuesday, November 9, 2010

Parallelizing & Multiprocessing Commands Using Python

My computer has multiple processor cores. That means I could speed up scripts by running some of their tasks in parallel. I have written up a simple Python script that uses the Multiprocessing library to take a list of jobs (each is a unix command string) and then executes them on a specified number of independent processes. These processes are created only once and act as a pool of "workers" which undertake a job, submit the result of the computation, and then undertake another job (if available in the job queue). The script ends when there are no more jobs in the job queue.
This approach is useful when (1) You have a multi-processor/multicore CPU. (2) Your tasks are CPU intensive. (3) You are reasonably sure that the jobs are not internally parallelized to take advantage of multiple CPUs. In my case, I had two directories full of numerically-named image (.ppm) files whose PSNR's had to be compared using the pnmpsnr utility. Computing PSNR is a computationally intensive task. Running the comparisons serially (single process) was significantly slower than adopting a multiprocess approach.
The code below should get you started on parallelizing your computationally intensive script. You can download the script from here.

#! /usr/bin/env python
# Sachin Agarwal, Google, Twitter: sachinkagarwal, Web: http://sites.google.com/site/sachinkagarwal/ 
# November 2010
# Using Python to execute a bunch of job strings on multiple processors and print out the results of the jobs in the order they were listed in the job list (e.g. serially).
# Partly adapted from http://jeetworks.org/node/81


#These are needed by the multiprocessing scheduler
from multiprocessing import Queue
import multiprocessing
import commands
import sys

#These are specific to my jobs requirement
import os
import re
 
def RunCommand (fullCmd):
    try:
        return commands.getoutput(fullCmd)
    except:
        return "Error executing command %s" %(fullCmd)

        
class Worker(multiprocessing.Process):
 
    def __init__(self,
            work_queue,
            result_queue,
          ):
        # base class initialization
        multiprocessing.Process.__init__(self)
        self.work_queue = work_queue
        self.result_queue = result_queue
        self.kill_received = False
 
    def run(self):
        while (not (self.kill_received)) and (self.work_queue.empty()==False):
            try:
                job = self.work_queue.get_nowait()
            except:
                break

            (jobid,runCmd) = job
            rtnVal = (jobid,RunCommand(runCmd))
            self.result_queue.put(rtnVal)

            
def execute(jobs, num_processes=2):
    # load up work queue
    work_queue = multiprocessing.Queue()
    for job in jobs:
        work_queue.put(job)
 
    # create a queue to pass to workers to store the results
    result_queue = multiprocessing.Queue()
 
    # spawn workers
    worker = []
    for i in range(num_processes):
        worker.append(Worker(work_queue, result_queue))
        worker[i].start()
    
    # collect the results from the queue
    results = []
    while len(results) < len(jobs): #Beware - if a job hangs, then the whole program will hang
        result = result_queue.get()
        results.append(result)
    results.sort() # The tuples in result are sorted according to the first element - the jobid
    return (results) 

 
#MAIN 
if __name__ == "__main__":
    
    import time #Code to measure time
    starttime = time.time() #Code to measure time
    
   
    jobs = [] #List of jobs strings to execute
    jobid = 0#Ordering of results in the results list returned

    #Code to generate my job strings. Generate your own, or load joblist into the jobs[] list from a text file
    lagFactor = 5
    ppmDir1 = sys.argv[1]
    ppmDir2 = sys.argv[2]
    decNumRe = re.compile(u"((\d+)\.(\d+))")
    ctr = 0
    for root, dirs, files in os.walk(ppmDir1):
        numFiles = len(files)
        files.sort()
        fNameLen = len(files[0].split('.')[0])
        for fName in files:
            for offset in range(0,lagFactor):
                fNumber = int(fName.split('.')[0])
                targetFile =  '%0*d' % (fNameLen, max(fNumber-offset,1))
                targetPath = ppmDir2+'/'+targetFile+'.ppm'
                origPath = ppmDir1+'/'+fName
                fullCmd = "pnmpsnr "+origPath+' '+targetPath  #Linux command to execute
                jobs.append((jobid,fullCmd)) # Append to joblist
                jobid = jobid+1

    # run
    numProcesses = 2
    if len(sys.argv) == 3:
        numProcesses = int(sys.argv[1])
    results = execute(jobs,numProcesses) #job list and number of worker processes
    
    #Code to print out results as needed by me. Change this to suit your own need
    # dump results
    ctr = 0
    for r in results:
        (jobid, cmdop) = r  
        if jobid % lagFactor == 0:
            print
            print jobid/lagFactor,
        print '\t',
         
        try:
            print cmdop.split()[10],
        except:
            print "Err",
        ctr = ctr+1
    print

    print "Time taken = %f" %(time.time()-starttime) #Code to measure time