Thursday, January 15, 2015

MongoDB


Just a few years back all  data was usually  stored in relational db  , like Oracle or MySQL. Application developers had some form of understanding about the database and the tables and using them . Relational Databases supported ACID (Atomicity, Consistency, Isolation, Durability) and were the right choice for some of the applications . The transaction and join capabilities came at a cost of performance and scalability , which couldn't be solved easily without hardware upgrades and admin intervention . NoSQL databases provides a alternative with Replication and scalability features built in , and would provide a lot better alternative for high-traffic , performance oriented systems .Article is to describe features of MongoDB with examples and the limitations


Comparison with Relational DB



Relational DB is normalized and unused columns are left as null's . Table relations are not easily recognizable by just looking at data and usually would require understanding of the table design and the mapping information .

Comparing to Mongo DB , there is no defined schema . Actual data represents the structure and optional values are simply left blank . Flexible design provides easily understanding but there is replication of data when dealing with relations ,this is the tradeoff.


MongoDB can horizontally scale like web servers . De-Normalized focussed on performance vs Join and Best Suited for - caching , event logging and High volume traffic.Scaled for reading and writing using replicated set and Sharding

MongoDB is not ideal for - Transactional applications , ad-hoc business intelligence , problems that require SQL .


Features


MongoDB stores data as documents (BSON) and not rows . Rows are flat vs Documents that has embedded structure with other documents and arrays . Arrays can also have arrays inside  

References can be achieved in MongoDB like Relational databases, however its up to the user to maintain the relational reference - Mongo doesn’t support the relational integrity

MapReduce functions are available and are written in JavaScript .


Advantages

  • Replication and HA capabilities built-in . Its a Considerable advantage when moving to production as the data can scale horizontally
  • Map reduce and Aggregation framework is built in
  • GRID FS - can use Mongo as File system . Provides a way to use Mongo just as file system. Effective when the application requires File . Working with Java the document can be accessed by java.io.File ,File can be breaker to chunks . When sharing however chunks cannot be stored in different shard

BSON


Actual data ( Documents ) are stored as BSON ( Binary JSON ) .Driver takes care of transforming JSON >< BSON.


ObjectID


ObjectID - unique id for each document . It can be generated without locking  or sequencing and is unique every time irrespective of the number of machines used. uses Time,machine id and process id and  sequence  for generation of object id . ObjectID is created if not provided during insert .

Index

Mongo uses 3 kinds of indexes


  1. normal using Strings numbers
  2. text based indexing
  3. Geospatial index based on location , very useful for location based searches .Insert has to capture the location based on 2d or 2dsphere and search can be based on
    1. $near around a point
    2. min and max range within 2d space
    3. around a particular area ( circle , box or arbitrary polygon ) using $within

First index from the scan is used for query and other attempts to find the scan are disregarded after the index is found .

Index can be ensured using ensureIndex . query can also use hints ( not a good idea because query analyzer  knows best )

Examples
db.collection.find(username:’foo’, city:’New York’}).hint({‘username’:1})
e.g
db.posts.ensureIndex({“comments.author”:1})
db.posts.find({“comments.author”:”eliot”})

Sort on large amount of data without index will result in error
Full table scans - bad - so use indexes

Sparse index
introduced in mongo 2.0 to reduce storage of null fields
downside  can’t answer “not in index” queries.

Arrays



Multiple data fields e.g author for books can be represented as an array .

Index on array works on each field ( each array item ) and also the array as a whole . This makes retrieval of data much easier

Search can be powerful as the matching items can be easily searched from one table

$Operators

$in  

$or

$set ( for update ) -for array the item that matches can be individually updated e.g {$set:{“author.$.company”:”abc corp”}}


Examples

Example of some of the basic commands used in MongoDB
show dbs
show collections
  system.indexes special collection 
db.scores.findOne()
db.scores.distinct(“name”)
db.scores.count()
db.scores.count({“name”:”exam”})

fields returned can be included  (or) excluded but not both 
db.scores.find({“score”:{$gt:95}}, {“score”,1,_id:0})
sort  can be done by specifying field name and order ( 1 for ascending)  and (-1 for desc )

student ascending
 db.scores.find({“score”:{$gt:95}}, {“score”,1,“student”,1,_id:0}).sort({“student”:1})
student ascending and score descending 
 db.scores.find({“score”:{$gt:95}}, {“score”,1,“student”,1,_id:0}).sort({“student”:1,”score”,-1})

unset example
update all grade - first param {} is for all 
 db.scores.update({},{$unset:{“grade”:1}},false,true)
 
push / pull  ( to push items to array or pull items to array ) 
 add array item
db.books.update({“_id”: ObjectId(“16 digit text”)},
                            {$push:{“tags”:”classic and important”}})

addToSet can be added only once 
db.books.update({“_id”: ObjectId(“16 digit text”)},
                            {$addToSet:{“tags”:”classic and important”}})

 delete array item
db.books.update({“_id”: ObjectId(“16 digit text”)},
                            {$pull:{“tags”:”classic and important”}})

pullAll pushAll for multiple items

$inc - increment a field by a given value 
mongo figures out the increment value - takes away issue of atomicity that may be required to be done with web server 

Deployment


RAM - frequently used data is stored in Ram for faster retrieval . Disk access is 100 times more slower than RAM .
Deployment trick - fetch required data forcibly using query to have the data available in RAM for the users
Production should be 64 bit as 32 bit supports only 2gb limit

Mongo will pre allocate data so file systems matter - posix_fallocate() on linux system works fine with just marking start and end on memory map
reading data from memory will have padding to allow changes in data to be written to disk .
toString on any mongo data has the JSON information , this is very useful for debugging


Driver's and working with applications


Native driver - How to
uses DBObject - Map interface with key value pair - Key -> “string” and Value -> Object of Mongo Data

ODM - Object Document mapping  
Morphia  - Like JPA , Annotation Driven and Written for Mongo so it’s customized for Mongo
Spring-Data-Document - less strongly coupled as its used for all NoSQL databases , recommended for Spring  veterans

Mongo has support for Hadoop and hadoop components - Pig , Hive

Capped Collection
designed for replication , No id , No Deletes , maintained in serration order , Updates only for it won’t grow above padding

Tailable Cursors
  like tail -f in unix , efficient , “Await” cursor can pull data until more documents arrive to the particular query
findAndModify - can use in specialized scenario where changed information is required back from Mongo

Sharding and Replication

Scaling is not just about adding servers it has the following components


  • Operations/Sec
  • Storage needs go up - Capacity and IOPs
  • Complexity goes up

Optimization and tuning is based of


  • Schema and Index Design
  • O/S Tuning
  • Hardware

Vertical Scaling - Expensive and may be impossible when dealing with certain cloud based systems

Scaling can be done by


  • Schema & Index Design
  • Sharding
  • Replication

Schema


  • Use Embedded vs linking , facts to consider
    • Round trip to DB
    • Disk seek time
    • size of data to R/W
  • Use Partition vs full document writes ( $set uses partial write )
  • Use Partial vs full document read

Index


  • Mongo can use only one index at a time for a query
  • Index common query , but do not over index . (A) and (A,B) are equivalent
  • Right-balanced index

Sharding
  • Automatic partitioning and management
  • Range based
  • Convert to sharded system with no downtime
  • Fully consistent


e.g of adding shard based on age

db.runCommand({ addshard:”shard1”});
db.runCommand({ ShardCollection:”mydb.blogs”, key:{age:1}})
ranges are stored as chunks
chunks are created automatically

shard’s can be added as many as required using
db.runCommand({ addshard:”shard2”});

shard can also be removed
Shard has no downtime and automatic balancing as data is written

Insert must have shard key
Update must have shard key
Query with Shard key routed to the right node
 - no Shard key scattered and gathered
Indexed query
  - with shard key rated in order
  - without shard key is distributed sort and merge

Replication

Replica set is not Master/Slave as in master/slave a new master has to be switched  if the master is down

Replica Set works as
  • cluster of N Servers
  • Any(one) can be primary
  • Automatic failover, recovery
  • All writes to Primary
  • Read can be to Primary(default) or Secondary
  • There has to be at least 2 or More replica set so that Primary can be elected
  • if Primary fails then negotiation happens for the new Primary
Node can be
 Normal ( priority:1}  or Passive { priority:0 }  - cannot become primary
 Can also be Arbiter - can vote but cannot hold any data
 Can be hidden { hidden:true }
 Can allow tags ( after 2.0 version ) e.g tags { “dc”:”ny”, “rack” :”123”}

slaveOk()
  -driver sends read request to the secondary
  -write always happens to primary
Java -
   DB.slaveOk() or Collection.slaveOk()

Read are consistent in Primary
Read for secondary are eventually consistent

No comments: