Java Stuff: MongoDB

Just a few years back all data was usually stored in relational db , like Oracle or MySQL. Application developers had some form of understanding about the database and the tables and using them . Relational Databases supported ACID (Atomicity, Consistency, Isolation, Durability) and were the right choice for some of the applications . The transaction and join capabilities came at a cost of performance and scalability , which couldn't be solved easily without hardware upgrades and admin intervention . NoSQL databases provides a alternative with Replication and scalability features built in , and would provide a lot better alternative for high-traffic , performance oriented systems .Article is to describe features of MongoDB with examples and the limitations

Comparison with Relational DB

Relational DB is normalized and unused columns are left as null's . Table relations are not easily recognizable by just looking at data and usually would require understanding of the table design and the mapping information .

Comparing to Mongo DB , there is no defined schema . Actual data represents the structure and optional values are simply left blank . Flexible design provides easily understanding but there is replication of data when dealing with relations ,this is the tradeoff.

MongoDB can horizontally scale like web servers . De-Normalized focussed on performance vs Join and Best Suited for - caching , event logging and High volume traffic.Scaled for reading and writing using replicated set and Sharding

MongoDB is not ideal for - Transactional applications , ad-hoc business intelligence , problems that require SQL .

Features

MongoDB stores data as documents (BSON) and not rows . Rows are flat vs Documents that has embedded structure with other documents and arrays . Arrays can also have arrays inside

References can be achieved in MongoDB like Relational databases, however its up to the user to maintain the relational reference - Mongo doesn’t support the relational integrity

MapReduce functions are available and are written in JavaScript .

Advantages

Replication and HA capabilities built-in . Its a Considerable advantage when moving to production as the data can scale horizontally
Map reduce and Aggregation framework is built in
GRID FS - can use Mongo as File system . Provides a way to use Mongo just as file system. Effective when the application requires File . Working with Java the document can be accessed by java.io.File ,File can be breaker to chunks . When sharing however chunks cannot be stored in different shard

BSON

Actual data ( Documents ) are stored as BSON ( Binary JSON ) .Driver takes care of transforming JSON >< BSON.

ObjectID

ObjectID - unique id for each document . It can be generated without locking or sequencing and is unique every time irrespective of the number of machines used. uses Time,machine id and process id and sequence for generation of object id . ObjectID is created if not provided during insert .

Index

Mongo uses 3 kinds of indexes

normal using Strings numbers
text based indexing
Geospatial index based on location , very useful for location based searches .Insert has to capture the location based on 2d or 2dsphere and search can be based on

$near around a point
min and max range within 2d space
around a particular area ( circle , box or arbitrary polygon ) using $within

First index from the scan is used for query and other attempts to find the scan are disregarded after the index is found .

Index can be ensured using ensureIndex . query can also use hints ( not a good idea because query analyzer knows best )

Examples

db.collection.find(username:’foo’, city:’New York’}).hint({‘username’:1})

e.g

db.posts.ensureIndex({“comments.author”:1})
db.posts.find({“comments.author”:”eliot”})

Sort on large amount of data without index will result in error

Full table scans - bad - so use indexes

Sparse index

introduced in mongo 2.0 to reduce storage of null fields

downside can’t answer “not in index” queries.

Arrays

Multiple data fields e.g author for books can be represented as an array .

Index on array works on each field ( each array item ) and also the array as a whole . This makes retrieval of data much easier

Search can be powerful as the matching items can be easily searched from one table

$Operators

$in

$or

$set ( for update ) -for array the item that matches can be individually updated e.g {$set:{“author.$.company”:”abc corp”}}

Examples

Example of some of the basic commands used in MongoDB

show dbs
show collections
  system.indexes special collection 
db.scores.findOne()
db.scores.distinct(“name”)
db.scores.count()
db.scores.count({“name”:”exam”})

fields returned can be included  (or) excluded but not both 
db.scores.find({“score”:{$gt:95}}, {“score”,1,_id:0})
sort  can be done by specifying field name and order ( 1 for ascending)  and (-1 for desc )

student ascending
 db.scores.find({“score”:{$gt:95}}, {“score”,1,“student”,1,_id:0}).sort({“student”:1})
student ascending and score descending 
 db.scores.find({“score”:{$gt:95}}, {“score”,1,“student”,1,_id:0}).sort({“student”:1,”score”,-1})

unset example
update all grade - first param {} is for all 
 db.scores.update({},{$unset:{“grade”:1}},false,true)
 
push / pull  ( to push items to array or pull items to array ) 
 add array item
db.books.update({“_id”: ObjectId(“16 digit text”)},
                            {$push:{“tags”:”classic and important”}})

addToSet can be added only once 
db.books.update({“_id”: ObjectId(“16 digit text”)},
                            {$addToSet:{“tags”:”classic and important”}})

 delete array item
db.books.update({“_id”: ObjectId(“16 digit text”)},
                            {$pull:{“tags”:”classic and important”}})

pullAll pushAll for multiple items

$inc - increment a field by a given value 
mongo figures out the increment value - takes away issue of atomicity that may be required to be done with web server

Deployment

RAM - frequently used data is stored in Ram for faster retrieval . Disk access is 100 times more slower than RAM .

Deployment trick - fetch required data forcibly using query to have the data available in RAM for the users

Production should be 64 bit as 32 bit supports only 2gb limit

Mongo will pre allocate data so file systems matter - posix_fallocate() on linux system works fine with just marking start and end on memory map

reading data from memory will have padding to allow changes in data to be written to disk .

toString on any mongo data has the JSON information , this is very useful for debugging

Driver's and working with applications

Native driver - How to

uses DBObject - Map interface with key value pair - Key -> “string” and Value -> Object of Mongo Data

ODM - Object Document mapping

Morphia - Like JPA , Annotation Driven and Written for Mongo so it’s customized for Mongo

Spring-Data-Document - less strongly coupled as its used for all NoSQL databases , recommended for Spring veterans

Mongo has support for Hadoop and hadoop components - Pig , Hive

Capped Collection

designed for replication , No id , No Deletes , maintained in serration order , Updates only for it won’t grow above padding

Tailable Cursors

like tail -f in unix , efficient , “Await” cursor can pull data until more documents arrive to the particular query

findAndModify - can use in specialized scenario where changed information is required back from Mongo

Sharding and Replication

Scaling is not just about adding servers it has the following components

Operations/Sec

Storage needs go up - Capacity and IOPs

Complexity goes up

Optimization and tuning is based of 

Schema and Index Design

O/S Tuning

Hardware

Vertical Scaling - Expensive and may be impossible when dealing with certain cloud based systems

Scaling can be done by 

Schema & Index Design

Sharding

Replication

Schema 

Use Embedded vs linking , facts to consider

Round trip to DB
Disk seek time
size of data to R/W

Use Partition vs full document writes ( $set uses partial write )

Use Partial vs full document read

 Index 

Mongo can use only one index at a time for a query

Index common query , but do not over index . (A) and (A,B) are equivalent

Right-balanced index

Sharding

Automatic partitioning and management

Range based

Convert to sharded system with no downtime

Fully consistent

e.g of adding shard based on age

 db.runCommand({ addshard:”shard1”});

 db.runCommand({ ShardCollection:”mydb.blogs”, key:{age:1}})

ranges are stored as chunks

chunks are created automatically 

shard’s can be added as many as required using 

 db.runCommand({ addshard:”shard2”});

shard can also be removed

Shard has no downtime and automatic balancing as data is written

Insert must have shard key

Update must have shard key

Query with Shard key routed to the right node

  - no Shard key scattered and gathered

Indexed query 

   - with shard key rated in order

   - without shard key is distributed sort and merge

Replication

Replica set is not Master/Slave as in master/slave a new master has to be switched  if the master is down

Replica Set works as

cluster of N Servers
Any(one) can be primary
Automatic failover, recovery
All writes to Primary
Read can be to Primary(default) or Secondary
There has to be at least 2 or More replica set so that Primary can be elected
if Primary fails then negotiation happens for the new Primary 

Node can be 

  Normal ( priority:1}  or Passive { priority:0 }  - cannot become primary 

  Can also be Arbiter - can vote but cannot hold any data 

  Can be hidden { hidden:true }

  Can allow tags ( after 2.0 version ) e.g tags { “dc”:”ny”, “rack” :”123”}

slaveOk()

   -driver sends read request to the secondary 

   -write always happens to primary 

Java - 

    DB.slaveOk() or Collection.slaveOk()

Read are consistent in Primary

Read for secondary are eventually consistent

Thursday, January 15, 2015

MongoDB