Just a few years back all data was usually stored in relational db , like Oracle or MySQL. Application developers had some form of understanding about the database and the tables and using them . Relational Databases supported ACID (Atomicity, Consistency, Isolation, Durability) and were the right choice for some of the applications . The transaction and join capabilities came at a cost of performance and scalability , which couldn't be solved easily without hardware upgrades and admin intervention . NoSQL databases provides a alternative with Replication and scalability features built in , and would provide a lot better alternative for high-traffic , performance oriented systems .Article is to describe features of MongoDB with examples and the limitations
MapReduce functions are available and are written in JavaScript .
ObjectID - unique id for each document . It can be generated without locking or sequencing and is unique every time irrespective of the number of machines used. uses Time,machine id and process id and sequence for generation of object id . ObjectID is created if not provided during insert .
RAM - frequently used data is stored in Ram for faster retrieval . Disk access is 100 times more slower than RAM .
Native driver - How to
Comparison with Relational DB
Relational DB is normalized and unused columns are left as null's . Table relations are not easily recognizable by just looking at data and usually would require understanding of the table design and the mapping information .
Comparing to Mongo DB , there is no defined schema . Actual data represents the structure and optional values are simply left blank . Flexible design provides easily understanding but there is replication of data when dealing with relations ,this is the tradeoff.
MongoDB can horizontally scale like web servers . De-Normalized focussed on performance vs Join and Best Suited for - caching , event logging and High volume traffic.Scaled for reading and writing using replicated set and Sharding
MongoDB is not ideal for - Transactional applications , ad-hoc business intelligence , problems that require SQL .
MongoDB stores data as documents (BSON) and not rows . Rows are flat vs Documents that has embedded structure with other documents and arrays . Arrays can also have arrays inside
References can be achieved in MongoDB like Relational databases, however its up to the user to maintain the relational reference - Mongo doesn’t support the relational integrity
- Replication and HA capabilities built-in . Its a Considerable advantage when moving to production as the data can scale horizontally
- Map reduce and Aggregation framework is built in
- GRID FS - can use Mongo as File system . Provides a way to use Mongo just as file system. Effective when the application requires File . Working with Java the document can be accessed by java.io.File ,File can be breaker to chunks . When sharing however chunks cannot be stored in different shard
Actual data ( Documents ) are stored as BSON ( Binary JSON ) .Driver takes care of transforming JSON >< BSON.
Mongo uses 3 kinds of indexes
- normal using Strings numbers
- text based indexing
- Geospatial index based on location , very useful for location based searches .Insert has to capture the location based on 2d or 2dsphere and search can be based on
- $near around a point
- min and max range within 2d space
- around a particular area ( circle , box or arbitrary polygon ) using $within
First index from the scan is used for query and other attempts to find the scan are disregarded after the index is found .
Index can be ensured using ensureIndex . query can also use hints ( not a good idea because query analyzer knows best )
db.collection.find(username:’foo’, city:’New York’}).hint({‘username’:1})
db.posts.ensureIndex({“comments.author”:1}) db.posts.find({“comments.author”:”eliot”})
Sort on large amount of data without index will result in error
Full table scans - bad - so use indexes
Sparse index
introduced in mongo 2.0 to reduce storage of null fields
downside can’t answer “not in index” queries.
Multiple data fields e.g author for books can be represented as an array .
Index on array works on each field ( each array item ) and also the array as a whole . This makes retrieval of data much easier
Search can be powerful as the matching items can be easily searched from one table
$set ( for update ) -for array the item that matches can be individually updated e.g {$set:{“author.$.company”:”abc corp”}}
Example of some of the basic commands used in MongoDB
show dbs show collections system.indexes special collection db.scores.findOne() db.scores.distinct(“name”) db.scores.count() db.scores.count({“name”:”exam”}) fields returned can be included (or) excluded but not both db.scores.find({“score”:{$gt:95}}, {“score”,1,_id:0}) sort can be done by specifying field name and order ( 1 for ascending) and (-1 for desc ) student ascending db.scores.find({“score”:{$gt:95}}, {“score”,1,“student”,1,_id:0}).sort({“student”:1}) student ascending and score descending db.scores.find({“score”:{$gt:95}}, {“score”,1,“student”,1,_id:0}).sort({“student”:1,”score”,-1}) unset example update all grade - first param {} is for all db.scores.update({},{$unset:{“grade”:1}},false,true) push / pull ( to push items to array or pull items to array ) add array item db.books.update({“_id”: ObjectId(“16 digit text”)}, {$push:{“tags”:”classic and important”}}) addToSet can be added only once db.books.update({“_id”: ObjectId(“16 digit text”)}, {$addToSet:{“tags”:”classic and important”}}) delete array item db.books.update({“_id”: ObjectId(“16 digit text”)}, {$pull:{“tags”:”classic and important”}}) pullAll pushAll for multiple items $inc - increment a field by a given value mongo figures out the increment value - takes away issue of atomicity that may be required to be done with web server
Deployment trick - fetch required data forcibly using query to have the data available in RAM for the users
Production should be 64 bit as 32 bit supports only 2gb limit
Mongo will pre allocate data so file systems matter - posix_fallocate() on linux system works fine with just marking start and end on memory map
reading data from memory will have padding to allow changes in data to be written to disk .
toString on any mongo data has the JSON information , this is very useful for debugging
Driver's and working with applications
uses DBObject - Map interface with key value pair - Key -> “string” and Value -> Object of Mongo Data
ODM - Object Document mapping
Morphia - Like JPA , Annotation Driven and Written for Mongo so it’s customized for Mongo
Spring-Data-Document - less strongly coupled as its used for all NoSQL databases , recommended for Spring veterans
Mongo has support for Hadoop and hadoop components - Pig , Hive
Capped Collection
designed for replication , No id , No Deletes , maintained in serration order , Updates only for it won’t grow above padding
Tailable Cursors
like tail -f in unix , efficient , “Await” cursor can pull data until more documents arrive to the particular query
findAndModify - can use in specialized scenario where changed information is required back from MongoSharding and Replication
Scaling is not just about adding servers it has the following components
- Operations/Sec
- Storage needs go up - Capacity and IOPs
- Complexity goes up
Optimization and tuning is based of
- Schema and Index Design
- O/S Tuning
- Hardware
Vertical Scaling - Expensive and may be impossible when dealing with certain cloud based systems
Scaling can be done by
- Schema & Index Design
- Sharding
- Replication
- Use Embedded vs linking , facts to consider
- Round trip to DB
- Disk seek time
- size of data to R/W
- Use Partition vs full document writes ( $set uses partial write )
- Use Partial vs full document read
- Mongo can use only one index at a time for a query
- Index common query , but do not over index . (A) and (A,B) are equivalent
- Right-balanced index
- Automatic partitioning and management
- Range based
- Convert to sharded system with no downtime
- Fully consistent
e.g of adding shard based on age
db.runCommand({ addshard:”shard1”});
db.runCommand({ ShardCollection:”mydb.blogs”, key:{age:1}})
ranges are stored as chunks
chunks are created automatically
shard’s can be added as many as required using
db.runCommand({ addshard:”shard2”});
shard can also be removed
Shard has no downtime and automatic balancing as data is written
Insert must have shard key
Update must have shard key
Query with Shard key routed to the right node
- no Shard key scattered and gathered
Indexed query
- with shard key rated in order
- without shard key is distributed sort and merge
Replica set is not Master/Slave as in master/slave a new master has to be switched if the master is down
Replica Set works as
- cluster of N Servers
- Any(one) can be primary
- Automatic failover, recovery
- All writes to Primary
- Read can be to Primary(default) or Secondary
- There has to be at least 2 or More replica set so that Primary can be elected
- if Primary fails then negotiation happens for the new Primary
Node can be
Normal ( priority:1} or Passive { priority:0 } - cannot become primary
Can also be Arbiter - can vote but cannot hold any data
Can be hidden { hidden:true }
Can allow tags ( after 2.0 version ) e.g tags { “dc”:”ny”, “rack” :”123”}
-driver sends read request to the secondary
-write always happens to primary
Java -
DB.slaveOk() or Collection.slaveOk()
Read are consistent in Primary
Read for secondary are eventually consistent
