4
I Use This!
Moderate Activity

News

Analyzed about 1 hour ago. based on code collected about 7 hours ago.
Posted almost 11 years ago by Kim Sung Kyu
PostgreSQL shows excellent functionalities and performance. Considering its high quality, it may seem strange that PostgreSQL is not more popular. However, PostgreSQL continues to make progress. This article will discuss this database. Why You Should ... [More] Know about PostgreSQL PostgreSQL is an RDBMS, which is popular mainly in North America and Japan. It is not used much in Korea yet, but as it is a very excellent RDBMS in terms of functionality and performance, it is worth learning about what kind of database PostgreSQL is. PostgreSQL (pronounced as [Post-Gres-Q-L]) is an object-relational database system (ORDBMS), and is an open-source DBMS that provides the enterprise-level DBMS functionalities and many other functionalities you can find only in advanced DBMS. PostgreSQL is also known as an open-source DBMS that Oracle users can adapt themselves to the most easily, as it has many functionalities similar to those of Oracle. History There were many ancestors of PostgreSQL, and of them, Ingres (INteractive Graphics REtrieval System) can be said to be the progenitor of PostgreSQL. Ingres was a project launched by Michael Stonebraker (Picture 1), a great master in the area of databases who is still working hard even today. Picture 1: Michael Stonebraker started Ingres project. The Ingres project was launched at Berkeley University in the US in 1977. After Ingres Michael Stonebraker had started another project called Postgres (Post-Ingres). As Postgres version 3 was released in 1991, its user base grew to be quite large. But as the burden of providing support to users became too high, the project was terminated in 1993 (Postgres is known to have had a huge influence on the current Informix product, even after the end of the project. Illustra, a commercial version of POSTGRES, was taken over by Informix in 1997, and then by IBM in 2001.). Figure 1: Product History. Despite the project having ended, Postgres users and students continued its development and finally created Postgres95, which achieved 40% better performance than Postgres by supporting SQL and improving its structure. When Postgre95 became an open-source system in 1996, it was given the name PostgreSQL, its current name, to reflect the fact that it succeeded Postgres and supports SQL (Postgres supported a language called QUEL instead of SQL). In 1997, PostgreSQL was finally released after determining its first version as 6.0. Since then, PostgreSQL has been actively developed to this day through an open-source community, and the latest release is 9.2, as of May 2013. In addition, due to its open license (like the BSD or MIT license, PostgreSQL allows commercial use and modification, but it also clarifies that the original developers are not liable for any problem that may occur in its use), there have been more than 20 various forks, some of which have had an influence on PostgreSQL and some of which have disappeared. PostgreSQL's logo is an elephant named 'Slonik' (a baby elephant in Russian language). The true reason why an elephant was used for the logo is not known, but it has been said that just after it became an open-source system, one of its users was inspired by Agatha Christie's novel "Elephants Can Remember" and suggested it. Since then, the elephant logo has been visible at every official PostgreSQL event. As elephants are thought of as large, strong, reliable and have a good memory, Hadoop and Evernote also use an elephant as their official logo. Functionalities and Limitations PostgreSQL supports transaction and ACID, which are the basic functionalities of a relational DBMS. Moreover, PostgreSQL also has many progressive functionalities or expanded functionalities for academic research as well as for basic reliability and stability. Even a general list of PostgreSQL functionalities includes a large number of functionalities. Nested transactions (savepoints) Point in time recovery Online/hot backups, Parallel restore Rules system (query rewrite system) B-tree, R-tree, hash, GiST method indexes Multi-Version Concurrency Control (MVCC) Tablespaces Procedural Language Information Schema I18N, L10N Database & Column level collation Array, XML, UUID type Auto-increment (sequences),  Asynchronous replication LIMIT/OFFSET Full text search SSL, IPv6 Key/Value storage Table inheritance In addition to these, it features a variety of functionalities and new functionalities of enterprise-level DBMS. In general, PostgreSQL has the following limits: Table 1: Basic Limits of PostgreSQL. Limit Value Max. Database Size Unlimited Max. Table Size 32 TB Max. Row Size 1.6 TB Max. Field Size 1 GB Max. Rows per Table Unlimited Max. Columns per Table 250~1600 Max. Indexes per Table Unlimited Roadmap As of May 2013, the latest release is 9.2. Figure 2 provides some brief information on the progress of PostgreSQL by year. Figure 2: Progress of PostgreSQL by Year. The main functionalities of each version are as follows: Table 2: Main Functionalities by Version. VersionRelease YearMain Functionalities 0.01 1995 Postgres95 release 1.0 1995 Copyright change, open source 6.0~6.5 1997~1999 Renamed PostgreSQL Index, VIEWs and RULEs Sequences, Triggers Genetic Query Optimizer Constraints, Subselect MVCC, JDBC interface, 7.0~7.4 2000~2010   Foreign keys SQL92 syntax JOINs Write-Ahead Log Information Schema, Internationalization 8.0~8.4 2005~2012 Native Support for MS Windows Savepoint, Point-in-time recovery Two-phase commit Table spaces, Partitioning Full text search Common table expressions (CTE) SQL/XML, ENUM, UUID Type Window functions Per-database collation Replication, Warm standby 9.0 2010-09 Streaming replication, Hot standby Support for 64bit MS Windows Per-column conditional trigger 9.1 2011-09 Functionality differentiation Synchronous replication Per-column collations Unlogged tables K-nearest-neighbor indexing Serializable isolation level Writeable CTE (WITH) SQL/MED External Data SE-Linux integration 9.2 2012-09 Performance optimization linear scalability to 64 cores Reduction in CPU power consumption Cascade streaming replication JSON, Range Type Improved lock management Space-partitioned GiST index Index-only scans (covering) The next PostgreSQL release under development is PostgreSQL 9.3, which is due to be released in the third quarter of 2013. This release features many functionalities, including an enhanced management functionality, parallel query, MERGE/UPSERT, multi-master replication, materialized view, and enhanced multi-language support. Internal Structure The following shows the process structure: Figure 3: Process Structure. If the client requests connection with the server through the (1) interface library (variety of Interfaces Including libpg, JDBC and ODBC), the Postmaster process relays connection with the server (2). Then, the client executes a query through connection with the allocated server (Figure 3). The following shows the process of query execution in the server: Figure 4: Query Execution Procedure. If it receives a query request from the client, the system creates a parse tree through the syntax analytics process (1), starts a new transaction through the semantic checking process (2) and creates a query tree. Next, a query tree is re-generated according to the rules defined in the server (3), and of the many available execution plans, the most optimized plan tree is created (4). The server executes this (5) and sends the result of the requested query to the client. While the server executes a query, a system catalog in the database is frequently used. In the system catalog, users can directly define the type of functions and data, as well as index access methods and rules. In PostgreSQL, therefore, a system catalog is utilized as an important point in adding or expanding its functionalities. A file that stores data consists of multiple pages, and a single page has a scalable slotted page structure (Figures 5 and 6). Figure 5: Data Page Structure. Figure 6: Index Page Structure. Development Process The development process model of PostgreSQL can be explained by the following sentence: ‘A community-based open-source project led by a few.’ Like the Linux, Apache and Eclipse projects, the PostgreSQL project is also composed only of a few administrators, a variety of developers and a large number of users. The small administrator group (Core Team) collects requests and feedback (the group sometimes takes a vote to determine priorities at http://postgresql.uservoice.com) from a large number of users, determines the direction of the product, has final approval right for the code and exerts its right for release. This is a different model from corporate management development processes such as MySQL and JBoss. The developer group consists of code committers and code developers/contributors. They are located in many countries, including the U.S., Japan and Europe. Figure 7: Distribution of PostgreSQL Developers by Region. Codes developed by a variety of developers go through a variety of review processes (Submission Review, Usability Review, Feature Test, Performance Review, Coding Review, Architecture Review, Review Review), and are reflected in the product after approval by the Core Team. The mailing list that has been used by the community for a long time is usually used, and a variety of documents, including manuals, are well maintained through the official website. Products in Competition PostgreSQL wants to be compared with enterprise-level commercial DBs, but it has been compared mainly with popular open-source DBMSs. The following are the catchphrases of these open-source DBMSs, each of which reflects its features: PostgreSQL: The world's most advanced open source database MySQL: The world's most popular open source database CUBRID: Open Source Database Highly Optimized for Web Applications Firebird: The true open source database SQLite: self-contained library, serverless, zero-configuration, transactional SQL database engine It is not easy to compare these products using their catchphrases alone, but you can see that PostgreSQL seeks progressiveness and openness. The following is brief comparison of PostgreSQL and its competitiors: Table 3: Comparison of Products in Competition. Oracle An enormous amount of long-proven code and a variety of references. High cost DB2, MS SQL Similar to Oracle MySQL A variety of applications and references. Corporate development model And the burden of licensing CUBRID An alternative to MySQL Built-in HA and database sharding Dual licensing Other commercial DBs Other commercial DBs show a downtrend due to open-source DBMSs Other open source DBs Struggle to attract developers For a long time, the PostgreSQL community has made attempts to enter the enterprise DBMS market. In 2004, EnterpriseDB, a company using PostgreSQL, was established, and it is striving to strengthen its position in the enterprise DBMS market. One of the company's main products is Postgres Plus Advanced Server. Postgres Plus Advanced Server was developed by adding Oracle-compatible functionalities (PL/SQL, SQL statements, functions, DB Links, OCI library, etc.) to the open-source PostgreSQL, featuring easy data and application migration and a cost reduction of 20% compared to Oracle (Figure 7). Figure 8: Cost Reduction Compared to Oracle. In addition, Postgres Plus Advanced Server provides differentiated services, including a training, consulting and migration, and technical support service from PostgreSQL experts. Through approximately 300 reference sites in a variety of areas, the product is promoted as a database for all industries, with a growing base of users across the world. Present Status and Trend As you can see from most posts on PostgreSQL, most PostgreSQL users have a developer-like tendency, and are very loyal to the product. In fact, they have a good reason for their loyalty. PostgreSQL provides sufficient functionalities and conservative performance compared to other products, and one of its advantages is that it has good enough conditions for beginners to attract new developers. These good conditions include a well-written manual on the project page, related documents, over 300 reference publications, and over 10 seminars and conferences held in a variety of countries every year. More recently, a PostgreSQL magazine has even appeared. And these are the results of the active PostgreSQL community. The representative features that PostgreSQL users identify as being important are as follows: Reliability is the top priority of the product ACID and transaction A variety of indexing techniques Flexible full-text search MVCC for better concurrency performance Diverse and flexible replication methods A variety of procedures (PL/pgSQL, Perl, Python, Ruby, TCL, etc.)/Interface (JDBC, ODBC, C/C++, .Net, Perl, Python, etc.) languages Excellent community and commercial support Well-made documents and a thorough manual A variety of expansion functionalities and ease of development of such functionalities are also advantages of PostgreSQL. The following are the differentiated expansion functionalities of PostgreSQL: GIS add-on (PostGIS) Key-Value store expansion (HStore) DBLink Support for a variety of functions and types, including Crypto and UUID There are many other practical and experimental expansion functionalities as well. Of these, you will see a brief account of GIS (Geographic Information System), which has recently become a hot topic. PostGIS is a middleware expansion functionality that enables PostgreSQL to conform to the OpenGIS standard and support geographic objects (Figure 9).   Figure 9: PostGIS Structure.   PostGIS began to be developed from 2001, and with many functionality and performance improvements, it currently has the most users among the open-source products. There are some commercial products, such as Oracle Spatial, DB2 and MS SQL Server, but the commercial products have not been as well-received in terms of price-performance ratio. In addition, you can easily find benchmark data that shows that the functionalities and performance of PostGIS/PostgreSQL are worthy of comparison to Oracle. According to the recent trend, PostgreSQL is also much talked about in relation to cloud as well as GIS. With the recent increase in the number of companies providing DBaaS (Database as a Service), the demand for PostgreSQL, which has advantages in terms of costs and license, has increased, and as such EnterpriseDB has released Postgres Plus Cloud Database in the cloud market, with the following features: Simple setup & web-based management Automatic scaling, load balancing and failover Automated online backup Database Cloning It is used in many web services, including Amazon EC2, Eucalyptus cloud, and Red Hat Openshift development platform cloud. Other cloud service providers such as Heroku and dotCloud also provide services using PostgreSQL. Conclusion As Sun, which had acquired MySQL, was acquired by Oracle in 2009, MySQL began to be developed as a more closed corporate project, and many MySQL developers left the community around the same time. Afraid of this change, MySQL users are paying attention not only to the forks (MariaDB, Drizzle, Percona, etc.) of MySQL to which they can easily migrate, but also to the migration to PostgreSQL. Looking at the trend of help-wanted ads related to PostgreSQL and MySQL in the most popular job finding portal http://www.indeed.com (Figure 9), we can see the increase in help-wanted ads related to MySQL is slowing down, while help-wanted ads related to PostgreSQL continue to increase. Figure 10: Trend of Help-wanted Ads. According to the trend of search frequency in search sites (Figure 10), MySQL shows a continued downtrend, while PostgreSQL seems to have almost no change. In Korea, however, the search frequency for PostgreSQL has shown an upward trend since mid 2010. Figure 11: Search Frequency Trend (source). Of course, the popularity and usage of MySQL is still much higher than PostgreSQL. Although you may not be able to determine the true status or prospects of these products from the above data alone, you could infer that if the popularity of MySQL declines, the popularity of PostgreSQL will increase. PostgreSQL is not yet powerful enough to surpass MySQL in popularity, but the PostgreSQL open source project community continues to make the following efforts:  Improvement of the reliability of basic DBMS functionalities Provision of progressive and differentiated functionality expansion Continuous attraction of more open source developers In addition, EnterpriseDB, which has stronger business purposes, is also striving to achieve the following objectives: Expansion of its share in the enterprise market Expansion of its share in the cloud market Efforts to replace Oracle and MySQL By Kim Sung Kyu, Senior Software Engineer at CUBRID DBMS Lab, NHN Corporation. [Less]
Posted almost 11 years ago by Esen Sagynov
Three weeks ago on behalf of the CUBRID team a few of my colleagues and me have attended and given talks at two international conferences. Today I would like to share my impressions of these events. I will write a separate post about various ... [More] sharding solutions introduced at these conferences. So, stay tuned! RIT++ The first presentation at RIT++ (Russian Internet Technologies) was held on Monday April 22nd, 2013, in Moscow, Russia. The second one at Percona MySQL Conference & Expo was held on Wednesday the same week on Wednesday April 24th, 2013, in Santa Clara, CA, US. At both conferences the agenda was the same: "Easy MySQL Database Sharding with CUBRID SHARD". At RIT++, though, the presentation was given in Russian language. Very exciting! The following is a list of resouces related to the talks. The presentation abstract in English The presentation abstract in Russian Slideshare in English Slideshare in Russian This was the third time we, the CUBRID team, have attended the conferences organized by Russian Ontico company. Previously we have attended to RIT++ 2012 and HighLoad++ 2012 conferences. This year at RIT++ 2013 there were over 800 attendees, and 13 categories of talks ranging from client-side development to server-side, to database scalability, to project management, to analytics, and so on. Annually after the conference is over the organizers conduct after-event survey and assess the past experience. I think because of users' past feedback this year RIT++ organizers have accepted more talks related to client-side development than usually. Besides us from Korea, there were presenters from the States, representing Facebook, and Brasil, representing PUC-Rio University of Brasil. My personal impression was that this year there were fewer foreign speakers than last year at RIT++ or HighLoad++. At my session about MySQL database sharding with CUBRID SHARD there were over 100, close to 150, I guess attendees. The audience welcomed my speech in Russian language very well. Next time I should talk in Russian again. They like it! When the presentation was over, there were slew of questions. I think CUBRID SHARD as an easy sharding middleware for MySQL was received very well. To my surprise there were many questions unrelated to CUBRID SHARD. The audience asked a lot about CUBRID itself and its HA feature. Later I learned that many attendees listened to my talks about CUBRID open source relational database system from the last year. One from the audience said that he'd been looking into CUBRID for a while already and was considering to use it in production. His most favorite feature in CUBRID was its built-in support for HA and very clever 3-tier architecture. Overall the unofficial Q&A session lasted for over 1 hour 30 minutes. It was a great experience for me to present CUBRID SHARD at RIT++ this year and a great opportunity to our CUBRID team. The conference lasted two days, but I could not attend the second day as I had to head to Santa Clara, CA, to give a talk at Percona MySQL Conference & Expo. Percona It was the first time I have talked at Percona. Previously we have spoken at OSCON 2011 about CUBRID HA, and 2010 MySQL Conference & Expo about CUBRID Database. When compared to OSCON, Percona MySQL conference was a lot more specific (obviously about MySQL). There were more quality talks about scalability and performance tuning. If I was to choose where to go next year, I would definitly select Percona. That interesting it was! Unlike at RIT++, our session at Percona conference had attracted only about 20 attendees. The presentation went well, but I should accept that the number of listeners plays a big role. There were fewer questions, less enthusiasm. On the other hand, Facebook, two Percona, Continuent, and Tokutek presentations, which were held at the same time at 3:30 PM, attracted hundreds of listeners each. After realizing this I came to a conclusion that it is the brand recognition that plays a significant role in attracting listeners. Even though NHN is very popular in Korea and Asia in general, it is almost unknown in Western countries. In fact, when I asked the audience at Percona if they had ever heard about NHN, their answer was negative. Very pitty. I think NHN has to seriously reconsider its strategy on increasing its worldwide brand recognition. Nevertheless, I am very glad we had this chance to present our open source sharding middleware at a well-known conference like Percona. Like I mentioned at the beginning of this post, I will write another post covering various sharding solutions presented at Percona conference. It was very interesting to learn about different techniques used by large scale service providers who have developed their own sharding solutions. After my presentation was over and I had answered all the questions, I headed to one of the lounge rooms where I had made an oppointment to meet with Ryan Walsh, a Corporate Account Executive at Percona. We have discussed about various opportunities for cooperation between Percona and NHN, the company behind CUBRID development. Persona is a widely-known and reputable MySQL support and consulting company. It is known to be the oldest and largest independent company which provides not only MySQL support, consulting and training but also develops a custom MySQL server, i.e. provides patches, "backport changes to older MySQL versions to obtain a key patch without a full version upgrade". During our conversation Rayn had introduced his company and told about large scale cases their company has worked so far. One that I would like to mention today is that some of the services at Amazon Web Services have been actually developed by Percona. Amazon RDS was said to have been developed by Percona team. Percona database tools seem to work with RDS natively. Also Percona is cooperating with HP to build RedDwarf DaaS as part of the OpenStack open source cloud project. At Percona conference HP engineers have presented how to use RedDwarf APIs to use and administer the features of Percona Server. Such vast knowledge and experience of Percona in bulding cloud database services may be quite benefitial to NHN to develop and provide its own cloud computing service. Overall, both presentations went well. I have talked to many attendees and answered to quite a lof of their questions about CUBRID SHARD and CUBRID Database. One thing which requires more attention from NHN is its global brand recognition. The more developers will recognize NHN and its services, the more will be eager to listen to and learn from NHN enginneers. If you have any feedback or suggestions, feel free to comment below. Also you should follow us on twitter here. [Less]
Posted almost 11 years ago by Lee Jae Ik
At NHN we have a service called NELO (NHN Error Log System) to manage and search logs pushed to the system by various applications and other Web services. The search performance and functionality of NELO2, the second generation of the system, have ... [More] significantly been improved through ElasticSearch. Today I would like to share our experience at NHN in deploying ElasticSearch in Log Search Systems. ElasticSearch is a distributed search engine based on Lucene developed by Shay Banon. Shay and his team have recently released the long awaited version 0.90. Here is a link to a one-hour recorded webinar where Clinton Gormley, one of the core ElasticSearch developers, explains what's new in ElasticSearch 0.90. If you are developing a system which requires a search functionality, I would recommend ElasticSearch as its installation and server expansion are very easy. Since it is a distributed system, ElasticSearch can easily cope with an increase in the volume of search targets. At NHN all logs coming into NELO2 are stored and indexed by ElasticSearch for faster near real-time search results. Features of ElasticSearch Let's get started with familiarizing ourselves with the terms widely used in ElasticSearch. For those who are familiar with relational database systems, the following table compares the terms used in relational databases with the terms used in ElasticSearch. Table 1: Comparison of the terms of RDBMS and ElasticSearch. Relational DB ElasticSearch Database Index Table Type Row Document Column Field Schema Mapping Index Everything is indexed SQL Query DSL JSON-based Schemaless Storage ElasticSearch is a search engine but can be used like NoSQL. Since a data model is represented in JSON, both requests and responses are exchanged as JSON documents. Moreover, sources are also stored in JSON. Although schema is not defined in advance, JSON documents are automatically indexed when they are transferred. Number and date types are automatically mapped. Multitenancy ElasticSearch supports multitenancy. Multiple indexes can be stored in a single ElasticSearch server, and data of multiple indexes can be searched with a single query. NELO2 separates indexes by date and stores logs. When executing a search, NELO requests indexes of dates within the scope of search with a single query. Code 1: Multitenancy Example Query. # Store logs in the log-2012-12-26 index curl -XPUT http://localhost:9200/log-2012-12-26/hadoop/1 -d '{     "projectName": "hadoop",     "logType": "hadoop-log",     "logSource": "namenode",     "logTime":"2012-12-26T14:12:12",     "host": "host1.nelo2",     "body": "org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile" }' # Store logs in the log-2012-12-27 index curl -XPUT http://localhost:9200/log-2012-12-27/hadoop/1 -d '{     "projectName": "hadoop",     "logType": "hadoop-log",     "logSource": "namenode",     "logTime":"2012-12-27T02:02:02",     "host": "host2.nelo2",     "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem" }' # Request search to the nelo2-log-2012-12-26 and nelo2-log-2012-12-27 indexes at once curl -XGET http://localhost:9200/nelo2-log-2012-12-26,nelo2-log-2012-12-27/_search Scalability and Flexibility ElasticSearch provides excellent scalability and flexibility. It enables the expansion of functionality through plug-ins, which was further improved in recent 0.90 release. For example, by using Thrift or Jetty plugin, you can change transfer protocol. If you install BigDesk or Head, which is a required plugin, you can use the functionality of ElasticSearch monitoring. As shown in the following Code 2, you can also adjust the number of replicas dynamically. The number of shards is not changeable as it is fixed for each index, so an appropriate number of shards should be allocated in the first time by taking the number of nodes and future server expansion into account. Code 2: Dynamic Configuration Change Query. $ curl -XPUT http://localhost:9200/log-2012-12-27/ -d '{     "settings": {         "number_of_shards": 10,         "number_of_replicas": 1     } }' Distributed Storage ElasticSearch is a distributed search engine. It distributes data by configuring multiple shards according to keys. An index is configured for each shard. Each shard has 0 or more replicas. Moreover, ElasticSearch supports clustering, and when a cluster runs, one of many nodes is selected as the master node to manage metadata. If the master node fails, another node in the cluster automatically becomes the master. It is also very easy to add nodes. When a node is added to the same network, the added node will automatically find the cluster through multicast and add itself to the cluster. If the same network is not used, the master node address should be specified through unicast (see a related video: http://youtu.be/l4ReamjCxHo). Installing Quick Start ElasticSearch supports zero configuration installation. As shown in the following code snippets, all you have to do for execution is download a file from the official homepage and unzip it. Download ~$ wget http://download.ElasticSearch.org/ElasticSearch/ElasticSearch/ElasticSearch-0.20.1.tar.gz    ~$ tar xvzf ElasticSearch-0.20.1.tar.gz Executing Server ~$ bin/ElasticSearch -f Installing Plugins You can easily expand the functionality of ElasticSearch through plugins. You can add management functionalities, change the analyzer of Lucene, and change the basic transfer module from Netty to Jetty. The following is a command we use to install plugins for NELO2. Head and bigdesk, which are found in the first and second lines, are the plugins required for ElasticSearch monitoring. It is strongly recommended to install them and check their functionalities. After installing them, visit http://localhost:9200/plugin/head/ and http://localhost:9200/plugin/bigdesk/, and you can see the status of ElasticSearch in your Web browser. bin/plugin -install Aconex/ElasticSearch-head bin/plugin -install lukas-vlcek/bigdesk bin/plugin -install ElasticSearch/ElasticSearch-transport-thrift/1.4.0 bin/plugin -install sonian/ElasticSearch-jetty/0.19.9 Main Configurations You don't need to change configurations when conducting a simple functionality test. When you carry out a performance test or apply it to production services, then you should change some default configurations. See the following snippet and try to find for yourself the configurations which should be changed from the initial configuration file. Code 5: Main Configurations (config/ElasticSearch.yml). # As it is a name used to identify clusters, use a name with uniqueness and a meaning. cluster.name: ElasticSearch-nelo2 # A node name is automatically created but it is recommended to use a name that is discernible in a cluster like a host name. node.name: "xElasticSearch01.nelo2" # The default value of the following two is all true. node.master sets whether the node can be the master, while node.data is a configuration for whether it is a node to store data. Usually you need to set the two values as true, and if the size of a cluster is big, you should adjust this value by node to configure three types of node. More details will be explained in the account of topologies configuration later. node.master: true node.data: true # You can change the number of shards and replicas. The following value is a default value:  index.number_of_shards: 5 index.number_of_replicas: 1 #To prevent jvm swapping, you should set the following value as true: bootstrap.mlockall: true # It is a timeout value for checking the status of each node in a cluster. You should set an appropriate value; if the value is too small, nodes may frequently get out of a cluster. The default value is 3 seconds. discovery.zen.ping.timeout: 10s # The default value is multicast, but in an actual environment, unicast should be employed due to the possibility of overlapping with other clusters. It is recommended to list servers that can be a master in the second setting. discovery.zen.ping.multicast.enabled: false discovery.zen.ping.unicast.hosts: ["host1", "host2:port", "host3[portX-portY]"] Using REST API ElasticSearch provides a REST API as shown below. It provides most of its functionalities through REST API, including the creation and deletion of indexes, mappings, as well as search and change of settings. In addition to REST API, it also provides various client APIs for Java, Python and Ruby. Code 6: REST API Format in ES. http://host:port/(index)/(type)/(action|id) As mentioned earlier, NELO2 classifies indexes (databases in RDBMS terms) by date, and type (table) is separated by project. Code 7 below shows the process of creating logs that came into the hadoop project on December 27, 2012, in the unit of document by using a REST API. Code 7: An Example of Using ElasticSearch REST API. #Creating documents curl -XPUT http://localhost:9200/log-2012-12-27/hadoop/1 curl -XGET http://localhost:9200/log-2012-12-27/hadoop/1 curl -XDELETE http://localhost:9200/log-2012-12-27/hadoop/1 #Search curl -XGET http://localhost:9200/log-2012-12-27/hadoop/_search curl -XGET http://localhost:9200/log-2012-12-27/_search curl -XGET http://localhost:9200/_search #Seeing the status of indexes curl -XGET http://localhost:9200/log-2012-12-27/_status Creating Documents and Indexes As shown in the following Code 8, when the request is sent, ElasticSearch creates the log-2012-12-27 index and hadoop type automatically without any pre-defined index or type. If you want to create them specifically instead of using auto creation, you should specify the setting of action.auto_create_index and index.mapper.dynamic as false in the configuration file. Code 8: Creating Documents. # Request curl -XPUT http://localhost:9200/log-2012-12-27/hadoop/1 -d '{     "projectName": "hadoop",     "logType": "hadoop-log",     "logSource": "namenode",     "logTime":"2012-12-27T02:02:02",     "host": "host2.nelo2",     "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem" }' # Result {     "ok": true,     "_index": "log-2012-12-27",     "_type": "hadoop",     "_id": "1",     "_version": 1 } As shown in Code 9 below, you can make a request after including type in a document. Code 9: A Query Including Type. curl -XPUT http://localhost:9200/log-2012-12-27/hadoop/1 -d '{     "hadoop": {         "projectName": "hadoop",         "logType": "hadoop-log",         "logSource": "namenode",         "logTime":"2012-12-27T02:02:02",         "host": "host2.nelo2",         "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem"     } }' If an id value is omitted as in Code 10, an id will be created automatically when a document is created. Note that thePOST method was used instead of PUT when a request was made. Code 10: A Query Creating a Document without an ID. # Request curl -XPOST http://localhost:9200/log-2012-12-27/hadoop/ -d '{     "projectName": "hadoop",     "logType": "hadoop-log",     "logSource": "namenode",     "logTime":"2012-12-27T02:02:02",     "host": "host2.nelo2",     "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem" }' # Result {     "ok": true,     "_index": "log-2012-12-27",     "_type": "hadoop",     "_id": "kgfrarduRk2bKhzrtR-zhQ",     "_version": 1 } Deleting a Document Code 11 below shows how to delete a document (a record in RDBMS terms) in type (a table). You can delete a hadoop type document with id=1 of the log-2012-12-27 index by using the DELETE method. Code 11: A Query to Delete a Document. # Request $ curl -XDELETE 'http://localhost:9200/log-2012-12-27/hadoop/1' # Result {     "ok": true,     "_index": "log-2012-12-27",     "_type": "hadoop",     "_id": "1",     "found": true } Getting a Document You can get a hadoop type document with id=1 of the log-2012-12-27 index by using the GET method as shown in Code 12. Code 12: A Query to Get a Document. #Request curl -XGET 'http://localhost:9200/log-2012-12-27/hadoop/1' # Result {     "_index": "log-2012-12-27",     "_type": "hadoop",     "_id": "1",     "_source": {         "projectName": "hadoop",         "logType": "hadoop-log",         "logSource": "namenode",         "logTime":"2012-12-27T02:02:02",         "host": "host2.nelo2",         "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem"     } } Search When the Search API is called, ElasticSearch executes the Search API and returns the search results that match the content of the query. Code 13 shows an example of using Search API. Code 13: An Example Query of Using Search API. # All types of a specific index $ curl -XGET 'http://localhost:9200/log-2012-12-27/_search?q=host:host2.nelo2' # A specific type of a specific index $ curl -XGET 'http://localhost:9200/log-2012-12-27/hadoop,apache/_search?q=host:host2.nelo2' # A specific type of all indexes $ $ curl - XGET 'http://localhost:9200/_all/hadoop/_search?q=host:host2.nelo2' # All indexes and types $ curl -XGET 'http://localhost:9200/_search?q=host:host2.nelo2' Search API by Using URI Request Table 2: Main Parameters. Name Description q Query string. default_operator The operator used as a default (AND or OR). The default is OR. fields The field to get as a result. The default is the "_source" field. sort Sort method. Ex) fieldName:asc/fieldName:desc. timeout Search timeout value. The default is "unlimited". size The number of result values. The default is 10. If you use URI, you can search easily by using parameters in Table 2 and a query string. As it does not provide all search options, it is useful when used for tests. Code 14: Search Query by Using URI Request. # Request $ curl -XGET 'http://localhost:9200/log-2012-12-27/hadoop/_search?q=host:host2.nelo2' # Result {     "_shards":{         "total": 5,         "successful": 5,         "failed": 0     },     "hits":{         "total": 1,         "hits": [             {                 "_index": "log-2012-12-27",                 "_type": "hadoop",                 "_id": "1",                  "_source": {                     "projectName": "hadoop",                     "logType": "hadoop-log",                     "logSource": "namenode",                     "logTime":"2012-12-27T02:02:02",                     "host": "host2.nelo2",                     "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem"                 }             }         ]     } } Search API by Using Request Body When HTTP body is used, perform a search by using query DSL. As query DSL has a large amount of contents, you are advised to refer to a guide from the official website. Code 15: Search by Using Query DSL. # Request $ curl -XPOST 'http://localhost:9200/log-2012-12-27/hadoop/_search' -d '{     "query": {         "term": { "host": "host2.nelo2" }     } }' # Result {     "_shards":{         "total": 5,         "successful": 5,         "failed": 0     },     "hits":{         "total": 1,         "hits": [             {                 "_index": "log-2012-12-27",                 "_type": "hadoop",                 "_id": "1",                 "_source": {                     "projectName": "hadoop",                     "logType": "hadoop-log",                     "logSource": "namenode",                     "logTime":"2012-12-27T02:02:02",                     "host": "host2.nelo2",                     "body": "org.apache.hadoop.hdfs.server.namenode.FSNamesystem"                 }             }         ]     } } Mapping Put Mapping API To add a mapping to a specific type, you can define a mapping in the form shown in Code 16. Code 16: Query to Register a Mapping. $ curl -XPUT 'http://localhost:9200/log-2012-12-27/hadoop/_mapping' -d ' {     "hadoop": {         "properties": {             "projectName": {"type": "string", "index": "not_analyzed"},             "logType": {"type": "string", "index": "not_analyzed"},             "logSource": {"type": "string", "index": "not_analyzed"},             "logTime": {"type": "date"},             "host": {"type": "string", "index": "not_analyzed"},             "body": {"type": "string"},         }     } }' Get Mapping API To get defined mapping information, you can use a query in the form shown in Code 17. Code 17: Query to Get a Mapping. $ curl -XGET 'http://localhost:9200/log-2012-12-27/hadoop/_mapping' Delete Mapping API Code 18 shows an example of deleting a defined mapping. Code 18: Query to Delete a Mapping. $ curl -XDELETE 'http://localhost:9200/log-2012-12-27/hadoop' How to Optimize Performance Memory and the Number of Open Files If the amount of data to search increases, you will need more memory. When you run ElasticSearch, you will encounter many problems due to the use of memory. In an operating method recommended by an ElasticSearch community, when you run a server exclusively for ElasticSearch, you are advised to allocate only half of the memory capacity to ElasticSearch, and to allow the OS to use the other half for system cache. You can set the memory size by setting the ES_HEAP_SIZE environmental variable or by using -Xms and -Xmx of JVM. Code 19: Execution by Specifying Heap Size. bin/ElasticSearch -Xmx=2G -Xms=2G When using ElasticSearch, you will see OutOfMemory errors frequently. This error occurs when the field cache exceeds the maximum heap size. If you change the setting for index.cache.field.type from resident (default) to soft, soft reference will be used and the cache area will be preferentially GC, and this problem can be resolved. Code 20: Configuring Field Cache Type. index.cache.field.type: soft If the amount of data increases, the number of index files also increases. This is because Lucene, which is used by ElasticSearch, manages indexes in the unit of segments. Sometimes the number will even exceed the number of MAX_OPEN files. For this reason, you need to change the maximum open file limit by using the ulimit command. The recommended value is 32000-64000, but sometimes you may need to set a larger value depending on the size of the system or data. Index Optimization NELO2 manages indexes by date. If indexes are managed by date, you can delete old logs that don't need to be managed easily and quickly, as shown in Code 21. In this case, the overhead imposed on the system is smaller than when deleting logs by specifying the TTL value for each document. Code 21: Deleting an Index. $ curl -XDELETE 'http://localhost:9200/log-2012-10-01/' If index optimization is performed, segments are incorporated. Using this method, you can enhance search performance. As index optimization can impose a burden on the system, it is better to perform it when the system is being used less. Code 22: Index Optimization. $ curl -XPOST 'http://localhost:9200/log-2012-10-01/_optimize' Shards and Replicas You can't change the number of shards after setting it. For this reason, you need to decide this value carefully by taking the current number of nodes in the system and the number of nodes expected to be added in the future into account. For example, if there are 5 nodes and the number is expected to reach 10 in the future, it is recommended to set the number of shards as 10 from the beginning. If you set it as 5 in the beginning, you can add 5 more nodes later, but you won't be able to use the added 5 nodes. If you set the number of replicas to 1, of course, you can utilize the added 5 nodes as nodes exclusively for replication. If the number of shards increases, it is more advantageous to process a large amount of data because queries are distributed as much as the number of shards. But you need to set this value appropriately, because the performance could be deteriorated due to increasing traffic if the value is too high. Configuring Cluster Topologies The content of the configuration file of ElasticSearch is shown in Code 23 below. There are three types of nodes: data nodeThis does not act as the master, and only stores data. When it receives a request from a client, it searches data from shards or creates an index.  master nodeIt functions to maintain a cluster, and requests indexing or search to data nodes.   search balancer nodeIf it receives a search request, it requests data, gathers data and delivers the result. You can have one node which will function both like a master and a data node. But if you use the three types of node separately, you can reduce the burden of the data node. In addition, if you configure the master node separately, you can improve the stability of a cluster. Also, you can reduce operation costs by using low-spec. server equipment for the master and search node. Code 23: Settings Related to Topology. # You can exploit these settings to design advanced cluster topologies. # # 1. You want this node to never become a master node, only to hold data. #    This will be the "workhorse" of your cluster. # # node.master: false # node.data: true # # 2. You want this node to only serve as a master, to not store any data and #    to have free resources. This will be the "coordinator" of your cluster. # # node.master: true # node.data: false # # 3. You want this node to be neither a master nor a data node, but #    to act as a "search load balancer" (fetching data from nodes, #    aggregating results, etc.) # # node.master: false # node.data: false Figure 1 below shows the configuration of NELO2 topologies that use ElasticSearch. The efficiency of equipment use and the stability of the entire cluster has been improved as follows: only ElasticSearch runs on the 20 data nodes (server) so that they can achieve sufficient performance, while other daemon server processes in addition to ElasticSearch run on the 4 master nodes and 3 search nodes. Figure 1: NELO2 ElasticSearch Topologies. Configuring Routing When a large amount of data needs to be indexed, increasing the number of shards will improve the overall performance. On the other hand, if the number of shards increases, the traffic among nodes will also go up. For example, when there are 100 shards, if it receives a single search request, it sends the request to all the 100 shards and aggregates data, and this imposes a burden on the entire cluster. If you use routing, data will be stored only in a specific shard. Even if the number of shards increases, the application will still send a request only to a single shard, and consequently the traffic can be reduced dramatically. Figure 2, 3, and 4 are excerpted from the slides Rafal Kuc presented at Berlin Buzzwords 2012. If you don't use routing, as shown in Figure 2, the application will send a request to all the shards. But if you use routing, it will send a request only to a specific shard, as shown in Figure 3. According to the material cited, in Figure 4 when there are 200 shards, the response time is over 10 times faster with routing than without routing. If routing is applied, the number of threads will increase by 10 to 20 times compared to when it is not applied, but the CPU usage is much smaller. In some cases, however, the performance will be better when routing is not applied. For a search query whose result should be collected from multiple shards, it could be more advantageous in terms of performance to send the request to multiple shards. To complement this, NELO2 determines the use of routing depending on the log usage of the project. Figure 2: Before Using Routing. Figure 3: After Using Routing. Figure 4: Performance Comparison before and after Using Routing. Conclusion The number of users of ElasticSearch is increasing rapidly, thanks to its easy installation and high scalability. It was several days only since the release of the latest ElasticSearch version 0.90. Its functionality is improving very quickly thanks to its active community. In addition, more and more companies are beginning to use ElasticSearch for their services. Recently, some committers, including the developer Shay Banon, gathered together and established ElasticSearch.com, which provides consulting and training services. In this article I have explained the basic information on the installation of ElasticSearch, how to use it, and do performance tuning. We have started testing the latest 0.90 release and soon will migrate the current 0.20.1 ES deployment. In the next post I will continue this topic and tell you about our experience with 0.90 as well as the critical split-brain problem we have previously experienced. Due to the scarcity of solutions for this problem, I believe it will be very useful for our readers. By Lee Jae Ik, Senior Software Engineer at Global Platform Development Lab, NHN Corporation. References Official guide: http://www.ElasticSearch.org/guide/ Introduction to ElasticSearch and comparison of the terms of ElasticSearch and RDB: http://www.slideshare.net/clintongormley/cool-bonsai-cool-an-introduction-to-ElasticSearch About ElasticSearch: http://www.slideshare.net/dadoonet/ElasticSearch-devoxx-france-2012-english-version Shay Banon's articles: http://2011.berlinbuzzwords.de/sites/2011.berlinbuzzwords.de/files/ElasticSearch-bbuzz2011.pdf Using ElasticSearch for logs: http://www.ElasticSearch.org/tutorials/2012/05/19/ElasticSearch-for-logging.html Concept of multitenancy: http://en.wikipedia.org/wiki/Multitenancy Shay Banon's ElasticSearch optimization: https://github.com/logstash/logstash/wiki/ElasticSearch-Storage-Optimization Rafal Kuc's article on performance tuning presented at Berlin Buzzwords 2012: http://www.slideshare.net/kucrafal/scaling-massive-elastic-search-clusters-rafa-ku-sematext [Less]
Posted about 11 years ago by Esen Sagynov
We are very glad to announce the immediate availability of CUBRID ALL-IN-ONE Windows Downloader version 1.0 beta. You can download CUBRID ALL-IN-ONE Windows Downloader from http://www.cubrid.org/wiki_tools/entry/cubrid-all-in-one-windows-downloader. ... [More] The source code is available at http://svn.cubrid.org/cubridtools/cubrid-downloader/ which is open sourced under BSD license just like all other CUBRID Tools. CUBRID ALL-IN-ONE Windows Downloader is an application that allows our users to easily download CUBRID components including the server engine, drivers and GUI tools. All you have to do is to select the components you want to download on your local Windows machine and the Downloader will download them for you, one by one, without any other actions required. Application key features: The application can auto-update itself, anytime a new version is available (it uses the ClickOnce technology). Retrieves all the components information from a remote CUBRID online location, so it is always up-to- date with the latest application releases. Detects local machine specifics - CUBRID version, OS architecture – and automatically selects the appropriate list of components. Can handle software pre-prerequisites dependencies and download them as well. Supports both HTTP and FTP protocols for downloads. Provides additional information to users like links to online resources. Handles download errors and auto-retries in case of failures. Supports alternate download locations to try in case of failures. Saves the user preferences and re-uses them next time. Provides a comprehensive operations log information. Supports UI localization. Here is a mute video which shows how to use CUBRID ALL-IN-ONE Downloader.   If you have questions or suggestions, leave your comments below. [Less]
Posted about 11 years ago by Jaehee Ahn
For a long time, Java has provided security-related functions. Among the security-related functions, Java Cryptography Architecture (JCA) is the core one. JCA uses a provider structure with a variety of APIs related to security. These functions are ... [More] essential for modern IT communication encryption technology, including Digital Signature, Message Digest (hashs), Certificate, Certificate Validation, creation and management of Key, and creation of Secure Random Number. With JCA, even developers who do not have specialized knowledge of encryption can successfully implement security-related functions. You don't need to use algorithms like those you had to rack your brain for a long time to understand in computer science classes and cryptology-related classes. JCA allows you to implement the algorithms with a few lines of codes. Of course, utilizing the APIs well will be highly valuable for business. But, it does not mean that you do not need to understand how JCA runs. Understanding how JCA runs internally will be important to using the functions more efficiently. To be a better software developer and architect, you may need to trace how the result, JCA, was created from the cryptology and security-related algorithms. This article is a summary of JCA architecture that I learned while producing the nClavis (Symmetric-key cryptography) at NHN. Of course, I do not understand all of JCA yet. However, I was so happy to understand JCA at this level that I decided to write this article to share my experience with you. Design Principles As I mentioned, JCA is a Java security platform, based on the provider structure, having implementation independence, implementation interoperability, and algorithm extensibility. An application can utilize the information protection encryption technology just by requesting security services on the Java platform, without implementing security algorithms. JCA-provided security services are implemented by the provider mounted on the Java security platform. An application can introduce a variety of security functions by using several independent providers. The list of providers is described in the jre/lib/security/java.security file. The Java platform includes many providers and installs them by default when JRE is installed. Code 1: java.security file. # # List of providers and their preference orders (see above): # security.provider.1=sun.security.provider.Sun security.provider.2=sun.security.rsa.SunRsaSign security.provider.3=com.sun.net.ssl.internal.ssl.Provider security.provider.4=com.sun.crypto.provider.SunJCE security.provider.5=sun.security.jgss.SunProvider security.provider.6=com.sun.security.sasl.Provider security.provider.7=org.jcp.xml.dsig.internal.dom.XMLDSigRI security.provider.8=sun.security.smartcardio.SunPCSC security.provider.9=sun.security.mscapi.SunMSCAPI The providers mounted on the Java security platform by default are compatible with all Java applications and so widely used to regard them as trusted. Of course, JCA supports mounting of custom providers for applications which want to introduce the latest security technology that has not been implemented yet. Architecture Cryptographic Service Providers All providers are an implementation of java.security.Provider. This provider implementation includes the list of security algorithm implementations. When an instance of a specific algorithm is necessary, the JCA framework searches the proper implementation class of the corresponding algorithm from the provider repository and creates a class instance. The providers defined in the java.security file are included in the repository by default. In this way, a provider can be statically included. In addition, it can be dynamically added in runtime. When several providers are defined, they may implement an identical encryption algorithm in different ways. In this case, an application can specify the provider or specify the preference in the repository. To use JCA, an application simply requests a specific object type (such as MessageDigest) and an algorithm or service (e.g., MD5). Then, the application obtains an implementation from one of the installed providers. Of course, it can explicitly request an object of a specific provider. Code 2: Requesting an Object of a Provider. md = MessageDigest.getInstance("MD5"); md = MessageDigest.getInstance("MD5", "ProviderC"); Figure 1: Provider Framework of JCA (Source: http://docs.oracle.com/javase/6/docs/technotes/guides/security/crypto/CryptoSpec.html). Oracle JRE (Sun JDK) has a variety of providers (Sun, SunJSSE, SunJCE, and SUnRsaSign) included by default. The criteria of classifying the providers are made based on the production process; functions or algorithms used by each provider are not so different from each other. JRE except Oracle has no need to include the providers mandatorily. Therefore, it is not recommended to implement an application in the provider-dependent way. All encryption technology implementations required by an application are provided by default and implemented with a fully reliable level. Therefore, developers do not need to pay attention to a provider itself. Key Management Two of the most important things in JCA are provider and key management. Java uses a kind of key database, called "keystore", to manage the key/certificate repository. KeyStore can be usefully used in an application that needs information for authentication, encryption, or signature. An application can access the KeyStore by using the implementation of the java.security.KeyStore class. The default KeyStore implementation is provided by Sun Microsystems (the package name still starts with com.sun even though Oracle acquired it long time ago :D). The KeyStore is created as a file with the naming rule of “jks”. It can also be converted to the type of “jceks” or “pkcs12” in order to suppoprt applications which use another format of KeyStore implemetation. "jceks" format is in PBE format which uses triple DES, which uses even stronger encryption algorithm than "jks" type, to protect KeyStore. "pkcs12" format is a standard syntax to exchange personal information, based on RSA. For machines, applications, and browser Internet kiosks that support this standard, users can export, import, or activate the personally identifying information (certificates for identification, pkcs12 format certificates). Safari, Chrome, and IE browsers follow this standard. Therefore, when the pkcs12 certificate file is installed once, it is applied to all browsers. However, Firefox does not follow this standard. KeyStore is to save and to manage SecretKey (SymmetricKey), Public/Private KeyPair (AsymmetricKey), self-signed certificate, and the certificates signed by trusted CA (Certificate Authority) or the private CA on a file. Here, experienced developers may recall OpenSSL which was used to create a certificate file. Both certificate files have the same purpose but different file format. However their formats are convertible. Java even provides 'keytool' command-line utility in JDK_HOME/bin directory, which is similar as OpenSSL's utility. Therefore you can handle certificates using a keytool on Windows, unlikely OpeSSL only on Linux. Of course, this tool runs with the KeyStore implementation provided by Oracle JRE. If JDK is installed in the system, a certificate can be created by using the keytool. However, note that the keytool can provide the functions at the same level provided by OpenSSL from JAVA. A KeyStore file created by using the keytool is compatible with the lower Java versions, so you do not need to worry about it. In-depth 1: Certificate. Here, I need to address the correct meaning of certificate. In a narrow sense, KeyStore is a kind of certificate. A certificate is used for two purposes; a "lock" required for encrypting the information and a tool for "identification" to identify the opponent technically. The authenticated certificate used for bank transactions is a security technology which utilizes both purposes of a certificate. The cryptographic meaning of a certificate is an electronic document that uses a digital signature to sign the public key created by the RSA algorithm (asymmetric-key cryptography) with the private key of the certificate authority (CA). It is popular practice that a pair of keys created by using the RSA algorithm is solid. It had been proven long ago that calculating the other key by using one key within a meaningful time is impossible. So, why is electronic signature required? When "A" and "B" communicate with each other by encrypting their data, A opens its public key and then protects the private key paired with the public key. Then B encrypts the data to be sent to A by using the public key of A. In this case, there is a problem of how B will know it can trust the public key; is it really provided by A? When a malicious attacker "C" disguises its public key as A's to deceive B, the public key cannot be used. To solve this problem, a trusted certificate authority "D" is necessary. The CA has a chain structure from the top root certificate authority to the sub certificate authority. In this structure, the upper layer certifies the lower layer by using signature. As it had been rooted as a worldwide standard so long ago, the signature chain of the certificates around the world includes a few common top root CAs (e.g. VeriSign, Thawte). Countries of which IT communication environment reaches a certain standard have their national root CA (e.g., KISA in Korea). The top root CAs may have a circular structure that allows signatures with each other. In some cases, one CA can sign the public key with its private key. So, the signature of CA should not be valid limitlessly but updated regularly or irregularly by other security events. It is possible for an individual or a company to establish a private certificate authority if required. Certificates signed by the private certificate authority have a more complex certification process. You may have seen the following Figure 2 on your browser: Figure 2: Browser Display when Private Certificate Authority is used. It is displayed when the website uses the private certificate authority. On the browser (even though it is not recommended), the user can skip this situation by clicking the mouse. However, for server-to-server connection, the opponent's certificate issued by the private certificate authority should be imported to JAVA_HOME/jre/lib/security/cacerts or added to SSLContext when creating the connection of the program code. The certificates from official certificate authorities are widely granted as trusted CAs and included to the OS or JRE by default. Therefore, the situation illustrated in Figure 2 does not occur. If a private certificate authority can obtain the pkcs12 format certificate, the private certificate authority is considered as a trusted CA. Therefore, it can be easily installed in the system. Code 2: Certificate Verification Program installed in the system. @Windows>certmgr.msc @MAC OS X>Keychain Access @Linux> keychain In-depth 2: HTTPS To understand encryption, you need to understand SSL/TLS, the certificate-based cryptographic protocols, as well as certificate. The purpose of a certificate can be easily misunderstood when encrypting the HTTPS protocol communication section. When a server and a client communicate via HTTPS, if only the communication section is encrypted without identifying the client (setting the clientAuth attribute of HTTPS Connector to false in the server.xml setting of Tomcat), only a simple certification verification is executed. Symmetric-key cryptography is used for encryption of data. Here, I will describe how a certificate is used through HTTPS. A client connects to a server via the HTTPS protocol (at this time, the SSL connection-defined server port is used). The server sends its certificate public key to the client.(including several meta information for validation and Cipher supported by the server) The client validates the server with the public key and meta information sent by the server.(checking whether the public key is signed with a trusted official Root CA) When the certificate of the server is signed with the official CA, it is passed (the official CAs are registered to the system as a trusted authority by default). When the certificate of the server is signed with a private CA, checking the trust manager of the SSL socket (SSLContext) created by the client. If the certificate is registered, it is passed. When the server's certificate passes the validation check, the client creates a symmetric key, encrypts the symmetric key and the cipher as a public key of the server, and then sends them to the server.(the symmetric key is created by selecting one of cipher algorithms supported by the server) The server decodes the [symmetric key and cipher: encrypted as the public key of the server] to its private key and acquires the symmetric key to be used for encrypted communication. After that, data communication between the server and the client is made to be encrypted with the symmetric key created by the client. JCA Structure JCA structure can be described with Engine class and algorithm. In JCA, an Engine class provides interfaces for all encryption service types regardless of a specific encryption algorithm or provider. The engine class provides one of the following functions: Encryption operations (encryption, digital signatures, message digests, etc.) Creating and converting the elements (keys and algorithm parameters) required for encryption Objects (keystores or certificates) which imply the encryption data or can be used by an object or the upper abstraction layer Let's take a look at the Engine classes provided by JCA and talk about encryption in detail. SecureRandom SecureRandom class is used to create a Pseudo Random number. In Java, random refers to Pseudo Random, as a more accurate expression. If so, are random and pseudo random different? Technically, they are different. There are two random types: True Random and Pseudo Random. True Random is a random number which cannot be forecasted. You may say that Pseudo Random cannot be forecasted. Pseudo Random is a random progression determined by seed and a mathematical algorithm. It has a sequence which is eventually repeated even if it takes a very long time or its probability is very low. In addition, if you know the seed and the random algorithm, you can forecast the sequence of the Pseudo Random. True Random creates a random number based on atomic physical phenomena, not the mathematical way used by the Pseudo Random. If there is no hardware equipment to measure the atomic physical phenomena, e.g., electromagnetic noises and radioactive element decay, it is impossible to create True Random. JCA SecureRandom class is an engine class that provides a powerful function to create random numbers. As I described, it is not easy to implement a True Random Number Generator (TRNG). Therefore, many implementations implement Pseudo Random Number Generator (PRNG). As mentioned before, the random level of the pseudo random is incomplete. The popular random class cannot satisfy the minimum level that is required cryptographically. Therefore, the implementation of SecureRandom should be verified that it satisfies the requirements of cryptographic level (CSPRNG). Figure 3: Classification of Encryption Type (source: http://en.wikipedia.org/wiki/Cipher). Now, you may ask why creating the random number is considered such an very important thing for encryption. The core of modern cryptology is the key used for encryption. Previously, cryptology had been based on a conversion table like Base64 or UTF8 encoding. For modern cryptology, that kind of traditional method is not considered as encryption any more. The key is a random sequence generated by the random sequence generator. We naturally think of an encryption algorithm as thinking of cryptology. However, the open symmetric key encryption algorithm can be simplified to XOR operation (or multiplication/division operation) for the input values and key streams. As I said, the core is the key. If the random level of a random sequence generator is not ensured, the entire outline of the key may be revealed to an attacker when a part of created key or some sequential random sequences are leaked out to the attacker. For modern cryptology, the key used for encryption is the core. So, the random sequence generator is very important. JCA uses SecureRandom as the random sequence generator. Implementation of an algorithm to create random numbers is provided by a provider, like other encryption algorithms. MessageDigest MessageDigest is used to calculate the message digest (hash) of input data. The purpose of message digest is the integrity check to check whether the original file is reserved as it is. Message digest algorithm processes a variable-length original message into a fixed-length hash output. Message digest algorithm consists of a unidirectional hash function, so it is not possible to draw the original value from the hash value. When A and B are communicating with each other, A sends the original message, message hash value of the original message, and the message digest algorithm to B. B calculates the message hash value by using the algorithm and original message sent from A. When the message hash value calculated by B is identical to the message hash value sent from A, it means that the original message sent from A has not been changed or modified until B receives it via the network. Figure 4: Example of Using MessageDigest at a Download Site. You can frequently see Checksums or digital fingerprints at download sites. It is an alternative name for MessageDigest. MD5 or SHA1 is the well-known message digest algorithm. Signature Signature is used to sign data and to decide validity of the digital signature with a key received during initialization. Receiving a key means that key-based encryption is executed. Figure 5: Flow of Actions of Signature Object (source: http://docs.oracle.com/javase/6/docs/technotes/guides/security/crypto/CryptoSpec.html). In initialization, Signature Object receives the private key and the original data to be signed as parameters and finishes preparation for signing. The sign() method of Signature signs the original data with the private key and returns Signature Bytes. To validate the signed data, the verification signature object is initialized by using the public key paired with the private key used for signing. The object additionally receives the original data, signature output and Signature Bytes, and the verify method checks whether the two parameters are identical to determine the reliability of the original data. Signature can be made only by the person who holds the private key. However, verification is made by using the public key. So, anyone who acquired the public key can perform verification. Digital Signature vs Cryptography vs MessageDigest For cryptography, users can select either symmetric key method or asymmetric key method based on the user's request. Digital signature is also a kind of cryptography. However, asymmetric key encryption is a prerequisite for digital signature. In addition, digital signature is a combination of MessageDigest and asymmetric key encryption. Large-capacity data with a variable length is compressed to a fixed-length format which is easy to manage by the MessageDigest and then signed with a private key to create fixed-length signature bytes. When creating a signature instance, you can see the principle of digital signature from the signature algorithm names, such as SHA1withRSA, MD5withRSA, SHA1withDSA; the signature algorithms are sent as a signature.getInstance() parameter and their names are made by combining RSA (asymmetric key encryption algorithm), MessageDigest algorithm, SHA1, and MD5. Signature dsa = Signature.getInstance("SHA1withDSA"); Signed Certificate Keystore Type: jks Keystore Provider: SUN Keystore includes the following two items: Alias: rootcaalias Written on: 2012. 9. 26 Input Type: trustedCertEntry Holder: CN=NSYMKEY Root CA, OU=NHN NBP, O=NHN INC Issuer: CN=NSYMKEY Root CA, OU=NHN NBP, O=NHN INC Serial Number: Opened on: Fri Apr 06 10:17:08 KST 2012 Expired on: Sun Mar 13 10:17:08 KST 2112 Certificate Fingerprint: MD5: 0C:FC:12:C5:68:E5:95:0B:95:7D:B0:2F:FA:4F:DB:B4 SHA1: 90:37:1C:E6:F4:64:AD:E6:27:AA:4F:58:88:16:11:24:6D:A5:EB:2B ******************************************* ******************************************* Alias: nplatform Written on: 2012. 9. 26 Input Type: keyEntry Length of Certificate Chain: 2 Certificate[1]: Holder: O=NHN INC, OU=NHN NBP, CN= NPLAFORM, UID=1 Issuer: CN=NSYMKEY Root CA, OU=NHN NBP, O=NHN INC Serial Number: Opened on: Fri Sep 21 17:26:22 KST 2012 Expired on: Sun Aug 28 17:26:22 KST 2112 Certificate Fingerprint: MD5: 48:8C:46:A3:E7:54:58:97:60:0D:5C:56:08:B0:D1:E7 SHA1: 12:64:3C:DA:C1:2C:94:1A:2B:EB:E9:98:2B:DA:8F:06:78:6E:26:1E Certificate[2]: Holder: CN=NSYMKEY Root CA, OU=NHN NBP, O=NHN INC Issuer: CN=NSYMKEY Root CA, OU=NHN NBP, O=NHN INC Serial Number: Opened on: Fri Apr 06 10:17:08 KST 2012 Expired on: Sun Mar 13 10:17:08 KST 2112 Certificate Fingerprint: MD5: 0C:FC:12:C5:68:E5:95:0B:95:7D:B0:2F:FA:4F:DB:B4 SHA1: 90:37:1C:E6:F4:64:AD:E6:27:AA:4F:58:88:16:11:24:6D:A5:EB:2B Let's review certificate and signature, which were described in the previous in depth section, with JCA Signature object mechanism. The above text box is the KeyStore certificate file created by using the keytool of Java. From Holder and Issuer of Certificate[1], you can see that “CN=NSYMKEY Root CA, OU=NHN NBP, O=NHN INC” has signed the certificate of Holder “O=NHN INC, OU=NHN NBP, CN= NPLAFORM, UID=1” by using its private key. The result of the signature is the Certificate Fingerprint. The length of the Certificate Fingerprint is decided by the MessageDigest algorithm (MD5 or SHA1). As following the Certificate Chain, you can see that the Holder and Issuer of Certificate[2] are identical. It means that “CN=NSYMKEY Root CA, OU=NHN NBP, O=NHN INC” is self-signed by using its private key. As the Certificate[2] has self-signed, the Certificate Chain is ended here. Cipher Class Figure 6: Flow of Actions of Cipher Object (source: http://docs.oracle.com/javase/6/docs/technotes/guides/security/crypto/CryptoSpec.html). Cipher class provides encryption/decryption functions. The encryption/decryption algorithms are variously classified as follows: Symmetric bulk encryption (AES, DES, DESede, Blowfish, IDEA), Stream encryption (RC4), Asymmetric encryption (RSA), and Password-based encryption (PBE). I will not describe classification of encryption to Symmetric and Asymmetric because the classification is so well-known. Stream vs. Block Cipher Symmetric bulk encryption can be classified into Stream and Block Cipher. Block Cipher encodes the data in the fixed-length block unit. Data whose length does not fit the fixed length is padded with dummy values. Bytes padded are removed while decrypting the data. This padding is executed by the padding type (e.g., PKCS5PADDING) which is sent as a parameter while initializing Cipher. On the contrary, Stream Cipher processes input data in the unit of byte or bit. Therefore, it can process variable-length data without padding. Modes Of Operation The important concept of Block Cipher you should know is Feedback Modes. Assume a very simple block cipher. If the input data is identical, the encrypted result is identical. From this characteristic, attackers obtain a hint to decrypt the encrypted data with a repeated same pattern. To avoid security vulnerabilities and make Cipher more complex, Feedback Mode was introduced. Feedback Mode is an operation which combines (XOR operation) the Nth input data block (or the Nth encrypted result data block) and the N-1st input data block (or the N-1st encrypted result data block) at the Nth encryption process. Therefore, when the input data blocks are identical, the result values are different corresponding to the variables used in the previous encryption process. Note one more thing: if N = 1, any variable cannot be acquired from the N-1st encryption process. In this case, Initial Value (IV) takes the role instead of the variable in the previous process. To use Feedback Mode, the IV value should be randomly created and prepared for encryption. The IV value used for encryption should be stored because it is necessary for decryption as well. The feedback modes provided by JCA are CBC, CFB, and OFB. The mode that no feedback mode is used is called ECB for distinction. More detailed description of each mode will not be provided here. Figure 7 shows the importance of feedback modes. If the original image data is encrypted without using the feedback mode (ECB MODE), identical input data is used and an identical encryption result is acquired. Therefore, the entire outline is drawn up. Figure 7: Image Encryption (source: http://en.wikipedia.org/wiki/Modes_of_operation). Creating Cipher Object The essential thing for creating a Cipher instance is to specify transformation. Transformation consists of encryption algorithms (/feedback modes/paddings) described before. Only the encryption algorithm values can be specified. But, the default feedback mode/padding (ECB/PKCS5Padding) is internally specified. Cipher c1 = Cipher.getInstance("DES/ECB/PKCS5Padding"); or Cipher c1 = Cipher.getInstance("DES"); The Cipher class instance can be initialized by selecting one from four modes (opmode: Encryption, Decryption, Wrap, Unwrap) for initialization. WRAP_MODE: Wraps Java.security.Key to convert it to the byte unit for secured key transmission UNWRAP_MODE: Unwraps the wrapped key to the Java.security.Key object When initializing the cipher class instance, the init() method is called as its parameter. It requests opmode, key(certificate), params, and random as its parameters. Here, note the AlgorithmParameters-type params parameter. This instance is used to store the IV value of feedback mode and the salt value and the iteration count value of the PBS algorithm. These values are not required when initializing cipher of ENCRYPTION_MODE. These can be randomly created by ScureRandom and used for the encryption process. The values created are stored in the AlgorithmParameters field of the encryption cipher object. On the other hand, the params value is required for initializing DECRYPTION_MODE Cipher. In the decryption process, the params value identical to the value used for the encryption process is required. When the init() method is called, all existing values are deleted from the cipher class. Therefore, before initializing the cipher instance again, the getParameters() method should be called to store the AlgorithmParameter object used for the encryption process. To make the jobs simpler, SealedObject can be used in the encryption result. SealedObject class receives a target statement to encrypt and Cipher object as arguments(sealing process in SealedObject class). The SealedObject itself is an encrypted data and it manages algorithm arguments used for encryption. If a key used in encryption process is passed, you can obtain decrypted data(unsealing process in SealedObject class). Code 4: Encryption using SealedObject. //Create Cipher object Cipher c = Cipher.getInstance("DES"); c.init(Cipher.ENCRYPT_MODE, sKey); // Create SealedObject: it is an encryped data SealedObject so = new SealedObject("This is a secret", c); Code 5 Decryption using SealedObject //Note: sKey is as same as encryption key //Note: so is SealedObject which was previously created. //Decryption using SealedObject #1 //Decrypt using Cipher object c.init(Cipher.DECRYPT_MODE, sKey); try { String s = (String)so.getObject(c); } catch (Exception e) { //do something }; //Decryption using SealedObject #2 //Decrypt using encryption key try { String s = (String)so.getObject(sKey); } catch (Exception e) { //do something }; Message Authentication Codes(MAC) MAC is similar to MessageDigest because it creates the hash value; however, it is different from MessageDigest in that it requires SecretKey (symmetric key) for initialization. MessageDigest allows any receiving party to execute integrity check for the received message. However, MAC allows only the party which has the identical SecretKey to execute integrity check for the received message. MAC is used among those who share the SecretKey. Figure 8. Flow of Actions of MAC (source: http://docs.oracle.com/javase/6/docs/technotes/guides/security/crypto/CryptoSpec.html). HMAC is a MAC based on the encryption hash function (MessageDigest algorithm: MD5 or SHA1). HMAC is a combination of MessageDigest algorithm and shared SecretKey. Signature is different from HMAC because Signature uses the asymmetric key. HMAC allows identifying the opponent faster than the signature that uses the RSA algorithm. So, some services strategically use HMAC. Conclusion So far, I have described half of JCA functions. However, the rest of the functions are also important even if they are not described here. This article does not include the other core of JCA, such as Key, KeyPair, KeyFactory, KeyGenerator, KeyStore, CertificateFactory, and CertStore. I think that the functions should be deeply and fully described. For lack of space, I won't deal with the functions here. They will be described in the next article if possible. It was very difficult to study JCA and prepare this article. I felt that there were more things to study and research as I prepared the article and was left with even more questions while writing. I hope my article will help you to understand the "vague" concept of encryption more clearly. By Jaehee Ahn, Software Egnineer at Web Platform Development Lab, NHN Corporation. References http://docs.oracle.com/javase/6/docs/technotes/guides/security/crypto/CryptoSpec.html http://en.wikipedia.org/wiki/Modes_of_operation http://en.wikipedia.org/wiki/Cipher http://en.wikipedia.org/wiki/Stream_Cipher http://luxsci.com/blog/how-does-secure-socket-layer-ssl-or-tls-work.html [Less]
Posted about 11 years ago by Woo Seongmin
Vert.x is a server framework which is rapidly arising. Each server framework claims its strong points are high performance with a variety of protocols supported. Vert.x takes a step forward from that. Vert.x considers the environment of establishing ... [More] and operating the server network environment. In other words, Vert.x includes careful consideration in producing several 'server process DAEMONs' that run on the clustering environment, as well as producing one server process DAEMON. Therefore, it is important to review into Vert.x: which network environment it considers as well as how it delivers high performance. So, I think it will be valuable to pay sufficient time examining Vert.x structure. Philosophy of Vert.x Vert.x is a project affected by Node.js. Like Node.js, Vert.x is a server framework providing an event-based programming model. Therefore, Vert.x API is very similar to that of Node.js. That's because both of these models provide asynchronous API. Node.js is created by using JavaScript, but Vert.x is created by using Java. However, it is too much to understand Vert.x as a Java version of Node.js. Vert.x has been affected by Node.js; but Vert.x has its own unique philosophy, different from Node.js. The most typical design philosophy of Vert.x is summarized as follows: Polyglot - supports several languagesVert.x itself is built in Java. However, Java is not required to use Vert.x. As well as languages based on JVM operation, such as Java or Groovy, Vert.x can be used with Ruby, Python, and even JavaScript. If you need to build a server application by using JavaScript, there is an alternative to Node.js. In addition, Scala and Closure are planned to be supported. Super Simple Concurrency modelWhen building a server application by using Vert.x, users can write code as a single thread application. That means that the multi-thread programming effect can be achieved without synchronization, lock, or volatility.In Node.js, the JavaScript execution engine does not support multi-thread. However, to utilize all CPU cores, several same JavaScript programs have to be executed. However, Vert.x allows to create multiple threads based on the number of CPU cores whlie only one process is executed. It handle the multi-threading so users can focus on implementing business logic. Provides Event Bus As described in the introduction, the goal of Vert.x is not only to produce a ‘one server process DAEMON'. Vert.x aims to make a variety of Vert.x-built server programs communicate well with each other. For this, Vert.x provides Event Bus. Therefore, MQ functions such as Point to Point or Pub/Sub can be used (to provide Event Bus function, Vert.x uses Hazelcastm, an In-Memory Data Grid).With this Event Bus, a server application built with different languages can easily communicate with each other. Module System & Public Module RepositoryVert.x has a module system. This module system can be understood as a type of component. That means the Vert.x-built server application project itself is modularized. It aims at reusability. This module can be registered to Public Module Repository. Through the Public Module Repository, the module can be shared. What is the relationship and defference between Netty and Vert.x?  Before discussing the Vert.x performance, we should summarize the relationship between Netty and Vert.x. Vert.x uses Netty. In other words, it processes all IOs by using Netty. Therefore, it is meaningless to verify differences of the performance between Vert.x and Netty. Vert.x is a server framework that provides API and functions different and independent from Netty, designed with different purpose from Netty. Netty is a framework that can process the low-level IO and Vert.x can process the higher-level IO than Netty. Comparison of Performance with Node.js Even if functions provided by Vert.x are different from those of Node.js, comparing the performance between them is a significant matter. Figure 1 and Figure 2 below show the performance of Vert.x (Java, Ruby, Groovy) and Node.js. (Source: http://vertxproject.wordpress.com/2012/05/09/vert-x-vs-node-js-simple-http-benchmarks/). Figure 1 shows a comparison of performance when an HTTP server is built and only a 200/OK response has been returned. Figure 2 shows the comparison of performance when a 72-byte static HTML file is returned as a response. Figure 1: Comparison of Performance When Only 200/OK Response Has Been Returned. Figure 2: Comparison of Performance When a 72-byte Static HTML File is Returned. This performance is proclaimed by Vert.x developers but the test has not been made under a strict environment. Just look the relative differences in performance. Also, a notable point is that the performance of Vert.x-JavaScript is better than Node.js. However, even if the performance result is very reliable, it may be difficult to say that Vert.x is better than Node.js. That's because Node.js provides great models such as Socket.io and has lots of references. Vert.x Terminology Vert.x defines its unique terms and redefines general terms for Vert.x itself. Therefore, to understand Vert.x, it is necessary to understand the Vert.x-defined terms. The followings are popular terms used in Vert.x: Verticle For Java, it is the class with a main method. Verticle can also include other scripts referred to by the main method. It can also include the jars files or resources. An application may consist of one Verticle or several Verticles, which communicate through Event Bus. Alongside Java, it can be understood as an independently executable class or a jar file. Vert.x Instance A Verticle is executed within a Vert.x instance and the Vert.x instance is executed in its JVM instance. So there will be a lot of Verticles which are simultaneously executed in a single Vert.x instance. Each Verticle can have its own unique class loader. In this manner, direct interactions between Verticles, made through static members and global variables, can be prevented. A lot of Verticles can be simultaneously executed in several hosts on the network and the Vert.x instances can be clustered through Event Bus. Concurrency The Verticle instance guarantees it is always executed on an identical thread. As all codes can be developed as a single thread operation type, developers who use environment where Vert.x can be easily developed. In addition, race condition or deadlock can be prevented. Event-based Programming Model Like the Node.js framework, Vert.x provides an event-based programming model. When programming a server by using Vert.x, most codes for development are related to event handlers. For example, an event handler should be set to receive data from a TCP socket or an event handler, which will be called when data is received, should be created. In addition, event handlers should be created to receive alarms 'when Event Bus receives a message,' 'when HTTP messages are received,' 'when a connection has been disconnected,' and 'when a timer is timeout.' Event Loops Vert.x instance internally manages the thread pool. Vert.x matches the number of thread pools to the number of CPU cores as closely as possible. Each thread executes Event Loop. Event Loop verifies the events as rounding the loop. For example, verifying whether there is data to read in the socket or on which timer an event has occurred. If there is an event to process on the loop, Vert.x calls the corresponding handler (of course, additional work is necessary if the handler-processing period is too long or there is a blocking I/O). Message Passing Verticles use Event Bus for communication. If a Verticle is assumed as an actor, Message Passing is similar to an actor model, which was famous in Erlang programming languages. The Vert.x server instance has a lot of Verticle instances and allows message passing among the instances. Therefore, the system can be extended according to the usable cores without executing the Verticle code through multi-thread. Shared data Message passing is very useful. However, it is not always the best approach in all types of application concurrency situations. Cache is one of the most popular examples. If only one vertical has a certain cache, it is very inefficient. If other Verticles need the cache, each Verticle should manage the same cache data. Therefore, Vert.x provides a method for global access. It is the Shared Map. Verticles share immutable data only. Vert.x Core As named, this is the core function of Vert.x. Functions that the Verticle can directly call are included in the core. Therefore, the core can be accessed by each programming language API supported by Vert.x. Vert.x Architecture The simple architecture of Vert.x is shown in the following Figure 3. Figure 3: Vert.x Architecture (source: http://www.javacodegeeks.com/2012/07/osgi-case-study-modular-vertx.html) The default execution unit of Vert.x is Verticle and several Verticles can be simultaneously executed on one Vert.x instance. The Verticles are executed on the event-loop thread. Several Vert.x instances can be executed on several hosts, as well as on one host on the network. At this time, the Verticles or modules communicate by using the event bus. To sum up, the vert.x application consists of combinations of Verticles or modules and communication among those is made by using Event Bus. vert.x Project Structure The following is the Vert.x project structure viewed from eclipse when clonning the source code from the Vert.x Github page. Figure 4: Vert.x source tree. The overall configuration is as follows: vertx-core is the core library. vertx-platform manages distribution and lifecycle. vert-lang used to expose the Core Java API to another language. Gradle is used as its project build system. It has the advantages of ant and maven. Installing Vert.x and Executing Simple Examples To use Vert.x, JDK7 is required because Vert.x uses the invokeDynamic in JDK7. Vert.x can easily be installed. Download the compressed installation file from http://vertx.io/downloads.html to a desired location. Decompress the file and add the bin directory to the PATH environment variables. This is all about the installation of Vert.x. In the command window, execute the vertx version. If the version information successfully prints out, the installation is completed. Example 1 Now, let's build and execute a simple Web server with JavaScript which prints out "hello world". After writing the following codes, save it in server.js file. It is almost identical to the Node.js code. load('vertx.js'); vertx.createHttpServer().requestHandler(function(req) { req.response.end("Hello World!");}) .listen(8080, 'localhost'); Execute the created server.js application by using the vertx command as follows: $ vertx run server.js Open a browser and connect to http://localhost:8080. If you can see the 'Hello World!' message, you have succeeded. Example 2 Let's see another example built other languages. The following code is written in Java. It shows a Web server that reads a static file and returns it as an HTTP response. Vertx vertx = Vertx.newVertx(); vertx.createHttpServer().requestHandler(new Handler() { public void handle(HttpServerRequest req) { String file = req.path.equals("/") ? "index.html" : req.path; req.response.sendFile("webroot/" + file); } }).listen(8080); The following code is written in Groovy and provide the same functionality: def vertx = Vertx.newVertx() vertx.createHttpServer().requestHandler { req -> def file = req.uri == "/" ? "index.html" : req.uri req.response.sendFile "webroot/$file" }.listen(8080) Future of Vert.x and NHN At NHN we have been observing Vert.x development since its preofficial release. We think highly of Vert.x. We have been communicating with the main developer, Tim Fox, since June 2012 to discuss ways to improve Vert.x. For example, Socket.io on Vert.x. Socket.io is available on Node.js only. So we have ported it to Java and sent a Pull Request https://github.com/vert-x/vert.x/pull/320 to Vert.x repository on Gitub. It's now merged to Vert.x-mod project. Our effort, socket.io vert.x module, will be used for RTCS 2.0 version (vert.x + Socket.io) which is under-developing in NHN. Node.js could remain very popular because of Socket.io. If Vert.x can use Socket.io, Vert.x may have many use cases. Furthermore, if this socket.io vertx module is used as an embedded library, it will be meaningful to use socket.io in Java based applications. What is RTCS? RTCS (Real Time Communication System) is a Real Time Web Development Platform created by NHN. It helps to transfer messages between a Browser and a Server in real time. RTCS has been deployed for NHN Web services such as Baseball 9, Me2Day Chatting, BAND Chatting and so on. Wrap-up The first version of Vert.x was released in May 2012. Compared to Node.js where the first version was released in 2009, the history of Vert.x is very short. Therefore, Vert.x does not have many references yet. However, Vert.x is supported by VmWare and can run on CloudFoundry. So, we expect that many references will soon be obtained. By Seongmin Woo, Software Engineer at Web Platform Development Lab, NHN Corporation. References "Main Manual" http://vertx.io/manual.html "Installation Guide" http://vertx.io/install.html "The C10K problem" http://www.kegel.com/c10k.html Gim Seongbak, Song Jihun "Java I/O & NIO Network Programming”, Hanbit Media 2004. [Less]
Posted about 11 years ago by Esen Sagynov
We released CUBRID 9.0 beta version in October last year. Since then we have been working hard on stabilzing the beta features, fixing bugs, and improving the overall engine performance. Today I am excited to announce the immediate availability of ... [More] the CUBRID 9.1 stable release. You can download CUBRID Database Server from http://www.cubrid.org/?mid=downloads&item=cubrid&os=detect&cubrid=9.1.0. I would also like to announce that we will give a talk about CUBRID Database Sharding at Percona MySQL Conference on April 24, 2013, in Santa Clara, CA. Join us there to meet CUBRID Engineers and get the first-hand insight into the new CUBRID 9.1. Below I will provide an overview of the latest changes and improvements in CUBRID 9.1. Overview CUBRID 9.1 is an upgraded and stabilized version of CUBRID 9.0 Beta. To learn more about the biggest features introduced in 9.x family, refer to 9.0 official announcement. Issues found in the 9.0 Beta version have been fixed and stabilized in this new 9.1 stable release. With a variety of query-related functionalities, CUBRID 9.1 offers improved query processing performance as well as improved query optimization. In addition, its multi-language related functionalities have been further improved. This new 9.1 release is accompanied by new CUBRID Tools and Drivers releases. Backward Compatibility Database compatibility As a database volume of CUBRID 9.1 is not compatible with the database of CUBRID 9.0 Beta, users of CUBRID 9.0 Beta or previous versions should migrate their database. We have created a migration instructions which you can find in Upgrade section of the Release Notes. Driver compatibility The JDBC and CCI driver of CUBRID 9.1 are compatible with CUBRID 9.0 Beta and CUBRID 2008 R4.x version. Some features that are fixed and improved for 9.1 are not supported when 9.1 drivers connect to the previous versions. Major enhacements New SQL functions and index hints New SQL analytics functions like NTILE, LEAD and LAG have been introduced in CUBRID 9.1. WIDTH_BUCKET new SQL numeric function is also introduced. TRUNC and ROUND functions now also accept the date types. New SQL Hints: Support a new index hint clause. SQL hints for Multi UPDATE and DELETE statement. SQL hints for MERGE statement. Performance improvements and optimizations The performance of data replication in HA environment has been significantly improved in CUBRID 9.1. Improved multi-key range optimization. Enhanced optimization of ORDER BY and GROUP BY clause. Improved analytic function performance. Improved performance of INSERT ON DUPLICATE KEY UPDATE and REPLACE statement. Improved search and delete performance for non-unique indexes with many duplicate keys. Improved delete performance when insert and delete operations are repeated. The overall performance of SELECT operations has been improved by nearly 20%.  Based on the results obtained from the basic performance test, we have found that the performance of INSERT, DELETE, and UPDATE operations are almost same as that of 9.0 Beta. Multi-language support In CUBRID 9.1 we now support collation for tables. SHOW COLLATION statement and new CHARSET, COLLATION, and COERCIBILITY functions are now supported. Support collation with expansion which sorts French with backward accent order. Improved and fixed restrictions and issues of 9.0 Beta version. CUBRID SHARD We have added cubrid shard getid command to verify the shard ID of the shard key. CUBRID SHARD is now available on Windows as well. Administration utility cubrid applyinfo utility now also shows information about the replication delay. cubrid killtran utility now has the ability to show the query execution information of each transaction as well as the ability to remove transactions which executes a designated SQL. When a query timeout occurs, added a functionality to log the query execution information to the server error log and the CAS log files. Behavioral Changes CUBRID_LANG environment variable is no longer used. CUBRID_CHARSET environment variable which sets the database charset instead of CUBRID_LANG and the CUBRID_MSG_LANG environment variable which sets the charset for utility and error messages. Change array execution functions such as cci_execute_array, cci_execute_batch function and Statement.executeBatch and PreparedStatement.executeBatch method of JDBC to commit whenever it executes an individual query under auto commit mode, while the previous versions commit once for entire execution. Change the behavior of cci_execute_array, cci_execute_batch and cci_execute_result function when an error occurs while they are executing multiple statements. These functions now continue to execute the entire given queries while the previous versions stop execution and return an error. Users can access the results and identify the errors with CCI_QUERY_RESULT_* macros. OFF is no longer supported for KEEP_CONNECTION broker parameter. SELECT_AUTO_COMMIT broker parameter is no longer supported. Change the allowed value range of a broker parameter APPL_SERVER_MAX_SIZE_HARD_LIMIT to 1 - 2,097,151. Change the default value of a broker parameter SQL_LOG_MAX_SIZE from 100 MB to 10 MB. Change the behavior of the call_stack_dump_activation_list parameter. Numerous Improvements and Bug Fixes Fix many critical issues of the previous versions. Improve of fix many issues of stability, SQL, partitioning, HA, Sharding, utilities, and drivers. For more details on changes, see the Release Notes in English or Korean. So far CUBRID 9.1 is our biggest release which we would like you to try. In fact we have released new improved drivers for Node.js, PHP, PDO, Python, Perl, JDBC, ODBC, OLEDB, ADO.NET, and C. So you should definitely try the new, more performant and stable CUBRID 9.1 Database. If you have any questions, feel free to leave your comment below. [Less]
Posted about 11 years ago by Esen Sagynov
I would like to announce that on April 24, 2013, six weeks from now, we will talk at Percona MySQL Conference & Expo in Santa Clara, CA. The topic of the presentation is Easy MySQL Database Sharding with CUBRID SHARD. The presentation will be @ ... [More] 3:30 PM in Ballroom A. Come and join us! Abstract If you ask companies who operate mission-critical services, they will tell: that a relational database system is still the best choice for mission-critical data; that service availability is more important than performance; that high performance is good, but predictable performance is the king. This is a fact, and we know it. At NHN we have over 30,000 Web servers that operate over 150 large scale Web and mobile services. At such scale we must know what scales, how to provide high-availability and operate at predictable speed. At Percona Live MySQL Conference 2013 I will talk about CUBRID SHARD, a universal database sharding solution for CUBRID, MySQL, and Oracle. CUBRID SHARD can be used with a heterogeneous database backend, i.e. some shards can be stored in CUBRID, some in MySQL or even Oracle. At NHN we deploy various combinations: MySQL only, MySQL + Oracle, MySQL + CUBRID, CUBRID only, and Oracle only. I will explain how DBAs can easily configure it, and how we have implemented this feature. CUBRID SHARD allows to store unlimited number of database shards and distribute data based on modulo, DATETIME, or hash/range calculations. The developers can even feed in their own library to calculate the SHARD_ID using some complicated custom algorithm. At the session I will show how easy it is to setup all this. No need for a third-party management tool. With CUBRID SHARD application developers do not need to modify the application logic to provide data sharding. This is DBAs job as all this is handled by the database system automatically. CUBRID SHARD is designed to be very efficient. It provides built-in (*) distributed load balancing and (*) connection and statement pooling. At the conference I will present several cases where CUBRID SHARD is deployed as a shard manager and a connection manager, or where it's used as a way for seamless data migration between different systems. Who should come to the session? If you run a service which spends money on a database solution, on tools you need to shard databases or manage connections, you should come and learn how CUBRID SHARD can provide your applications native scale-out through single database view. If you would like to learn more about CUBRID Database Sharding, see our Database Sharding the Right Way: Easy, Reliable, and Open source I have presented at 2012 HighLoad++ Conference. More about CUBRID you can find at Important Facts to Know about CUBRID. If you have questions, feel free to leave your comment below. [Less]
Posted about 11 years ago by Esen Sagynov
Hello reader, What do you know about CUBRID Database? Let this be your first introduction to CUBRID. Today I would like to tell a story about what we do at CUBRID in order to improve the experience our users have when they get started with CUBRID ... [More] Database. So far we have published numerous installation instructions and short HOWTO tutorials which help our users to quickly get started with installing and configuring CUBRID Database. These include instructions for apt-get and yum package managers. Once their server is up and running, users can continue their learning experience with more tutorials. To improve user experience, couple months ago we have written multiple Vagrant and Chef Cookbook tutorials which provide easy step by step instructions on how to create a clean virtual machine image for VirtualBox and install CUBRID and any other necessary software on a new Linux operating system in a matter of minutes. Vagrant in combination with Chef cookbooks is a great tool and time saver for developers, especially for testers and those users who would like to just try a new software in a VM without polutting their host machine. You can fire up a single command like vagrant up, and Vagrant will built up a new VM machine with all the software you need preinstalled and configured for you. How cool is that! In fact, I use Vagrant and Chef on every day basis to reproduce issues users have reported on CUBRID forums, or to quickly start hacking some new features for another project. While Vagrant is a great tool for local development, Knife Solo is the guy you need for remote server provisioning. Just like Vagrant, Knife Solo prepares and cooks the Chef cookbooks on a remote server, be it a VM on your local machine or a remote Amazon EC2 server. And we have written a Knife Solo tutorial which introduces this tool and shows how to install CUBRID Database, its tools and drivers on a remote machine. For those users who wish to directly download a virtual machine with a preconfigured CUBRID Database, we have built and uploaded CentOS and Ubuntu VirtualBox images with different versions of CUBRID Server. To further improve the user experience, today I am immensely happy to announce the CUBRID Cloud Database Service at http://cloud.cubrid.org. We have come from installing CUBRID manually on a user machine to installing automatically on a VM or a remote server, to not having to install CUBRID at all. Now you can request a connection information to a remote CUBRID cloud database for free as soon as you need. All you need is a valid email address. Figure 1: CUBRID Cloud Database Service front page. Once requested, you will receive a confirmation email address. You confirm your email, and we will start cooking your very own CUBRID cloud database. In a minute or so you will receive a second email with the database credentials. This will include the remote database host IP address, the port number, your database name, a username, and a password. We will also include a short getting started tutorial for you to take off quickly. We have built this CUBRID Cloud Service for educational purpose. We want our users be able to get their hands on a CUBRID database as soon as they need. You can use this cloud database for testing, for learning CUBRID, or for building non-critical demo applications. To discuss CUBRID Cloud Service, we have created a dedicated forum thread. You are welcome to join us! If you want to chat with our engineers, head to #cubrid freenode chat room. We will be glad to see you there! So, go ahead and creat your first CUBRID cloud database! [Less]
Posted about 11 years ago by Hyeongyeop Kim
We cannot imagine Internet service without TCP/IP. All Internet services we have developed and used at NHN are based on a solid basis, TCP/IP. Understanding how data is transferred via the network will help you to improve performance through tuning ... [More] , troubleshooting, or introduction to a new technology. This article will describe the overall operation scheme of the network stack based on data flow and control flow in Linux OS and the hardware layer. Key Characteristics of TCP/IP How should I design a network protocol to transmit data quickly while keeping the data order without any data loss? TCP/IP has been designed with this consideration. The following are the key characteristics of TCP/IP required to understand the concept of the stack. TCP and IP Technically, since TCP and IP have different layer structures, it would be correct to describe them separately. However, here we will describe them as one. 1. Connection-oriented First, a connection is made between two endpoints (local and remote) and then data is transferred. Here, the "TCP connection identifier" is a combination of addresses of the two endpoints, having type. 2. Bidirectional Byte Stream Bidirectional data communication is made by using byte stream. 3. In-order Delivery A receiver receives data in the order of sending data from a sender. For that, the order of data is required. To mark the order, 32-bit integer data type is used. 4. Reliability through ACK When a sender did not receive ACK (acknowledgement) from a receiver after sending data to the receiver, the sender TCP re-sends the data to the receiver. Therefore, the sender TCP buffers unacknowledged data from the receiver. 5. Flow Control A sender sends as much data as a receiver can afford. A receiver sends the maximum number of bytes that it can receive (unused buffer size, receive window) to the sender. The sender sends as much data as the size of bytes that the receiver's receive window allows. 6. Congestion Control The congestion window is used separately from the receive window to prevent network congestion by limiting the volume of data flowing in the network. Like the receive window, the sender sends as much data as the size of bytes that the receiver's congestion window allows by using a variety of algorithms such as TCP Vegas, Westwood, BIC, and CUBIC. Different from flow control, congestion control is implemented by the sender only. Data Transmission As indicated by its name, a network stack has many layers. The following Figure 1 shows the layer types. Figure 1: Operation Process by Each Layer of TCP/IP Network Stack for Data Transmission. There are several layers and the layers are briefly classified into three areas: User area Kernel area Device area Tasks at the user area and the kernel area are performed by the CPU. The user area and the kernel area are called "host" to distinguish them from the device area. Here, the device is the Network Interface Card (NIC) that sends and receives packets. It is a more accurate term than the commonly used "LAN card". Let's take a look at the user area. First, the application creates data to send (the "User data" box in Figure 1) and then calls the write() system call to send the data. Assume that the socket (fd in Figure 1) has been already created. When the system call is called, the area is switched to the kernel area. POSIX-series operating systems including Linux and Unix expose the socket to the application by using a file descriptor. In the POSIX-series operating system, the socket is a kind of a file. The file layer executes a simple examination and calls the socket function by using the socket structure connected to the file structure. The kernel socket has two buffers. One is the send socket buffer for sending; And the other is the receive socket buffer for receiving. When the write system call is called, the data in the user area is copied to the kernel memory and then added to the end of the send socket buffer. This is to send data in order. In the Figure 1, the light-gray box refers to the data in the socket buffer. Then, TCP is called. There is the TCP Control Block (TCB) structure connected to the socket. The TCB includes data required for processing the TCP connection. Data in the TCB are connection state (LISTEN, ESTABLISHED, TIME_WAIT), receive window, congestion window, sequence number, resending timer, etc. If the current TCP state allows for data transmission, a new TCP segment (in other words, a packet) is created. If data transmission is impossible due to flow control or such a reason, the system call is ended here and then the mode is returned to the user mode (in other words, the control is passed to the application). There are two TCP segments as shown in Figure 2: TCP header; And payload. Figure 2: TCP Frame Structure (source). The payload includes the data saved in the unacknowledged send socket buffer. The maximum length of the payload is the maximum value among the receive window, congestion window, and maximum segment size (MSS). Then, TCP checksum is computed. In this checksum computation, pseudo header information (IP addresses, segment length, and protocol number) is included. One or more packets can be transmitted according to the TCP state. In fact, since the current network stack uses the checksum offload, the TCP checksum is computed by NIC, not by the kernel. However, we assume that the kernel computes the TCP checksum for convenience. The created TCP segment goes down to the IP layer. The IP layer adds an IP header to the TCP segment and performs IP routing. IP routing is a procedure of searching the next hop IP in order to go to the destination IP. After the IP layer has computed and added the IP header checksum, it sends the data to the Ethernet layer. The Ethernet layer searches for the MAC address of the next hop IP by using the Address Resolution Protocol (ARP). It then adds the Ethernet header to the packet. The host packet is completed by adding the Ethernet header. After IP routing is performed, the transmit interface (NIC) is known as the result of IP routing. The interface is used for transmitting a packet to the next hop IP and the IP. Therefore, the transmit NIC driver is called. At this time, if a packet capture program such as tcpdump or Wireshark is running, the kernel copies the packet data onto the memory buffer that the program uses. In that way, the receiving packet is directly captured on the driver. Generally, the traffic shaper function is implemented to run on this layer. The driver requests packet transmission according to the driver-NIC communication protocol defined by the NIC manufacturer. After receiving the packet transmission request, the NIC copies the packets from the main memory to its memory and then sends it to the network line. At this time, by complying with the Ethernet standard, it adds the IFG (Inter-Frame Gap), preamble, and CRC to the packet. The IFG and preamble are used to distinguish the start of the packet (as a networking term, framing), and the CRC is used to protect the data (the same purpose as TCP and IP checksum). Packet transmission is started based on the physical speed of the Ethernet and the condition of Ethernet flow control. It is like getting the floor and speaking in a conference room. When an NIC sends a packet, the NIC generates interrupts on the host CPU. Every interrupt has its own interrupt number and the OS searches an adequate driver to handle the interrupt by using the number. The driver registers a function to handle the interrupt (an interrupt handler) when the driver is started. The OS calls the interrupt handler and then the interrupt handler returns the transmitted packet to the OS. So far we have discussed the procedure of data transmission through the kernel and the device when an application performs write. However, without a direct write request from the application, the kernel can transmit a packet by directly calling TCP. For example, when an ACK is received and the receive window is expanded, the kernel creates a TCP segment including the data left in the socket buffer and sends the TCP segment to the receiver. Data Receiving Now, let's take a look at how data is received. Data receiving is a procedure for how the network stack handles a packet coming in. Figure 3 shows how the network stack handles a packet received. Figure 3: Operation Process by Each Layer of TCP/IP Network Stack for Handling Data Received. First, the NIC writes the packet onto its memory. It checks whether the packet is valid by performing the CRC check and then sends the packet to the memory buffer of the host. This buffer is a memory that has already been requested by the driver to the kernel and allocated for receiving packets. After the buffer has been allocated, the driver tells the memory address and size to the NIC. When there is no host memory buffer allocated by the driver even though the NIC receives a packet, the NIC may drop the packet. After sending the packet to the host memory buffer, the NIC sends an interrupt to the host OS. Then, the driver checks whether it can handle the new packet or not. So far, the driver-NIC communication protocol defined by the manufacturer is used. When the driver should send a packet to the upper layer, the packet must be wrapped in a packet structure that the OS uses for the OS to understand the packet. For example, sk_buff of Linux, mbuf of BSD-series kernel, and NET_BUFFER_LIST of Microsoft Windows are the packet structures of the corresponding OS. The driver sends the wrapped packets to the upper layer. The Ethernet layer checks whether the packet is valid and then de-multiplexes the upper protocol (network protocol). At this time, it uses the ethertype value of the Ethernet header. The IPv4 ethertype value is 0x0800. It removes the Ethernet header and then sends the packet to the IP layer. The IP layer also checks whether the packet is valid. In other words, it checks the IP header checksum. It logically determines whether it should perform IP routing and make the local system handle the packet, or send the packet to the other system. If the packet must be handled by the local system, the IP layer de-multiplexes the upper protocol (transport protocol) by referring to the proto value of the IP header. The TCP proto value is 6. It removes the IP header and then sends the packet to the TCP layer. Like the lower layer, the TCP layer checks whether the packet is valid. It also checks the TCP checksum. As mentioned before, since the current network stack uses the checksum offload, the TCP checksum is computed by NIC, not by the kernel. Then it searches the TCP control block where the packet is connected. At this time, of the packet is used as an identifier. After searching the connection, it performs the protocol to handle the packet. If it has received new data, it adds the data to the receive socket buffer. According to the TCP state, it can send a new TCP packet (for example, an ACK packet). Now TCP/IP receiving packet handling has completed. The size of the receive socket buffer is the TCP receive window. To a certain point, the TCP throughput increases when the receive window is large. In the past, the socket buffer size had been adjusted on the application or the OS configuration. The latest network stack has a function to adjust the receive socket buffer size, i.e., the receive window, automatically. When the application calls the read system call, the area is changed to the kernel area and the data in the socket buffer is copied to the memory in the user area. The copied data is removed from the socket buffer. And then the TCP is called. The TCP increases the receive window because there is new space in the socket buffer. And it sends a packet according to the protocol status. If no packet is transferred, the system call is terminated. Network Stack Development Direction The functions of network stack layers described so far are the most basic functions. The network stack in the early 1990s had few more functions than the functions described above. However, the latest network stack has many more functions and complexity as the network stack implementation structure gets higher. The latest network stack is classified by purpose as follows. Packet Processing Procedure Manipulation It is a function like Netfilter (firewall, NAT) and traffic control. By inserting the user-controllable code to the basic processing flow, the function can work differently according to the user configuration. Protocol Performance It aims to improve the throughput, latency, and stability that the TCP protocol can achieve within the given network environment. Various congestion control algorithms and additional TCP functions such as SACK are the typical examples. The protocol improvement will not be discussed here since it is out of the scope. Packet Processing Efficiency The packet processing efficiency aims to improve the maximum number of packets that can be processed per second by reducing the CPU cycle, memory usage, and memory accesses that one system consumes to process packets. There have been several attempts to reduce the latency in the system. The attempts include stack parallel processing, header prediction, zero-copy, single-copy, checksum offload, TSO, LRO, RSS, etc. Control Flow in the Stack Now, we will take a more detailed look at the internal flow of the Linux network stack. Like a subsystem which is not a network stack, a network stack basically runs as the event-driven way that reacts when the event occurs. Therefore, there is no separated thread to execute the stack. Figure 1 and Figure 3 showed the simplified diagrams of control flow. Figure 4 below illustrates more exact control flow. Figure 4: Control Flow in the Stack. At Flow (1) in Figure 4, an application calls a system call to execute (use) the TCP. For example, calls the read system call and the write system call and then executes TCP. However, there is no packet transmission. Flow (2) is same as Flow (1) if it requires packet transmission after executing TCP. It creates a packet and sends down the packet to the driver. A queue is in front of the driver. The packet comes into the queue first, and then the queue implementation structure decides the time to send the packet to the driver. This is queue discipline (qdisc) of Linux. The function of Linux traffic control is to manipulate the qdisc. The default qdisc is a simple First-In-First-Out (FIFO) queue. By using another qdisc, operators can achieve various effects such as artificial packet loss, packet delay, transmission rate limit, etc. At Flow (1) and Flow (2), the process thread of the application also executes the driver. Flow (3) shows the case in which the timer used by the TCP has expired. For example, when the TIME_WAIT timer has expired, the TCP is called to delete the connection. Like Flow (3), Flow (4) is the case in which the timer used by the TCP has expired and the TCP execution result packet should be transmitted. For example, when the retransmit timer has expired, the packet of which ACK has not been received is transmitted. Flow (3) and Flow (4) show the procedure of executing the timer softirq that has processed the timer interrupt. When the NIC driver receives an interrupt, it frees the transmitted packet. In most cases, execution of the driver is terminated here. Flow (5) is the case of packet accumulation in the transmit queue. The driver requests softirq and the softirq handler executes the transmit queue to send the accumulated packet to the driver. When the NIC driver receives an interrupt and finds a newly received packet, it requests softirq. The softirq that processes the received packet calls the driver and transmits the received packet to the upper layer. In Linux, processing the received packet as shown above is called New API (NAPI). It is similar to polling because the driver does not directly transmit the packet to the upper layer, but the upper layer directly gets the packet. The actual code is called NAPI poll or poll. Flow (6) shows the case that completes execution of TCP, and Flow (7) shows the case that requires additional packet transmission. All of Flow (5), (6), and (7) are executed by the softirq which has processed the NIC interrupt. How to Process Interrupt and Received Packet Interrupt processing is complex; however, you need to understand the performance issue related to processing of packets received. Figure 5 shows the procedure of processing an interrupt. Figure 5: Processing Interrupt, softirq, and Received Packet. Assume that the CPU 0 is executing an application program (user program). At this time, the NIC receives a packet and generates an interrupt for the CPU 0. Then the CPU executes the kernel interrupt (called irq) handler. This handler refers to the interrupt number and then calls the driver interrupt handler. The driver frees the packet transmitted and then calls the napi_schedule() function to process the received packet. This function requests the softirq (software interrupt). After execution of the driver interrupt handler has been terminated, the control is passed to the kernel handler. The kernel handler executes the interrupt handler for the softirq. After the interrupt context has been executed, the softirq context will be executed. The interrupt context and the softirq context are executed by an identical thread. However, they use different stacks. And, the interrupt context blocks hardware interrupts; however, the softirq context allows for hardware interrupts. The softirq handler that processes the received packet is the net_rx_action() function. This function calls the poll() function of the driver. The poll() function calls the netif_receive_skb() function and then sends the received packets one by one to the upper layer. After processing the softirq, the application restarts execution from the stopped point in order to request a system call. Therefore, the CPU that has received the interrupt processes the received packets from the first to the last. In Linux, BSD, and Microsoft Windows, the processing procedure is basically the same on this wise. When you check the server CPU utilization, sometimes you can check that only one CPU executes the softirq hard among the server CPUs. The phenomenon occurs due to the way of processing received packets explained so far. To solve the problem, multi-queue NIC, RSS, and RPS have been developed. Data Structure The followings are some key data structures. Take a look at them and review the code. sk_buff structure First, there is the sk_buff structure or skb structure that means a packet. Figure 6 shows some of the sk_buff structure. As the functions have been advanced, they get more complicated. However, the basic functions are very common that anyone can think. Figure 6: Packet Structure sk_buff. Including Packet Data and meta data The structure directly includes the packet data or refers to it by using a pointer. In Figure 6, some of the packets (from Ethernet to buffer) refer to using the data pointer and the additional data (frags) refer to the actual page. The necessary information such as header and payload length is saved in the meta data area. For example, in Figure 6, the mac_header, the network_header, and the transport_header have the corresponding pointer data that points the starting position of the Ethernet header, IP header and TCP header, respectively. This way makes TCP protocol processing easy. How to Add or Delete a Header The header is added or deleted as up and down each layer of the network stack. Pointers are used for more efficient processing. For example, to remove the Ethernet header, just increase the head pointer. How to Combine and Divide Packet The linked list is used for efficient execution of tasks such as adding or deleting packet payload data to the socket buffer, or packet chain. The next pointer and the prev pointer are used for this purpose. Quick Allocation and Free As a structure is allocated whenever creating a packet, the quick allocator is used. For example, if data is transmitted at the speed of 10-Gigabit Ethernet, more than one million packets per second must be created and deleted. TCP Control Block Second, there is a structure that represents the TCP connection. Previously, it was abstractly called a TCP control block. Linux uses tcp_sock for the structure. In Figure 7, you can see the relationship among the file, the socket, and the tcp_sock. Figure 7: TCP Connection Structure. When a system call has occurred, it searches the file in the file descriptor used by the application that has called the system call. For the Unix-series OS, the socket, the file and the device for general file system for storage are abstracted to a file. Therefore, the file structure includes the least information. For a socket, a separate socket structure saves the socket-related information and the file refers to the socket as a pointer. The socket refers to the tcp_sock again. The tcp_sock is classified into sock, inet_sock, etc to support various protocols except TCP. It may be considered as a kind of polymorphism. All status information used by the TCP protocol is saved in the tcp_sock. For example, the sequence number, receive window, congestion control, and retransmit timer are saved in the tcp_sock. The send socket buffer and the receive socket buffer are the sk_buff lists and they include the tcp_sock. The dst_entry, the IP routing result, is referred to in order to avoid too frequent routing. The dst_entry allows for easy search of the ARP result, i.e., the destination MAC address. The dst_entry is part of the routing table. The structure of the routing table is very complex that it will not be discussed in this document. The NIC to be used for packet transmission is searched by using the dst_entry. The NIC is expressed as the net_device structure. Therefore, by searching just the file, it is very easy to find all structures (from the file to the driver) required to process the TCP connection with the pointer. The size of the structures is the memory size used by one TCP connection. The memory size is a few KBs (excluding the packet data). As more functions have been added, the memory usage has been gradually increased. Finally, let's see the TCP connection lookup table. It is a hash table used to search the TCP connection where the received packet belongs. The hash value is calculated by using the input data of of the packet and the Jenkins hash algorithm. It is told that the hash function has been selected by considering defense against attacks to the hash table. Following Code: How to Transmit Data We will check the key tasks performed by the stack by following the actual Linux kernel source code. Here, we will observe two paths which are frequently used. First, this is a path used to transmit data when an application calls the write system call. SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, ...) { struct file *file; [...] file = fget_light(fd, &fput_needed); [...] ===> ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos); struct file_operations { [...] ssize_t (*aio_read) (struct kiocb *, const struct iovec *, ...) ssize_t (*aio_write) (struct kiocb *, const struct iovec *, ...) [...] }; static const struct file_operations socket_file_ops = { [...] .aio_read = sock_aio_read, .aio_write = sock_aio_write, [...] }; When the application calls the write system call, the kernel performs the write() function of the file layer. First, the actual file structure of the file descriptor fd is fetched. And then the aio_write is called. This is the function pointer. In the file structure, you will see the file_operations structure pointer. The structure is generally called function table and includes the function pointers such as aio_read and aio_write. The actual table for the socket is socket_file_ops. The aio_write function used by the socket is sock_aio_write. The function table is used for the purpose that is similar to the Java interface. It is generally used for the kernel to perform code abstraction or refactoring. static ssize_t sock_aio_write(struct kiocb *iocb, const struct iovec *iov, ..) { [...] struct socket *sock = file->private_data; [...] ===> return sock->ops->sendmsg(iocb, sock, msg, size); struct socket { [...] struct file *file; struct sock *sk; const struct proto_ops *ops; }; const struct proto_ops inet_stream_ops = { .family = PF_INET, [...] .connect = inet_stream_connect, .accept = inet_accept, .listen = inet_listen, .sendmsg = tcp_sendmsg, .recvmsg = inet_recvmsg, [...] }; struct proto_ops { [...] int (*connect) (struct socket *sock, ...) int (*accept) (struct socket *sock, ...) int (*listen) (struct socket *sock, int len); int (*sendmsg) (struct kiocb *iocb, struct socket *sock, ...) int (*recvmsg) (struct kiocb *iocb, struct socket *sock, ...) [...] }; The sock_aio_write() function gets the socket structure from the file and then calls sendmsg. It is also the function pointer. The socket structure includes the proto_ops function table. The proto_ops implemented by the IPv4 TCP is inet_stream_ops and the sendmsg is implemented by tcp_sendmsg. int tcp_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, size_t size) { struct sock *sk = sock->sk; struct iovec *iov; struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; [...] mss_now = tcp_send_mss(sk, &size_goal, flags); /* Ok commence sending. */ iovlen = msg->msg_iovlen; iov = msg->msg_iov; copied = 0; [...] while (--iovlen >= 0) { int seglen = iov->iov_len; unsigned char __user *from = iov->iov_base; iov++; while (seglen > 0) { int copy = 0; int max = size_goal; [...] skb = sk_stream_alloc_skb(sk, select_size(sk, sg), sk->sk_allocation); if (!skb) goto wait_for_memory; /* * Check whether we can use HW checksum. */ if (sk->sk_route_caps & NETIF_F_ALL_CSUM) skb->ip_summed = CHECKSUM_PARTIAL; [...] skb_entail(sk, skb); [...] /* Where to copy to? */ if (skb_tailroom(skb) > 0) { /* We have some space in skb head. Superb! */ if (copy > skb_tailroom(skb)) copy = skb_tailroom(skb); if ((err = skb_add_data(skb, from, copy)) != 0) goto do_fault; [...] if (copied) tcp_push(sk, flags, mss_now, tp->nonagle); [...] } tcp_sengmsg gets tcp_sock (i.e.,TCP control block) from the socket and copies the data that the application has requested to transmit to the send socket buffer. When copying data to sk_buff, how many bytes will one sk_buff include? One sk_buff copies and includes MSS (tcp_send_mss) bytes to help the code that actually creates packets. Maximum Segment Size (MSS) stands for the maximum payload size that one TCP packet includes. By using TSO and GSO, one sk_buff can save more data than MSS. This will be discussed later, not in this document. The sk_stream_alloc_skb function creates a new sk_buff, and skb_entail adds the new sk_buff to the tail of the send_socket_buffer. The skb_add_data function copies the actual application data to the data buffer of the sk_buff. All the data is copied by repeating the procedure (creating an sk_buff and adding it to the send socket buffer) several times. Therefore, sk_buffs at the size of the MSS are in the send socket buffer as a list. Finally, the tcp_push is called to make the data which can be transmitted now as a packet, and the packet is sent. static inline void tcp_push(struct sock *sk, int flags, int mss_now, ...) [...] ===> static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, ...) int nonagle, { struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; [...] while ((skb = tcp_send_head(sk))) { [...] cwnd_quota = tcp_cwnd_test(tp, skb); if (!cwnd_quota) break; if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now))) break; [...] if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) break; /* Advance the send_head. This one is sent out. * This call will increment packets_out. */ tcp_event_new_data_sent(sk, skb); [...] The tcp_push function transmits as many of the sk_buffs in the send socket buffer as the TCP allows in sequence. First, the tcp_send_head is called to get the first sk_buff in the socket buffer and the tcp_cwnd_test and the tcp_snd_wnd_test are performed to check whether the congestion window and the receive window of the receiving TCP allow new packets to be transmitted. Then, the tcp_transmit_skb function is called to create a packet. static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, gfp_t gfp_mask) { const struct inet_connection_sock *icsk = inet_csk(sk); struct inet_sock *inet; struct tcp_sock *tp; [...] if (likely(clone_it)) { if (unlikely(skb_cloned(skb))) skb = pskb_copy(skb, gfp_mask); else skb = skb_clone(skb, gfp_mask); if (unlikely(!skb)) return -ENOBUFS; } [...] skb_push(skb, tcp_header_size); skb_reset_transport_header(skb); skb_set_owner_w(skb, sk); /* Build TCP header and checksum it. */ th = tcp_hdr(skb); th->source = inet->inet_sport; th->dest = inet->inet_dport; th->seq = htonl(tcb->seq); th->ack_seq = htonl(tp->rcv_nxt); [...] icsk->icsk_af_ops->send_check(sk, skb); [...] err = icsk->icsk_af_ops->queue_xmit(skb); if (likely(err <= 0)) return err; tcp_enter_cwr(sk, 1); return net_xmit_eval(err); } tcp_transmit_skb creates the copy of the given sk_buff (pskb_copy). At this time, it does not copy the entire data of the application but the metadata. And then it calls skb_push to secure the header area and records the header field value. Send_check computes the TCP checksum. With the checksum offload, the payload data is not computed. Finally, queue_xmit is called to send the packet to the IP layer. Queue_xmit for IPv4 is implemented by the ip_queue_xmit function. int ip_queue_xmit(struct sk_buff *skb) [...] rt = (struct rtable *)__sk_dst_check(sk, 0); [...] /* OK, we know where to send it, allocate and build IP header. */ skb_push(skb, sizeof(struct iphdr) + (opt ? opt->optlen : 0)); skb_reset_network_header(skb); iph = ip_hdr(skb); *((__be16 *)iph) = htons((4 << 12) | (5 << 8) | (inet->tos & 0xff)); if (ip_dont_fragment(sk, &rt->dst) && !skb->local_df) iph->frag_off = htons(IP_DF); else iph->frag_off = 0; iph->ttl = ip_select_ttl(inet, &rt->dst); iph->protocol = sk->sk_protocol; iph->saddr = rt->rt_src; iph->daddr = rt->rt_dst; [...] res = ip_local_out(skb); [...] ===> int __ip_local_out(struct sk_buff *skb) [...] ip_send_check(iph); return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT, skb, NULL, skb_dst(skb)->dev, dst_output); [...] ===> int ip_output(struct sk_buff *skb) { struct net_device *dev = skb_dst(skb)->dev; [...] skb->dev = dev; skb->protocol = htons(ETH_P_IP); return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING, skb, NULL, dev, ip_finish_output, [...] ===> static int ip_finish_output(struct sk_buff *skb) [...] if (skb->len > ip_skb_dst_mtu(skb) && !skb_is_gso(skb)) return ip_fragment(skb, ip_finish_output2); else return ip_finish_output2(skb); The ip_queue_xmit function executes tasks required by the IP layers. __sk_dst_check checks whether the cached route is valid. If there is no cached route or the cached route is invalid, it performs IP routing. And then it calls skb_push to secure the IP header area and records the IP header field value. After that, as following the function call, ip_send_check computes the IP header checksum and calls the netfilter function. IP fragment is created when ip_finish_output function needs IP fragmentation. No fragmentation is generated when TCP is used. Therefore, ip_finish_output2 is called and it adds the Ethernet header. Finally, a packet is completed. int dev_queue_xmit(struct sk_buff *skb) [...] ===> static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q, ...) [...] if (...) { .... } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) && qdisc_run_begin(q)) { [...] if (sch_direct_xmit(skb, q, dev, txq, root_lock)) { [...] ===> int sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, ...) [...] HARD_TX_LOCK(dev, txq, smp_processor_id()); if (!netif_tx_queue_frozen_or_stopped(txq)) ret = dev_hard_start_xmit(skb, dev, txq); HARD_TX_UNLOCK(dev, txq); [...] } int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, ...) [...] if (!list_empty(&ptype_all)) dev_queue_xmit_nit(skb, dev); [...] rc = ops->ndo_start_xmit(skb, dev); [...] } The completed packet is transmitted through the dev_queue_xmit function. First, the packet passes via the qdisc. If the default qdisc is used and the queue is empty, the sch_direct_xmit function is called to directly send down the packet to the driver, skipping the queue. Dev_hard_start_xmit function calls the actual driver. Before calling the driver, the device TX is locked first. This is to prevent several threads from accessing the device simultaneously. As the kernel locks the device TX, the driver transmission code does not need an additional lock. It is closely related to the parallel processing that will be discussed next time. Ndo_start_xmit function calls the driver code. Just before, you will see ptype_all and dev_queue_xmit_nit. The ptype_all is a list that includes the modules such as packet capture. If a capture program is running, the packet is copied by ptype_all to the separate program. Therefore, the packet that tcpdump shows is the packet transmitted to the driver. When checksum offload or TSO is used, the NIC manipulates the packet. So the tcpdump packet is different from the packet transmitted to the network line. After completing packet transmission, the driver interrupt handler returns the sk_buff. Following Code: How to Receive Data The general executed path is to receive a packet and then to add the data to the receive socket buffer. After executing the driver interrupt handler, follow the napi poll handle first. static void net_rx_action(struct softirq_action *h) { struct softnet_data *sd = &__get_cpu_var(softnet_data); unsigned long time_limit = jiffies + 2; int budget = netdev_budget; void *have; local_irq_disable(); while (!list_empty(&sd->poll_list)) { struct napi_struct *n; [...] n = list_first_entry(&sd->poll_list, struct napi_struct, poll_list); if (test_bit(NAPI_STATE_SCHED, &n->state)) { work = n->poll(n, weight); trace_napi_poll(n); } [...] } int netif_receive_skb(struct sk_buff *skb) [...] ===> static int __netif_receive_skb(struct sk_buff *skb) { struct packet_type *ptype, *pt_prev; [...] __be16 type; [...] list_for_each_entry_rcu(ptype, &ptype_all, list) { if (!ptype->dev || ptype->dev == skb->dev) { if (pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; } } [...] type = skb->protocol; list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) { if (ptype->type == type && (ptype->dev == null_or_dev || ptype->dev == skb->dev || ptype->dev == orig_dev)) { if (pt_prev) ret = deliver_skb(skb, pt_prev, orig_dev); pt_prev = ptype; } } if (pt_prev) { ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev); static struct packet_type ip_packet_type __read_mostly = { .type = cpu_to_be16(ETH_P_IP), .func = ip_rcv, [...] }; As mentioned before, the net_rx_action function is the softirq handler that receives a packet. First, the driver that has requested the napi poll is retrieved from the poll_list and the poll handler of the driver is called. The driver wraps the received packet with sk_buff and then calls netif_receive_skb. When there is a module that requests all packets, the netif_receive_skb sends packets to the module. Like packet transmission, the packets are transmitted to the module registered to the ptype_all list. The packets are captured here. Then, the packets are transmitted to the upper layer based on the packet type. The Ethernet packet includes 2-byte ethertype field in the header. The value indicates the packet type. The driver records the value in sk_buff (skb->protocol). Each protocol has its own packet_type structure and registers the pointer of the structure to the ptype_base hash table. IPv4 uses ip_packet_type. The Type field value is the IPv4 ethertype (ETH_P_IP) value. Therefore, the IPv4 packet calls the ip_rcv function. int ip_rcv(struct sk_buff *skb, struct net_device *dev, ...) { struct iphdr *iph; u32 len; [...] iph = ip_hdr(skb); [...] if (iph->ihl < 5 || iph->version != 4) goto inhdr_error; if (!pskb_may_pull(skb, iph->ihl*4)) goto inhdr_error; iph = ip_hdr(skb); if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl))) goto inhdr_error; len = ntohs(iph->tot_len); if (skb->len < len) { IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INTRUNCATEDPKTS); goto drop; } else if (len < (iph->ihl*4)) goto inhdr_error; [...] return NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, dev, NULL, ip_rcv_finish); [...] ===> int ip_local_deliver(struct sk_buff *skb) [...] if (ip_hdr(skb)->frag_off & htons(IP_MF | IP_OFFSET)) { if (ip_defrag(skb, IP_DEFRAG_LOCAL_DELIVER)) return 0; } return NF_HOOK(NFPROTO_IPV4, NF_INET_LOCAL_IN, skb, skb->dev, NULL, ip_local_deliver_finish); [...] ===> static int ip_local_deliver_finish(struct sk_buff *skb) [...] __skb_pull(skb, ip_hdrlen(skb)); [...] int protocol = ip_hdr(skb)->protocol; int hash, raw; const struct net_protocol *ipprot; [...] hash = protocol & (MAX_INET_PROTOS - 1); ipprot = rcu_dereference(inet_protos[hash]); if (ipprot != NULL) { [...] ret = ipprot->handler(skb); [...] ===> static const struct net_protocol tcp_protocol = { .handler = tcp_v4_rcv, [...] }; The ip_rcv function executes tasks required by the IP layers. It examines packets such as the length and header checksum. After passing through the netfilter code, it performs the ip_local_deliver function. If required, it assembles IP fragments. Then, it calls ip_local_deliver_finish through the netfilter code. The ip_local_deliver_finish function removes the IP header by using the __skb_pull and then searches the upper protocol whose value is identical to the IP header protocol value. Similar to the Ptype_base, each transport protocol registers its own net_protocol structure in inet_protos. IPv4 TCP uses tcp_protocol and calls tcp_v4_rcv that has been registered as a handler. When packets come into the TCP layer, the packet processing flow varies depending on the TCP status and the packet type. Here, we will see the packet processing procedure when the expected next data packet has been received in the ESTABLISHED status of the TCP connection. This path is frequently executed by the server receiving data when there is no packet loss or out-of-order delivery. int tcp_v4_rcv(struct sk_buff *skb) { const struct iphdr *iph; struct tcphdr *th; struct sock *sk; [...] th = tcp_hdr(skb); if (th->doff < sizeof(struct tcphdr) / 4) goto bad_packet; if (!pskb_may_pull(skb, th->doff * 4)) goto discard_it; [...] th = tcp_hdr(skb); iph = ip_hdr(skb); TCP_SKB_CB(skb)->seq = ntohl(th->seq); TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin + skb->len - th->doff * 4); TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq); TCP_SKB_CB(skb)->when = 0; TCP_SKB_CB(skb)->flags = iph->tos; TCP_SKB_CB(skb)->sacked = 0; sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest); [...] ret = tcp_v4_do_rcv(sk, skb); First, the tcp_v4_rcv function validates the received packets. When the header size is larger than the data offset (th->doff < sizeof(struct tcphdr) / 4), it is the header error. And then __inet_lookup_skb is called to look for the connection where the packet belongs from the TCP connection hash table. From the sock structure found, all required structures such as tcp_sock and socket can be got. int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) [...] if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */ sock_rps_save_rxhash(sk, skb->rxhash); if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len)) { [...] ===> int tcp_rcv_established(struct sock *sk, struct sk_buff *skb, [...] /* * Header prediction. */ if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags && TCP_SKB_CB(skb)->seq == tp->rcv_nxt && !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt))) { [...] if ((int)skb->truesize > sk->sk_forward_alloc) goto step5; NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITS); /* Bulk data transfer: receiver */ __skb_pull(skb, tcp_header_len); __skb_queue_tail(&sk->sk_receive_queue, skb); skb_set_owner_r(skb, sk); tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq; [...] if (!copied_early || tp->rcv_nxt != tp->rcv_wup) __tcp_ack_snd_check(sk, 0); [...] step5: if (th->ack && tcp_ack(sk, skb, FLAG_SLOWPATH) < 0) goto discard; tcp_rcv_rtt_measure_ts(sk, skb); /* Process urgent data. */ tcp_urg(sk, skb, th); /* step 7: process the segment text */ tcp_data_queue(sk, skb); tcp_data_snd_check(sk); tcp_ack_snd_check(sk); return 0; [...] } The actual protocol is executed from the tcp_v4_do_rcv function. If the TCP is in the ESTABLISHED status, tcp_rcv_esablished is called. Processing of the ESTABLISHED status is separately handled and optimized since it is the most common status. The tcp_rcv_established first executes the header prediction code. The header prediction is also quickly processed to detect in the common state. The common case here is that there is no data to transmit and the received data packet is the packet that must be received next time, i.e., the sequence number is the sequence number that the receiving TCP expects. In this case, the procedure is completed by adding the data to the socket buffer and then transmitting ACK. Go forward and you will see the sentence comparing truesize with sk_forward_alloc. It is to check whether there is any free space in the receive socket buffer to add new packet data. If there is, header prediction is "hit" (prediction succeeded). Then __skb_pull is called to remove the TCP header. After that, __skb_queue_tail is called to add the packet to the receive socket buffer. Finally, __tcp_ack_snd_check is called for transmitting ACK if necessary. In this way, packet processing is completed. If there is not enough free space, a slow path is executed. The tcp_data_queue function newly allocates the buffer space and adds the data packet to the socket buffer. At this time, the receive socket buffer size is automatically increased if possible. Different from the quick path, tcp_data_snd_check is called to transmit a new data packet if possible. Finally, tcp_ack_snd_check is called to create and transmit the ACK packet if necessary. The amount of code executed by the two paths is not much. This is accomplished by optimizing the common case. In other words, it means that the uncommon case will be processed significantly more slowly. The out-of-order delivery is one of the uncommon cases. How to Communicate between Driver and NIC Communication between a driver and the NIC is the bottom of the stack and most people do not care about it. However, the NIC is executing more and more tasks to solve the performance issue. Understanding the basic operation scheme will help you understand the additional technology. A driver and the NIC asynchronously communicate. First, a driver requests packet transmission (call) and the CPU performs another task without waiting for the response. And then the NIC sends packets and notifies the CPU of that, the driver returns the received packets (returns the result). Like packet transmission, packet receiving is asynchronous. First, a driver requests packet receiving and the CPU performs another task (call). Then, the NIC receives packets and notifies the CPU of that, and the driver processes the received packets received (returns the result). Therefore, a space to save the request and the response is necessary. In most cases, the NIC uses the ring structure. The ring is similar to the common queue structure. With the fixed number of entries, one entry saves one request data or one response data. The entries are sequentially used in turn. The name "ring" is generally used since the fixed entries are reused in turn. As following the packet transmission procedure shown in the following Figure 8, you will see how the ring is used. Figure 8: Driver-NIC Communication: How to Transmit Packet. The driver receives packets from the upper layer and creates the send descriptor that the NIC can understand. The send descriptor includes the packet size and the memory address by default. As the NIC needs the physical address to access the memory, the driver should change the virtual address of the packets to the physical address. Then, it adds the send descriptor to the TX ring (1). The TX ring is the send descriptor ring. Next, it notifies the NIC of the new request (2). The driver directly writes the data to a specific NIC memory address. In this way, Programmed I/O (PIO) is the data transmission method in which the CPU directly sends data to the device. The notified NIC gets the send descriptor of the TX ring from the host memory (3). Since the device directly accesses the memory without intervention of the CPU, the access is called Direct Memory Access (DMA). After getting the send descriptor, the NIC determines the packet address and the size and then gets the actual packets from the host memory (4). With the checksum offload, the NIC computes the checksum when the NIC gets the packet data from the memory. Therefore, overhead rarely occurs. The NIC sends packets (5) and then writes the number of packets that are sent to the host memory (6). Then, it sends an interrupt (7). The driver reads the number of packets that are sent and then returns the packets that have been sent so far. In the following Figure 9, you will see the procedure of receiving packets. Figure 9: Driver-NIC Communication: How to Receive Packets. First, the driver allocates the host memory buffer for receiving packets and then creates the receive descriptor. The receive descriptor includes the buffer size and the memory address by default. Like the send descriptor, it saves the physical address that the DMA uses in the receive descriptor. Then, it adds the receive descriptor to the RX ring (1). It is the receive request and the RX ring is the receive request ring. Through the PIO, the driver notifies that there is a new descriptor in the NIC (2). The NIC gets the new descriptor of the RX ring. And then it saves the size and location of the buffer included in the descriptor to the NIC memory (3). After the packets have been received (4), the NIC sends the packets to the host memory buffer (5). If the checksum offload function is existing, the NIC computes the checksum at this time. The actual size of received packets, the checksum result, and any other information are saved in the separate ring (the receive return ring) (6). The receive return ring saves the result of processing the receive request, i.e., the response. And then the NIC sends an interrupt (7). The driver gets packet information from the receive return ring and processes the received packets. If necessary, it allocates new memory buffer and repeats Step (1) and Step (2). To tune the stack, most people say that the ring and interrupt setting should be adjusted. When the TX ring is large, a lot of send requests can be made at once. When the RX ring is large, a lot of packet receives can be done at once. A large ring is useful for the workload that has a huge burst of packet transmission/receiving. In most cases, the NIC uses a timer to reduce the number of interrupts since the CPU may suffer from large overhead to process interrupts. To avoid flooding the host system with too many interrupts, interrupts are collected and sent regularly(interrupt coalescing) while sending and receiving the packets. Stack Buffer and Flow Control Flow control is executed in several stages in the stack. Figure 10 shows buffers used to transmit data. First, an application creates data and adds it to the send socket buffer. If there is no free space in the buffer, the system call is failed or the blocking occurs in the application thread. Therefore, the application data rate flowing into the kernel must be controlled by using the socket buffer size limit. Figure 10: Buffers Related to Packet Transmission. The TCP creates and sends packets to the driver through the transmit queue (qdisc). It is a typical FIFO queue type and the maximum length of the queue is the value of txqueuelen which can be checked by executing the ifconfig command. Generally, it is thousands of packets. The TX ring is between the driver and the NIC. As mentioned before, it is considered as a transmission request queue. If there is no free space in the queue, no transmission request is made and the packets are accumulated in the transmit queue. If too many packets are accumulated, packets are dropped. The NIC saves the packets to transmit in the internal buffer. The packet rate from this buffer is affected by the physical rate (ex: 1 Gb/s NIC cannot offer performance of 10 Gb/s). And with the Ethernet flow control, packet transmission is stopped if there is no free space in the receive NIC buffer. When the packet rate from the kernel is faster than the packet rate from the NIC, packets are accumulated in the buffer of the NIC. If there is no free space in the buffer, processing of transmission request from the TX ring is stopped. More and more requests are accumulated in the TX ring and finally there is no free space in the queue. The driver cannot make any transmission request and the packets are accumulated in the transmit queue. Like this, backpressure is sent from the bottom to the top through many buffers. Figure 11 shows the buffers that the receive packets are passing. The packets are saved in the receive buffer of the NIC. From the view of flow control, the RX ring between the driver and the NIC is considered as a packet buffer. The driver gets packets coming into the RX ring and then sends them to the upper layer. There is no buffer between the driver and the upper layer since the NIC driver that is used by the server system uses NAPI by default. Therefore, it can be considered as the upper layer directly gets packets from the RX ring. The payload data of packets is saved in the receive socket buffer. The application gets the data from the socket buffer later. Figure 11: Buffers Related to Packet Receiving. The driver that does not support NAPI saves packets in the backlog queue. Later, the NAPI handler gets packets. Therefore, the backlog queue can be considered as a buffer between the upper layer and the driver. If the packet processing rate of the kernel is slower than the packet flow rate into the NIC, the RX ring space is full. And the space of the buffer in the NIC is full, too. When the Ethernet flow control is used, the NIC sends a request to stop transmission to the transmission NIC or makes the packet drop. There is no packet drop due to lack of space in the receive socket buffer because the TCP supports end-to-end flow control. However, packet drop occurs due to lack of space in the socket buffer when the application rate is slow because the UDP does not support flow control. The sizes of the TX ring and the RX ring used by the driver in Figure 10 and Figure 11 are the sizes of the rings shown by the ethtool. For most workloads which regard throughput as important, it will be helpful to increase the ring size and the socket buffer size. Increasing the sizes reduces the possibility of failures caused by lack of space in the buffer while receiving and transmitting a lot of packets at a fast rate. Conclusion Initially, I planned to explain only the things that would be helpful for you to develop network programs, execute performance tests, and perform troubleshooting. In spite of my initial plan, the amount of description included in this document is not small. I hope this document will help you to develop network applications and monitor their performance. The TCP/IP protocol itself is very complicated and has many exceptions. However, you don't need to understand every line of TCP/IP-related code of the OS to understand performance and analyze the phenomena. Just understanding its context will be very helpful for you. With continuous advancement of system performance and implementation of the OS network stack, the latest server can offer 10-20 Gb/s TCP throughput without any problem. These days, there are too many technology types related to performance, such as TSO, LRO, RSS, GSO, GRO, UFO, XPS, IOAT, DDIO, and TOE, just like alphabet soup, to make us confused. In the next article, I will explain about the network stack from the performance perspective and discuss the problems and effects of this technology. By Hyeongyeop Kim, Senior Engineer at Performance Engineering Lab, NHN Corporation. [Less]