Operation and maintenance development analysis

 Now that you have known about the Salary of Network Security Engineer, you must be attracted to achieving it. If you wish to have it, you must have to do lots and lots of studies, unless you have a good and reliable AWS DAS C01 Certified Data Analytics Specialty Dumps provider like that of the ITCertDumps.

Welcome everyone to learn about communication

is jumping

key technology point anatomy: 1 large number of high-concurrent websites; 2 high reliable, high scalability network architecture design; 3 website security Question, how to avoid black? 4 South-North interconnection, dynamic CDN solution; 5 massive data storage architecture

one, what is a large website operation and maintenance?

First of all, the "operation and maintenance" mentioned by the full text refers to the difference between large websites, and other operations, and then we will define the large website and small websites. This definition is mainly considered from the perspective of operation and maintenance complexity, such as website specifications, visibility, service device, PV quantity, etc., other factors are not focus; therefore, we first define the server size greater than 1000 units, PV is at least 100 million per day (at least Top 10 in China, such as Sina, Baidu, QQ, http://51.com, etc. In the collection network, system, developed "Compound Talents", as some companies put some contract procurement into the range of operation and maintenance duties, as well as IDC network planning is also included in the operational rights. Therefore, it is very important to understand that operation and maintenance must be familiar with familiarity to other associated types: network, system, system development, storage, security, DB, etc .; I am talking about the operation and maintenance engineer is a full-time operation and maintenance engineer.

Let's talk about the "birth" process of general products:

1, first, the company's management gives guiding ideology, PM positioning market demand (or Copy mature application) to investigate, Analysis and finally give a detailed design.

2, architects complete network planning, architecture design, etc. according to the needs of product design, such as PV size estimates, server scale, application architecture, etc. (substantially constant to network changes, unless big project)

3, development engineers will design CODE, and test engineers test the application.

4, good, to the operation and maintenance engineer, first clearly, not to say that the first three steps have nothing to do with the operation and maintenance work, just in contrast, the first three steps and operational relationships are very large: the previous period of the application Architecture design, soft / hardware resource evaluation application, application design performance hazards and evaluation, IDC, service performance \ security tuning, server system level optimization (related to specific applications), etc., etc., and lead the entire application Project; the operation and maintenance engineer is responsible for the product server, the server system installation, network, IP, general toolset installation. The operation and maintenance engineer also needs to be reasonable for the application system architecture of the online application system. It is responsible for the scalability, and security hazards, and is responsible for finalizing the product (procedure), network, system three, and optimized together. , Finally completed the product online to provide users, and revised: demand - & gt; development (upgrade) - & gt; test - & gt; on the line (performance, security issues, etc., the problem is slowly out) Here, you will find that the website development model is completely different from traditional software development. The website has been developed on the line 1 ~ 5 upgraded versions are homestroom, the user experience is king, if a line problem is like M $ takes 1 year to solve, users Early running; after the application is online, the operation and maintenance work is just beginning. The specific work may include: upgrade version online work, service monitoring, application statistics, daily service status inspection, burst processing, service daily change adjustment, cluster management, Service performance assessment optimization, database management optimization, with application PV reduction and decrease in application architecture scaling, security, operation and maintenance development work:

a, try to achieve daily mechanical manual work through tools (such as service Monitoring, application statistics, service online, etc.), improve efficiency.

b, solve problems in real-world services, such as high reliability, scalability issues, etc.

C, development of large-scale cluster management tools, such as how 10,000 machines complete password modification within 1 minute, or run a specified task? How does 2000 servers install the operating system? How to distributed IDC, storage cluster, multi-number PT, how fast storage, sharing, analysis? Waiting for a series of challenges to work hard for operation and maintenance engineers.

This shows that other cooperative work conditions, in the entire project, the front-end application is black box for the network / system engineer, and the development of engineers is only responsible for completing the functional development of the application, and the application itself Performance, security and other applications itself, it is not responsible for or cares about the network / system architecture, of course, other colleagues such as soft / hardware procurement personnel will not care about these issues, and their own duties, but the core of the project is to transport Dimensional engineer ~! The bridge of all other departments.

above said a lot, I think everyone should have some concepts for the operation, let's make an example, if we are a high-speed car on the highway, the operation and maintenance engineer is Driver and maintenance work, this driver is not simple, sometimes need to change tire during high-speed driving, change the gear position according to the road situation, when the car speed is getting faster, the car itself does not meet the speed of automobile performance or part Upgrade, high-speed travel to solve the problem of automobile faults and performance issues, always pay attention to the security problems in front, and take the avoidance tools. This is the operation and maintenance work ~!

Finally, the responsibility of the operation and maintenance engineer: "Make sure the line is stable", it seems simple, but it is not easy, and the operation and maintenance engineers must trade out in many disadvantages: new product model is now There are architectural and technologically impacts, and the online bug hidden danger of high frequency, the operation and maintenance automation management is not high, and the high efficiency pursued by the IT industry leads to the lack of process implementation, users boost The pressure on the performance and architecture, the IT industry is loose technology management culture, innovative risks, Internet security issues, etc. High sense of responsibility, principle and coordination skills, if you can do the best balance of factors, is an excellent operation and maintenance engineer.

In addition, I have seen the experience of many people want sina, qq, baidu, http://51.com, etc., in fact, this for them Available from itself:

a, each company has a network architecture, scale, or more, but also a company's core secret, to keep confidential, in addition, for general software, architecture, due to Many companies will perform secondary development (such as Apache, PHP, MySQL), and the operating system kernel will be customized according to different business types due to their own actual business needs. If some applications belong to an arithmetic, some are high IO, or large storage large memory. Based on these features, the kernel optimization is customized. If Sina has been developed two times on Memcache, it has made a memcachedb, how we don't talk about how we do, but open source is worthy of praise, domestic companies are basically Request, there is no contribution; in addition, the server is not known, according to the business characteristics, most of the Dell / HP / IBM is customized; in addition, there is self-propelled solution in distributed storage, That is to use the ready-made Hadoop solution, or since it has been developed. But 90% is to learn from Google GFS: distributed storage, calculation, and large tables.

b, the company's business direction is different, which will cause the operation and maintenance mode or method, such as http://51.com and baidu operation and maintenance, because their business model determines Its architecture, service device, IDC distribution, network structure, general technology will not be the same, Sina of the main news portal and the main SNS http://51.com operation and maintenance mode is very large, even the responsibilities are not large; However, there is a little, general technology and rough architectures are similar, and everyone should not be too emotional. More companies are just playing the game, there is no technical content.

c, as mentioned above, the current large-scale website operation and maintenance is still in the young age, and there is no mature knowledge system. It may be a way to operate. Everyone must think first. , Or I haven't thought about it, the real discussion is only the Iceberg of the operation and maintenance, limited to the specific technical details, or a certain framework of a famous website, the true operation and maintenance system is not, this may be the current online operation and maintenance related information. It's relatively small. Or is also more difficult to recruit in domestic operators, and more than one of the reason why the oysters of cattle is relatively rare.

II. What kind of skills and quality do you need to be a functioning engineer? What kind of skills and quality do you need to do as an operation and maintenance engineer, first talk about skills, such as Everyone sees, the operation and maintenance is a multi-IT work skill and a position, for the system - & gt; network - & gt; storage - & gt; protocol - & gt; demand - & gt; development - & gt; test - & gt; All the links need to be understood, but for some links, it is necessary to be familiar with even proficiency, such as the system (the familiar use of the basic operating system, * NIX, Windows ..), protocol, system development (daily important work is Automatic operation and maintenance Translate development, large-scale cluster tool development, management), universal application (such as LVS, HA, Web Server, DB, Middleware, Storage, etc.), network, IDC topology;

The following points:

1. Developing capabilities, this is very important, because the operation and maintenance tools need to be developed, developed: Perl, Python, PHP (one), Shell (awk, sed, EXPECT ...., etc., need to have experience in actual project development, otherwise the work will be very painful.

2, general application needs to be understood: operating system (currently Linux, BSD), Webserver (NGINX, APAHE, PHP, Lighttpd, Java ...), database (mysql, orract) Other mixed seven-eight-drawn stuff; system optimization, high reliability; these just add points, no need for necessary, can slowly learn slowly while work, these things are not difficult. Of course, in the operation and maintenance, some are different from the division of labor.

3, system, network, security, storage, cdn, db, etc. need to be quite understanding, knowing the relevant principles.

Personal quality:

1, communication skills, team collaboration: operation and maintenance work cross-sectoral, cross-labor work, need to communicate, and teamwork is strong; this should be The basic quality of modern enterprises is required, not to say.

2, the work is daring: courage can be innovated, do not go to the common road, especially for the new type of work such as operation and maintenance, more innovation can promote development; heart, the operation and maintenance engineer is the website Admin, the highest online privileges, you will regret it for life or in the 18th floor hell.

3, initiative, executive strength, strong energy, strong pressure resistance: Due to the characteristics of the IT industry, the change is fast; often plan to change the change, the operation and maintenance work is more prominent, such as China major companies The server is often the country all over the country. It is a very headache that is a very headache, which is a very headache, a large-scale service migration (involved in the server), this is a very headache problem; Completed, in this case, the initiative and execution of the operation and maintenance engineers have high requirements: plan, program, service seamless migration, machine relocation, environmental preparation, safety assessment, performance assessment, infrastructure, all associations Department of sectors, 7x24 small emergency response, etc.

4, others are some basic qualities: the mind should be spiritual, the logical thinking ability is strong, and the people are humble, affinity, helpful, and have a big view.

5, the last point, doing website operation and maintenance needs to explore the spirit of innovation, solve the problem in reality through innovative thinking, because this is a career in the young age (the same, but starting at home ), There is no mature system or methodology can be drawn on, and you can only rely on everyone to explore efforts.

three, how can it be a qualified operation and maintenance engineer

1. to ensure the service requirements for the requirements, such as 99.9%; ensure that the line is stable, this is the operation and maintenance engineer Basic grant.

2, the reliability and robustness of the continuous improvement application, performance optimization, safe and improvement; this is very testing the initiative and innovative thinking.

3, the website can be monitored, the coverage, software, hardware, operational status of the statistics require monitoring statistics to avoid monitoring dead angles, and can understand the operation of the application in real time.

4, through innovative thinking to solve operation and maintenance efficiency problems; most of the company's main operations are still dependent on manual operation intervention, need to liberate as much as possible.

5, the accumulation of operation and maintenance and the intensity of the document, the completeness of document, the operation and maintenance is a very empiric, good experience and traps need to accumulate, avoid repetitiveness.

6, planning and execution power; work has a plan, the idea is trying to achieve the goal, do not find excuses.

7, automated operation and maintenance; refining, design and development of daily mechanization work, system, allows the system to completely complete the system; let everyone think more to think, innovate Thinking, doing self likely things.

The above is only some technical levels, of course, personal awareness is also very important.

four, the staging and development prospects of the operation and maintenance, the prospects of the operation, the operation and maintenance position is not like other positions, such as R & D engineers, test engineers, etc., have very clear duties and occupations Plan, more professional identity and sense of accomplishment; while the operation and maintenance work may give people what the feeling is some, but it is more proficient than the full-time engineer. It feels low in the usual attention (unless the line is faulty) ), Slowly everyone will be fascinated, practicing career development, why do you have this phenomenon? In addition to the characteristics of the occupational itself, mainly because of the understanding of the operation and maintenance, it is not almost caused; in fact, this problem will also appear, but I found that the operation and maintenance is more typical, more prone to this problem;

< P> Talk about this issue, I will talk about the current situation and development prospects of the website (also in thinking, maybe not deeply comprehensive, please also add)

operation and maintenance status:

< P> 1, in the initial stage of the initial start, major companies have this full-time, but pay attention to or importance is not high, can be replaced; small companies are more by other positions to do this work, no full-time, It is impossible to do deep.

2, the technical level is relatively low; mainly in technology exploration, accumulation stage, no systematic, technology.

3, physical labor is large; this problem is mainly related to the second point, and many things still rely on manpower, without completing good developments, there is no mature automation management method for large-scale clusters, This shows that the large-scale cluster and operational work are related to the dozen machines, and there is no service space for the maintenance too much.

4, the extreme lack of excellent operation and maintenance; the major companies have basically rely on self-cultivation, this status, leading to the fluidity of operation and maintenance talents in the industry, very good technology is limited Interior of major companies, such as Google 500,000 machine science management, or some of the domestic internet company TOP 10's operation and maintenance experience, these experiences are very valuable things and determine the core competitiveness of a company; these problems can lead to industry The circulation, penetration, and reuse of advanced operation and maintenance technology will eventually limit the development of operation and maintenance.

5, many excellent operation and maintenance experiences have mastered in the hands of the big bus; this is not the company's technical strength, but the technical scale, massive PV, hardware scale, such as Baidu terrible Traffic, http://51.com massive data ~~~~ These factors determine that the problems they encounter are all other / small companies have not been encountered, or will be encountered. But big companies may have a good solution or system.

Development prospects:

1, from the perspective of the industry, with the rapid development of China's Internet (current China netizens have jumped to the world), the website is increasingly increasing Large, architecture is more complicated; the requirements for full-time website operation and maintenance engineers, website architects will be more urgent, especially the demand for experienced excellent operation and maintenance talents, and the more gorganism; currently China It is a collection of graduates (limited to big companies), high cultivation, and no experience talents will lead to slow technology updates, affect the company's technology development; of course, graduates are also beneficial: white paper, strong plasticity, Comparison and easily integrate into corporate culture.

2, from personal perspectives, operation and maintenance engineer technical content and requirements will be increasing, and it is also the most familiar person to company applications, and the most familiar people are getting more and more attention.

3, website operation and maintenance will become a comprehensive technical position that integrates multidisciplinary (network, system, development, security, application architecture, storage, etc.), providing you with a good personal ability and technology Singapore development space.

4, the relevant experience of operation and maintenance will become very important, and will also become a personal core competitiveness, and the solution and program of various problems, global thinking ability, etc. .

5, the training of expertise and interest; because the knowledge of the performance of the operation and maintenance position is very broad, it is easier to cultivate or play someone else or hobbies, such as kernel, network, development, Database, etc., you can do very deeply in progress and become an expert in this area.

6, if you really don't want to do it, it is relatively easy to transfer to other positions, and there will be no too much limitations. Of course, you have to do it with your heart.

7, technology development direction: website / system architect.

5, operation and main technology point dissection

1, large-scale cluster management problem

First we must clarify the concept of clusters, the cluster is not a general The sum of the function server, but meant to achieve a server, the integration of the hard disk resource (the number of machines is greater than two), which is a whole, and the current regular cluster can be divided into: High availability cluster (HA), load balancing cluster (such as LVS), distributed storage, computing storage cluster (DFS, such as Google GFS, Yahoo Hado), specific application cluster (a specific function server combination, such as DB, Cache layer, etc.), current The Internet industry is based primarily based on these four types; for the first two similar, if the business is simple, the POST operation is relatively small, and the four-layer switch can be used to solve (such as F5), and achieve high service high availability / responsibility. Companies that have nervous resolution also have some open source solutions such as LVS + HA, very flexible; for the latter, then test the company's technical strength and application, the third DFS is mainly applied to massive data applications, such as email, search, etc. Applications, especially search requirements, in addition to simple mass storage, including data mining, user behavior analysis; if Google, Yahoo can save analytical data for nearly one year, and Baidu should be less than 30 days, SOGUO It is less. . . These are critical for search preparation, and user experience.

Next, let's talk about how scientific management clusters, have the following key points:

i, monitoring

mainly includes fault monitoring and performance, Status of traffic, load, etc. These monitoring is related to the healthy operation of the cluster, and the timely discovery and intervention in the potential problem;

a, service failure, status monitoring: mainly for the server itself, upper application, association Service data interactive monitoring; for example, for front-end Web Server, we can have many types of monitoring, including application port status monitoring, easy to discover if servers or applications themselves CRASH, detect server health through ICMP packets, and more on the upper layer may also include The monitoring of each channel service is used to use the facing feature code to determine, or sign the key page, and the website is tampeared (alarm, and automatically restore the tamper data), etc., these is just part, There are N-multi-monitoring methods, depending on the characteristics, there are some problems to be resolved, such as the cluster is too large, how to monitor high performance is also a real problem.

b, others are the monitoring or statistics of the cluster status class, providing our reasonable management of tuning clusters, including services, bottlenecks, performance issues, abnormal traffic, attacks and other issues.

ii, fault management

a, hardware fault problem; for the N multi-cluster of thousands or tens of thousands of machines, server crash, hardware failure probability is very large, almost There are service hardware issues, crash, hard disk damage, power, memory, and switches every time. In response to this situation, we need to take into account these issues when designing website architectures, and treat them as normal; more dependent redundant mechanisms to avoid this risk, but give system engineers enough to process time. (If Google is not known as the 800 machines at the same time, the service will not be affected); this is a place to test the function of the operation and maintenance engineer and website architect, and a good design can meet the self-recovery capabilities described in Google, such as GFS. The bad design is that a server's crash may cause a chain fault of large area services to reflect directly to the user directly.

b, application failure problem; maybe a bug is triggered, or a performance threshold is exceeded, and the attack is not fixed, but the important point is to have these problems. Preventive measures cannot be obvious, it will not have problems, if you really have problems, how to deal with? This requires operation and maintenance engineers to do footwork, including emergency response speed, scientificness of fault treatment, and effectiveness of the standby plan.

iii, automation

Automation: In short, it is to complete some of our daily manual work through tools, system automatically, liberate our hands and boring repetition Labor, for example: Before there is a tool, our installation system requires a bare metal installation, such as 200,000, may take 10 people / 10 days, measuring N-shadow, human cost is greater. . . Now, now, only a few simple commands can get it, and there is a machine and human program, automatically complete the work of manual intervention, make it automatically, report the results, and have certain expert system capabilities, can do some Simple / non-judgment, optimization options, etc. . . These benefits are very obvious. . . It should be said that automated operation and maintenance is a pursuit of operation and maintenance engineer, and is profitable. Although this is an extremely difficult task: constantly changing business, unregulated application design, development model, network architecture change, IDC change For factors such as normative changes, they may have an impact on existing automation systems, so they need modularization, interface, and change, etc., automated related work, is one of the core key tasks of operation and maintenance engineers, and is also worth reflect.

Welcome everyone to learn to exchange

is jumping

If you wish to make your career in network, the Certifications is considered to be the best certification, to jump-start your career. But gaining this certification isn’t considered to be that much easy. You have to go through lots and lots of study process unless you have the help of the AWS SAP-C01 Certified Solutions Architect Professional Dumps offered at the ITCertDumps.

评论

此博客中的热门博文

Top 50+ Cisco SD-WAN Interview Questions

Top 46 Best Cisco Tools for Networking for Network Engineers

E-Book- An Ultimate Self Study Guide Cisco IOS VPN AIO