DEV Community: ObservabilityGuy

Accepted by Top Conferences! Multiple Alibaba Cloud Achievements Improve O&M Intelligence Accuracy and Efficiency

ObservabilityGuy — Fri, 24 Apr 2026 06:32:07 +0000

This article introduces three top-conference-accepted research achievements by Alibaba Cloud that solve core AIOps challenges in data augmentation, se...

As the core direction of enterprise digital transformation and artificial intelligence for IT operations (AIOps), operation intelligence is becoming a key enabler for improving business stability and reducing O&M costs in the AI-native era. Its technical development and engineering implementation always revolve around core aspects such as data processing, semantic understanding, and exception detection.

The Alibaba Cloud Observability team continues to work deeply in this field. Recently, a series of research achievements in the operation intelligence realm jointly published with universities such as Fudan University, Tsinghua University, and Tongji University have been consecutively accepted by top international academic conferences International Conference on Learning Representations (ICLR) 2026, Transactions on Software Engineering (TSE) 2026, and International Symposium on Software Testing and Analysis (ISSTA) 2025. These achievements systematically overcome core technical challenges in realms such as metric data augmentation, large-scale semantic parsing, and cross-system exception detection. They build a complete operation intelligence technical system from data infrastructure to semantic understanding, and then to industrial-level deployment. This further promotes the engineering implementation of large language model (LLM) in scenarios such as automatic inspection by AI agents, assisted root cause analysis, and automatic fault recovery. This lays a solid technical foundation for large-scale applications.

Three Major Challenges in the Engineering Implementation of AIOps
Challenge 1: Semantics Gap
Traditional tools process O&M data essentially by performing "format matching". Log resolvers categorize similar strings into one class. Timing analysis applies common methods in the image realm. Exception detection only looks at a single metric. These methods do not understand the essential difference between "timeout after 30s" and "timeout after 0.01s" in the O&M context. They do not understand the statistical semantics such as the trend, epoch, or stationarity of metrics. They also do not know the deep association among logs, metrics, or traces. The lack of semantics directly leads to persistently high missed detections and false positives.

Challenge 2: Generalization Bottleneck
Real O&M systems are never static. Microservices frequently release new versions, and log templates continuously evolve. After new operational systems are published, all history annotations become invalid. The data distribution drifts over time, and the model that was well-trained yesterday may fail today. More critically, the annotation cost of industry-level systems is extremely high. For each new system annotated, it often requires months of human effort. Existing methods perform excellently in a stable lab environment. However, they struggle to adapt to a dynamically evolving production environment.

Challenge 3: Industrial Availability
The academic community pursues accuracy. The industrial community requires both accuracy and efficiency. Log streaming of 100,000 logs per second, abnormal response requirements within 100 ms, and limited memory and computing power budgets are hard constraints. These hard constraints keep many "good methods in papers" confined to the lab. They cannot be truly implemented.

Systematic Breakthroughs of Alibaba Cloud Observability
① AutoDA-Timeseries: Break through the limitations of timing modeling, enabling AI to predict faults with less data
Without a good augmentation policy, the true potential of metrics cannot be tapped. For a long time, metric data augmentation has been limited by paradigm migration in the image domain. Timing features are ignored. Augmentation policies cannot be adaptive. Existing Automated Data Augmentation (AutoDA) frames blindly apply image transformations. This destroys autocorrelation and time dependencies. This critically restricts the performance of downstream tasks such as categorization, prediction, and exception detection.

The paper "AutoDA-Timeseries: Automated Data Augmentation for Time Series" (Tsinghua University & Alibaba Cloud) accepted by ICLR 2026 proposes the first general automated data augmentation frame for metrics. It fetches 24-dimensional timing statistical features and integrates them into a stacking augmentation layer. Through Gumbel-Softmax differentiable sampling, it adaptively optimizes the augmentation probability and intensity in a single-stage end-to-end manner. It covers five major jobs such as categorization, long- and short-term prediction, regression, and exception detection. The categorization accuracy reaches 0.730 (+6.7%) on Temporal Convolutional Network (TCN) and 0.721 (+5.2%) on ROCKET. It comprehensively surpasses 7 state-of-the-art (SOTA) baselines. This provides the first generalized and automated solutions for metric data augmentation.

Paper address: https://openreview.net/forum?id=vTLmHAkoIW

② A SemanticLog: Balancing high accuracy and high throughput, the peak throughput of semantic log parsing reaches 1.28 million logs per second
Without good semantic understanding, the true meaning behind log parameters cannot be read. Log parsing technology has remained at the syntax layer for a long time. That is, it uniformly replaces dynamic parameters with the wildcard character (*). This loses semantic information carried by parameters, such as object identifier (ID), status code, and UNIX timestamp. This critically restricts the accuracy of AIOps downstream tasks such as exception detection and root cause analysis. Existing LLM resolvers mostly depend on the online APIs of ChatGPT. They face three major challenges: privacy leakage, unstable latency, and uncontrollable versions. They are difficult to implement in a production environment.

The paper "SemanticLog: Towards Effective and Efficient Large-Scale Semantic Log Parsing" (Fudan University & Alibaba Cloud & Tongji University), accepted by TSE 2026, proposes the first semantic log resolver based on an open-source LLM. The semantic log resolver consists of three core modules that work together. LogLLM removes causal masks and reconstructs log parsing from text generation to a token categorization job to fully utilize bidirectional context. The SemPerception module uses multi-head cross-attention to aggregate subword features and achieves 16 classes of fine-granularity semantic categorization (which is extended by 60% compared to the VALB 10-class system, and 96% of parameters in enterprise logs can be accurately categorized). The EffiParsing prefix tree caches parsed templates to significantly reduce repetitive inference overhead.

A comprehensive evaluation based on LLaMA2-7B on the LogHub-2.0 benchmark shows that SemanticLog achieves the best results in five traditional and semantic parsing Metrics (GA 93.3%, PA 93.6%, FTA 84.4%, SPA 83.2%, SPA+ 55.9%). SemanticLog comprehensively surpasses 11 SOTA resolvers including the ChatGPT solution. The semantic parsing accuracy SPA is improved by 18.7% compared to the similar method VALB. The inference speed is better than all LLM resolvers. In the downstream exception detection experiment, fine-granularity semantic tagging increases the detection F1 score by up to 4%. This provides an efficient and reliable open-source solution for the engineering implementation of semantic log parsing in privacy-sensitive scenarios.

Paper address: https://ieeexplore.ieee.org/document/11216353/

③ LogBase: The first semantic log parsing benchmark, enabling AI to truly "understand" every log
Without a good ruler, you cannot measure true progress. The semantic log parsing realm has long faced systematic challenges such as scarce annotations, limited data size, and fragmented evaluation standards. The mainstream benchmark LogHub-2.0 only covers 14 systems and 3,488 templates, which critically restricts the accuracy of AIOps downstream tasks.

The paper "LogBase: A Large-Scale Benchmark for Semantic Log Parsing" (Fudan University & Alibaba Cloud & Tongji University), accepted by ISSTA 2025, builds the first large-scale semantic log parsing benchmark. The benchmark covers 130 open-source projects and provides 85,300 high-quality semantic tagging templates. Compared to LogHub-2.0, the data source size is increased by about 9 times, and the quantity of templates is expanded by 24.5 times. The benchmark is equipped with an 8+16 hierarchical semantic categorization system and an automated building frame GenLog. The benchmark achieves the evaluation paradigm upgrade from syntax parsing to semantic understanding for the first time. A comprehensive evaluation of 15 mainstream resolvers exposes the true shortcomings of existing methods in complex scenarios. This provides a unified standard and reliable foundation for the engineering implementation of semantic log parsing.

Paper address: https://dl.acm.org/doi/10.1145/3728969

Currently, the Alibaba Cloud observability team has integrated the aforementioned innovative technologies into product systems such as Cloud Monitor (CMS), Simple Log Service (SLS), and Application Real-Time Monitoring Service (ARMS). This achieves accurate intelligent alerting, in-depth log understanding, and low-threshold intelligent O&M. This helps enterprises break O&M efficiency bottlenecks, reduce costs, and improve business stability.

The iteration of LLM and AI agent technologies is accelerating. The value of observability data as a key link connecting AI and production systems continues to become prominent. The Alibaba Cloud Observability team will continue to drive technological breakthroughs through academic innovation. The team will improve the operation intelligence technology system, participate in the construction of industry standards, and promote the large-scale implementation of AIOps. This provides more solid artificial intelligence for IT operations support for the digital transformation of enterprises.

Two Thousand Years of Ontology: From Metaphysics to the Engineering Practice of Alibaba Cloud UModel

ObservabilityGuy — Fri, 24 Apr 2026 06:22:45 +0000

This article introduces the evolution of ontology from philosophy to engineering practice, highlighting how Alibaba Cloud UModel utilizes it to unify observability data and empower AIOps.

Have you ever thought that the underlying logic used by Alibaba Cloud Operations and Maintenance (O&M) engineers today to locate server faults is essentially the same as the thinking of ancient Greek philosophers who asked "what the world is made of" more than two thousand years ago? From the first existence analysis frame built by Aristotle in "Metaphysics" to the observability modeling of enterprise Information Technology (IT) systems in the digital age today, ontology has spanned more than two thousand years, gradually evolving from a core branch of metaphysics into the underlying methodology for the digital transformation of various industries. It is never an obscure philosophical speculation in a study. Instead, it always revolves around the simplest proposition: how can we clearly understand the world? How can we turn scattered and personal experiences into a transferable, reusable, and verifiable consensus?

Today, we will follow this path from philosophy to practice to thoroughly analyze the essence of ontology, and see how ontology transforms from an abstract philosophical theory into an engineering implementation tool, and finally completes its native practice in the realm of observability and artificial intelligence for IT operations (AIOps) on Alibaba Cloud UModel.

I. What Exactly is Ontology?
When many people hear about ontology, their first reaction is that ontology is a "profound philosophical concept". However, to put it plainly, ontology is to draw a unified and unambiguous map for the "world" you want to study. The etymology of ontology comes from the Greek words ontos (existence) and logos (doctrine), which literally translates to "the doctrine of existence". In the philosophical system, ontology is the core of metaphysics, and the ultimate questions it needs to answer are: what is the world made of? What is the essence of things? How does existence become existence? Whether it is ontology in philosophy or ontology in the computer realm, the core must solve three problems:

● What truly exists in this world? (What exists?)

● How should we perform categorization and definition for these things? (How to classify?)

● What are the relationships between these things, and how will they interact with each other? (How to relate?)

Here we must distinguish three concepts that are easily confused, which is also the foundation for us to understand the value of ontology:

● Ontology: It defines "what the world itself is". It is the starting point of all cognition. For example, you must first clearly define what a host, pod, and service are, and what the relationships between them are, before you can discuss subsequent O&M operations.

● Epistemology: It answers "how we should understand this world" and is the method of cognition. For example, we need to decide whether to conduct an observation of the Status of a host through Metrics, logs, or traces.

● Methodology: It solves "what means we should use to transform this world" and is the path for implementation. For example, after a fault occurs, we need to determine what steps to take to locate the root cause and complete the disposal.

Without the underlying "map" of ontology, epistemology and methodology become water without a source. If you have not even clearly explained what you want to study, subsequent observations and operations will inevitably fall into chaos. The biggest misunderstanding of ontology is thinking that it is just defining things and attaching labels to things. However, the true soul of ontology is never a static entity definition, but rather dynamic relationships and behaviors. Take the simplest example: to understand "water", we must first clarify that its molecular formula is H₂O. This is the essential definition of water. However, what truly makes us understand "water" is its status changes at different temperatures, its chemical reactions with other substances, and its loop patterns in the ecosystem. Detached from these dynamic behaviors and relationships, "H₂O" is just a cold symbol without any practical significance. This is the essential difference between the static perspective and the dynamic perspective in ontology. The static perspective only focuses on the properties of the things themselves, while the dynamic perspective believes that the essence of a thing can only truly manifest in its relationships with other things and in its own movement and changes. This core cognition is also the fundamental reason why ontology can step out of the philosophical study and take root in the engineering realm. The most painful problem in enterprise digitalization is never "we do not have data", but rather "we have a pile of data, but we do not know what the relationships between the data are, let alone what the business logic behind the data is".

II. Two Thousand Years of Ontology: From Philosophical Speculation to Engineering Practice
The development of ontology has never been a random accumulation of scattered viewpoints, but has completed three key leaps along the main line of "standardizing human cognition".

2.1 Philosophical Foundation: From "Questioning the Origin" to "Building a System"
The starting point of ontology was ancient Greece in the 6th century BC. Before this, people used myths to explain the world. The philosophers of ancient Greece used reason for the first time to start asking "what the origin of the world actually is." Thales said that "water is the origin of all things." He attributed the essence of the world to concrete matter for the first time and opened the prelude to rational inquiry. Heraclitus said that "all things stream, and a person cannot step into the same river twice." He shifted the perspective to "change" and believed that the essence of the world is a procedure of movement. Parmenides proposed that "true existence is eternal and unchanging." The dispute between the two also planted the core proposition of "static and dynamic" in ontology. The person who truly transformed ontology into a complete System was Aristotle. In "Metaphysics", he treated "the study of existence itself" as an independent discipline for the first time. He dismantled the underlying logic of the existence of things using the theory of four causes (material cause, formal cause, efficient cause, and final cause). He also used the ten-category System to perform categorization on all manifestations of existence. Aristotle drew a universal "ontology map" for the world for the first time. This turned scattered inquiries into a reusable Analysis frame.

In the subsequent Middle Ages, European philosophy was incorporated into the theological frame. The dispute between realism and nominalism became the core. Realism believed that universal concepts truly exist. Nominalism believed that only concrete individuals are real, and concepts are just names. This dispute appears to be attached to theology, but this dispute actually clarified the "relationship between concepts and entities." This is exactly the core premise of "knowledge representation" in the computer realm later. From the 17th century to the 19th century, driven by the profound impact of the modern scientific revolution and the rational spirit of the Enlightenment, the modern scientific revolution completely pulled ontology out of theology. Descartes' mind-body dualism separated the cognitive entity and the objective world, and set the research paradigm of "subject-object separation" for modern science. Kant's twelve-category system achieved the conversion of the inquiry of traditional ontology into an epistemological issue. Kant's twelve-category system no longer inquired about the unknowable 'thing-in-itself', but instead studied the a priori logical frame of how humans perceive the world. Hegel's dialectics thoroughly injected the thinking of dynamic evolution into ontology. Hegel's dialectics completed the crucial upgrade from the "description of static existence" to the "description of the laws of motion of existence."

At this point, the philosophical kernel of ontology had become completely mature. The remaining task was to wait for an era that could allow ontology to be implemented.

2.2 Paradigm Transformation: From Philosophical Theory to Engineering Tools
Since the 20th century, the successive explosion of mathematical logic, computer science, and IT has opened the door of engineering for ontology. The development of ontology has also followed the technological wave and completed four crucial steps:

The first step is from text to symbols. The concept of a "universal language" proposed by Leibniz in the 17th century was finally implemented from the end of the 19th century to the 20th century. The mathematical logic founded by Frege and Russell provided ontology with a rigorous and unambiguous formal expression tool. The "existence" that could only be described in words before can now be calculated using symbols and formulas. Ontology transformed from philosophical speculation into a scientific system that can be authenticated and computed.

The second step is from science to the core tool of artificial intelligence (AI). When the discipline of AI was born in the mid-20th century, the first problem to be solved was "how to make machines understand human knowledge." This is exactly what ontology is best at. In 1993, the scholar Gruber proposed the classic definition: An ontology is an explicit specification of a conceptualization. Later, scholars such as Studer performed extension and perfection on this definition, and formed the consensus definition still in use today: Ontology is a formal and explicit normative specification of a shared conceptual system in a certain realm. This completely accomplished the paradigm transformation of ontology. Ontology is no longer a toy for philosophers, but ontology has become the core foundation of knowledge representation in the realm of AI.

The third step is from a single System to the infrastructure of the Internet. Around the year 2000, the Internet rapidly became popular. However, the information on the Internet could only be read by humans. Machines could not understand the information, and the data between different websites were completely isolated islands. The concept of the Semantic Web proposed by Tim Berners-Lee, the father of the World Wide Web, was to use ontology to perform unified semantic tagging on information on the Internet. The publication of standards such as Resource Description Framework (RDF) and Web Ontology Language (OWL) made ontology the underlying infrastructure for knowledge interconnection and interoperability on the Internet.

The fourth step is from the Internet to the era of big data and Large Language Models (LLMs). In 2012, Google published the knowledge graph, and Google took ontology as the "pattern layer" of the knowledge graph. The combination of ontology and Graph Database allowed ontology to achieve large-scale engineering implementation in the era of big data. After the explosion of large language models in 2022, ontology found a new positioning. LLMs have massive knowledge, but LLMs are prone to "talking nonsense", and the procedure is uncontrollable when LLMs infer. However, the structured and precise attributes of ontology can exactly put a "halter" on LLMs, and ontology and LLMs become a Gold combination for the industry implementation of LLMs.

2.3 Modern Exploration: The Breakthroughs and Limitations of Palantir
At this point, many people will certainly ask: Is there an application for ontology in industries such as healthcare, finance, industry, and government affairs? In essence, ontology needs to solve three common difficulties that cannot be bypassed in the digital transformation of all enterprises:

● Data silos. Different systems and departments have different data standards and disconnected semantics. Even though data is available, the data cannot be used together. For example, in the medical industry, the disease glossaries of different hospitals are not unified, and data cannot interoperate at all. In the government realm, the data of different departments is managed independently, and citizens must visit several departments to complete a single task. Ontology builds a unified "translation language" for this heterogeneous data, which allows data from different systems to communicate with each other.

● Experience churn. Most of the core capabilities of an enterprise are hidden in the minds of senior employees. For example, a senior worker in a factory knows what sound a device makes when the device is about to malfunction, and a senior risk control expert in a bank knows what features indicate fraud. Newcomers need to spend several years learning this tacit experience, and if the employees leave, the experience is lost. Ontology can break down this fragmented experience into standardized rules, and turn the experience into reusable and inheritable knowledge in the system. This knowledge will not be lost because of personnel turnover.

● Disconnection between systems and businesses. The IT systems of many enterprises only move offline flows online, but do not incorporate business logic. A heap of data exists in the system, but the data cannot support business decisions, and problems cannot be quickly located when problems occur. Ontology models business entities, relationships, and rules into the system. This makes the system truly understand the business, rather than just storing data.

On the path of modern engineering practice of ontology, Palantir is an unavoidable benchmark. This company can gain a firm foothold in global intelligence, finance, and industrial realms. This is never because of how powerful its big data technology is, but because it is the first to truly implement the core value of ontology into enterprise-level scenarios. Palantir hits the nail on the head by exposing the fatal flaw of traditional data systems. The data of enterprises lies in the database, but the business relationships between data are invisible, and the business experience and judgment rules in the minds of senior employees cannot be incorporated into the system. Everyone is looking at the data, but no one can clearly explain how the business behind the data actually runs. Palantir uses ontology to find the answer to this problem:

● Palantir jumps out of the cold association of primary and foreign keys in traditional databases, and adds business semantics to the relationships between entities. This is not a simple "Identifier (ID) match", but an association with practical significance, such as "Company A holds shares in Company B" and "Account C transfers money to Account D".

● Palantir breaks down the tacit experience in the minds of business experts into configurable and executable rules, and incorporates the rules into the system. This allows the system to replicate the judgment logic of experts.

● Palantir not only records the final status of data, but also traces the end-to-end flow of data generation and circulation. This makes the complete procedure of the business observable and traceable.

This set of strategies allows Palantir to prove the huge value of ontology in highly complex scenarios, such as anti-terrorism, finance risk control, and industrial manufacturing. However, its limitations are also obvious. Palantir is positioned to serve only top-tier customers, takes the route of heavy customization and heavy delivery, has a long implementation cycle, and has an extremely high cost. Ordinary small and medium-sized enterprises cannot afford it at all. Moreover, the threshold for ontology modeling is very high, which requires the cooperation of professional teams and cannot be popularized on a large scale. Palantir has paved the way for the engineering of ontology, but it has also left a new problem. How to turn this system into a reusable, low-threshold, and inclusive capability for the entire industry, so that ordinary enterprises can also use it? This has become a brand-new proposition for the engineering implementation of ontology.

III. UModel: Making Ontology "Light" and "Practical"
If we focus on the observability realm, we will find a deeply meaningful point of convergence. With the continuous deepening of the digital transformation of enterprises, and IT architectures fully evolving toward microservices, cloud-native, and containerization, the core dilemma faced by the observability realm is essentially homologous to the philosophical proposition that ontology sought to solve more than 2,000 years ago. Both aim to solve the fundamental problem of how to clearly define cognitive objects, sort out association relationships, and form a unified consensus. The current enterprise observability systems generally face three major core pain points:

● Data silos and semantic fragmentation. The four major core observability data types, which are metrics, logs, traces, and changes, are scattered in systems of different vendors and different features. The data formats are not unified, and the business semantics are not interoperable. When a fault occurs, O&M engineers need to switch back and forth among multiple platforms for troubleshooting, and cannot achieve end-to-end association analysis or root cause location at all.

● Tacit experience and inheritance failure. Senior O&M engineers can quickly locate faults based on long-accumulated experience. However, this core judgment logic and handling methods exist in the minds of individuals in the form of tacit knowledge. Not only is the training cycle for newcomers long and difficult to master, but the core O&M capabilities of the enterprise cannot achieve standardized accumulation or scaled reuse. The fault handling efficiency always highly depends on individual capabilities.

● LLMs lack a reliable foundation for implementation. The industry generally attempts to apply LLMs to artificial intelligence for IT operations scenarios. However, LLMs lack a standardized knowledge framework in the vertical O&M realm, and have biases in understanding professional terms and business logic. They are highly prone to hallucinate, and their infer procedures and Results are uncontrollable. Therefore, they can never be truly implemented in a production environment.

Alibaba Cloud UModel emerged precisely to systematically solve these industry pain points. Based on the underlying logic of ontology, which prioritizes behavior and takes relationships as the core, UModel creates a universal and unified modeling framework for the observability realm. Essentially, it draws a complete and unambiguous cognitive map of the digital world for complex and heterogeneous IT systems, truly transforming ontology from an abstract theory into a practical tool that O&M engineers can use, know how to use, and afford to use. During the design procedure, UModel is not only an abstraction of data, but also a complete system that integrates data, knowledge, and actions.

UModel Builds a complete product system around four core dimensions. Each dimension not only aligns with the native ideas of ontology, but also forms an irreplaceable differentiated advantage against the industry pain points in the observability realm. This completely distinguishes UModel from general-purpose ontology platforms such as Palantir and traditional observability monitoring tools:

● Standardized semantics definition to solve the core pain points of data silos and semantics fragmentation

With the native ideas of ontology as the core, UModel provides unified and unambiguous standardized definitions for all entities, associate relationships, and business rules in the O&M world. This allows O&M engineers, applications, and AI LLMs to form a consistent understanding of observable data, solving the problem of semantics inconsistency from the root. Unlike general-purpose platforms that have a high threshold of requiring users to Build realm models from scratch, UModel is optimized in depth specifically for IT O&M and cloud Resource Management scenarios. It has built-in mature realm ontology libraries and standardized modeling templates that cover all scenarios such as infrastructure, intermediaries, application performance, and Alibaba Cloud services. Enterprises do not need to build from scratch, and can complete the adaptation of core scenarios out of the box.

● End-to-end closed-loop Build to achieve complete implementation from Data to actions

Based on the graph model, UModel bridges the complete closed loop of "data-knowledge-action". It connects the underlying multi-source observation Data, expert knowledge in the O&M realm, and automated disposal execute actions in depth, achieving end-to-end integration from Data observation and Root Cause Analysis to decision-making and disposal, rather than the simple static data storage and display of traditional tools. At the same time, as the core foundation of Cloud Monitor 2.0, UModel can natively connect to full-stack observability products such as Alibaba Cloud Simple Log Service (SLS) and Application Real-Time Monitoring Service (ARMS). It provides one-stop integration of all observable data, including metrics, logs, traces, and changes. Enterprises do not need to perform complex system integration or custom development, significantly reducing implementation costs.

● Explicit precipitation of implicit experience to achieve standardized inheritance of enterprise O&M capabilities

Closely following the core definition of "rules and constraints" in ontology, UModel dismantles the implicit experience accumulated by O&M engineers during fault judgment, root cause analysis, and emergency disposal into a standardized, configurable, and reusable rule system. This system is precipitated into the system, allowing personal experience to be converted into inheritable digital knowledge assets of the enterprise. Unlike the pattern of Palantir that heavily relies on professional teams to customize rules, UModel relies on visualization modeling tools and standardized modeling flows to completely break the technical barriers of rule precipitation. O&M engineers do not need to master complex ontology theories to independently complete the standardized dismantling of experience and model configuration, achieving universal reuse of core capabilities.

● LLM native integration design to achieve universally beneficial AIOps with bidirectional empowerment

UModel uses a unified ontology model to provide reliable realm knowledge constraints and logical frames for LLMs. This avoids the problem where LLMs hallucinate in vertical O&M scenarios from the root. At the same time, by leveraging the natural language understanding and generate capabilities of LLMs, UModel significantly lowers the technical threshold for ontology modeling and O&M operations, truly achieving bidirectional empowerment between ontology models and LLMs. This is also the core advantage that distinguishes UModel from traditional tools: traditional tools can only achieve simple connection with LLMs, whereas UModel has completed the native integration of the ontology model and the Qwen LLM from the beginning of its design. Users can complete fault localization, root cause analysis, and model configuration through daily conversations. They do not need to memorize complex query syntax or operation instructions, truly achieving universally beneficial "conversational O&M".

In terms of specific architecture implementation, UModel adopts a directed graph structure of "nodes + edges" to completely describe the entire IT world. Each architecture component forms a precise one-to-one mapping with the core concepts of ontology.

At the same time, the implementation procedure of UModel essentially breaks down the philosophical ideas of ontology into standardized flows that can be executed and copied in O&M scenarios. Through five core actions, it helps enterprises convert scattered, tacit O&M experience into standardized ontological models. Each step of the end-to-end flow deeply aligns with the core logic of ontology:

Division of business domains (corresponding to the realm concept definition in ontology). Based on the IT architecture, line-of-business division, and O&M team labor division of an enterprise, clear business domains are delineated, such as the infrastructure domain, application performance domain, Alibaba Cloud service domain, and operational system domain. The border scope and responsible team for each domain are clarified to avoid duplicate model construction from the source, building a foundation for ontological modeling.
Definition of entities and relationships (corresponding to class and association modeling in ontology). The core observability entities within each business domain are sorted out. The properties and field specifications of entity sets, as well as the business semantics relationships between entities, are defined. Examples include the "containment" relationship between a service and a pod, the "running on" relationship between a pod and a host, and the "invoke" relationship between microservices.
Explicitization of O&M rules (corresponding to constraint rule definition in ontology). Through expert interviews and reviews of history fault cases, the tacit experience of senior O&M engineers is extracted and broken down into standardized rule elements. These elements include fault trigger conditions, root cause analysis logic, alerting denoising rules, and automated handling flows. They are then mapped to the constraint rule system of UModel.
Multi-source data fusion (corresponding to instantiate implementation in ontology). Relying on the storage decoupling capability of UModel, it connects with various existing observability data sources of the enterprise. It completes the unified semantics snap of full data, and uniformly maps the metric, log, and trace scattered across different systems into the built ontological model, completely breaking data silos to formlive data that can be subjected to association analysis.
Scenario-based application and iterative optimization. Based on the built ontological model, specific O&M scenarios such as fault early warning, root cause analysis, alerting denoising, and automated handling are implemented. Then, according to the run effects in the actual production environment, the entity definitions, relationship rules, and judgment logic of the model are continuously iterated and optimized. This allows the ontological model to continuously evolve along with the business architecture and O&M requirements of the enterprise.

IV. UModel Practices in Multiple Industries
Based on this set of standardized methodologies, we are also actively exploring the further implementation practices of UModel in industries such as the Internet, finance, industrial manufacturing, and government affairs. This forms replicable implementation solutions adapted to the attributes of different industries, truly authenticating the inclusive value of ontology in the observability realm.

4.1 The Internet Industry: End-to-end Observability of Ultra-large-scale Microservices Models
The Internet industry generally adopts distributed microservices models. Core business traces often span tens to hundreds of microservices, with tens of thousands to hundreds of thousands of container instances running online. The industry generally faces three core challenges. First, observability data is scattered across multiple sets of monitoring tools. Metric, trace, log, and change data lack unified semantics definitions, forming critical data silos. When online faults occur, O&M engineers need to repeatedly troubleshoot across multiple platforms, resulting in extremely low positioning efficiency. Second, core fault handling and root cause analysis experience is highly concentrated in the hands of senior O&M engineers. The parenting epoch for new team members is long, and experience is difficult to standardize for accumulation and reuse. Third, the alert storm problem triggered by massive alerting is prominent. Effective alerting is overwhelmed by invalid information, and fault response efficiency is significantly reduced.

Based on the business architecture and O&M labor division, you can divide five core domains: the application performance domain, infrastructure domain, intermediary domain, Alibaba Cloud service domain, and operational system domain. You can clarify the border and core entity scope of each domain, and build the basic frame for ontological modeling.
You can define core entity sets such as service, instance, pod, edge zone, database, and Microsoft Message Queuing (MSMQ), as well as core semantics relationships such as service invocation, instance deployment, container run, and data read/write. You can build an end-to-end unified ontological model that covers from user requests to infrastructure.
Through reviews of History fault cases and extraction of experience from senior O&M experts, you can break down tacit experience, such as fault root cause judgment, alerting denoising, and automated handling flows, into a standardized rule system. You can accumulate this into UModel to achieve the explicitization and reusability of O&M experience.
You can connect multi-source heterogeneous monitoring data sources. Through UModel, you can complete the semantics snap of full data, achieve the association and connection of end-to-end Data, and completely break data silos.
4.2 The Finance Industry: Compliance-oriented AIOps under IT Application Innovation Transformation
The finance industry is currently in a critical stage of IT application innovation transformation. IT architectures are transforming from traditional centralized architectures to distributed hybrid cloud architectures. Hundreds of operational systems, such as core trading, credit, and wealth management, are running simultaneously in IT application innovation environments and traditional environments. The core pain points of the industry include the following aspects. First, observability data is scattered across monitoring tools of multiple vendors and types. Cross-environment data semantics are disconnected, making troubleshooting extremely difficult. Second, the size of O&M teams is limited, and senior O&M engineers are scarce. Fault handling highly depends on experts, and core experience is difficult to cover all operational systems. Third, the industry faces strict financial regulatory compliance requirements. It needs to achieve end-to-end traceability and auditability of O&M operations and trading traces. Traditional O&M patterns are difficult to meet rigid compliance requirements.
You can combine the IT application innovation transformation architecture and regulatory compliance requirements to divide the architecture into four core domains: the infrastructure domain, core business domain, IT application innovation resource domain, and compliance audit domain. This adapts to the architecture attributes and compliance requirements of the finance industry.
You can define core entities and association relationships, such as hosts, storage, databases, operational systems, and transaction links, to complete the building of the basic ontology model. You do not need to develop from scratch.
You can break down senior O&M experience, such as fault handling of core transaction systems, threat early warning, and compliance audits, into standardized rules, and accumulate them into the ontology model. This achieves system-wide reuse of expert experience.
You can connect multi-source monitoring data between the IT application innovation environment and the traditional environment. You can achieve semantic alignment of cross-environment data through a unified ontology model. This ensures that the end-to-end transaction links are traceable and meets regulatory compliance requirements.
4.3 Industrial Manufacturing Industry: End-to-end Observability of Production Lines in Industrial Internet Scenarios
The discrete manufacturing and process manufacturing industries are accelerating their transformation to the Industrial Internet. A single production line is often equipped with thousands of industrial devices, and the automation rate of production lines continues to increase. The core pain points of the industry include the following. First, the operational data of production line devices, manufacturing execution system (MES) data, and IT O&M data are isolated from each other. They lack a unified semantics definition. Operational technology (OT) and IT data cannot be integrated for analysis. Second, device fault handling highly depends on the personal experience of on-site maintenance personnel. The fault handling cycle is long, which easily causes unplanned downtime of production lines. Third, the core experience in device maintenance and process optimization is scattered across various production bases. When new bases are built or new employees are trained, the experience cannot be quickly reused. Fourth, there is a lack of a standardized predictive maintenance system. Sudden device faults occur frequently, and production continuity is difficult to guarantee.
Centered on the end-to-end production line, you can divide the architecture into four core domains: the device domain, production line domain, process domain, and IT system domain. This covers all scenarios from underlying industrial devices to upper-layer operational systems.
You can define core entities, such as industrial robots, machine tools, sensors, production lines, and process segments. You can also define the ownership relationships between devices and production lines, the transfer relationships between process segments, and the association relationships between device faults and parameters. This helps build a full-scenario ontology model for the production line.
You can extract the experience of senior maintenance personnel across multiple bases in device fault diagnosis, predictive maintenance, and process optimization. You can break down this experience into a standardized rule system and accumulate it into UModel. This achieves cross-base knowledge reuse.
You can connect production line programmable logic controller (PLC) data, sensor data, MES data, and IT O&M data. You can achieve deep integration of OT and IT data through a unified ontology model.
In addition to the standardized O&M scenarios in the aforementioned core industries, UModel explores and implements various innovative scenarios based on the underlying philosophy of ontology, which prioritizes behavior and centers on relationships. This further expands the implementation borders of ontology in the observability realm. These include conversational O&M with native integration of the LLM. Based on the unified realm ontology model built by UModel, the LLM can accurately understand the professional terms, entity relationships, and business rules in O&M scenarios. This fundamentally avoids the hallucination problem of the LLM in vertical O&M scenarios. Users can complete operations such as core system run status queries, fault root cause localization, and O&M policy configurations through natural language. They do not need to master professional query syntax or technical knowledge. This lowers the technical threshold for O&M operations. In response to the common industry status of enterprise hybrid cloud and multicloud deployments, UModel overcomes the limitation of traditional monitoring tools in cross-environment adaptation capabilities. It achieves unified ontology modeling across cloud vendors, deployment environments, and technology stacks. A single ontology model is compatible with the observable data of public clouds, private clouds, and traditional self-managed data centers. You do not need to build independent monitoring or O&M systems for different environments. This reduces the O&M complexity and Management costs under hybrid cloud architectures.

V. Conclusion
More than two thousand years ago, Aristotle wrote Metaphysics to find a unified and unambiguous explanation for the chaotic world. Today, we use UModel to build ontology models for IT Systems. We aim to draw a map for the complex digital world that can be understood and utilized. Today, when the LLM is rapidly popularized, we do not lack AI that can generate Content. What we lack is a knowledge frame that can put a "halter" on AI, make AI truly understand the business, and prevent it from talking nonsense. Ontology is exactly the core of this frame. The combination of the LLM and UModel essentially equips AI with a "business brain." This transforms it from being "eloquent" to being "capable of working and working accurately." This is probably the most charming aspect of ontology. From questioning the origin of the world to locating server faults, ontology has spanned more than two thousand years. What has changed is only the object of research. What remains unchanged is humanity's obsession with "explaining cognition clearly and passing it down." To this day, it still provides the most underlying power for our digital age.

Recommended Reading
🔥 UModel Data Governance: Practice of Building an O&M World Model

🔥 UModel Explorer: Redefining Observability Data Modeling with a Graphical Approach

🔥 From Symptoms to Root Causes: How MetricSet Explorer Reinvents the Metric Analysis Experience

🔥 Building a Unified Entity Search Engine by Using UModel for Observability Scenarios

One Command Equips Your OpenClaw with an X-ray Machine - Alibaba Cloud Observability Makes Farming Lobsters Cheaper and Safer

ObservabilityGuy — Thu, 23 Apr 2026 02:46:10 +0000

One-command observability integration makes OpenClaw AI agent operations transparent via Alibaba Cloud monitoring plugins.
❓Have you experienced this?

OpenClaw🦞(an open-source AI agent framework) is becoming a "digital employee" for more enterprises. It processes emails, writes code, manages files, and executes commands. It does almost anything. Many teams have deployed dozens or hundreds of OpenClaw instances. They formed a sizable "digital lobster farm".

However, a problem arises.

Lobster farmers can at least watch their pond. What about your OpenClaw? Do you know how many tokens it consumed today? Do you know which model is silently draining your budget? Do you know if a "lobster" was lured into reading /etc/passwd at 3:00 AM?

The answer for most is: I don't know. 😶

You carefully deployed OpenClaw. However, when these issues arise, you find yourself without the right tools to pinpoint the problem.

This article discusses using one command to equip your OpenClaw with an X-ray machine. This makes every LLM invocation, tool execution, and token consumption visible.

1.What Is Your Lobster Doing? Three “Blind Spots” Are Affecting Your Confidence
📚 Before we start, let's discuss three "blind spots". If you use OpenClaw, at least one has likely troubled you.

Blind spot 1: The inference process is a maze and debugging relies on guessing
The complete path OpenClaw takes to process a user message is more complex than you think. A simple question may travel the following journey:

User input → System prompt assembly → Model inference round 1 → Determine need for tool calling → Tool calling (such as search or code execution) → Return tool result → Model inference round 2 → Call another tool → Generate final response

If any step fails, the final output may deviate from expectations. Without tracing analysis, you face an "input-output" black box. You can only guess where the problem lies. Is the prompt poor? Is it model hallucination? Did the tool return incorrect data?

Tuning prompts relies on inspiration. Troubleshooting relies on luck. This is not science. It is mysticism. 🎲

Blind spot 2: Token bills are like blind boxes and cause pain at month-end
LLMs charge by token. Everyone knows this. However, as an agent, OpenClaw has a token consumption pattern different from directly invoking an API. It has a context snowball effect.

In every conversation round, the agent stuffs previous conversation history, system prompts, and tool calling results into the context. The first round might use 2000 tokens. By the fifth round, it might expand to 20,000. If a tool returns a large block of HTML or JSON, the situation worsens.

Worse, you do not know the source of the cost. Is a model too expensive? Is an agent prompt too wordy? Was the context not clipped in time? Without fine-grained consumption data, you cannot perform optimization. 💸

Blind spot 3: System status is like Schrödinger's cat
OpenClaw involves message queues, webhook processing, and session management during operation. When a user asks why it is not responding, the problem could lie in any layer. Did model inference timeout? Did tool calling stall? Are message queues stacked? Did the gateway fail?

Without real-time metric monitoring, you only discover issues after user complaints. By then, a group of users may be affected. ⏰

2.The Antidote Is Here: openclaw-cms-plugin + diagnostics-otel, Traces and Metrics Working Together
🛠️ To address these three "blind spots", our solution involves two plugins working together. They solve problems at different layers:

Both rely on the OpenTelemetry standard protocol. Data is uniformly reported to Cloud Monitor 2.0 of Alibaba Cloud. View and analyze data on the same platform.

The openclaw-cms-plugin is the focus of this topic. It is a trace reporting plugin designed for OpenClaw. It follows OpenTelemetry GenAI semantics and generates structured traces for every OpenClaw run.

Specifically, it records the following types of spans:

These spans have a parent-child relationship. Together, they form a complete trace. You can see a trace view similar to this in the Cloud Monitor 2.0 console:

You can see at a glance how many times the LLM was invoked and how many tokens were used. You can also see which tools were invoked, which step took the longest, and if any errors occurred.

It is that simple to go from "guessing" to "seeing". 👁

diagnostics-otel is a built-in extension of OpenClaw. It outputs runtime metrics data, including token consumption rate, invocation QPS, response duration distribution, queue depth, and session status. The installation script automatically finds and enables it. You do not need to do anything else.

Wait, does diagnostics-otel not also report traces? Why is openclaw-cms-plugin needed?
Good question. The diagnostics-otel supports trace reporting. However, if you look closely at the generated trace, you will find a fundamental problem: All spans are independent and have no parent-child relationship.

The diagnostics-otel uses an event-driven architecture to generate spans. Each event creates a span independently with a different trace ID. It generates the following five types of spans:

● openclaw.model.usage: model invocation (records token usage)

● openclaw.webhook.processed/openclaw.webhook.error: webhook processing

● openclaw.message.processed: message processing (records processing results and duration)

● openclaw.session.stuck: session stuck alerting

There is no trace context propagation between these spans. Simply put, they are just independent data points. The only way to associate them is using business fields such as sessionKey.

Webhook  [openclaw.webhook.processed]  traceId: abc123  
Message  [openclaw.message.processed]  traceId: def456  ❌ Different trace IDs  
Model    [openclaw.model.usage]        traceId: ghi789  ❌ Different trace IDs

However, openclaw-cms-plugin is designed for complete tracing. All spans share the same trace ID. They are linked into a call tree via an explicit parent-child relationship. You can see the full picture of a request:

enter_openclaw_system              traceId: aaa111  
  └── invoke_agent main            traceId: aaa111  ✅ Same trace ID  
        ├── chat qwen3-235b        traceId: aaa111  ✅ Same trace ID  
        ├── execute_tool search    traceId: aaa111  ✅ Same trace ID  
        └── execute_tool exec      traceId: aaa111  ✅ Same trace ID

In addition to trace integrity, there is a fundamental difference in data richness between the two:

Simply put: The trace from diagnostics-otel is a set of independent "record cards", while the trace from openclaw-cms-plugin is a complete "invocation map". The former only tells you "what happened," while the latter tells you "every step." Use them together. One handles system metrics, and the other handles business traces. They complement each other perfectly. 🤝

3.Setup in One Minute: One-Command Integration Tutorial
🚀 Enough theory. Let's get started. The entire integration process takes less than a minute.

3.1 Get the install command
Log on to the Cloud Monitor 2.0 console. Go to your application monitoring workspace. Choose Integration Center > AI application observability. Click OpenClaw.

In the sidebar, enter the application name and click Click to obtain to generate the integration command immediately. Click the icon in the upper-right corner to copy it with one click.

3.2 Start installation with one command
Open the terminal on the machine where OpenClaw runs. Paste the command you copied and press Enter:

curl -fsSL https://arms-apm-cn-hangzhou-pre.oss-cn-hangzhou.aliyuncs.com/openclaw-cms-plugin/install.sh | bash -s -- \  
  --endpoint "https://Your ARMS-OTLP address" \  
  --x-arms-license-key "Your license key" \  
  --x-arms-project "Your project" \  
  --x-cms-workspace "Your workspace" \  
  --serviceName "Your service name"

Then, sit back and watch it run. ☕

The installation script automatically does the following:

[INFO]  Checking prerequisites...  
[OK]    Node.js v24.14.0  
[OK]    npm 11.9.0  
[OK]    OpenClaw CLI found  
[INFO]  Downloading plugin...  
[OK]    Downloaded  
[INFO]  Extracting...  
[OK]    Extracted  
[INFO]  Installing npm dependencies...  
[OK]    Dependencies installed  
[INFO]  Locating diagnostics-otel extension...  
[OK]    Found diagnostics-otel at: /home/.../extensions/diagnostics-otel  
[OK]    diagnostics-otel dependencies already present  
[INFO]  Updating config...  
[OK]    Config updated  
[INFO]  Restarting OpenClaw gateway...  
[OK]    Gateway restarted  

════════════════════════════════════════════════════  
  ✅ openclaw-cms-plugin installed successfully!  
════════════════════════════════════════════════════

What does it do?

✅ Checks the environment (Node.js, npm, OpenClaw CLI).
✅ Downloads and decompresses openclaw-cms-plugin to the OpenClaw extension folder.
✅ Installs runtime dependencies for the plugin.
✅ Automatically locates the diagnostics-otel extension. If dependencies are missing, it installs them automatically.
✅ Updates the openclaw.json configuration (configurations for both plugins are written at once).
✅ Restarts the gateway to apply the configuration.
You do not need to manually edit any configuration files. The installation script intelligently handles various edge cases. It merges updates into existing configurations instead of overwriting them. It also searches for multiple possible installation locations for diagnostics-otel based on priority.

3.3 Verify installation
After installation, chat with your OpenClaw. Wait a minute or two. Open the Cloud Monitor 2.0 console. Go to AI application observability in the sidebar on the right. Your OpenClaw application appears. Congratulations. Your lobster is no longer a black box. 🎉

3.4 Want to uninstall? It is even simpler
If you want to stop using it (though I doubt it), one command does it:

curl -fsSL https://arms-apm-cn-hangzhou-pre.oss-cn-hangzhou.aliyuncs.com/openclaw-cms-plugin/uninstall.sh | bash

The uninstall script automatically cleans up the plugin folder and all related configurations in openclaw.json. It also disables the diagnostics-otel configuration. If you only want to uninstall the trace plugin but keep metrics, add the --keep-metrics parameter.

Clean and quick. No side effects. 🧹

4.The Highlight: What Can You See After Installation?
📈 Integration is just the beginning. The truly exciting part is what you see and solve after integration.
4.1 Complete trace: Finally understand its "thought process"
This is the core value of openclaw-cms-plugin. Cloud Monitor 2.0 displays a structured trace for every user request:

enter_openclaw_system (Request entry: sender and source)
　└── invoke_agent main (Agent execution procedure)
　　　├── chat qwen3-235b  (LLM invoke: model inference + token usage details) 
　　　├── execute_tool search (Tool calling: search)
　　　└── execute_tool exec (Tool calling: code execution)

In a conversation round, the plugin records agent-level LLM invokes and each independent tool calling. If the agent runs a tool loop internally (such as "invoke tool → get result → invoke next tool"), each tool calling is recorded independently as a tool span. This includes input parameters, return values, and execution status. You can clearly see the complete toolchain execution procedure.

💡 In the current version, LLM invokes in a conversation round aggregate into one LLM span. It records the final total token usage and input/output content for that round. Future versions will refine this. They will support generating a separate span for each independent LLM inference. Then, even intermediate inference steps in multi-round tool loops will be fully visible.

Each span is annotated with rich properties:

● Duration—see which step is slowest at a glance

● Model information—which model and provider were used

● Token usage—input_tokens, output_tokens, cache_read_tokens, and total_tokens, broken down item by item

● Tool parameters and return values—what tool was invoked, what parameters were passed, and what results were returned

● Error message—displayed in red if an error occurs

What does this mean?

Previously, if a user said the "answer is wrong," you had to guess by checking chat records. Now, check the traces. You see the search tool returned an empty result. The model "creatively" made up a paragraph based on that empty result. Problem localization drops from "two hours" to "two minutes". ⚡

4.2 Token usage breakdown—know exactly where every penny goes
Each LLM span in trace carries complete token usage properties:

Use gen_ai.request.model and gen_ai.provider.name. You can know exactly: which model consumed how many tokens at which step.

Consider a real scenario. You find five LLM invocations in a conversation trace. The input_tokens for the third invocation reach 12,000. Click it. You see the tool returned a full page of HTML, all stuffed into the context. You found the "token-swallowing blackhole." Optimization now has a direction.

Token usage transforms from a "messy account" to a "detailed ledger". 💰

4.3 System running metrics—pulse visible in real-time
Metrics data exported by the diagnostics-otel plugin can build running metric gauges on Cloud Monitor 2.0. This allows real-time monitoring:

● Token usage rate and fee trends — broken down by model and time dimension

● Invoke QPS and response duration — is system throughput normal

● MSMQ depth and wait time — is there a backlog

● Session stall count — Are any lobsters "playing dead"?

● Context size trend — Is the context expanding uncontrollably?

Paired with the alerting feature of Ccloud Monitor 2.0, these metrics enable automatic alerts for a 50% day-over-day surge in daily token consumption, automatic alerts when queue depth exceeds a threshold, and automatic alerts for session stalls. You know immediately when a problem occurs, rather than waiting for user complaints. 🔔

4.4 GenAI semantic conventions — Professional standards, not ad hoc solutions
Note that the trace data reported by openclaw-cms-plugin strictly follows the OpenTelemetry GenAI semantic conventions. These are not field names we defined arbitrarily, but international standards.

This means:

Standardized data structures — Property names such as gen_ai.request.model, gen_ai.usage.input_tokens, and gen_ai.tool.name match industry standards. This simplifies integration with other tools.
Normalized message formats — gen_ai.input.messages, gen_ai.output.messages, and gen_ai.system_instructions are formatted according to standard JSON schema. This supports multiple message types, such as TextPart, ReasoningPart, ToolCallRequestPart, and ToolCallResponsePart.
Future extensibility — As GenAI semantic conventions evolve, the plugin allows smooth upgrades.
4.5 Beyond standards — The "extra helpings" of Alibaba Cloud GenAI conventions
While compatible with OTel open-source standards, openclaw-cms-plugin also implements extension capabilities from the Alibaba Cloud GenAI semantic conventions. Compared to the community Standard Edition, you receive some "extra helpings":

ENTRY span — A clear "entry point" for the trace

The OTel community specification defines only span types such as LLM (inference), tool (tool calling), and agent. It lacks an "entry point" concept. The Alibaba Cloud specification extends the ENTRY span type to specifically identify the call entry point of an AI application. In openclaw-cms-plugin, this is the enter_openclaw_system span. It records "who initiated the request" (gen_ai.user.id) and the "current session ID" (gen_ai.session.id). This lets you view the trace and perform analysis and tracking by user and session dimensions.

🔗 Session-level association —gen_ai.session.id

The OTel standard provides gen_ai.conversation.id. However, for agent applications, "session" is more appropriate than "conversation". The Alibaba Cloud specification introduces gen_ai.session.id, which spans ENTRY, AGENT, and LLM spans. This lets you search directly by session ID in Cloud Monitor 2.0, retrieve all traces under that session at once, and quickly restore the full session content.

📊 gen_ai.span.kind — An AI-specific span categorization system

The SpanKind in the OpenTelemetry standard includes only generic types such as CLIENT, INTERNAL, and SERVER. For an AI application trace, SpanKind alone cannot distinguish between an LLM inference and a tool calling. Alibaba Cloud introduces the gen_ai.span.kind property to define a GenAI-specific classification system: LLM, TOOL, AGENT, ENTRY, TASK, STEP (ReAct round), CHAIN, RETRIEVER, and RERANKER. Cloud Monitor 2.0 uses this categorization to automatically detect the AI application structure and render a dedicated AI trace view. LLM calls appear in orange, tool calling in pink, and agents in green. This lets you see the "role distribution" of the entire trace at a glance.

💡 These extensions do not disrupt standard compatibility. The data reported by openclaw-cms-plugin displays basic information normally on any backend that supports OpenTelemetry. However, Cloud Monitor 2.0 unlocks the complete AI application observability experience.

This standardized approach benefits future data analytics and platform evolution.

5.From Black Box to Transparent: How Observability Changes Your Lobster Farming
📈 Installing an X-ray machine fundamentally changes your "lobster farming" method:

This is not merely an improvement. It is a leap from "blind farming" to "precision farming."

A farmer upgrades from "checking water color visually" to using "water quality sensors, cameras, and automatic feeding systems." You manage the same lobsters, but your control level changes completely. 🦞📊

One more thing: Security audit
Beyond performance tuning and cost control, enterprise AI agent deployment involves an unavoidable topic: security compliance and behavior audit. Agents can execute commands, read and write files, and initiate network requests. Without behavior audit capabilities, you cannot know if an agent secretly read an SSH key at 3:00 a.m.

Our observability team covers this capability with another solution: the Alibaba Cloud Simple Log Service (SLS) OpenClaw one-click solution. It collects OpenClaw session audit logs and application operational logs. It provides out-of-the-box security audit dashboards, including high-risk command detection, prompt injection detection, and sensitive data leakage analysis. This makes every agent operation traceable.

If you are interested in security audits, read this article: https://www.alibabacloud.com/help/sls/enable-managed-openclaw-with-sls (SLS one-click integration and audit solution makes OpenClaw controlled operation possible).

Cloud Monitor 2.0 manages performance and cost, and SLS manages security and compliance. Together, they form a complete control system for the "lobster farm." 🔐

6.FAQs
💡 Here are answers to common questions about the process:

Q: Does the integration impact OpenClaw performance?

A: The impact is minimal. The openclaw-cms-plugin uses the OpenTelemetry batch export mechanism. Span data is buffered in memory and reported in batches periodically. This does not block the normal processing flow of the agent.

Q: Can I install only traces without metrics?

A: Yes. Add the --disable-metrics parameter during installation to skip the diagnostics-otel configuration.

Q: Do traces from diagnostics-otel conflict with traces from openclaw-cms-plugin?

A: The installation script sets diagnostics.otel.traces to false by default. The openclaw-cms-plugin handles trace reporting. They work independently without duplication.

Q: I have configured diagnostics-otel. Will the installation overwrite my configuration?

A: No. The traces, logs, sample rate, and other configurations remain unchanged. It adds necessary fields such as endpoints and headers.

Q: Which OpenClaw versions are supported?

A: The version must be 26.2.19 or later (earlier versions exclude the diagnostics-otel plugin). The openclaw-cms-plugin works using the standard OpenClaw Hook mechanism. It does not depend on internal APIs of specific versions.

Q: Why is the token consumption always 0?

A: OpenClaw introduced a bug in V2026.3.8. This causes incorrect token consumption collection. We are urging the community to expedite the fix. Relevant issue link: https://github.com/openclaw/openclaw/issues/46616

7.Summary
📋 Back to the first question: Do you know what your lobster is doing underwater?

If the answer is "I don't know", it is time to install an X-ray machine.

The openclaw-cms-plugin + diagnostics-otel, and one command: ten minutes to integrate, bringing three core capabilities to your OpenClaw:

✅Tracing analysis— End-to-end visualization of every LLM invocation, tool execution, and token flow.

✅Real-time metrics— Monitor system pulse in real time, including token consumption rate, invocation QPS, queue depth, and session status.

✅GenAI semantic standards— Standardized data structures. They lay the foundation for cost analysis, performance optimization, and exception detection.

Stop letting your lobster "freestyle" in a black box. Install an X-ray machine. Make every step visible, traceable, and optimizable.

After all, a visible lobster is a good lobster. 🦞✨

❓Interaction time!

What is the most troublesome "black box problem" you encountered while using OpenClaw?
How do you troubleshoot OpenClaw issues now? Do you have any hacks to share?
What data do you want to see most after enabling observability?
Share your "lobster farming" insights in the comments. Bring your questions. We are here! 🦞🎉

Zero-Code Modification in 5 minutes Enables Go Applications to Automatically Obtain End-to-End Observability

ObservabilityGuy — Wed, 22 Apr 2026 02:41:35 +0000

This article introduces the Loongsuite Go agent, a compile-time instrumentation tool that enables zero-code modification for end-to-end observability in Go applications.

💡Are you still worried about the observability transformation of Go applications?
💡Are you still performing manual tracking, modifying code, or importing SDKs?
💡Are you still worried about tracking points affecting performance? Today, we are bringing a solution with zero-code modification-Loongsuite Go agent, allowing your Go application to automatically obtain end-to-end observability capabilities at compile-time!🚀

😫Three Pain Points of Traditional Observability Solutions
In the microservices model, observability has become an essential capability for application O&M. However, traditional observability solutions often face three major pain points:

According to statistics, traditional tracking plans require developers to spend 20-30% of their time on monitoring code, and it is very error-prone.

1.High code intrusiveness
Traditional tracking solutions require developers to manually insert monitoring code into the business code:

// Traditional method: Manual tracking is required.  
func handleRequest(w http.ResponseWriter, r *http. Request) {  
// Manually create a span.  
ctx, span := tracer.Start(r.Context(), "handleRequest")  
defer span.End()  

// Business logic  
result := doSomething()  

// Manually record attributes  
span.SetAttributes(attribute.String("result", result))  
}

This method raises the following issues:

● Code pollution: Business code and monitoring code are mixed together.

● High maintenance costs: The monitoring code must be updated each time the business logic is modified.

● Easy to omit: Developers may forget to add tracking points to some critical paths.

2.Heavy modification workload
For an existing Go application, if you want to integrate observability, you usually need to:

● Import the OpenTelemetry SDK

● Modify each key function and add tracking code

● Configure the exporter and sampling policy.

● Test to verify that the tracking point is correct.

This process can take days or even weeks of workload.

3.Performance overhead concerns
Although runtime tracking is flexible, it incurs certain performance overhead:

● The tracking logic must be executed for each call.

● Serialization, network transmission, and other operations

● may affect application performance.

✨Solution: Automatic Compile-time Instrumentation
The Loongsuite Go agentuses compile-time instrumentation technology to automatically inject monitoring code during the compilation phase, achieving true zero-code modification..

This is an enterprise-level Go application observability solution open-sourced by Alibaba, which has been used on a large scale in the production environment.

Core Strengths
Zero-Code Modification
You only need to add the otel prefix before go build without modifying any business code:

# Traditional method  
go build -o app cmd/app  

# Use the Loongsuite Go agent  
otel go build -o app cmd/app

It is that simple! Your application automatically obtains end-to-end observability capabilities.

🚀Automatic Instrumentation
The tool automatically detects the frameworks and libraries you use and injects the corresponding monitoring code:

● HTTP frameworks: Gin, Echo, Fiber, FastHTTP, and Hertz

● RPC frameworks: gRPC, Dubbo-go, Kitex, and Kratos

● Databases: Database/SQL, GORM, MongoDB, and Elasticsearch

● Caches: go-redis and redigo

● Logstores: Zap, Logrus, Slog, and Zerolog

● AI frameworks: LangChain and Ollama

● More: Supports more than 50 mainstream Go frameworks and libraries.

⚡Performance-friendly
Compile-time instrumentation means:

● Low runtime overhead: Monitoring code is already optimized at compile time.

● No reflection overhead: Does not rely on runtime reflection mechanisms.

● Production-ready: Validated in large-scale production environments.

🎯Case Study: Automatic Instrumentation for the Official MCP SDK
Recently, we implemented automatic instrumentation support for the official Model Context Protocol (MCP) Go SDK. MCP is a protocol introduced by companies such as Google and Anthropic. It is used to integrate LLM applications with external data sources and tools, becoming increasingly important in AI application development.

Why choose MCP?
With the rapid development of AI applications, more and more developers are using the MCP protocol to build LLM applications. However, the observability of MCP applications has always been a challenge:

● Complex protocol: MCP supports multiple operations (such as tools/call, resources/read, and prompts/get).

● Middleware mechanism: The official SDK provides middleware, but users may not actively use it.

● Time measurement: It is necessary to accurately measure the complete time of requests and responses.

Our Solution
We adopted the strategy of automatic injection during initialization. Monitoring middleware is automatically injected when NewServer and NewClient are created, ensuring 100% coverage.

Technical Challenges
The official MCP SDK provides a comprehensive middleware mechanism, but how to automatically inject monitoring middleware without modifying user code is a technical challenge.

Solution
We adopted the strategy of automatic injection during initialization:

// Automatically inject monitoring middleware when NewServer is created.
func afterNewServer(call api.CallContext, s *mcp.Server) {
    if s == nil {
        return
    }
    // Automatically inject monitoring middleware.
    monitoringMiddleware := createServerMonitoringMiddleware()
    s.AddReceivingMiddleware(monitoringMiddleware)
}

// Automatically inject monitoring middleware when NewClient is created.
func afterNewClient(call api.CallContext, c *mcp.Client) {
    if c == nil {
        return
    }
    // Automatically inject monitoring middleware.
    monitoringMiddleware := createClientMonitoringMiddleware()
    c.AddReceivingMiddleware(monitoringMiddleware)
}

Implementation Effect
In this way, we achieved:

100% coverage: The monitoring middleware is automatically injected regardless of whether the user manually invokes AddReceivingMiddleware.
Accurate time measurement: The middleware can be executed before and after request processing, allowing accurate measurement of the complete request-response time.
Automatically record key information:
MCP method names (initialize, tools/call, and resources/read)
Tool name, resource URI, and prompt name
Request parameters and response results
Error messages and duration statistics
Examples
User code does not need to be modified at all:

// User code: Create an MCP server.
server := mcp.NewServer(&mcp.Implementation{
    Name: "my-server",
    Version: "1.0.0",
}, nil)

// Add a tool for normal use.
mcp.AddTool(server, &mcp.Tool{
    Name: "greet",
    Description: "Say hi",
}, handler)

// Run the server.
server.Run(ctx, transport)

After compilation using otel go build, all MCP requests are automatically monitored, including:

● Client invoking tools (tools/call)

● Read resources (resources/read)

● Retrieving prompts (prompts/get)

● Initializing connections (initialize)

Technical Principle: Compile-time Instrumentation
Workflow
The Loongsuite Go agent adds two key phases during compile-time:

Traditional Go compilation flow:
Source code parsing → Type checking → Semantic analysis → Code optimization → Code generation → Linking

Use the Loongsuite Go agent:
Preprocessing → Instrumentation → Source code parsing → Type checking → Semantic analysis → Code optimization → Code generation → Linking

Preprocessing: Analyze dependencies and select applicable instrumentation rules.
Instrumentation: Generate code based on rules and inject the code into the source code.

Core Technologies
● go:linkname: Linking the instrumentation function to the namespace of the target package.

● AST operation: Modify the abstract syntax tree to inject monitoring code.

● Rule-driven: Define instrumentation behavior via JSON rule files.

Instrumentation Methods
Based on framework attributes, we support multiple instrumentation methods:

Intermediary injection (such as MCP and gRPC): Inject the middleware during initialization.
Hook mechanism (such as Redis and Kafka): Utilize the Hook API of the framework.
Direct function peg (such as OpenAI SDK): Instrument directly on key functions.
Struct field injection (such as database and SQL): Inject fields to store metadata.
🚀Get Started in 5 Minutes
Step 1: Install the tool (1 minute)

# Linux/MacOS (Recommended)
sudo curl -fsSL https://cdn.jsdelivr.net/gh/alibaba/loongsuite-go-agent@main/install.sh | sudo bash

# Or download manually
wget https://github.com/alibaba/loongsuite-go-agent/releases/latest/download/otel-linux-amd64
chmod +x otel-linux-amd64
sudo mv otel-linux-amd64 /usr/local/bin/otel

Step 2: Compile the application (1 minute)

# Just prefix the go build with otel.
otel go build -o app cmd/app

Step 3: Configure the export (1 minute)

# Export to Jaeger (development environment)
export OTEL_EXPORTER_JAEGER_ENDPOINT=http://localhost:14268/api/traces

# Or export to OTLP (production environment)
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

Step 4: Run the application (1 minute)

./app

It's that simple! Your application is now equipped with end-to-end observability capabilities.🎉

Demonstration
After use, you can see the following on Jaeger, Zipkin, or other observability platforms that support OpenTelemetry:

● ✅Complete invocation chain: From HTTP requests to database queries, everything is clear at a glance.

● ✅Detailed performance metrics: duration and error rate of each operation

● ✅Rich contextual information: request parameters, response results, and error messages

Supported export methods
The tool supports multiple export methods. You only need to configure environment variables:

For more information about the configuration options, see Official documentation.

Supported frameworks
The Loongsuite Go agent supports 50+ mainstream Go frameworks and libraries, including:

For more information, see GitHub repository.

⚡Production-grade Performance
Performance advantages brought by compile-time instrumentation:

Benefits:

● ✅Low runtime overhead: Monitoring code is optimized at compile-time, and no runtime reflection is required.

● ✅Production verification: It has been verified in large-scale production environments of companies such as Alibaba.

● ✅Performance-friendly: According to benchmarks, the application performance overhead after instrumentation is usually less than 3%.

💡Note: Although the compile time increases, this only occurs during the developer/build phase and does not affect runtime performance.

Community and Support
Open-source address
● GitHub: https://github.com/alibaba/loongsuite-go-agent

● Document: https://alibaba.github.io/loongsuite-go-agent/

Join the community
● DingTalk group: 102565007776

● GitHub issues: Feedback on questions and suggestions

● Contribution code: Welcome to submit pull requests.

Comparison summary

📚 Related Resources
● 🌟GitHub: https://github.com/alibaba/loongsuite-go-agent

● 📖Document: https://alibaba.github.io/loongsuite-go-agent/

● 💼Commercial edition: https://www.alibabacloud.com/help/arms/application-monitoring/user-guide/monitoring-the-golang-applications/

● 💬DingTalk group: 102565007776

If you find it useful, welcome to star⭐ and share!

References:

● GitHub: https://github.com/alibaba/loongsuite-go-agent

● Document: https://alibaba.github.io/loongsuite-go-agent/

● Commercial edition: https://www.alibabacloud.com/help/arms/application-monitoring/user-guide/monitoring-the-golang-applications/

RUM Practice: Android Network Performance Optimization with Data

ObservabilityGuy — Wed, 22 Apr 2026 02:09:42 +0000

This article introduces the RUM Practice for Android, detailing how to optimize network performance through fine-grained metric analysis and connection pool tuning.

1.Overview
In the era of the mobile Internet, network request performance has become a key factor that affects user experience. Statistics show that the conversion rate drops significantly as the page load time increases, and the most common user feedback in mobile applications is related to network performance issues such as "slow load" and "stuttering". However, the complexity of the mobile network environment far exceeds that of the web client:

Diversified network environments
● Multiple network standards such as Wi-Fi, 4G, 5G, 3G, and 2G coexist.

● The signal strength varies, and network transitions are frequent.

● The network quality varies greatly across different regions and carriers.

Critical device fragmentation
● There are many Android device brands and models.

● The system version span from Android 5.0 to the latest version is large.

● The device performance is uneven, which affects the network processing capability.

Difficulty in troubleshooting
● Lack of visibility: Traditional monitoring can only see whether a request succeeded or failed and the total duration, but cannot understand which specific segment the time is spent on.

● Difficult to reproduce: The user feedback is "very slow", but it often cannot be reproduced in the development environment.

● Lack of quantization basis: Optimization is based on feeling, and the optimization effect cannot be evaluated.

● Lack of end-to-end tracking: Client logs are missing, and it is separated from the server-side monitoring, which cannot form a complete trace.

To solve the above pain points, we need to turn the "black box" of the network request into a "transparent box" to clearly see the duration of each segment. Real User Monitoring (RUM) of Cloud Monitor 2.0 for the Android SDK provides mobile network performance monitoring capabilities. Next, we will introduce the resource metric data model collected by the RUM SDK in detail to help you understand the meaning and compute method of each metric.

2.Description of Resource Metric Data
To make each phase of each network request clearly visible and quantifiable, you must first establish a standardized data model. Alibaba Cloud RUM uses resource events as the core data model for network request monitoring.

Resource events are a standardized event type specifically designed for network requests. It is formulated based on the Hypertext Transfer Protocol (HTTP) and the World Wide Web Consortium (W3C) Performance Timing API standard, which ensures the accuracy and comparability of data collection. Considering the implementation differences of the API in different environments (Web, iOS, Android, and HarmonyOS), RUM has corrected and snapped them. This allows developers to see consistent performance data on both the web client and mobile clients, facilitating cross-platform performance comparison and troubleshooting.

Next, we will introduce the property fields and metric fields included in resource events in detail.

2.1 Property Field Description
Resource events contain rich attribute fields that describe the context information of a request:

2.2 Metric Field Description
In addition to property fields, resource events also contain core performance metrics. This part of the data is the core data for us to troubleshoot slow network requests.

Metric Type Description

2.3 Request Duration Phase Description
A complete HTTPS request usually includes the following key phases:

2.4 Compute Method
After understanding the definition of the metric, we will deeply understand the specific compute implementation based on OkHttp3 on the Android client.

2.4.1 OkHttp3 compute method
The following table shows the compute method for the duration of each phase of the Android network resource request, and clearly defines the start and end time points and compute methods of each stage.

You can view the detailed time start points in the resource.timing_data field of the raw data.

Note: The TCP connection duration displayed in the console actually includes the SSL handshake time.

2.4.2 Connection reuse detection
Based on the metric data collected by the RUM SDK, we can detect whether the connection is reused. The judgment basis is as follows:

Judgment basis:

● connectionAcquiredTime > 0: The connection is obtained.

● dnsStartTime ≤ 0: No DNS resolution callback.

● tcpStartTime ≤ 0: No TCP connection callback.

Features when the connection is reused:

● resource.dns_duration = 0

● resource.connect_duration = 0

● resource.ssl_duration = 0

● There is a wait time from callStart to connectionAcquired (connection pool seek time).

This wait time is an important performance metric. If it is too long, it may indicate improper connection pool configuration.

2.4.3 Relationship between TCP and SSL connections
For HTTPS requests, connection establishment is divided into two phases:

connectStart (TCP starts)
    ↓
    [TCP three-way handshake]
    ↓
secureConnectStart (SSL handshake starts)
    ↓
    [SSL/TLS handshake]
    ↓
secureConnectEnd (SSL handshake ends)
    ↓
connectEnd (Connection established)

Time relationship:

Total connection time = connectEnd - connectStart
Pure TCP time = secureConnectStart - connectStart (approximate)
SSL time = secureConnectEnd - secureConnectStart

2.5 View Metrics in the Console
You can log on to the RUM console, select your application, click the API request module, and click specific details to view the duration and duration distribution of each phase of the request.

After understanding the data model and data compute methods, let's look at how to use these metric data to quickly locate performance issues through a real online user case.

3.User Case Analysis
3.1 Case Background
An app received online user complaints, with feedback such as "page load is particularly slow" and "spinning often exceeds 1 second." The developer team immediately troubleshot the backend service, but found a confusing phenomenon:

The client reported that the response time of a core API often exceeded 1 second (some users even reached 2-3 seconds). This problem existed regardless of whether the network environment was Wi-Fi or 4G, and it was random, making it difficult to stably reproduce in the development environment.

However, backend monitoring showed that the server-side processing time of the API was stable at about 400 ms, the database query performance was normal with no slow queries, and the server CPU and memory payload were also healthy. The data on both sides did not match. The client reported 1.2 seconds, while the server-side only took 400 ms. Where did the remaining 800 ms go? Without fine-grained monitoring, the team fell into a "blind men and an elephant" dilemma: the client and the server-side blamed each other, and the problem could not be resolved for a long time.

By integrating the Alibaba Cloud RUM Android SDK, we collected detailed duration data.

Let's see how the problem was precisely located.

3.2 Raw Timing Data
In the resource.timing_data field, we obtained the raw time points (in nanoseconds) of each phase of the request:

{
    "requestHeadersEnd": 1560814315115219,
    "responseBodyStart": 1560814719308917,
    "requestType": "OkHttp3",
    "connectionAcquired": 1560814312934751,
    "connectionReleased": 1560814721700948,
    "requestBodyEnd": 1560814315850323,
    "responseHeadersEnd": 1560814718722250,
    "requestHeadersStart": 1560814312975011,
    "responseBodyEnd": 1560814719441625,
    "requestBodyStart": 1560814315146573,
    "callEnd": 1560814721840948,
    "duration": 1232825780,
    "callStart": 1560813486615845,
    "responseHeadersStart": 1560814718314125
}

Key observations:
● No DNS, TCP, or SSL-related callback time points → This indicates that connection pool reuse is used.

● The interval from callStart to connectionAcquired is 826 ms → The connection pool wait time is abnormally long.

● Total duration = 1232.8 ms

There is already a clear clue here: The problem does not lie in DNS, TCP, or SSL handshake, but in the fact that the wait time for the connection pool to assign a connection is too long.

3.3 Detailed Phase Analysis
Based on the raw data and the data calculation methods in section 2.4, we calculate the duration phase by phase to precisely locate performance bottlenecks:

Phase 1: Wait for the connection pool to assign

callStart → connectionAcquired
Time consumed: (1560814312934751-1560813486615845)/1,000,000 = 826.32 ms⚠️

Note:

● The wait time to retrieve an active connection from the connection pool.

● No DNS/TCP callback = Reuse the existing connection.

● This is the biggest bottleneck. It accounts for 67% of the total duration.

Phase 2: Send request headers

requestHeadersStart → requestHeadersEnd
Time consumed: (1560814315115219-1560814312975011)/1,000,000 = 2.14 ms✅

Phase 3: Send the request body

requestBodyStart → requestBodyEnd
Time consumed: (1560814315850323-1560814315146573)/1,000,000 = 0.70 ms✅

Phase 4: Wait for the server response (TTFB)

requestBodyEnd → responseHeadersStart
Time consumed: (1560814718314125-1560814315850323)/1,000,000 = 402.46 ms
Note: The time the server takes to process the request is consistent with the backend log and is within the normal range.

Phase 5: Receive response headers

responseHeadersStart → responseHeadersEnd
Time consumed: (1560814718722250-1560814718314125)/1,000,000 = 0.41 ms✅

Phase 6: Receive the response body

responseBodyStart → responseBodyEnd
Time consumed: (1560814719441625-1560814719308917)/1,000,000 = 0.13 ms✅

Phase 7: Release the connection

responseBodyEnd → connectionReleased
Time consumed: (1560814721700948-1560814719441625)/1,000,000 = 2.26 ms✅

Through this analysis, we can clearly see that the connection pool wait time is a performance bottleneck.

3.4 Issue Diagnosis
Diagnosis of abnormal points
Core issue: The connection pool wait time is too long (826 ms).

Possible causes:
The connection pool is full - All connections are in use, and it is necessary to wait for other requests to release connections.
Serial request queuing - Too many requests are sent to the same host, which is limited by the maxRequestsPerHost configuration.
Connection leaks - Previous requests did not correctly release connections.
Improper connection pool configuration - The maxIdleConnections setting is too small.
Diagnosis steps
Step 1: Check the connection pool configuration

// View the connection pool configuration of the current OkHttpClient.
ConnectionPool connectionPool = okHttpClient.connectionPool();
// Default configurations: A maximum of five idle connections, and keep alive for 5 minutes.

After the check, it is found that the application uses the OkHttp default configurations, and there are only five idle connections.

Step 2: Monitor the concurrent request quantity
You can view the quantity of concurrent requests to the same host within this time segment via the RUM console.

Step 3: Check for connection leaks
You can view application logs to confirm that all requests have correctly closed the response body:

Response response = client.newCall(request).execute();
try {
    String body = response.body().string();
    // Process the response
} finally {
    response.close();  // Close it

Diagnostic conclusion:
The issue is caused by a connection pool configuration that is too small. A large number of requests are waiting for connection release, causing critical performance bottlenecks.

After the cause of the issue is identified, we will introduce troubleshooting methods and optimization ideas for common network performance issues.

4.Best Practices for Troubleshooting Common Issues
Through the above case, we have seen how to use RUM data to locate issues. This chapter will systematically introduce four categories of the most common network performance issues and their troubleshooting methods.

4.1 Long Connection Pool Wait Time
Symptom: An abnormal connection acquisition duration is observed in resource.timing_data.

callStart → connectionAcquired duration > 500 ms

Diagnosis steps:
Step 1: View the connection pool configuration

// Check the current configuration.
ConnectionPool pool=okHttpClient.connectionPool();
// Default: five idle connections

Step 2: View the number of concurrent requests
View the number of concurrent requests for the time period through the RUM console:

-- Execute the query in the RUM console
SELECT 
    COUNT(*) as concurrent_requests
FROM rum_resource
WHERE 
    timestamp BETWEEN start_time AND end_time
    AND resource.url LIKE 'https://api.example.com%'
GROUP BY timestamp
ORDER BY concurrent_requests DESC

Step 3: Check for connection leaks

// Add log monitoring status for connection pools
interceptor.addInterceptor(chain -> {
    ConnectionPool pool = chain.connection().connectionPool();
    Log.d("Pool", "Active: " + pool.connectionCount() + 
                   ", Idle: " + pool.idleConnectionCount());
    return chain.proceed(chain.request());
});

Optimization ideas:

// Solution 1: Increase the connection pool size
.connectionPool(new ConnectionPool(30, 5, TimeUnit.MINUTES))

// Solution 2: Increase the maximum number of concurrent requests per host
.dispatcher(new Dispatcher() {{
    setMaxRequestsPerHost(10);  // 默认5
    setMaxRequests(64);         // 默认64
}})

// Solution 3: Merge requests

4.2 Slow DNS Resolution
Symptom: It is observed in the console that the DNS duration remains high.

resource.dns_duration > 500ms

Diagnosis steps:
Step 1: Confirm that it is a DNS issue
You can check whether resource.dns_duration remains high. You can check the differences between different network environments (WiFi vs. 4G).
Step 2: Analyze a specific domain name

// Group by domain name in the RUM console
SELECT 
    resource.url_host,
    AVG(resource.dns_duration) as avg_dns_time,
    MAX(resource.dns_duration) as max_dns_time
FROM rum_resource
WHERE resource.dns_duration > 0
GROUP BY resource.url_host
ORDER BY avg_dns_time DESC

Solutions:

// Solution 1: Use a custom DNS  
.dns(new CustomDns())  

// Solution 2: Use HttpDNS  
.dns(new AliHttpDns())  

// Solution 3: DNS pre-parsing  
DnsPreloader.preload(client);</font>

4.3 High SSL Handshake Duration
Symptom: An abnormal SSL handshake duration is observed in the console.

resource.ssl_duration > 1000ms

Diagnosis steps:
Step 1: Confirm the SSL version

// Add an interceptor to View SSL information
interceptor.addInterceptor(chain -> {
    Connection connection = chain.connection();
    if (connection != null) {
        Handshake handshake = connection.handshake();
        if (handshake != null) {
            Log.d("SSL", "Protocol: " + handshake.tlsVersion());
            Log.d("SSL", "Cipher: " + handshake.cipherSuite());
        }
    }
    return chain.proceed(chain.request());
});

Step 2: Check the connection reuse rate

// Query in the RUM console  
SELECT   
</font><font style="background-color:#d0cece;">    </font><font style="background-color:#d0cece;">COUNT(CASE WHEN resource.ssl_duration = 0 THEN 1 END) * 100.0 / COUNT(*) as reuse_rate  
FROM rum_resource  
WHERE resource.url LIKE 'https://%'

Optimization ideas:

// Solution 1: Enable SSL session reuse  
.sslSocketFactory(SslConfig.createSSLSocketFactory())  

// Solution 2: Increase the connection keep-alive time  
.connectionPool(new ConnectionPool(30, 10, TimeUnit.MINUTES))</font><font style="background-color:#d0cece;">  </font><font style="background-color:#d0cece;">// Extend to 10 minutes  

// Solution 3: Use certificate pinning  
.certificatePinner(certificatePinner)

4.4 Long TTFB
Symptom: The time from when a request is sent to when the first byte is received is excessively long. You can observe a long request response duration in the console.

resource.first_byte_duration > 2000ms

Diagnosis steps:
Step 1: Troubleshoot client issues
Make sure that the following metrics are normal:

● DNS resolution time < 300 ms

● Connection establishment time < 500 ms

● Request sending time < 100 ms

Step 2: Analyze the server response time
TTFB is mainly determined by the server processing time. If the client metrics are normal, you can:

1.Check the server load.  
2.Check the database query performance.  
3.Check the complexity of the interface business logic.  
4.Use an application performance management (APM) tool to track server performance.

Step 3: Network path analysis

// View the TTFB differences across different regions and carriers in the RUM console  
SELECT   
</font><font style="background-color:#d0cece;">    </font><font style="background-color:#d0cece;">user.region,  
</font><font style="background-color:#d0cece;">    </font><font style="background-color:#d0cece;">user.isp,  
</font><font style="background-color:#d0cece;">    </font><font style="background-color:#d0cece;">AVG(resource.first_byte_duration) as avg_ttfb  
FROM rum_resource  
GROUP BY user.region, user.isp  
ORDER BY avg_ttfb DESC

Optimization ideas:

// Solution 1: Use CDN for acceleration  
// Deploy static resources and APIs to CDN points of presence  

// Solution 2: Enable server caches  
// Implement a reasonable cache policy on the server-side  

// Solution 3: Use data prefetching  
// Request data in advance before users might access it   
PreloadManager.preload("https://api.example.com/user/profile");  

// Solution 4: Manage request priorities  
.dispatcher(new Dispatcher() {{  
// Use a separate thread pool for high-priority requests.  
})

5.Case Summary
By using the troubleshooting methods for the preceding four categories of common issues, we have mastered a systematic diagnosis approach. Now, let's return to the real case in Chapter 3 that troubled the team for days: the performance bottlenecks of an 826 ms connection pool wait time. By precisely locating the issue using RUM data, we discovered that the root cause of the issue is that improper connection pool configurations cause requests to queue up and wait. The solution is actually very simple: Select appropriate connection pool configurations based on different application types.

Configuration suggestions:
For the maxIdleConnections parameter of OkHttpClient (the default value is 5), we recommend that you adjust it based on application characteristics. Based on experience, common configurations are as follows:

● Highly concurrent applications: maxIdleConnections = 30-50.
Such applications have high user popularity, frequent network requests, and a large amount of concurrency, and require sufficient connection pool support.

● General applications: maxIdleConnections = 10-20.
Moderate the request frequency and concurrency, and maintain a moderate connection pool size.

● Low-frequency applications: maxIdleConnections = 5-10. Fewer user requests. In this case, keep the default configuration or slightly increase it to meet the demand.

From post-event optimization to proactive monitoring:
However, this case also brings us deeper reflection. Performance optimization should not be an after-the-fact remedy. In addition to mastering post-troubleshooting and optimization methods, establishing a comprehensive performance monitoring system is more important. You can grasp the network performance metrics of the application in real time through the RUM console to shift from "passive firefighting" to "active observation." If necessary, you can also configure custom alert rules based on the RUM platform (such as triggering notifications when the connection pool wait time P95 > 500 ms) to further improve the problem response speed.

Suggestions for monitoring and alerting configuration
RUM data allows users to create custom alerts for real-time monitoring. Establishing a scientific monitoring and alerting system allows you to detect and handle problems in a timely manner before the problems impact users.

Reference for metric-based alerting thresholds
Based on industry practices such as the RAIL model and Google Web Vitals, common threshold references are as follows:

6.Summary
In mobile application development, network request performance directly impacts user experience. By integrating the Alibaba Cloud RUM Android SDK, developers can obtain the following core capabilities:

Accurately locate performance bottlenecks
● Fine-grained phase duration (such as DNS, TCP, SSL, and TTFB) helps quickly detect problems.

● From the vague description of "slow requests" to the accurate positioning of "connection pool waits for 826 ms"

Connection reuse analysis
● Automatically detect the use efficiency of the connection pool

● Detect hidden problems such as connection leaks and improper connection pool configurations

Real user experience monitoring
● Collect data based on the network environments of real users

● Analyze performance differences by dimensions such as region, carrier, and network type

Data-driven optimization
● The comparison before and after optimization is clearly visible

● Establish performance baselines and alerting mechanisms for continuous improvement

Alibaba Cloud RUM implements a non-intrusive monitoring and collection SDK for application performance, stability, and user behavior on the Android client. You can refer to the integration document to experience and use the SDK. In addition to Android, RUM also supports monitoring and analysis on multiple platforms such as web, mini program, iOS, and HarmonyOS. For related questions, you can join the RUM support group (DingTalk group number: 67370002064) for consultation.

Achieve Operational Control for OpenClaw with Alibaba Cloud SLS One-Click Integration

ObservabilityGuy — Tue, 21 Apr 2026 02:54:06 +0000

One-click SLS Integration Center setup ingests OpenClaw logs (session audits + app logs) and delivers ready-to-use dashboards for security, cost, and ops monitoring.

You can use the Alibaba Cloud Simple Log Service (SLS) Integration Center to complete the log integration of the OpenClaw AI Agent with one-click. Combined with built-in audit dashboards and observation dashboards, this achieves an out-of-the-box closed loop for security audit and O&M observation.

1.OpenClaw Security Risks: Why Controlled Operation is Crucial
OpenClaw is one of the most followed open source AI Agent platforms in 2026. It allows large language models to directly operate file systems, execute Shell commands, browse web pages, or send and receive messages. This converts the inference capabilities of the Large Language Model (LLM) into real system operations. This "autonomous execution" capability is its core value, and also its core threat.

1.1 Industry Security Incidents: Threats Are Not Assumptions, but Facts
In early 2026, multiple security vendors collectively disclosed a batch of OpenClaw-related vulnerabilities and incidents. The data is shocking:

A particularly illustrative case is Summer Yue, the AI Alignment Director at Meta Super Intelligent Lab—a security expert with professional sensitivity higher than 99% of users. She issued an instruction to OpenClaw to clean Emails and explicitly set a restriction of "no operation without approval." However, when large amounts of Data were processed, limited by the context window compression mechanism of the Large Language Model (LLM), this critical security instruction was "forgotten." Eventually, a large number of Emails were permanently deleted. It was too late even when she shouted STOP 3 times or ran to unplug the network cable.

1.2 Codebase Audit Findings: OpenClaw's Own Security Fix Frequency
Industry reports explain the external threat posture. The audit of the OpenClaw Code repository reveals another dimension—the project itself is fixing security issues at high frequency. Through security semantics Analysis based on Git History and commit messages, you can quantify the scale and distribution of security-related Code Changes within a period of Time, thereby determining which layers the attack surface is concentrated in.

By filtering and categorizing the 14,254 commits of OpenClaw in the recent 60 Days (2026-01-05 to 2026-03-05), with an average of about 2.45 security fixes per Day, you can obtain the following threat Level distribution:

Critical and High total 50, accounting for about 34% of explicit security fixes. This indicates that medium and important issues are continuously discovered and fixed within the observation window. According to the Code module distribution, threats are highly concentrated in the entry and execution layers:

modules tools/ and gateway/ account for 61%, corresponding to the two main battlefronts of the Agent: "who invokes" and "what can be executed."

In summary, these Data explain two things:

First, OpenClaw continuously invests in security fixes at the Code level with timely response. Moreover, most security-related commits carry identifiable threat Types in the message, which facilitates tracing and review. This indicates that the project already has good practices in "runtime security."

Second, the attack surface of the AI Agent is naturally broad—the tool execution layer (tools/) and gateway layer (gateway/) are exactly the price of "autonomous operation" and "multi-entry access." Static Code audit can only cover Submitted Changes, but cannot exhaust runtime behavior variations, configuration combinations, or attack paths driven by external inputs.

1.3 Why Relying Only on Runtime Protection Is Not Enough
OpenClaw provides multiple lines of Preventive Controls in the architecture: The Tool Policy Pipeline makes policy decisions before invoking. Owner-only encapsulation attaches permissions to sensitive operations. The loop detector detects sessions with no progress. The command allowlist/denylist limits the executable command collection. Under Normal configuration, these mechanisms can effectively reduce the attack surface. However, from a security engineering perspective, they belong to execution-time validation within the same trust domain and have the following types of inherent limitations:

Therefore, runtime protection is equivalent to a "city wall"—it can block most known attack paths, but cannot guarantee that the configuration never errors, or cover unknown bypasses or logical misuses. In security architecture, you need a complementary "sentry" to perform continuous observability and audit on the Agent’s callers, token consumption, and tool invocation sequences, and Results.

2.The Three Pillars of Observability and the SLS Solution

Observability fulfills the role of a "sentinel": using logs, metrics, and traces to continuously observe Agent behavior, supporting audit tracking and usage compliance, and leverages anomaly detection** to answer "who is invoking, how much is spent, and what exactly was done." This allows you to discover issues early when policies fail or when you encounter new types of attacks, and to provide a response before the impact expands.

2.1 Mapping of the Three Pillars in the AI Agent Scenario
Observability is built on the three pillars of Logs + Metrics + Traces. In the OpenClaw scenario, the correspondence between the three pillars and data sources, as well as the core questions each answers, are as follows:

The three pillars are indispensable. With only Metrics, you cannot answer "who and why" caused costs to soar. With only Session logs, you cannot perceive system health and abnormal inflection points from a global perspective. With only application operational logs, you cannot see the business behavior and tool calling sequence of the Agent. Collaboration of the three can simultaneously support security audits, cost control, and O&M troubleshooting.

2.2 Why Choose SLS: Capabilities and Advantages
SLS serves as a foundation in the observability realm. In the OpenClaw scenario, it has the following natural advantages:

● Powerful Data Integration capabilities, natively aligned with the OpenClaw technology stack

LoongCollector provides powerful OneAgent collection and natively supports both logs and the OpenTelemetry Protocol (OTLP). Because Agent Session logs often carry model interaction contexts, the logs are often long. LoongCollector provides high-performance collection for long-text logs. It integrates with the built-in diagnostics-otel plugin of OpenClaw with zero code modification, and Metrics and Traces are directly written to SLS via OTLP.

● Rich query and analysis and processing operators

Session logs are in JSON nested format (such as message.content, message.usage.cost, and message.toolName). SLS provides SQL + Structured Process Language (SPL) computing engines and rich parsing, filtering, and aggregation operators. You can create indexes and perform Real-time analysis on nested fields without additional extract, transform, and load (ETL) processing.

● Security and compliance capabilities

RAM permission control, and sensitive data masking and encrypted storage meet audit tracking and compliance requirements. SLS holds the Network Security Dedicated Product Security Detection/Certification certificate (formerly the Sales License for Computer Information System Security Dedicated Products), making it easy to use as an observability and audit foundation in classified protection and industry compliance scenarios. The alerting channel supports DingTalk, text messages, and Email, facilitating the timely responsiveness for security events and cost/anomaly alerts.

● Fully managed, pay-as-you-go, and auto-scaling

One-stop log analysis: "Collection → Storage → Indexing → Query → Dashboard → Alerting" all in one. Logstores and MetricStores are fully managed. For small-Size Agents, the Log Volume is small, and the pay-as-you-go billing method keeps costs low. When traffic increases, the service provides auto-scaling, so you do not need to reserve capacity or perform manual scale-out. You also do not need to build Elasticsearch or Prometheus yourself.

It can be seen that SLS integrates OpenClaw observable data and supports multiple scenarios such as audit, cost, anomaly detection, Security Compliance, and O&M. It is suitable as an observability and audit foundation for the controlled operation of OpenClaw.

Therefore, SLS introduces the OpenClaw one-stop access solution:

● You can configure collection paths and parsing methods using the wizard in the Integration Center. The configurations are automatically generated and applied to achieve a unified entry point and unified Project for Session logs, application logs, and OTLP telemetry. One-stop integration significantly reduces the complexity and O&M costs caused by fragmented data sources.

● A single set of Session data can be used for security audits as well as cost and behavior analytics, meeting the requirements for multi-scenario reuse.

● Preset audit dashboards, cost dashboards, and operation metric dashboards enable an out-of-the-box closed loop for controlled operation observability.

3.Use SLS Integration Center for one-click integration
3.1 Prerequisites
SLS side:

● Activate SLS and create a Project (such as openclaw-observability).

● Ensure that LoongCollector is installed on the ECS instance or server.

3.2 Log Integration (using Session logs as an example)
Session logs are the core data source for security audits. They record every round of conversation, every tool calling, and every Token consumption.

Integration steps:

1.Create a Logstore and select the integration card.

2.Configure the machine group. We recommend that you use a custom ID-based machine group.

Auto Fill the built-in collection configuration.

About text file paths: The file path pre-filled in the one-click integration assumes that the user uses the default installation path for a non-root user on a Linux host. If this does not match the actual situation, modify the path.

About log topic types: LoongCollector supports automatically extracting the topic and session_id from the file path. If the file path is customized and does not match the pre-filled path, you must adjust the configuration.

About time parsing: By default, the time zone in logs output by OpenClaw is UTC+0. If you have customized the time zone, modify the time zone in the time parsing plugin accordingly to avoid time mismatches.

4.Automatically generate built-in indexes and reports.

5.Integration verification and log formats

4.One-click Audit and Observability Solution
SLS provides preset dashboards for OpenClaw, covering four dimensions: security audit, cost analysis, behavior analytics, and operation metrics.

4.1 Security Audit Dashboard
The transparency of Agent behavior is directly associated with system security and compliance risks, and abnormal behaviors often show signs before causing actual damage. The security audit dashboard is the core Dashboard for the controlled operation of OpenClaw. It focuses on answering the core questions of "what the Agent is doing, whether there are high-risk actions, and who is executing unauthorized operations." It expands from dimensions such as behavior overview, high-risk commands, prompt injection, and data leakage to provide complete capabilities for real-time behavior monitoring, threat detection, and post-event traceability.

Security audit statistics overview

The Overview page focuses on the count of multi-dimensional high-risk operations within a specified time window, compressing the security posture of OpenClaw into a readable risk snapshot on a single screen. Seven metrics, such as high-risk command execution, outbound web requests, outbound command lines, outbound communication tools, sensitive file access, and prompt injection, are rendered side by side. Together with comparative data, this helps the security team quickly judge whether the current risk level is abnormal without delving into details.

The count of high-risk operations after prompt injection events is particularly worth noting. Ordinary high-risk operations may stem from the reasonable needs of the task itself, while high-risk behaviors triggered after injection are strong threat signals. This means that the injected malicious instructions have driven the Agent to execute them. Even if there are false positives, such signals should trigger the highest level of manual review rather than waiting for further confirmation. Therefore, the "number of sessions with tool calling after injection" is the signal with the highest threat confidence level in the entire overview. The Priority of 3 such sessions is often higher than that of hundreds of ordinary high-risk commands.

The high-risk session Table below aggregates risk counts of various dimensions by Session. It automatically sorts sessions based on comprehensive risk scores, rendering the sessions that most require manual intervention at the top. The security team does not need to screen logs one by one. They can start tracing directly from the Session with the highest risk, significantly compressing the time window from discovery to response.

Skills usage analysis

Skills usage analysis examines the capability borders of OpenClaw from the perspective of the attack surface. Skills are the native capability extension mechanism of OpenClaw and the main attack entry point for malicious prompt injection. Users often inadvertently install a Skill that contains a security vulnerability or embedded malicious instructions, providing an attacker with a controllable capability entry point. Therefore, the invocation distribution of Skills is not just simple usage statistics, but also an important basis for attack path analysis.

The usage distribution pie chart helps the security team quickly establish a baseline understanding of Skills invocations: which Skills belong to high-frequency mainstream invocations, and which belong to low-frequency edge invocations. Once the proportion of an uncommon Skill suddenly rises, or a new Skill that has never been seen before appears, it often means that the Agent is being guided to an unexpected capability path, and intervention is needed for troubleshooting.

The value of the newly added Skills Table is particularly critical. Newly imported Skills have not passed sufficient security assessments, and their permission borders and behavior patterns remain blind spots for the security team. By sorting in descending order by the first invocation time, you can catch newly appearing Skills in the environment at the first opportunity and complete the review before the Skills are abused.

Important command invocation monitoring

One of the innovative capabilities of OpenClaw is the autonomous execution of system commands, which also makes it an ideal springboard for attackers. Once the Agent suffers from prompt injection or is controlled by a malicious Skill, the attacker can use the system access permissions of the Agent to execute destructive operations such as deleting files, escalating permissions, or exfiltrating data. The entire process is initiated as the Agent, making it extremely difficult to distinguish from normal Job behavior.

The core value of important command invocation monitoring lies in establishing an independent observation layer outside of runtime protection. The tool permission system of OpenClaw has implemented controls at the runtime layer. However, policy configuration faults, blurred permission border definitions, or uncovered edge scenarios may all cause important commands to quietly pass at the runtime layer. The observation layer runs independently of the protection mechanism, ensuring that even if an oversight occurs at runtime, important operations will not go completely undetected.

The significance of the timeline view is not just counting, but helping the security team detect behavior patterns. The threat implications of an isolated single important command and intensive invocations within a short time are completely different. The latter is often a typical feature of an Agent systematically executing malicious instructions after being controlled, requiring immediate intervention. The Fact Table provides complete traceability context, supporting the security team in quickly tracing from abnormal signals to specific sessions and original commands.

Prompt injection detection

Prompt injection is the core attack method that drives AI to execute harmful behaviors. Regardless of the attack path, whether it is direct input from a user, returns from Skills invocations, or external data read by tools such as web_fetch and read, malicious instructions ultimately need to be merged into the prompt to exert influence on the Agent. The prompt is the final convergence point of all attack paths.

The distribution of injection sources can help judge the nature of the actual threat. Injections directly input by a user are usually intentional, while injections carried via toolResult are often unknown to the user. For personal assistant Agents such as OpenClaw, indirect injection is the main threat. Skills installed by the user or external content accessed by the user may become injection carriers, and it is difficult for the user to actively detect and avoid them.

The value of injection categorization lies in detecting the attack intent, not just marking abnormalities. For the same injection event, ROLE_HIJACK and JAILBREAK mean that the attacker is attempting to break through the behavior borders of the Agent, while HIDDEN_INSTRUCTION represents a more covert implantation method. The response priority and handling methods for these types are different. Continuous observation of changes in categorization distribution also helps Search for concentrated attempts against specific attack surfaces.

The Fact Table records the triggering tool, session context, and original content of each injection event. It supports the security team in quickly drilling down from categorization statistics to specific events, completing the closed loop from pattern detection to traceability response.

Sensitive data leakage detection

Data leakage in Agent scenarios is often not a single event, but a behavior chain composed of multiple steps: the Agent is guided to read sensitive files, content enters the model context, and then exfiltration is completed via subsequent tool calling. It is difficult to judge the threat by observing any single link alone. Only by associating file access with outbound behavior can you reconstruct the complete intent of the attack.

Sensitive data leakage detection adopts a funnel analysis approach to narrow down noise layer by layer and precisely locate real threats. The first layer records all sensitive file access, categorizes them by five types of assets: SSH_KEY, ENV_FILE, CREDENTIALS, CONFIG_SECRET, and HISTORY, and establishes an access baseline. The second layer independently tracks outbound behaviors by channel (API_CALL, MESSAGE_SEND, WEB_ACCESS, EMAIL) to detect potential data exits. The third layer associates the two in the time dimension. If sensitive file access and outbound operations appear successively within a short time window in the same session, they are marked as high-priority exfiltration events.

The core value of this mechanism lies in causal positioning rather than single-point alerting. An Agent reading an SSH_KEY is not necessarily a threat, and initiating an API_CALL is not necessarily a threat. However, if both occur sequentially within the same session at a minute-level interval, and the outbound parameters carry sensitive file Content, the threat confidence level increases significantly. The behavior chain Analysis Table directly renders the time difference between access_time and outbound_time as well as the complete invocation parameters, allowing the security team to complete traceability judgment without manually associating logs.

4.2 Token Analysis Dashboard
Token consumption is directly associated with operational costs, and its fluctuation is often an early signal of System Exceptions (such as context expansion caused by Prompt injection). The Token Analysis Dashboard revolves around the core questions of "where the money is spent, whether it is spent reasonably, and whether there are abnormalities." It expands from dimensions such as overall overview, model dimension Trends, and sessions to provide usage monitoring, cost Analysis, and abnormality Search capabilities.

About Fee Data: The Fee (cost) field in the dashboard comes from usage.cost in OpenClaw. Taking the Qwen3.5-Plus model as an example, for the Fee of Model Studio API Calls, see https://www.alibabacloud.com/help/en/model-studio/models

The configuration of model costs in .openclaw is:

{
  "id": "qwen3.5-plus",
  "name": "Qwen3.5 Plus",
  "cost": {
    "input": 0.8, // Taken from the lowest tier input price
    "output": 4.8, // Taken from the lowest tier output price
    "cacheRead": 0.4, // Estimated using half of the input
    "cacheWrite": 0
  },
}

OpenClaw does not natively Support tiered billing, and the Computation Logic for cacheRead + cacheWrite cannot remain consistent with the provider. It only estimates the single invocation Fee based on inputTokens × input + outputTokens × output + .... Therefore, the dashboard Fee should be regarded as a reference baseline for cost estimation, rather than an accurate bill. For models without cost configuration, the Fee column will display 0.

4.2.1 Overall Overview and Model Distribution

The top of the dashboard provides a 1 Day comparison of overall Tokens and overall Fees: today vs. yesterday usage (unit: 10,000 tokens), today vs. yesterday Fees (unit: CNY), and the day-to-day comparison ratio. This facilitates quick judgment of whether there is a sudden increase in usage or Fees on the current day. The day-to-day comparison is the first signal of cost abnormalities. If the compare (day to day) exceeds the preset threshold (such as ±30%), it usually means that Prompt expansion, recursive invocation, or abnormal sessions have occurred, and you can immediately drill down to troubleshoot.

4.2.2 Consumption Trend by Provider / Model (Time Series)

The two time series charts for Model Tokens Trend and Model Fee Trend (relative to 1 week) share a timeline and legend, displaying Token consumption and Fee changes for each model in the time dimension, colored by model. You need to focus on Token surges. This is often not just a cost issue, but more likely a threat signal for security and stability. Prompt injection causing the context to be maliciously padded, tool calling falling into an infinite loop, or sessions continuously expanding because they did not trigger loop Detection will all appear as a steep rise in a certain curve on the Trend chart. The two charts are rendered with colors distinguishing models. Model switching will be directly reflected as a change in color composition, allowing you to confirm the switch time point and the involved model without extra extrapolation, facilitating the judgment of whether it is an expected Change.

4.2.3 Top Consumption by Session and by Host/Pod (Column Chart)

The column charts constitute a 2×2 layout, answering "who is spending money, and which machine or container is spending money" from the dimensions of session and host (or pod in container scenarios), associated with specific responsible entities:

● Top Tokens By Session / Top Cost By Session: The total Tokens and Fees for each session in the past 1 week are sorted in descending order. In practice, the cost distribution of an Agent often exhibits long-tail features—a few sessions account for the vast majority of consumption. Identifying these "head sessions" is the first step in Cost Optimization.

● Top Tokens By Host / Top Cost By Host: Tokens and Fees aggregated by host (instance) or pod, used for cost analysis and threat localization under multi-instance deployments. In enterprise environments, a host or pod is usually attached to a specific team, line-of-business, or User. By combining this with asset ownership, you can map consumption Data to specific responsible parties. This not only supports cost allocation but also allows you to quickly pinpoint potential threat users or out-of-control sessions when the consumption of an instance is abnormal.

4.2.4 Model Tokens Details Table (Cost Details)

Model Tokens Details Table (1 week relative) lists the following by model: totalTokens, inputTokens, outputTokens, cacheReadTokens, cacheWriteTokens, and the corresponding totalCost, inputCost, outputCost, cacheReadCost, cacheWriteCost. It Supports sorting and Filtering, and can directly answer "which model spent the most money, and what are the proportions of input/output." The ratio of inputTokens to outputTokens reflects the interaction pattern of the Agent. A high input proportion indicates Prompt or context redundancy, while a high output proportion may indicate that the model Generated a large amount of Invalid Content. The cacheReadTokens proportion visually reflects the benefits of the cache policy—the higher the proportion, the lower the actual billing. This provides a basis for quantization for Prompt engineering and cache tuning.

4.3 Behavior Analytics Dashboard
The behavior analytics dashboard takes the session as the basic unit, records and performs categorized Statistics on the full running behavior of OpenClaw, and answers the basic but critical question of "what the Agent did within the current time window."

Session Statistics

The top count card breaks down tool calling by Behavior Type into dimensions such as command execution, background process, web Request, communication tool, and file reading/writing, providing a quick snapshot of the overall behavior composition. Among them, call abnormalities are listed separately to facilitate immediate assessment of System stability.

The session statistics table below expands with Session as the granularity, recording the call volume of each session across various behavior dimensions. In the screenshot, the total number of tool callings for the Session in the first row reaches 1,925, including 1,364 command executions and 561 file reads/writes. Compared with other Sessions, the magnitude is disparate. Such abnormally active Sessions are often worth prioritizing for review. The Table uses sorting by the last active time. Combined with the call distribution of each dimension, you can quickly detect sessions with abnormal behavior patterns.

Tool Calling Volume Statistics and Error Analysis

Tool calling is the Unique channel for the Agent to interact with the external world. Changes in its calling volume and Error Rate directly reflect the running health Status of the Agent. The tool calling timeline displays the calling Frequency composition of each time period by tool Type in different colors. Abnormal spikes are the first entry point for troubleshooting. Combined with the composition changes of tool Types, you can quickly determine which Type of operation drove the surge in calls. The Error Rate Trend chart shares the timeline with the calling volume timeline. The peak of the Error Rate does not necessarily coincide with the peak of the calling volume. The time difference between the two can often reveal the true Source of the problem: whether a certain class of tools failed continuously in a specific time period, or a certain Job Imported an abnormal calling pattern.

The full tool calling log provides the protocol faults, execution Status, and return Content of each call. It Supports rapid drill down from Trend abnormalities to specific failed calls to locate the root cause.

External Interaction

External interaction records all external behaviors initiated by the Agent during the Run procedure, including API calls, web page access, message sending, and Email outbound. These are rendered by category: session, tool name, and interaction Type.

For the Agent, external interaction is both a necessary means to complete the Job and a potential threat outlet. Recording external interaction behaviors in full helps the team master the actual capability borders and usage habits of the Agent on the one hand, and provides a complete behavior context when abnormalities occur on the other hand. This Supports cross-tool and cross-session association analysis and traceability.

5.Custom Observable Data Exploration
The built-in dashboard provides audit and observation views of common dimensions. In actual security operations, the dashboard is often the starting point for "discovering issues" rather than the end point. When the audit dashboard marks a high-risk session, the Token Trend graph shows an abnormal spike, or a runtime metric-based alerting is triggered, you often need to further drill down from the statistical overview to specific events, reconstruct the complete behavior chain, and confirm the root cause. The query and analysis engine of SLS provides flexible custom exploration capabilities for this procedure.

5.1 Log Data Model: The Foundation of Custom Analysis
The prerequisite for custom exploration is understanding the data structure. The SLS ingestion solution has pre-built indexes based on audit analysis requirements, so you can query directly without additional configuration. The following two types of logs constitute the core data sources for custom analysis:

Session Log — Records the complete business behavior of the Agent. It is the main basis for security audit and cost analysis.

Runtime Log — Records the runtime status of the gateway and each subsystem. It is the data foundation for troubleshooting and system health analysis.

5.2 Session-level Drill Down: From High-risk Session to Complete Behavior Chain
Typical scenario: The "High-risk Session" list in the audit dashboard marks a high-risk Session. The security team needs to reconstruct the complete interaction process of this session to confirm whether the threat is real.

In a multi-instance deployment environment, the logs of each OpenClaw instance are centrally written to the same SLS Logstore. The first step of custom exploration is to isolate by Session ID, narrowing the view down to a single session to clarify "who triggered which requests and when, which tools were invoked, and how the model responded." This provides a clear border for compliance proof.

After the session filter is completed, you can use the context preview feature of SLS to reconstruct the complete behavior chain within the session in the original order. User input, model inference, tool calling requests, and tool execution results are clear at a glance. This capability is particularly critical in audit scenarios: It not only helps detect abnormal invocation sequences (such as sensitive file reading followed immediately by an exfiltration operation) but also provides a complete context view for the reproduction of security events and evidence retention.

5.3 Runtime Troubleshooting: Keyword Retrieval and Aggregation and Analysis
Typical scenario: The runtime metric-based alerting dashboard prompts a sudden increase in Error Rate. You need to quickly locate the faulty module and root cause from the massive Runtime Logs.

SLS supports a combination of full-text index and structured field retrieval. Combined with the Time Range, you can narrow down layer by layer the troubleshooting scope. The typical troubleshooting path is divided into two steps: first narrow down the scope, then quantify the distribution:

Step 1: Filter layer by layer to lock onto the issue

Filter by log level: Use _meta.logLevelName: ERROR or _meta.logLevelName: WARN or _meta.logLevelName: FATAL to filter all error logs and Warning Logs, focusing attention on anomalous activity.
Drill down by subsystem: Overlay field conditions in the error logs, such as 0.subsystem: plugins, to narrow the scope down to a specific subsystem. As shown in the figure below, two steps of filtering can quickly locate the error log where the diagnostics-otel plugin failed to load.

Step 2: SQL Aggregation, Quantify Global Distribution

Keyword filtering locates a single event, while SQL aggregation and analysis elevates single logs to a global statistical view. For example, performing grouping and aggregation on the subsystem field can intuitively render the Error Distribution of each subsystem, quickly detect concentrated abnormalities, and point out the direction for further troubleshooting.

6.Multi-data Source Filter Interaction: The Troubleshooting Closed Loop from Anomaly Discovery to Root Cause Localization
We previously introduced data ingestion, built-in dashboards, and custom exploration based on observable data. In actual O&M and audit, observable data is not used in isolation but follows a fixed collaboration pattern, narrowing down layer by layer and corroborating each other:

OpenTelemetry (OTEL) Metrics → application logs (error context) → Session audit logs (complete behavior chain). The typical troubleshooting path is as follows: OTEL Metrics detect abnormalities (such as latency spikes, Token surges, or Error Rate spikes). Then, locate the Error Details in the application logs within the corresponding time window (Webhook timeout, authentication failed, or gateway abnormality). Finally, drill down to the Session audit logs to reconstruct the complete tool calling sequence, model interaction Content, and cost consumption of the session, confirm the root cause, and retain audit evidence.

7.Summary
To answer "Is your OpenClaw really under control?", you need to answer several questions at the same time: who is triggering the invocation, how much money was spent, what operations were performed, and is the behavior traceable and auditable.

Industry security reports and OpenClaw's own Codebase audit findings indicate that the attack surface of AI agents is naturally broad. Within 60 Days, there were 147 security patches, with the tools/ and gateway/ modules accounting for 61% of the total. Runtime protection is indispensable, but protection alone is not enough to claim control. You must establish a continuous observability system to answer the above questions with Data.

This article shows how to use the SLS Integration Center to complete the access of OpenClaw observable data (Session audit logs and application logs) with one-click, and achieve out-of-the-box security audit, cost monitoring, and operational observation through built-in dashboards. The value of the observability system is not limited to detecting problems, but lies in continuously integrating the operational Status of the Agent into a quantifiable and traceable Management framework. This is the necessary path for AI agents to move from "usable" to "trustworthy."

LoongSuite Python Agent Launches: Observability Into Every AI Agent Action, Zero-code Integration

ObservabilityGuy — Mon, 20 Apr 2026 05:54:22 +0000

This article introduces the LoongSuite Python Agent, Alibaba Cloud's OpenTelemetry distribution for zero-code AI application observability.

As AI applications grow in complexity, they often hit an inflection point where the features work — but making changes feels increasingly risky. With multi-agent pipelines, tool calling, retrieval-augmented generation (RAG), and memory all in play, the hard questions start to surface: What actually happened during that run? When did the context shift? Which step caused latency to spike? What did that response cost? The deeper challenge is that much of this happens inside the model's black box — leaving teams with limited visibility and no clear starting point for debugging.

The LoongSuite Python Agent brings full observability to your AI applications — no code changes required. Trace any request end to end: which model was called, which tools were invoked, which documents were retrieved, how many tokens were consumed, and how context evolved at each step. Get a clear picture of how your agent actually behaves in production, and streamline analysis, evaluation, and optimization.

I. Three Core Challenges in AI Application Observability
Traditional microservice observability centers on performance and availability. AI applications demand more — the goal is to make runtime context and behavior traceable, reproducible, and analyzable. In practice, three challenges are unavoidable.

1.1 Collecting Runtime Data Without Impacting Performance

In traditional microservices, code is the core asset. In AI applications, what truly matters is the data generated at runtime: conversations, tool calls, retrieval results, memory reads and writes, and multimodal inputs and outputs such as images, audio, and video. This runtime data is what guides agent and model optimization — making your agent smarter over time.

Collecting this data completely — without slowing down the pipeline or disrupting the application — is harder than it sounds:

● Context management is dynamic. It can shift inside a framework or be controlled by business logic. Capturing these changes transparently, across both framework internals and application code, requires a non-invasive approach.

● Multimodal payloads are large. Embedding images or audio directly into the trace pipeline can bottleneck the entire system. They need to be extracted and stored separately without blocking the application.

1.2 Inconsistent Data Semantics Undermine Observability
A range of collection tools exist — OpenTelemetry, OpenInference, Langfuse — and some frameworks like AgentScope and LangChain generate their own observability data. But when each source uses different naming conventions, attributes, and semantics, collected data becomes difficult to use:

● Storage reuse breaks down. Different observability backends support different data protocols, meaning data collected by one tool may not be correctly ingested, processed, and stored by another.

● Consumption logic cannot be shared. Even when tools share the same protocol (e.g., OTLP), semantic differences persist. The same metric may carry different names or labels across tools, making cross-platform display and processing unreliable.

This forces developers into a tight coupling between their observability backend and collection tooling. If a tool doesn't support the framework in use, developers must manually implement the backend's semantic specifications — a costly and error-prone process.

To address this, the OpenTelemetry GenAI SIG [1] — backed by dozens of leading cloud, AI, and observability vendors — established a common semantic specification for AI application observability [2]. It defines what to collect, how to name it, and in what form, across key GenAI interactions.

Platforms like Langfuse and Arize have adopted this standard, effectively decoupling observability backends from collection tooling. Once the collected data complies with the GenAI specification, subsequent visualization, consumption, and iteration will be much easier.

That said, correctly implementing the OpenTelemetry GenAI specification remains complex. Better tooling is needed to lower the barrier.

1.3 End-to-End Tracing: In-Process Visibility Is Not Enough
In production, agents and tool services frequently span multiple processes and services. Observing only in-process LLM calls leaves critical gaps: traces go unconnected, latency attribution becomes unclear, and the full request path is invisible. Meaningful troubleshooting and optimization require end-to-end visibility across the entire chain.

Single-framework observability cannot meet this need. Support for cross-process communication components — MCP, A2A, httpx, Flask, and others — is essential to closing the loop.

II. Solution: LoongSuite Python Agent
The LoongSuite Python agent addresses all three challenges out of the box.

It is Alibaba Cloud's open-source distribution of the OpenTelemetry Python agent — purpose-built to make AI application observability faster and more practical. It stays compatible with upstream standards while incorporating production-hardened practices and contributing improvements back to the community.

2.1 How It Works
Built on the OpenTelemetry standard, the LoongSuite Python agent instruments your AI application automatically — no changes to business code required. Simply wrap your start command and it handles the rest:

Auto-discovery — Detects and loads instrumentation based on the libraries present in your environment (e.g., DashScope, LangChain, Flask).
Unified semantics — All data conforms to OpenTelemetry GenAI semantic conventions, eliminating repeated adaptation for downstream visualization and consumption.
Full-stack coverage — Instruments both AI interactions (LLM, agent, tool, RAG, memory) and microservice calls (HTTP, gRPC, databases) — the foundation of end-to-end observability.
Flexible export — Exports data via OTLP to any compatible backend, including Jaeger, Langfuse, and Alibaba Cloud Observability.

2.2 Getting Started in Three Steps
Step 1: Install LoongSuite Distro from PyPI

pip install loongsuite-distro

Step 2: Install Instrumentation Packages

loongsuite-bootstrap -a install --version 0.1.0

This installs all AI-related instrumentation into your environment. Use --auto-detect to install only what's needed, or --whitelist for precise control over which instrumentation to include.

Step 3: Launch Your Application with an Bootstrapper

# Set the OTLP endpoint to your OTLP service address. The default value is gRPC.
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 \
# Enable statistics on the input and output of LLM calls.
OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental \
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=SPAN_ONLY \
loongsuite-instrument python app.py

That's it — your AI application is now fully instrumented.

What You Get
On any OTLP-compatible platform — Jaeger, Langfuse, Alibaba Cloud Observability — you can immediately view:

● Complete trace chains — LLM calls and microservice calls, all in one view.

● Granular performance metrics — Latency and error details for every invocation.

● Full context records — Captures inputs and outputs at key steps, ready for evaluation and downstream analysis.

III. LoongSuite and OpenTelemetry: The Relationship in Brief
The LoongSuite Python agent is a fork of OpenTelemetry Python Contrib. It maintains upstream compatibility while extending GenAI framework support and responding more quickly to the needs of the domestic ecosystem.

3.1 Why a Separate Release
● The upstream OTel framework matrix has limited coverage of the domestic ecosystem.

LoongSuite adds instrumentation for DashScope, AgentScope, Dify, MCP, Mem0, and more.
● Upstream development of opentelemetry-util-genai moves slowly and lacks production-ready features.

LoongSuite extends it with multimodal upload support, additional span types, and updated semantic specifications.
● Alibaba Cloud's commercial deployments have produced valuable practices, including:

ReAct round-level visualization and evaluation
Session-level trace auto-association
Through its independent release cadence, LoongSuite ships updates via loongsuite-distro commands, regularly syncs with upstream, and contributes downstream improvements back to the OpenTelemetry community.

3.2 Modules and Release Policy

IV. LoongSuite GenAI Util: A Superset of OTel GenAI Util
Not every AI agent is built on a managed framework. Many developers implement custom pipelines — calling self-hosted LLMs via REST APIs, hand-rolling ReAct loops, or building agents from scratch for more flexible and efficient control over context management.

These custom code paths fall outside the reach of automatic instrumentation and require manual tracing. Manual tracing done right involves more than adding a few spans. Developers must also consider:

● Correctly establishing parent-child span relationships

● Conforming to GenAI semantic conventions

● Properly capturing exceptions and faults

● Recording metrics and emitting logs

● Using consistent toggles to control capture of large input/output payloads

● Handling multimodal data separately to avoid bloating traces

● ...

To simplify this, the OTel GenAI SIG launched OpenTelemetry GenAI Util [4], which lets developers construct an invocation object and fill in the relevant fields — the utility handles the rest.

However, upstream development is slow and many features are not yet production-ready. LoongSuite GenAI Util [5] builds on this foundation to deliver a more complete, production-grade solution.

4.1 Supported Operation Types
The loongsuite-util-genai is available as a standalone PyPI package. It extends OpenTelemetry GenAI Util with broader span type coverage, multimodal handling, and enhanced semantic specifications.

4.2 Multimodal Upload: Keep Large Payloads Out of the Trace Pipeline
Images, audios, and videos are too large to embed directly in spans or events — doing so slows down the pipeline and inflates storage costs. LoongSuite GenAI Util handles this with asynchronous multimodal upload: large payloads are offloaded to OSS, SLS, or local storage, and only a URI reference is retained in the trace.

PreUploader — Detects Base64, Blob, and URI content; generates upload jobs; replaces multimodal parts in messages with URI references.
Uploader — Processes upload jobs asynchronously, without blocking business threads; supports idempotency to avoid duplicate uploads.
Storage backends — Supports file://, oss://, sls://, and more; integrates with OSS and SLS.

4.3 Getting Started with LoongSuite GenAI Util
Installation:

pip install loongsuite-util-genai
# Include multimodal upload support
pip install loongsuite-util-genai[multimodal_upload]

Environment configuration:

export OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental
export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=SPAN_AND_EVENT
export OTEL_INSTRUMENTATION_GENAI_EMIT_EVENT=true

# Multimodal upload (optional)
export OTEL_INSTRUMENTATION_GENAI_MULTIMODAL_UPLOAD_MODE=both
export 
OTEL_INSTRUMENTATION_GENAI_MULTIMODAL_STORAGE_BASE_PATH=file:///var/log/genai/multimodal

Manual instrumentation example using ExtendedTelemetryHandler:

from opentelemetry.util.genai.extended_handler import get_extended_telemetry_handler
from opentelemetry.util.genai.extended_types import InvokeAgentInvocation
from opentelemetry.util.genai.types import InputMessage, OutputMessage, Text

# Used to initialize the environment variable reading process, which is not required if you started the Python application using the method in section 2.2.
from opentelemetry.instrumentation._semconv import _OpenTelemetrySemanticConventionStability
if not _OpenTelemetrySemanticConventionStability._initialized:
    _OpenTelemetrySemanticConventionStability._initialize()
#1. Get the telemetry handler (can be used as a singleton)
handler=get_extended_telemetry_handler()

#2. Construct the InvokeAgent invocation
invocation = InvokeAgentInvocation(
    provider="dashscope",
    request_model=request["model"],
    agent_name="OrderAgent",
    input_messages=[
        InputMessage(role="user", parts=[Text(content="Check the status of order #101")]),
        InputMessage(role="system", parts=[Text(content="You are an order manager responsible for querying order information via tools")]),
    ]
)
with handler.invoke_agent(invocation) as invocation:
    #3. Execute InvokeAgent
    # ... Invoke the agent ...
    #4. Supplement the InvokeAgent result 
    invocation.output_messages = [
        OutputMessage(role="assistant", parts=[Text(content="Let me check that for you... Order #101 could not be found. Please verify the order number.")], finish_reason="stop")
    ]
    invocation.input_tokens=15
    invocation.output_tokens=20

V. Release Notes
Full release notes are available at: https://github.com/alibaba/loongsuite-python-agent/releases

1.Distribution and Ecosystem

The loongsuite-distro is now available on PyPI, providing loongsuite-bootstrap and loongsuite-instrument commands for one-command setup and launch.
Expanded instrumentation matrix with domestic ecosystem coverage: The self-developed instrumentation-loongsuite supports DashScope, AgentScope, Dify, MCP, Mem0, LangChain, Google ADK, Claude Agent SDK, Agno, and more.
2.LoongSuite GenAI Util

Multimodal upload — Automatically offloads Base64, Blob, and URI content to OSS, SLS, or local storage; retains URI references in messages; asynchronous by default.
Additional span types: invoke_agent, create_agent, execute_tool, retrieve, rerank, embedding, memory.
Enhanced semantic attributes: gen_ai.usage.total_tokens, gen_ai.response.time_to_first_token.
Expanded multimodal input support — Pre-upload pipeline now handles data URIs and local file paths.
Configurable hooks — Entry point extensions for PreUploader and Uploader.
VI. Conclusion
This release is just the beginning. Our roadmap is clear:

Move faster — Rapidly extend instrumentation coverage to keep pace with the domestic AI ecosystem.
Go deeper — Deliver more comprehensive multimodal handling, additional span and metric types, and up-to-date semantic specifications through LoongSuite GenAI Util.
Cover end to end — Unified tracing across AI and microservice calls to make end-to-end observability practical for multi-agent systems.
Stay upstream — Regularly sync with OpenTelemetry and contribute production-proven practices back to the community.
If you're building AI applications and care about observability, we invite you to try the LoongSuite Python agent, share your feedback, and contribute.

If you find the project useful, give us a ⭐ on GitHub — and join our developer community to help shape the observability tooling of the AI era.

WeChat group

DingTalk group

VII. References
[1] OpenTelemetry GenAI SIG: https://docs.google.com/document/d/1EKIeDgBGXQPGehUigIRLwAUpRGa7-1kXB736EaYuJ2M
[2] OpenTelemetry GenAI Semantic Convention: https://opentelemetry.io/docs/specs/semconv/gen-ai/
[3] LoongSuite Python agent: https://github.com/alibaba/loongsuite-python-agent
[4] OpenTelemetry GenAI Util: https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/util/opentelemetry-util-genai
[5] LoongSuite GenAI Util: https://github.com/alibaba/loongsuite-python-agent/blob/main/util/opentelemetry-util-genai/README-loongsuite.rst

When AI Agents Take Over Phones: How to Monitor on Mobile Devices

ObservabilityGuy — Mon, 20 Apr 2026 03:09:39 +0000

This article introduces detecting AI-driven "non-human" Android operations via AccessibilityService, event injection, and ADB using Alibaba Cloud RUM.

Background
AI agent-based phone assistants have recently gone viral on social media. They use AI to automate phone operations for complex tasks. These tasks include placing orders, comparing prices, and searching. A user simply says, "Find me the cheapest iPhone." The AI then opens shopping apps, searches for products, compares prices, and places the order. This scenario of AI taking over phones reveals a new form of future human-computer interaction.

However, when AI starts operating phones at scale, traditional user behavior analysis faces critical data pollution issues, such as:

● Inflated conversion rates: AI automated ordering interferes with conversion rate data. This leads to incorrect business decisions.

● User path analysis failure: AI operation paths are highly optimized and repetitive. This pollutes the analysis of user behavior paths.

● Recommendation algorithm bias: Recommendation models based on AI operation data deviate from real user preferences.

How do we detect "non-human" operations? Let's first break down how AI or scripts operate phones.

Technical Breakdown
Let's look at the principles of how AI agents operate phones.

flowchart TB
    subgraph Layer1["User Entry Layer"]
        A[User voice/text instructions]
    end

    subgraph Layer2["Screen Capture Layer"]
        A --> B[Get screen information]
        style B fill:#ffcccc
    end

    subgraph Layer3["Cloud Communication Layer"]
        B -->|Upload screen information| C[Cloud inference server]
        C -->|Return instructions| D[Phone operation instructions]
        style C fill:#99ccff
        style D fill:#cce5ff
    end

    subgraph Layer4["Operation Execution Layer"]
        D --> E[Execute operation instructions]
    end

    style Layer1 fill:#f9f9f9,stroke:#333,stroke-width:2px
    style Layer2 fill:#fff5f5,stroke:#ff6666,stroke-width:2px
    style Layer3 fill:#f5f9ff,stroke:#6699ff,stroke-width:2px
    style Layer4 fill:#f5fff5,stroke:#66ff66,stroke-width:2px

It is divided into the following layers:

● User entry layer: Users issue operation instructions via text or voice.

● Screen capture layer: Gets raw screen information.

● Cloud communication layer: Cloud inference server.

● Operation execution layer: Click, swipe, long press, input, and so on.

To detect "non-human" operations from a mobile monitoring perspective, focus on the "Operation Execution Layer". Take the Android platform as an example. Three common technical paths in the "Operation Execution Layer" enable "non-human" operations:

● Input events using AccessibilityService

● Inject events using INJECT_EVENTS

● Inject events using adb shell input

In addition, custom ROMs and external hardware can also perform "non-human" operations. This part is not covered in this topic.

Input Events Using AccessibilityService
AccessibilityService is an Android accessibility framework. It originally helped users with disabilities use phones but also supports automation. It is the primary technical path for accessibility apps and game supporting tools to automate operations.

AccessibilityService works in three phases:

● Phase 1: Event listening

● Phase 2: Screen reading

● Phase 3: Automated operation

Phase 1: Event listening
When the application interface changes, such as a new page opening or a button status change, the system notifies registered accessibility services via AccessibilityEvent. The service listens for various event types. These include window status changes, content changes, and view scrolling.

Phase 2: Screen reading
The accessibility service retrieves the view hierarchy of the current active window. It uses the AccessibilityNodeInfo object to access all UI elements on the screen. These include:

● Text content, such as button text and input box content

● View properties, such as location, size, and clickability

● View hierarchy relationships, such as parent and child nodes

● This lets the AI Agent "see" screen content and understand the current interface status.

Phase 3: Automated operation
Based on the read screen content, the accessibility service performs two types of operations:

● Node operations: Directly operate on UI nodes, such as clicking, long-pressing, and entering text.

● Gesture operations: Execute complex touch gestures using the GestureDescription API. Examples include swiping, dragging, and multi-touch.

Inputting events using the accessibility service has the following features:

● User authorization required: Users must manually enable the accessibility service in System Settings.

● Screen content reading: Fully reads on-screen text and view hierarchy information.

● Flexible operation capabilities: Supports complex operations such as clicking, swiping, long-pressing, and entering text.

Injecting Events Using INJECT_EVENTS
INJECT_EVENTS is an Android system-level permission. It lets applications directly inject touch events into the input system to simulate user operations. This is a low-level event injection mechanism provided by the Android system.

The INJECT_EVENTS mechanism also works in three phases:

● Phase 1: Event construction

● Phase 2: Permission authentication

● Phase 3: System injection

Phase 1: Event construction
The application constructs a MotionEvent object by calling system APIs using Instrumentation or reflection. This object contains basic information such as touch coordinates and action types (ACTION_DOWN, ACTION_UP).

Phase 2: Permission authentication
The Android system checks if the caller has the INJECT_EVENTS permission. Ordinary applications cannot obtain this system-level permission. It is available only in the following cases:

● System applications (with system signature)

● Applications with root permissions

Phase 3: System injection
After passing permission authentication, the event enters the Android input subsystem. The input subsystem handles all input events, such as touches and keystrokes. It treats injected events as real hardware input events and distributes them to the current focus window.

Injecting events using INJECT_EVENTS has the following features:

● Low-level injection: Events are injected directly into the system at a lower level.

● No user authorization required: Does not require manual user authorization, but requires system signature or root permissions.

● Harder to detect: Injection occurs at the system level, making it harder for the application layer to detect.

Injecting Events Using adb shell input
The adb shell input is a command line interface provided by the Android Debug Bridge (ADB). It injects input events into devices using a USB or network connection. This is common in development debugging and automated testing. It is essentially the same as INJECT_EVENTS but differs in the calling entity and permission acquisition.

The mechanism for injecting events using adb shell input works in four phases:

● Phase 1: Sending commands

● Phase 2: ADB protocol transmission

● Phase 3: Daemon process processing

● Phase 4: System injection

Phase 1: Sending commands
Send input commands using an ADB client on a PC or remote device (such as a USB device) as follows:

adb shell input tap 500 1000                 # Click coordinates (500, 1000) 
adb shell input swipe 100 200 300 400        # Swipe from (100, 200) to (300, 400) 
adb shell input text "hello"                 # Inp

Phase 2: ADB protocol transmission
The ADB client sends commands to the ADB daemon process (adbd) on the Android device over USB or a TCP/IP network. The ADB protocol handles command serialization, transmission, and deserialization.

Phase 3: Daemon process processing
After the adbd daemon process accepts the command, it parses the command parameters and constructs the corresponding MotionEvent or KeyEvent objects. The adbd process runs with system permission (usually shell or root) and holds system-level privileges.

Phase 4: System injection
adbd invokes the system API (InputManager.injectInputEvent()) to inject the event into the input subsystem. This procedure follows the same final injection path as INJECT_EVENTS within an application.

Compared to INJECT_EVENTS, the adb shell input method for injecting events has the following characteristics:

● Requires transmission via the ADB protocol.

● Permission acquisition: Establishing an ADB connection grants permission. Modifying the application is not required.

● The underlying implementation of event injection is consistent with INJECT_EVENTS.

Detection of "non-human" Operations
Many cheats and scripts, including AI Agents that can operate phones, use the above solution. However, special groups (such as visually impaired users) also use accessibility services. Simple analysis of events based on feature values may lead to false positives. The following section outlines how to use collected event features and external environment features to assist in analyzing "non-human" operation events when they occur.

Detect AccessibilityService input events
To operate a phone using AccessibilityService, enable the corresponding accessibility service in System Settings. The Android system provides APIs to determine whether an accessibility service is running as follows:

● Supports detection of running accessibility services.

● Supports reading the accessibility service ID.

● Supports detection of whether screen content is read.

● Supports detection of the capability to operate applications.

// Detect whether a running accessibility service exists
public boolean hasAccessibilityServiceRunning() {
      AccessibilityManager am = (AccessibilityManager) context.getSystemService(Context.ACCESSIBILITY_SERVICE);
      return am != null && am.isEnabled();
}

// Check the accessibility service ID
public void checkServiceId() {
    List<AccessibilityServiceInfo> enabledServices = am.getEnabledAccessibilityServiceList(AccessibilityServiceInfo.FEEDBACK_ALL_MASK);
    for (AccessibilityServiceInfo service : enabledServices) {
        //  Get the service ID (usually "package name/class name")
        String id = service.getId();
    }
}

// Check if the service has the capability to control the application
public boolean hasFullControlAgent() {
    List<AccessibilityServiceInfo> enabledServices = am.getEnabledAccessibilityServiceList(AccessibilityServiceInfo.FEEDBACK_ALL_MASK);
    for (AccessibilityServiceInfo service : enabledServices) {
        int capabilities = service.getCapabilities();

        // 1. Check if the service can read the screen
        boolean canRetrieve = (capabilities & AccessibilityServiceInfo.CAPABILITY_CAN_RETRIEVE_WINDOW_CONTENT) != 0;

        // 2. Check if the service can control the application
        boolean canPerform = (capabilities & AccessibilityServiceInfo.CAPABILITY_CAN_PERFORM_GESTURES) != 0;

        // Has the capability to control the application
        if (canRetrieve && canPerform) {
            return true;
        }
    }
    return false;
}

You can also check MotionEvent flags to determine if an accessibility service generated the event:

// Check if the event was generated by an accessibility service (API version requirement applies) 
public boolean isAccessibilityEvent(MotionEvent event) {
    if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.S) {
        int flags = event.getFlags();
        // Use a bitwise operation to check if 0x800 is included
        return (flags & FLAG_IS_ACCESSIBILITY_EVENT) != 0;
    }
    return false;
}

Using the preceding methods, the application can detect running accessibility services. However, this may mistakenly flag normal accessibility services.

Detect events injected by INJECT_EVENTS
Events injected by INJECT_EVENTS typically have the following features:

● Event attributes may lack properties such as pressure value and touch area

● The flag may be FLAG_IS_GENERATED_GESTURE

● The event source might be SOURCE_UNKNOWN

The detection logic is as follows:

public boolean isEventInjected(MotionEvent event) {
    if (event == null) {
        return false;
    }

    // Method 1: Check Event Attributes
    // Injected events might lack properties such as pressure value and touch area
    boolean hasPressure = event.getPressure() > 0;
    boolean hasSize = event.getSize() > 0;
    boolean hasToolType = event.getToolType(0) != MotionEvent.TOOL_TYPE_UNKNOWN;

    // If the event lacks these basic properties, it might be injected (false positives are possible)
    if (!hasPressure && !hasSize && !hasToolType) {
        return true;
    }

    // Method 2: Check event flags 
    int flags = event.getFlags();
    // FLAG_IS_GENERATED_GESTURE indicates a gesture generated by a program
    if ((flags & 0x08000000) != 0) {
        return true;
    }

    // Method 3: Check event source int source
    int source = event.getSource();
    if (source == InputDevice.SOURCE_UNKNOWN) {
        return true;
    }

    return false;
}

No single reliable method currently exists for events injected using INJECT_EVENTS. These events occur at a lower layer and can bypass application layer detection mechanisms. Detecting such injections often requires a multi-dimensional comprehensive detection approach (this also applies to other types of injected event detection). However, you can check event features to improve the success rate of detecting "non-human" operations.

Detect adb shell input injection events
Events injected using adb shell input are essentially the same as those injected using INJECT_EVENTS. However, because an ADB connection is required, you can detect them by checking the ADB Connection Status.

Detect whether ADB is enabled:

public static boolean isAdbEnabled(Context context) {
    return Settings.Global.getInt(
        context.getContentResolver(),
        Settings.Global.ADB_ENABLED, 0
    ) > 0;
}

Detect USB Connection Status:

private static boolean isUsbConnected(Context context) {
    Intent intent = context.registerReceiver(
        null,
        new IntentFilter(Intent.ACTION_BATTERY_CHANGED)
    );

    if (intent == null) return false;

    int plugged = intent.getIntExtra(BatteryManager.EXTRA_PLUGGED, -1);
    //  Check if powered via USB
    return plugged == BatteryManager.BATTERY_PLUGGED_USB;
}

Check if the ADB port is open (wireless debugging and emulator scenarios):

  private static boolean isAdbPortOpen() {
      // Common ADB ports. 5555 is the default. Some emulators may use 5554-5585
      int[] ports = {5555, 5554, 5556, 5557, 5558, 5559, 5560};

      for (int port : ports) {
          try (Socket socket = new Socket()) {
              socket.connect(new InetSocketAddress("127.0.0.1", port), 50);

              Log.w(TAG, "isAdbPortOpen, opened. port: " + port);
          } catch (Exception e) {
              // ignored
              e.printStackTrace();
              Log.w(TAG, "isAdbPortOpen, closed. port: " + port);
          }
      }
      return false;
  }

Check the debugger status:

public static boolean isDebuggerAttached() {
    return android.os.Debug.isDebuggerConnected();
}

Check the USB ADB status:

private static boolean isUsbAdbActive() {
    try {
        Class<?> systemPropertiesClass = Class.forName("android.os.SystemProperties");
        Method getMethod = systemPropertiesClass.getMethod("get", String.class);
        String usbState = (String) getMethod.invoke(null, "sys.usb.state");

        // Check if adb is included
        // Common return values:
        // "mtp,adb" -> MTP transfer enabled and ADB connected
        // "adb"     -> Charging only mode and ADB connected 
        // "mtp"     -> MTP only. ADB not enabled
        if (usbState != null && usbState.contains("adb")) {
            return true;
        }

        // Double validation: Some vendors might use persist.sys.usb.config
        String usbConfig = (String) getMethod.invoke(null, "persist.sys.usb.config");
        if (usbConfig != null && usbConfig.contains("adb")) {
            return true;
        }

    } catch (Exception e) {
        // Reflection failed or Insufficient Permissions (some newer Android versions might restrict reading)
        e.printStackTrace();
    }
    return false;
}

Use the preceding methods to collect ADB-related environment context during operations. When analyzing operation occurrences, use ADB status information to help determine the likelihood of "non-human" operations in the current application.

Detect Abnormal Operations Using RUM + custom query
The result fields detected by the preceding three methods are reported to the Real User Monitoring product via the RUM SDK. Use query and analysis on the result fields to quickly detect suspicious non-human operations. The following outlines several analysis scenarios.

Scenario 1: Detect users with enabled accessibility services
Analyze the enabling status and service ID of accessibility services to quickly identify users with potential non-human operations.

-- Query users who enabled accessibility services within the last hour and their operation counts
* and context.accessibility_enabled: true | 
SELECT 
  "user.name",
  "context.accessibility_service_id",
  COUNT(*) as operation_count,
  COUNT(DISTINCT "session.id") as session_count
GROUP BY 
  "user.name", "context.accessibility_service_id"
ORDER BY 
  operation_count DESC
LIMIT 100

Analysis: If a user has an abnormally high operation count and has enabled non-system accessibility services, pay close attention.

Scenario 2: Detect accessibility services with full control capabilities
Focus on analyzing accessibility services with the dual capabilities of reading screens and operating applications.

-- Query accessibility services with full control capabilities and the affected User Count
* and context.can_retrieve_window: true and context.can_perform_gestures: true | 
SELECT 
 "context.accessibility_service_id",
 COUNT(DISTINCT "user.name") as affected_users,
 COUNT(DISTINCT "device.id") as affected_devices
FROM log
GROUP BY "context.accessibility_service_id"
ORDER BY affected_users DESC

Analysis: Services with both screen reading and gesture operation capabilities are more likely used for automated operations.

Scenario 3: Detect operations in ADB connection environments
Analyze user operations under ADB Connection Status. This may indicate the presence of scripts or automation tools.

-- Query user operation features under ADB enabling status
* and context.adb_enabled: true or context.usb_adb_active:true or context.adb_port_open: true | 
SELECT 
 "user.name",
 "device.id",
 CASE 
 WHEN "context.adb_enabled" = true THEN 'ADB Enabled'
 WHEN "context.usb_adb_active" = true THEN 'USB-ADB Connected'
 WHEN "context.adb_port_open" = true THEN 'ADB Port Open'
 END as adb_status,
 COUNT(*) as event_count
FROM log
GROUP BY "user.name", "device.id", adb_status
ORDER BY event_count DESC
LIMIT 100

Analysis: Operations under ADB Connection Status may be automation scripts.

Scenario 4: Detect injected event features
Detect events possibly injected using INJECT_EVENTS based on event flags and missing properties.

-- Query events with injection features
* and event_type: "action" and (
 context.is_generated_gesture: true or context.event_source = 'SOURCE_UNKNOWN' or (context.has_pressure: false and context.has_size: false and context.has_tool_type: false)
) | 
SELECT 
 "user.name",
 "device.id",
 COUNT(*) as injected_event_count,
 COUNT(DISTINCT session_id) as session_count,
 ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY "user.name"), 2) as inject_ratio
FROM log
GROUP BY "user.name", "device.id"
HAVING injected_event_count > 50
ORDER BY inject_ratio DESC
LIMIT 100

Analysis: If the ratio of injected events for a user is too high (such as over 50%), it likely indicates non-human operations.

Conclusion
This section analyzed the technical principles of AI agents or scripts controlling mobile phones. It also outlined how to fetch feature information for events using three technical paths. The rise of AI agents such as AutoGLM and Doubao mobile phones marks a new stage in mobile interaction. AI can automatically perform complex interactions. This represents technical progress but creates new challenges in detecting non-human operations. Accurate non-human detection in mobile monitoring requires detection across multiple dimensions. These include enhanced detection of the application runtime environment, accessibility service package names, device features, behavior features, and screen trajectory features. The Alibaba Cloud Real User Monitoring SDK now collects relevant properties. You can use these properties to customize the analysis of potential non-human operations. Note that this feature is in grayscale. Contact the author for more information.

Network Disconnection and Power Outage Without Data Interruption: LoongCollector Reliable Collection Solution for Extreme Edge Scenarios

ObservabilityGuy — Mon, 20 Apr 2026 02:56:56 +0000

This article describes in detail how LoongCollector provides a complete and reliable collection solution for edge scenarios such as weak networks and power outages.

Background
With the rapid development of cloud computing and the Internet of Things (IoT) today, more and more business scenarios are pushing computing and data collection capabilities to the edge. From production line devices in smart manufacturing and onboard systems of new energy vehicles to retail terminals and smart home devices distributed across various locations, the observability data (logs, metrics, and traces) generated by these terminal devices is crucial for business operations, fault diagnosis, and user experience optimization.

However, the environments of terminal devices are extremely complex:
● Unstable network environment: Terminal devices often run in environments with weak networks or intermittent network disconnections. Problems such as mobile network signal fluctuations, unstable Wi-Fi connections, and high cross-region network latency are common.

● Unguaranteed power supply: Many terminal devices rely on batteries for power supply or face the threat of unexpected power outages.

● Extremely limited resources: The CPU, memory, storage, and network bandwidth of edge devices are extremely limited.

Collecting observability data under such extreme conditions faces significant challenges. For example, when a vehicle travels in a remote area, the vehicle may be in a state of weak network or network disconnection for a long time, and the network signal is intermittent. When the vehicle is turned off and the power is cut, all monitoring data cached in the memory is lost. In scenarios such as tunnels and underground parking lots, data collection is interrupted, and key fault diagnosis data cannot be transmitted back.

This article describes in detail how LoongCollector provides a complete and reliable collection solution for edge scenarios such as weak networks and power outages.

Three Major Challenges in Observability Data Collection for Terminal Devices

Challenge 1: Complex Network Environments
The network conditions of the operating environments of terminal devices are far more complex than those of data centers:

● Weak network scenarios: Unstable mobile network signals, weak Wi-Fi signals, and cross-region long links cause low network bandwidth, high latency, and high packet loss rates.

● Intermittent network disconnection: Device movement, network transitions, and temporary network faults cause periodic network interruptions.

● Long-term offline: In some scenarios, devices need to work offline for a long time, accumulating a large amount of data to be uploaded.

For example, when an in-vehicle terminal device is in transit in a remote area, the device may be in a state of weak network or network disconnection for a long time, and the normal network state is rare. When the vehicle is turned off or under maintenance, the in-vehicle terminal device is also powered off.

Challenge 2: Reliable Delivery of Observability Data
In unstable environments such as weak networks and power outages, ensuring reliable data delivery and consistency is the biggest challenge:

● Data loss threat: Network interruptions, device power outages, and process abnormalities can all lead to data loss.

● Sequential guarantee: Time-series data (such as metrics and traces) must maintain the chronological order of collection.

Challenge 3: Network Bandwidth Throttling
The network bandwidth of terminal devices is usually subject to strict limits:

● High traffic costs: The traffic fees of 4G/5G mobile networks are much higher than those of data center leased lines.

● Bandwidth contention: The upload of collected data needs to compete for limited bandwidth resources with business data transmission.

● Upload rate limits: Some carriers or network environments restrict upload bandwidth.

In such an environment, how to efficiently compress data, intelligently control the sending rate, and avoid bandwidth being fully occupied by collection traffic has become a problem that must be solved.

LoongCollector: Reliable Collection Solution Optimized for Edge Scenarios
LoongCollector is a high-performance and high-reliability observability data collector open sourced by Alibaba Cloud. While supporting the internal deployment of Alibaba Cloud at the scale of tens of millions, it has been deeply optimized for edge scenarios.

Overview of Core Capabilities
Unified Observability Data Collection
LoongCollector provides complete observability data collection capabilities:

● Host monitoring: Real-time collection of system metrics such as CPU, memory, disk, and network. It supports 100+ system metric items.

● Prometheus protocol: Fully compatible with the Prometheus ecosystem, allowing the collection of all application metrics that support Prometheus collection.

● Log collection: High-efficiency text log collection capabilities, supporting multiple log formats and parsing methods.

Ultra-Low Resource Consumption
For resource-constrained terminal devices, LoongCollector has undergone extreme performance optimization:

This means that under the same hardware conditions, LoongCollector can support more collection tasks or run stably on devices with more limited resources.

Enterprise-Level Stability Assurance
● Production-grade verification: Supports the observability data collection of more than 10 million instances within Alibaba Cloud.

● High availability: Single-instance high availability, supporting self-recovery from faults.

● Time-tested: Verified through years of Double 11 sales promotions, burst traffic, and other extreme scenarios.

Solution architecture: data persistence + asynchronous sending + intelligent retry
For edge scenarios such as weak networks, power outages, and network disconnections, LoongCollector adopts a core architecture design of "data persistence + asynchronous sending + intelligent retry."

Separation of collection and sending: Data collection and network sending are completely decoupled, and the collection procedure is not affected by network status.

Local persistence: Log data naturally possesses the capability of local persistence. This mainly refers to data without persistence capabilities, such as metrics. This solution writes all collected metrics to local files first to ensure no data is lost even during power outages or restarts.

Asynchronous consumption: An independent sending thread reads data from persistent files and sends the data. It automatically retries when the sending failed.

Intelligent backpressure: When the network is abnormal, the data reading speed is automatically controlled to avoid excessive memory usage.

Metric Data Flushing Persistence
Traditional metric collection solutions (such as Telegraf and Prometheus Pushgateway) usually send collected metric data directly to the server-side. This architecture works well in stable network environments but has fatal flaws in edge scenarios:

● Data loss due to network disconnection: When the network breaks, newly collected metric data cannot be sent and can only be discarded or cached in the memory.

● Data loss due to power outage: When the device unexpectedly loses power, all data cached in the memory is lost.

● High memory pressure: When the network is disconnected for a long time, the memory cache expands rapidly, eventually leading to out-of-memory (OOM).

LoongCollector innovatively performs local file persistence for host monitoring metrics and Prometheus metrics, realizing reliable storage of metric data:

● Periodically scrapes host and application metric data.

● Flushes data in text format to the local file system.

● Automatic rotation mechanism. It supports the configuration of single file size and file count, retains files in the latest fixed format, and automatically deletes expired files to prevent disk space from being filled up by historical data.

File Collection Asynchronous Consumption Mechanism
After metric data is persisted, how to efficiently and reliably send the data to the server-side is the next key issue. Challenges faced by traditional solutions include:

● Sending blocks collection: If the sending thread is coupled with the collection thread, a slow network slows down the collection speed.

● Sequence assurance: Metric data usually has time sequence requirements, and it is necessary to ensure that data is sent in the order of collection time.

● Resumable transmission: After the network recovers, sending needs to continue from the point of disconnection without duplication or omission.

LoongCollector adopts the method of file collection to asynchronously consume persisted metric data. The key technical points are as follows:

● Checkpoint mechanism: LoongCollector maintains fine-granularity checkpoints to record the reading position of each file. This ensures that even if the process crashes or power is lost during file reading, reading can continue from the disconnected position after a restart without data loss.

● File sequence assurance: Ensure that data is sent in the order of collection time through the file rotation order:

Prioritize earlier documents
Files in the same time segment are processed in increasing order of ordinal numbers
Support using the time in raw data to avoid data visualization issues caused by out-of-order UNIX timestamps
Intelligent Backpressure and Throttling
In a weak network environment, if data is read and sent without control, it will lead to:

● Memory usage surge: The read speed is much higher than the send speed, and data is stacked in the memory.

● Send queue overflow: After the queue is full, data is discarded or the process crashes.

● Bandwidth exhaustion: Collection traffic occupies the full bandwidth, Impacting normal business communication.

LoongCollector implements a multilayer intelligent backpressure mechanism:

Send concurrency adaptation: Drawing on the TCP congestion control algorithm, LoongCollector dynamically adjusts the send concurrency based on the network status. This adaptive mechanism ensures:

● Fast response: When the network is normal, bandwidth is fully utilized to send data quickly.

● Fast convergence: When the network is abnormal, the send frequency is quickly reduced to avoid invalid retries.

● Automatic recovery: After the network recovers, concurrency is automatically increased without manual intervention.

● Queue backpressure: When the send queue backlog reaches the threshold, LoongCollector pauses file reading. This prevents unlimited memory growth and ensures that the system runs stably even in a weak network environment for a long time.

● Traffic throttling: LoongCollector supports configuring the maximum send rate to prevent collection traffic from Impacting services. ilogtail_config.json:

{
  "max_bytes_per_sec": 1048576 # Limit the maximum send rate to 10 MB/s
}

Best Practice for LoongCollector Terminal Deployment
This example uses host monitoring and Prometheus collection for an application.

LoongCollector Start Parameter Suggestions
Modify ilogtail_config.json in the /usr/local/ilogtail directory.

Disable discard_old_data.
Increase the interval for the restart after disconnecting from the server-side, config_server_lost_connection_timeout. It is recommended to set it to 604800 seconds, or 7 days.
Increase the interval for the restart triggered by a read block, force_quit_read_timeout. It is recommended to set it to 604800 seconds, or 7 days.
Limit the maximum send rate max_bytes_per_sec. The traffic for host monitoring and one Java application is 0.88 KB/s, so it is recommended to set it to 1 MB/s to avoid abnormal traffic usage.
"working_ip". In mobile terminal scenarios, the IP address changes constantly. It is recommended to specify a fixed IP address on the machine.
ilogtail_config.json

{
  "discard_old_data": false,
  "config_server_lost_connection_timeout": 604800,
  "force_quit_read_timeout": 604800,
  "max_bytes_per_sec": 1048576,
  "cpu_usage_limit": 0.4,
  "mem_usage_limit": 384,
  "working_ip": 192.168.0.1
}

Collection Configuration
Local Configuration - Host Monitoring Collection Configuration
Create a input_host_monitor.yaml file in the /etc/ilogtail/config/local directory, and collect the host metrics to the local file path, such as /usr/local/ilogtail/metrics/host.log.

enable: true
inputs:
  - Type: input_host_monitor
    Interval: 15
flushers:
  - Type: flusher_file
    MaxFileSize: 104857600
    MaxFiles: 10
    FilePath: /usr/local/ilogtail/metrics/host.log

Local Configuration - Custom Metric Collection Configuration
Create a input_prometheus.yaml file in the /etc/ilogtail/config/local directory, and first collect the host metrics to the local file path, such as /usr/local/ilogtail/metrics/metric.log.

input_prometheus.yaml

enable: true
inputs:
  - Type: input_prometheus
    ScrapeConfig:
      job_name: node
      host_only_mode: true
      scrape_interval: 15s
      scrape_timeout: 10s
      static_configs:
        - targets: ["localhost:12345"]
flushers:
  - Type: flusher_file
    MaxFileSize: 524288000
    MaxFiles: 10
    FilePath: /usr/local/ilogtail/metrics/metric.log

Server-side Management Configuration - File Collection Configuration

{
    "aggregators": [],
    "global": {},
    "logSample": "",
    "inputs": [
        {
            "Type": "input_file",
            "FilePaths": [
                "/usr/local/ilogtail/metrics/*.log"
            ],
            "MaxDirSearchDepth": 0,
            "FileEncoding": "utf8",
            "EnableContainerDiscovery": false
        }
    ],
    "processors": [
        {
            "Type": "processor_parse_json_native",
            "SourceKey": "content",
            "KeepingSourceWhenParseFail": true
        }
    ]
}

Notices
1.Do not use extension plugins for processing plugins because extension plugins launch Go modules, which causes memory usage to increase.
2.In mobile terminal scenarios, the IP changes constantly. We recommend that you use identity machine groups.
LoongCollector Resource Monitoring Test Report
CPU: Average 0.02 cores, peak 0.028 cores

Memory: Average 31.5 MB, peak 35 MB

Network: Average 1.07 KB/s, peak 1.10 KB/s
Before compression: Average 12.99 KB/s, peak 13.13 KB/s
Actual sending: Average 1.07 KB/s, peak 1.10 KB/s

Disk: Average 6.07 KB/s, peak 13.03 KB/s

Summary and Outlook
Observability data collection in edge scenarios is a long-underestimated technical challenge. The instability of the network, the unreliability of power supplies, and the complexity of data consistency cause traditional collection solutions to frequently fail in edge environments. LoongCollector systematically solves these problems through an innovative architecture of "data persistence + asynchronous sending + Intelligent retry":

● Guaranteed reliable delivery of observability data
Local persistence ensures no data loss during network disconnection
Asynchronous sending mechanism achieves decoupling between collection and sending
Intelligent retry and backpressure ensure complete data upload after the network recovers

● Effectively implemented throttling
Efficient compression reduces the data transfer volume
Intelligent throttling avoids bandwidth saturation that has an Impact on services

However, the collection solution of LoongCollector still has more room for optimization:
1.The current persistence collection solution requires configuring two pipelines (collection pipeline + file read pipeline). Although flexible, it increases the user's understanding and configuration costs. LoongCollector is undergoing pipeline optimization to support internal persistence capabilities within a single pipeline, facilitating user configuration.

2.Terminal devices have a strong demand for STS authentication. LoongCollector is adapting to Alibaba Cloud STS dynamic authentication to support auto-refresh of temporary credentials, avoiding the threat of terminal AccessKey leakage.

3.In traffic cost-sensitive scenarios, every percentage point increase in compression rate means significant cost savings. LoongCollector is also exploring more extreme compression policies to further reduce network traffic.

Say Goodbye to High Outbound Fees: LoongCollector + CDN Build Low-Cost Cross-Cloud Real-Time Observability Data Collection

ObservabilityGuy — Mon, 20 Apr 2026 02:44:56 +0000

This article introduces a low-cost cross cloud real-time observability data collection solution using LoongCollector combined with CDN to dramatically cut outbound traffic.

Background
Today, as multicloud strategies become increasingly popular, enterprises often need to deploy operational systems on different cloud platforms. At the same time, enterprises want to uniformly collect observability data to a single platform for analysis and management. However, the high cost of cross-cloud data transmission has become a major obstacle for enterprises when the enterprises implement a unified observability strategy.

By using a CDN as a "stepping stone" for data transmission, you can significantly reduce cross-cloud transmission costs.

Based on this discovery, we designed a LoongCollector + CDN cross-cloud low-cost collection solution:

● As a new generation observability data collector, LoongCollector provides a throughput performance that is 10 times that of similar open-source solutions. In addition, LoongCollector reduces resource usage by more than 50%, which ensures the efficiency and stability of the data link.

● As a traffic exit, CDN utilizes its price advantage and global acceleration capabilities to significantly reduce costs while the transmission quality is ensured.

This solution can significantly reduce cross-cloud data transmission costs, allowing enterprises to realize the vision of a unified observability platform at a lower cost.

Existing Solutions and Pain Points

Scenario 1: Pure Internet
Simple Log Service (SLS) provides a public domain name. Users can directly send data to SLS over the Internet. In addition, SLS does not charge inbound traffic fees.

Pain points
● Cost issue: Although SLS does not charge inbound traffic fees, cross-cloud collection faces outbound traffic fees from the source cloud platform. Taking third-party cloud vendor as an example, the fee for data transmission to the Internet is approximately $0.09/GB. For large-scale data collection scenarios, the cost cannot be ignored.

● Network quality issue: Cross-cloud public network access is significantly affected by network fluctuations. Issues such as packet loss and increased latency may occur, which affects the stability and real-time performance of data collection.

Solution 2: Pure Internet + SLS Accelerated Domain Name
SLS utilizes globally distributed cloud data centers for transfer acceleration. This feature resolves access requests from users worldwide to SLS to the nearest access point via smart routing. This feature uses optimized networks and protocols to greatly improve access speed.

Pain points
● Double costs: In addition to the outbound traffic fees of the source cloud platform, you must also pay the acceleration fees of DCDN. Consequently, the overall cost further increases.

Solution 3: Cross-cloud Leased Line Connection
You can establish cross-cloud private network connectivity through the leased line services of cloud service providers, such as Alibaba Cloud Express Connect.

Pain points
● High construction cost: Leased line construction requires a large one-time investment, including port fees and leased line rental fees.

● Complex maintenance: A professional team is required to maintain the leased line connection, resulting in high O&M costs.

● Poor flexibility: The leased line bandwidth is fixed, which makes it difficult to meet burst traffic requirements.

● Long construction cycle: The process from request to activation usually takes weeks or even months.

Cross-cloud Low-cost Collection Solution
CDN products usually provide tiered pricing and batch discounts. As usage increases, the unit cost further decreases. By using the acceleration link of the CDN and configuring the SLS as the origin, you can reuse the forwarding link of the CDN to achieve the following advantages:

● Cost optimization: You can utilize the price advantage of the CDN to reduce data transmission costs.

● Easy implementation: You do not need to build a leased line. The configuration is simple, and the service can be quickly published.

● Scalability: You can use resources on demand without reserving bandwidth. This allows you to flexibly handle traffic fluctuations.

Prices for CloudFront regional data transmission to the origin:

Overall Solution
This solution takes CloudFront as an example. The overall collection solution is shown in the graph:

Architecture
Third-party cloud vendor (LoongCollector)

● Collection/Forwarding program deployed on third-party cloud vendor

● Main responsibilities:

Collect logs or data from local sources or applications.
Package data according to the SLS write protocol (HTTP POST).
Send data to the target SLS project.
CloudFront

● Serving as the transit entrance of the data link, it provides a unified domain name access point and point of presence (POP) access capabilities.

● Main responsibilities:

Receive Requests (HTTP/HTTPS) from LoongCollector.
Forward to the origin (the origin here is the writing endpoint of Alibaba Cloud SLA) based on behavior rules.
SLS

● Serves as a log/data receiving and storage analysis platform.

● Exposes HTTP/HTTPS write APIs externally.

● After writing, Data falls into the specified project (in the figure)/Logstore.

SLS ConfigServer (Management Endpoint)

● Used to distribute "control plane" capabilities such as collection configuration, heartbeat, metadata management, and authentication information refresh.

● Low requirements for data volume, and relatively controllable requirements for real-time performance.

Link Layering: Control Link & Data Link
A. Control Link (Control Plane) - Direct Connection to the Internet
Features: small request volume, small data volume, and insensitive to bandwidth.

● LoongCollector accesses the SLS ConfigServer directly via the Internet.

● Typical actions include:

Pull collection configuration/rules.
● Reasons for choosing direct connection to the Internet:

Control traffic is small, and requirements for cost and link quality are not high.
The architecture is simpler (reduces transit layers).
B. Data Link (Data Plane) - Forward to SLS via CloudFront
Features: continuous writing, sensitive to stability/connectivity, and potential cross-border nNetwork fluctuations.

● LoongCollector sends log data to the CloudFront domain name via HTTP POST.

● CloudFront then forwards the request (origin fetch) to the SLS write endpoint.

● SLS writes the data to the specified project/Logstore after receiving the data.

CloudFront Configurations
This example collects data to a project in the SLS China (Shanghai) region.

Source configurations

Note:

Do not include the project prefix in the SLS domain name.
When CloudFront accesses the SLS domain name, you can use either HTTP or HTTPS.
Behavior Configurations

Note:

The CDN caches response content by default, but LoongCollector sends data via POST requests, so you need to configure it to not cache.
For requests from CloudFront to the SLS domain name, you need to forward all headers except HOST.
CloudFront Domain Validation

Directly curl the CloudFront domain name. If the following response is returned, the configuration has succeeded.

LoongCollector Configurations
Use HTTP Protocol to Send Data

# /usr/local/ilogtail/ilogtail_config.json
{
    "primary_region" : "cn-shanghai",
    "config_servers" :
    [
        "https://logtail.cn-shanghai.log.aliyuncs.com"
    ],
    "data_servers" :
    [
        {
            "region" : "cn-shanghai",
            "disable_subdomain" : true,
            "endpoint_list": [
                "http://xxx.cloudfront.net"
            ]
        }
    ],
    ...
}

Key configuration description:

● In config_servers, configure the SLS Internet domain name. The standard format is logtail.${region}.log.aliyuncs.com.

● In data_servers:

You only need to configure the primary region. Set endpoint_list to the HTTP CloudFront domain name.
disable_subdomain: true (disable subdomain forwarding).
Use HTTPS protocol to send Data

# /usr/local/ilogtail/ilogtail_config.json
{
    "primary_region" : "cn-shanghai",
    "config_servers" :
    [
        "https://logtail.cn-shanghai.log.aliyuncs.com"
    ],
    "data_servers" :
    [
        {
            "region" : "cn-shanghai",
            "disable_subdomain" : true,
            "endpoint_list": [
                "https://xxx.cloudfront.net"
            ]
        }
    ],
    "enable_host_ip_replace": false,
    ...
}

Key configuration description:

● In config_servers, configure the SLS Internet domain name. The standard format is logtail.${region}.log.aliyuncs.com.

● In data_servers:

You only need to configure the primary region. Set endpoint_list to the HTTPS CloudFront domain name.
disable_subdomain: true (disable subdomain forwarding).
enable_host_ip_replace: false (Disable internal DNS resolution of LoongCollector).
Configure Resource Parameters
LoongCollector is deployed on EC2 or nodes. Therefore, you need to estimate the raw data volume of logs collected by a single machine and adjust resource parameters. For more information, you can refer to the help document.

Note: When LoongCollector sends data, it uses LZ4 for compression by default. For log data, it can achieve a compression ratio of 5 to 10 times.

Network Quality Test Results
Test scenario:

● Collect data from EC2 in the South Korea region to SLS in the South Korea region

● About 800 KB per packet after compression

You can see that in the same region scenario, the access quality of CloudFront is basically on par with direct public network access, and the HTTP access latency is slightly lower.

Limitations:
● LoongCollector ≥ V3.3.0

● Currently, only log collection is supported. Data such as time series and host monitoring is not supported to be sent via CDN links yet.

● This feature is being gradually released in grayscale by region. To use it, contact SLS tech support helpdesk.

Summary and Outlook
Scenario Summary
The real-time collection solution of cross-cloud low-cost observability data introduced in this topic achieves the following by using the combination of CDN and LoongCollector:

Cost reduction: Compared with the pure Internet solution, it significantly reduces cross-cloud data transmission costs.
Performance improvement: By leveraging the global nodes of CDN and the high performance of LoongCollector, it improves the speed and stability of data collection.
Easy implementation: Simple configuration, no need to build leased lines, and it can be quickly published.
Flexible extension: Pay-as-you-go, automatic scalability, adapting to traffic fluctuations.
As a new generation of unified observability agent, LoongCollector will continue to be dedicated to providing users with high-performance, low-cost, and easy-to-use cross-cloud data collection solutions, helping enterprises build a unified observability platform.

References
● LoongCollector official documentation
● CloudFront official documentation
● Alibaba Cloud SLS official documentation

Is Your OpenClaw Truly Under Control?

ObservabilityGuy — Thu, 02 Apr 2026 03:26:54 +0000

This article details how to build a comprehensive observability and security audit system for OpenClaw AI Agents using Alibaba Cloud Simple Log Service (SLS).
Based on OpenClaw and Alibaba Cloud Simple Log Service (SLS), you can ingest logs and OpenTelemetry (OTEL) telemetry into SLS to build an AI Agent observability system. This system helps achieve a closed loop of behavior audit, O&M observability, real-time alerting, and security audit.

1.Why Must We Ask: "Is the Agent Really Under Control?"
"Under control" involves at least four aspects: who triggers the invocation, what the costs are, what operations are performed (especially high-risk tools), and whether the behavior is traceable and auditable. If you cannot answer these questions, you cannot claim that the Agent is running under control.

This article focuses on "how to use Alibaba Cloud SLS to answer the above questions." Session logs answer "what was done and how much it cost." Application logs answer "Identify system abnormalities." OTEL Metrics and traces answer "current Status and Duration." Multiple Data pipelines collaborate to provide a well-documented answer to "Is the Agent really running under control?"

1.1 Security Attack Surface of AI Agents
There is a fundamental difference between AI Agents and traditional backend services: The behavior of an Agent is non-deterministic. For the same User input, the model may generate completely different tool calling sequences. This means you cannot predict all behavior paths through Code review, unlike when you audit a REST API.

If observability is not implemented, you cannot answer "who is invoking your model, how much it costs, or whether malicious instructions have been injected." Therefore, you cannot claim that the Agent is running under control. Specific attack surfaces include:

These risks cannot be addressed solely by runtime protection in the Code (such as the tool policies and loop detectors built into OpenClaw). Runtime protection is the "city wall," while observability is the "sentry post." Only by continuously observing what the Agent is doing, who is invoking it, and how much it costs can you discover what the city wall failed to block.

1.2 Three Pillars of Observability: Different Data Answers Different Questions
Traditional observability is built on the three pillars of Logs, Metrics, and Traces. For AI Agents, these three assume different observability functions. Understanding what questions each can answer is the foundation for building the entire system later:

1.3 Why Choose Alibaba Cloud SLS
Alibaba Cloud Simple Log Service (SLS) is naturally suitable for this scenario:

● Native OTLP support: LoongCollector natively supports the OTLP protocol. It seamlessly integrates with the diagnostics-otel plugin of OpenClaw and is out-of-the-box.

● Rich operators and flexible query: Built-in processing and analysis operators make it convenient to parse, filter, and aggregate JSON nested fields (such as message.content and message.usage.cost) in Session logs. You can perform tool calling Statistics, cost Attribution, and sensitive pattern matching by writing a few lines of Structured Process Language (SPL).

● Security and compliance capabilities: It supports log access audit, RAM permission control, sensitive data masking, and encrypted storage to meet audit trail and compliance requirements. Alerting can be integrated with DingTalk, text messages, and Emails to facilitate timely response to security management events.

● Comprehensive Log Analysis: It provides a one-stop service of "Collection → Index → Query → Dashboard → Alerting." For small-Size Agents, log volume is low, and the cost of the pay-as-you-go billing method is low. When traffic increases, it can also automatically support Auto-scaling.

2.Panoramic Architecture
2.1 Data Pipeline

2.2 Data Source Mapping Table

Next, we will expand on the data sources one by one: Ingest Data → view scenario.

3.Behavior Audit: Session Logs
Session logs are the core data source for AI Agent security audits. They record every round of conversation, every tool calling, and every token consumption—completely reconstructing "What the Agent actually performed."

3.1 Data format
Each session corresponds to a .jsonl file. Each line is a JSON object, and the entry type is distinguished by the type field. The following is a log sequence generated in a typical conversation (taking a user request to read a system file as an example):

User message
{
  "type": "message",
  "id": "70f4d0c5",
  "parentId": "b5690259",
  "message": {
    "role": "user",
    "content": [{ "type": "text", "text": " Help me read the /etc/passwd file " }]
  }
}

Assistant response (including tool calling)

{
  "type": "message",
  "id": "3878c644",
  "parentId": "70f4d0c5",
  "message": {
    "role": "assistant",
    "content": [
    { 
      "type": "toolCall", "id": "call_d46c7e2b...", "name": "read", 
      "arguments": { "path": "/etc/passwd" } 
    }],
    "provider": "anthropic",
    "model": "claude-4-sonnet",
    "usage": { "totalTokens": 2350 },
    "stopReason": "toolUse"
  }
}

Tool execution result

{
  "type": "message",
  "id": "81fd9eca",
  "parentId": "3878c644",
  "message": {
    "role": "toolResult",
    "toolCallId": "call_d46c7e2b...",
    "toolName": "read",
    "content": [{ "type": "text", "text": "root:x:0:0:root:/root:/bin/bash\n..." }],
    "isError": false
  }
}

Assistant final response (stopReason is stop)

{
  "type": "message",
  "id": "a025ab9e",
  "parentId": "81fd9eca",
  "message": {
    "role": "assistant",
    "content": [{ "type": "text", "text": "The content of the file `/etc/passwd` is as follows (excerpt): root:x:0:0:..." }],
    "usage": { "totalTokens": 12741, "cost": { "total": 0.0401 } },
    "stopReason": "stop"
  }
}

From an audit perspective, the above sample (a round of user → assistant toolCall → toolResult → assistant stop) can already answer several key questions: Who (user) asked the Agent to do what (the read tool reads /etc/passwd), which model the Agent used (claude-4-sonnet), how much it cost ($0.0401), and what the result was (successfully read the content of /etc/passwd).

3.2 Connect to Simple Log Service (SLS)
LoongCollector collection configuration

SLS index configuration
Configure the following field indexes for the session-audit Logstore in the SLS console:

3.3 Audit scenario: Sensitive data leakage detection
After the Agent reads files or executes commands via tools, the returned content is recorded in the toolResult entry. If the returned content contains sensitive data such as API keys, AKs, private keys, or passwords, it means that this data has entered the Agent's context—it may be "remembered" by the model and leaked in subsequent conversations.

type: message and message.role : toolResult 
  | extend content = cast(json_extract(message, '$.content')  as array<json>) 
  | project content | unnest 
  | extend content_type = json_extract_scalar(content, '$.type'), content_text = json_extract_scalar(content, '$.text') 
  | where content_type = 'text' | project content_text 
  | where content_text like '%BEGIN RSA PRIVATE KEY%' or content_text like '%password%' or content_text like '%ACCESS_KEY%' or regexp_like(content_text, 'LTAI[a-zA-Z0-9]{12,20}')

3.4 Audit scenario: Skills calling audit
When a skill file (such as SKILL.md) is read by the read tool, it is recorded in the content of the Assistant message with type: "toolCall", name: "read", and arguments.path. You can calculate statistics on which skills are called, the number of calls, and the most recent call time based on the path for compliance and usage analysis.

type: message and message.role : assistant and message.stopReason : toolUse
  | extend content = cast(json_extract(message, '$.content')  as array<json>) 
  | project content, timestamp | unnest 
  | extend content_type = json_extract_scalar(content, '$.type'), content_name = json_extract_scalar(content, '$.name'), skill_path = json_extract_scalar(content, '$.arguments.path') 
  | project-away content 
  | where content_type = 'toolCall' and content_name = 'read' and skill_path like '%SKILL.md' 
  | stats cnt = count(*), latest_time = max(timestamp) by skill_path | sort cnt desc

3.5 Audit scenario: High-risk tool calling monitoring
OpenClaw's tool permission system (Tool Policy Pipeline + Owner-only encapsulation) has already implemented control at runtime, but the observability layer should monitor independently of runtime protection—in case the policy configuration is incorrect, the observability layer is the last chance for Search. The definition of important tools is divided into two categories based on Scenarios.

Scenario 1: Tools prohibited by default in Gateway HTTP

When invoked via the gateway POST /tools/invoke, the following tools are denied by default, because their threat is too high or they cannot complete normally on the non-interactive HTTP interface:

whatsapp_login Interactive flow: Requires terminal QR code scanning, etc., and will suspend without response on HTTP
Scenario 2: Tools that require explicit approval from ACP

ACP (Automation Control Plane) is the automation entry point. The following tools are not allowed to Pass silently; they must be explicitly approved by the User before they are executed:

Monitoring the invocation of the above tools (and their equivalent Names in the log) in Session logs can detect abnormal or unauthorized behaviors. If a tool is still successfully invoked in the Gateway HTTP scenario, configuration bypass may exist, which you need to troubleshoot.

type: message and message.role : assistant and message.stopReason : toolUse
  | extend content = cast(json_extract(message, '$.content')  as array<json>) 
  | project content, timestamp | unnest | extend content_type = json_extract_scalar(content, '$.type'), content_name = json_extract_scalar(content, '$.name'), content_arguments = json_extract(content, '$.arguments') 
  | project-away content 
  | where content_type = 'toolCall' and content_name in ('exec', 'write', 'edit', 'gateway', 'whatsapp_login', 'cron', 'sessions_spawn', 'sessions_send', 'spawn', 'shell', 'apply_patch')

3.6 Audit Scenario: Cost Attribution
Each Assistant message carries usage (containing totalTokens, input, output, cacheRead, and cacheWrite) as well as provider and model. Aggregating totalTokens by provider and model can answer "where the usage is spent." If the upstream provides usage.cost.total, you can also use the same method to aggregate by provider and model for cost Attribution.

type: message and message.role : assistant 
  | stats totalTokens= sum(cast("message.usage.totalTokens" as BIGINT)), inputTokens= sum(cast("message.usage.input" as BIGINT)), outputTokens= sum(cast("message.usage.output" as BIGINT)), cacheReadTokens= sum(cast("message.usage.cacheRead" as BIGINT)), cacheWriteTokens= sum(cast("message.usage.cacheWrite" as BIGINT)) by "message.provider", "message.model"

4.O&M Observation: Application Logs
The role of application logs is different from Session logs. Session logs record Agent actions (audit-oriented), while application logs record the System running Status (O&M-oriented)—Is the Gateway Started Normally? Did the Webhook report errors? Is the MSMQ stacked?

4.1 Data Format
OpenClaw Gateway uses the tslog library to write structured JSONL logs:

{
  "0": "{\"subsystem\":\"gateway/channels/telegram\"}",
  "1": "webhook processed chatId=123456 duration=2340ms",
  "_meta": {
    "logLevelName": "INFO",
    "date": "2026-02-27T10:00:05.123Z",
    "name": "openclaw",
    "path": {
      "filePath": "src/telegram/webhook.ts",
      "fileLine": "142"
    }
  },
  "time": "2026-02-27T10:00:05.123Z"
}

Key fields:
● _meta.logLevelName: log level (TRACE / DEBUG / INFO / WARN / ERROR / FATAL)
● _meta.path: source code File Path and line number, used for precise positioning
● Numeric key "0": bindings in JSON format, usually containing subsystem (such as gateway/channels/telegram)
● Numeric key "1" and subsequent: log message text

Log files scroll by Day (openclaw-YYYY-MM-DD.log), are automatically cleaned up every 24 hours, and have a single file limit of 500 MB.

4.2 Ingest into Simple Log Service

For indexes, it is recommended to establish field indexes for _meta.logLevelName, _meta.date, _meta.path.filePath, "0" (subsystem bindings), and "1" (message text).

4.3 Fault Dashboard by Subsystem
Application logs are aggregated by abnormal levels (WARN, ERROR, FATAL) and subsystems as dimensions, which makes it easy to see which type of abnormal concentrates in which widget.

_meta.logLevelName: ERROR or _meta.logLevelName: WARN or _meta.logLevelName: FATAL
  | project subsystem = "0.subsystem", loglevel = "_meta.logLevelName" 
  | stats cnt = count(1) by loglevel, subsystem 
  | sort loglevel

4.4 Typical Security Audit Scenarios and Log Samples
Scenario 1: WebSocket unauthorized connection (unauthorized)

Security audit value: When a WebSocket connection is denied during the authentication phase, a WARN log is generated, which facilitates the discovery of unauthorized access caused by token errors, expiration, or forgery. During auditing, follow these points: subsystem: gateway/ws indicates that the log comes from the WS layer. In the message content, conn= indicates the connection ID, remote= indicates the client IP, client= indicates the client ID (such as openclaw-control-ui or webchat), and reason=token_mismatch indicates a token mismatch (expired, incorrect, or forged). If the same remote triggers a large number of unauthorized attempts with reason as token_mismatch within a short period, it may be a dictionary attack or misappropriation attempt. If the client is a known legitimate client but still fails frequently, the issue is likely a configuration or token rotation issue, and you need to troubleshoot from the O&M side.

{
  "0": "{\"subsystem\":\"gateway/ws\"}",
  "1": "unauthorized conn=e32bf86b-c365-4669-a496-5a0be1b91694 remote=127.0.0.1 client=openclaw-control-ui webchat vdev reason=token_mismatch",
  "_meta": { "logLevelName": "WARN", "date": "2026-02-27T07:46:20.727Z" },
  "time": "2026-02-27T07:46:20.728Z"
}

Scenario 2: HTTP tool calling denied or execution failed

Security audit value: Failed or alert logs for POST /tools/invoke can reveal who is attempting to execute prohibited important tools or triggering permission or sandbox exceptions during execution. During auditing, follow these points: subsystem: tools-invoke allows you to quickly filter such events. The exception type (such as EACCES, ENOENT, or path) in the message content can distinguish between "unauthorized access to sensitive paths" and "configuration or path errors". For example, "open '/etc/shadow'" in the following example clearly points to an attempt to read a sensitive file. You need to combine this with Session logs to locate the caller.

{
  "0": "{\"subsystem\":\"tools-invoke\"}",
  "1": "tool execution failed: Error: EACCES: permission denied, open '/etc/shadow'",
  "_meta": { "logLevelName": "WARN", "date": "2026-02-27T10:00:07.000Z" },
  "time": "2026-02-27T10:00:07.000Z"
}

Scenario 3: Connection or Request processing failed

Security audit value: Connection resets and parsing errors can expose abnormal client behavior, malformed requests, or man-in-the-middle interference. During auditing, follow these points: subsystem: gateway indicates that the log comes from the gateway core (WS or request processing). The message content distinguishes between two categories. "request handler failed: Connection reset by peer" is mostly caused by peer disconnection or network interruption. You can check whether the errors occur in bursts based on time or conn (suspected scan or DoS attacks). "parse/handle error: Invalid JSON" indicates that the request body is invalid, which may be a maliciously constructed malformed package or a compatibility issue. When a large number of such errors appear from the same source within a short period, you should prioritize troubleshooting attacks or abnormal clients.

{
  "0": "{\"subsystem\":\"gateway\"}",
  "1": "request handler failed: Connection reset by peer",
  "_meta": { "logLevelName": "ERROR", "date": "2026-02-27T10:00:08.000Z" },
  "time": "2026-02-27T10:00:08.000Z"
}

{
  "0": "{\"subsystem\":\"gateway\"}",
  "1": "parse/handle error: Invalid JSON",
  "_meta": { "logLevelName": "ERROR", "date": "2026-02-27T10:00:08.100Z" },
  "time": "2026-02-27T10:00:08.100Z"
}

Scenario 4: Security audit category (device access upgrade, etc.)

Security audit value: Device pairing and permission upgrades leave an audit trail of "who, from what role or permission, upgraded to what role or permission, from which IP, and what authentication type". During auditing, focus on the structured fields in the message content: reason=role-upgrade indicates that the event is triggered by role promotion. device= indicates the device ID. ip= indicates the client IP, which can be used for comparison with known management IPs. roleFrom=[] roleTo=owner indicates an upgrade from no role to owner, which is a highly sensitive operation. auth=token indicates the Authentication Type used. If the same IP or device upgrades frequently during non-working hours, or if the number of entries with roleTo as owner increases abnormally, you should prioritize troubleshooting whether unauthorized access or account compromise has occurred.

{
  "0": "{\"subsystem\":\"gateway\"}",

  "1": "security audit: device access upgrade requested reason=role-upgrade device=abc-123 ip=192.168.1.1 auth=token roleFrom=[ ] roleTo=owner scopesFrom=[ ] scopesTo=[...] client=control conn=conn-1",

  "_meta": { "logLevelName": "WARN", "date": "2026-02-27T10:00:09.000Z" },
  "time": "2026-02-27T10:00:09.000Z"
}

Scenario 5: FATAL and core exceptions

Security audit value: FATAL indicates that core features are unavailable, which may be caused by tampered configurations, dependency failures, or critical runtime errors. You need to immediately troubleshoot whether the issue is related to intrusion or misconfiguration. During auditing: Filter _meta.logLevelName = 'FATAL' in the error dashboard. Combine subsystem and the message content of "1" to locate the specific component and the cause of the error. If FATAL is accompanied by keywords such as "bind", "config", or "listen", you need to prioritize troubleshooting the exposed surface and configuration consistency. It is recommended that you configure real-time alerting (such as every minute, cnt > 0, push to DingTalk or text messages) to ensure an immediate response.

5.Real-time monitoring and alerting: OTEL telemetry
Session logs and application logs rely mainly on management events and audit trails, which are suitable for conditional retrieval and post-event attribution. From the perspective of the observability system, if you want to master aggregation metrics, trends, and request traces (such as cost/usage trends, session health degree, and duration and dependency of a single request), you need to use OpenTelemetry (OTEL) Metrics (counter, histogram, gauge) and Traces (distributed traces, latency, and invocation relationships). Together with logs, they form the complete observability capability of "logs + metrics + traces."

5.1 Access Simple Log Service (SLS)
OpenClaw has a built-in diagnostics-otel plugin (version 26.2.19 or later). It supports exporting Metrics, Traces, and Logs via the OpenTelemetry Protocol (OTLP)/HTTP (Protobuf) protocol.

Enable the plugin
Execute the command openclaw plugins enable diagnostics-otel to start the plugin. View the plugin status using the openclaw plugins list command. The expected status is loaded.

Configure ~/.openclaw/openclaw.json

{
  "plugins": {
    "allow": ["diagnostics-otel"],
    "entries": {
      "diagnostics-otel": { "enabled": true }
    }
  },
  "diagnostics": {
    "enabled": true,
    "otel": {
      "enabled": true,
      "endpoint": "https://127.0.0.1:4318",
      "protocol": "http/protobuf",
      "serviceName": "openclaw-gateway",
      "traces": true,
      "metrics": true,
      "logs": true,
      "sampleRate": 1,
      "flushIntervalMs": 60000
    }
  }
}

Create collection configuration
In the SLS console, create logstores: otlp-logs and otlp-traces. Create metricstore: otlp-metrics, and the corresponding collection configuration.

{
    "aggregators": [
        {
            "detail": {},
            "type": "aggregator_opentelemetry"
        }
    ],
    "inputs": [
        {
            "detail": {
                "Protocals": {
                    "HTTP": {
                        "Endpoint": "127.0.0.1:4318",
                        "ReadTimeoutSec": 10,
                        "ShutdownTimeoutSec": 5,
                        "MaxRecvMsgSizeMiB": 64
                    },
                    "GRPC": {
                        "MaxConcurrentStreams": 100,
                        "Endpoint": "127.0.0.1:4317",
                        "ReadBufferSize": 1024,
                        "MaxRecvMsgSizeMiB": 64,
                        "WriteBufferSize": 1024
                    }
                }
            },
            "type": "service_otlp"
        }
    ]
}

5.2 What data is exported
To answer observability requirements such as "usage and cost," "entry stability," and "queue and session health," OpenClaw exports Metrics and Traces via OTEL. The following provides an overall description and details table (metric name, type, and function) categorized by requirements.

Cost and usage metrics
It is directly related to Large Language Model (LLM) invocation costs and is the core of fee control. By monitoring token consumption, estimated fees, run duration, and context usage, you can master the cost of each model invocation and discover waste caused by improper configuration or inefficient use.

openclaw.cost.usd generates data only when the upstream model.usage management event provides costUsd.

Webhook processing metrics
Webhook is an important entry point for OpenClaw to interact with external systems. By monitoring the quantity of received requests, fault counts, and processing duration, you can discover external invocation abnormalities in time and ensure integration stability.

Message queue metrics
The message queue is a transit station for job processing. By following the enqueue/dequeue quantity, queue depth, and wait time, you can determine whether the system is congested or whether jobs are backlogged. This facilitates adjusting resources or troubleshooting bottlenecks.

Session management metrics
Changes in session status and the quantity of stuck sessions reflect interaction health. Monitoring metrics such as stuck sessions and retries allows you to quickly discover conversations trapped in infinite loops or abnormal statuses, improving observability and troubleshooting efficiency.

Trace Span

5.3 Data value analysis
Scenario: Usage and cost distribution
Answer: Which models and Providers are the usage and money mainly spent on? Is the recent Token consumption trend normal, or is there a sudden surge? How is the cumulative usage ranked by model or Provider? When the Token growth rate is abnormal, you can perform further analysis combined with Session logs.

# Token consumption growth rate (alerts can be set: such as exceeding N tokens/min)sum(rate(openclaw_tokens[10m]))

# Token consumption trend (by model)
sum(rate(openclaw_tokens[5m])) by (openclaw_model)

# Cumulative Tokens (by Provider)
sum(openclaw_tokens) by (openclaw_provider)

Scenario: Session stuck and execution too long
Answer: Are there currently stuck sessions or sessions with no progress? What are the frequency and time periods of stuck occurrences? Does the single Agent execution duration (P95/P99) exceed expectations, or are there long tails?

# Stuck sessions (Alert: > 0)sum(rate(openclaw_session_stuck[5m]))

# Execution duration P95 (Alert: such as > 5 minutes)
histogram_quantile(0.95, sum(rate(openclaw_run_duration_ms_bucket[5m])) by (le))

Scenario: Webhook Error Rate and processing latency
Answer: What are the Request volume and fault counts of Webhooks for each channel, and is the Error Rate within an acceptable range? Have the quantiles (P95/P99) of single Webhook processing duration and Agent execution duration deteriorated? What are the differences in latency distribution by channel or by model? When the Error Rate or latency is abnormal, you can combine application logs to search for specific faults by Webhook subsystem.

# Webhook Error Rate (Alert: such as > 5%)sum(rate(openclaw_webhook_error[5m])) / sum(rate(openclaw_webhook_received[5m]))

# Execution duration P99 (by model)histogram_quantile(0.99, sum(rate(openclaw_run_duration_ms_bucket[5m])) by (le, openclaw_model))

# Webhook processing duration P95 (by channel)histogram_quantile(0.95, sum(rate(openclaw_webhook_duration_ms_bucket[5m])) by (le, openclaw_channel))

Scenario: Queue backlog and wait time
Answer: Are the depth and enqueue/dequeue rates of each queue lane healthy? Is the wait time (P95/P99) of Jobs in the queue lengthening, or is there a backlog Trend? Which lanes are most prone to congestion? This facilitates detecting bottlenecks and adjusting resources before User experience deteriorates.

# Queue depth (by lane)histogram_quantile(0.95, sum(rate(openclaw_queue_depth_bucket[5m])) by (le, openclaw_lane))

# Queue wait time P95 (by lane)
histogram_quantile(0.95, sum(rate(openclaw_queue_wait_ms_bucket[5m])) by (le, openclaw_lane))

6.Multi-source interaction: Composite troubleshooting flow
The previous sections demonstrated the independent value of each Data Pipeline. However, what truly embodies "keeping the Agent running under control" is the ability of multiple observable data Pipelines to work collaboratively.

The key to this flow lies in each Data Pipeline answering questions at different layers, and none is dispensable:

● Only OTEL without Session logs: You know the cost is soaring, but you do not know who, or what was done.
● Only Session logs without OTEL: You can audit behaviors but cannot perceive the Status from a holistic view.
● Only application logs: You can see System Errors but do not know the business behavior of the Agent.

7.Summary
To answer "Is your OpenClaw truly running under control?", you need to answer four questions simultaneously: who is triggering the invocation, how much it costs, what operations were performed (especially high-risk tools), and whether the behavior is traceable and auditable. Relying solely on runtime protection (tool policies, loop detection, and so on) is insufficient to claim control. You must establish a continuous observability system and use data to answer the above questions.

Based on Alibaba Cloud Simple Log Service (SLS), this topic unifies OpenClaw's three types of observable data—Session audit logs, application logs, and OTEL metrics and traces—into SLS to form a complete "logs + metrics + traces" capability. Session logs answer "What did the Agent do and how much did it cost." Application logs answer "Where is the system abnormal." OTEL answers "Current status and duration." By using LoongCollector file collection and OTLP direct ingestion, the system achieves a one-stop closed loop of collection, indexing, query, dashboard, and alerting. It also utilizes the audit, permission, and masking capabilities of SLS to meet compliance requirements.

In practice, the three data pipelines should be used collaboratively. OTEL alerting detects anomalies. Application logs are used to narrow down the scope and locate the subsystem and session. Then, Session logs are used to reconstruct the complete behavior chain and take response measures. Only through the interaction of the three sources can a verifiable audit and O&M closed loop be formed—from "there is an anomaly" to "where the problem is" and then to "what exactly the Agent did"—truly allowing the Agent to run under control.

Announcing One-time File Collection for LoongCollector

ObservabilityGuy — Wed, 01 Apr 2026 01:56:45 +0000

This article introduces LoongCollector’s new one time file collection feature for fast, reliable, and automated batch ingestion of historical or static files.
Have you ever encountered such a scenario: You need to quickly migrate history logs, backfill data, or process a batch of static files, but are troubled by the inconvenience of traditional collection tools that "monitor constantly and only collect incremental data"? The one-time file collection launched by LoongCollector is a solution tailored for this type of requirement.

LoongCollector is a next-generation data collector launched by Alibaba Cloud Simple Log Service (SLS) that combines performance, stability, and programmability. It is designed to build the next-generation observability pipeline. LoongCollector extends and integrates the observability technology stack, changing the single-scenario limit of traditional log collectors, and supports the collection, processing, routing, and sending of Logs, Metrics, Traces, Events, and Profiles.

Commercial version: https://www.alibabacloud.com/help/en/sls/what-is-sls-loongcollector/

Open source version: https://github.com/alibaba/loongcollector

Different from regular continuous collection, the one-time file collection configuration will scan matching files once, complete reading, and automatically end after it starts, without the need for manual monitoring. It applies to scenarios such as history file migration, data backfilling, and temporary batch processing. It not only saves resources but also ensures complete data upload.

1.Stable, controllable, and traceable cloud-based batch automated data collection
Before the one-time file collection capability was released, LoongCollector (and its predecessor iLogtail) also provided a "history file collection" solution (Reference: Import history logs). Compared with the old solution, the new one-time file collection configuration is simpler and faster, possesses stronger batch processing capabilities and a clearer lifecycle, and improves stability and observability through finer-granularity checkpoints.

The new version of one-time collection upgrades static data collection from "standalone manual operation" to "cloud-based batch automation," making it more stable, controllable, and traceable. How are these advantages specifically realized? Let us introduce them one by one below.

1.1 Understanding the execution logic
1.1.1 One-time collection configuration
What is "one-time" collection configuration?
The collection pipelines of LoongCollector can be divided into two categories:

● Continuous: Runs constantly, continuously discovers and collects new content (typically such as input_file).

● One-time: Executes only once after starting, and ends after the collection is completed (typically such as input_static_file_onetime).

The scenarios for the two types of pipelines can be summarized as follows:

How to distinguish one-time collection configuration
On the client side, the "Toggle" for the one-time pipeline is global.ExecutionTimeout.

● When global.ExecutionTimeout exists in the configuration, LoongCollector will detect the pipeline as one-time and compute its time-to-live (TTL).

● In addition to global.ExecutionTimeout, the inputs plugin also needs to be a one-time input plugin (usually ending with _onetime). Otherwise, the configuration will not take effect. In this topic, we use the input_static_file_onetime plugin to execute "one-time file collection."

The comparison sample is as follows:

# Normal file collection
enable: true
inputs:
  - Type: input_file
    FilePaths:
      - /var/log/*.log
flushers:
  - Type: flusher_stdout
    OnlyStdout: true
    Tags: true

# One-time file collection
enable: true
global:
  ExcutionTimeout: 3600
inputs:
  - Type: input_static_file_onetime
    FilePaths:
      - /var/log/history/*.log
flushers:
  - Type: flusher_stdout
    OnlyStdout: true
    Tags: true

Execution window and expiration mechanism of one-time
To provide a comprehensive overview of the one-time collection pipeline, you need to align the configuration lifecycle on the server-side/console side with the execution and reliability mechanisms on the client side. You can understand it as follows:

● Server-side/Console side: Decides when the configuration is distributed and how long the configuration is retained (impacting "which machines can obtain the configuration and how long the machines can obtain the configuration").

● Client side: Decides how the configuration runs after the configuration is obtained, how long the configuration runs, and how to resume collection from breakpoints (impacting "whether the collection can be completed, and whether data is missed or duplicated after a restart").

Server side: Distribution window, execution window, and retention period
One-time collection configurations usually contain three key time points on the console side:

Configuration distribution window: Distributes configurations only to machines that have reported heartbeats within a period after the configuration creation (5 minutes; updating the configuration refreshes the window).
Configuration execution window: After the configuration takes effect, the maximum time allowed for the configuration to run is the execution timeout of the configuration (that is, global.ExecutionTimeout; default: 10 minutes; range: 10 minutes to 1 week).
Configuration retention period: The server-side retains the configuration for a period for tracing or reuse (7 days).
If machines are added to a group after configuration creation, they may miss the initial distribution window. When the data volume is large, you must increase the ExecutionTimeout in advance to prevent the collection from being interrupted because the execution window time limit is reached before the configuration collection is completed.

Client side: Execution and expiration
Timeout range and Default Value: The unit of global.ExecutionTimeout is seconds, and the range is limited to 600 to 604,800 (10 minutes to 1 week).
Expiration behavior: For one-time configurations, the client computes and records the expiration time (start + ExecutionTimeout). When the configuration expires, the client cleans up the expired configuration file and removes the status record of the configuration.
Whether configuration updates trigger a "Rerun" (to avoid erroneous collection or duplicate collection): When a one-time configuration is updated, the client combines the following factors to determine whether "re-execution" is required: If global.ForceRerunWhenUpdate is true, the client forces a rerun whenever any change occurs in the configuration. If global.ForceRerunWhenUpdate is false (default), the client determines whether to rerun based on whether the hash of inputs and ExecutionTimeout have changed. If neither has changed, the client does not rerun the configuration and continues to use the original expiration time. Otherwise, the client treats the configuration as a "new one-time configuration."
One of the design goals of one-time is "avoiding duplicate execution of the same configuration." Therefore, the update policy aims to achieve controllable reruns.

1.1.2 One-time file collection
"Snapshot Semantics" of one-time file collection
The core semantics of input_static_file_onetime can be summarized in three points:

Search for files once at startup: The client scans the matching paths at startup and solidifies the "list of matching files existing at that time" into the checkpoint. Subsequently added files will not be included in the current collection target.
Read only the file size at the startup moment: Each file records an initial size. During the collection process, even if the file is appended with data, the client only reads up to the initial size (to avoid uncontrollable duplication or missed collection caused by reading while writing).
Support rotation positioning: The file fingerprint contains information such as dev, inode, sig_hash, and sig_size. sig_hash and sig_size come from the signature of up to 1024 bytes at the beginning of the file. When file rotation causes the path to change, the client attempts to search by dev+inode in the folder and continues to read, avoiding missed collection as much as possible.
Reliability of one-time file collection (checkpoint mechanism)
One-time file collection records "configuration-level status + file-level progress" through checkpoints to support restart, upgrade, and abnormal recovery, and to avoid duplicate collection as much as possible.

Configuration-level checkpoint
This file records the core information of the one-time configuration (such as config_hash, expire_time, inputs_hash, and excution_timeout). This file is used to recover the time-to-live (TTL) and update policy judgment of the one-time configuration after a restart. The path is usually located at /etc/ilogtail/checkpoint/onetime_config_info.json.

File-level checkpoint
This file records the execution progress of one-time file collection and the status of each file. The path is usually located at: /etc/ilogtail/checkpoint/input_static_file/{config_name}@0.json.

Field description (aligned with the actual stored JSON):

{
  "config_name" : "xxxx",
  "expire_time" : 1768550944,
  "file_count" : 1,
  "files" : 
    [
      {
        "dev" : 2051,
        "filepath" : "/var/log/tmpfs.log",
        "finish_time" : 1768550345,
        "inode" : 2888304,
        "size" : 1282,
        "start_time" : 1768550345,
        "status" : "finished"
      }
    ],
  "finish_time" : 1768550345,
  "input_index" : 0,
  "start_time" : 1768550344,
  "status" : "finished"
}

Resource usage and throughput control
One-time file collection is a native input plugin (implemented in C++). This feature shares the reader system with regular file collection and possesses good throughput capacity. The theoretical limit performance of single-threaded collection for single-line Text logs can reach 300 MB/s. At the same time, "controllable" constraints are imposed on resource usage:

● Single-threaded sequential execution: All input_static_file_onetime collection configurations are uniformly scheduled by the StaticFileServer module inside LoongCollector. The overall process is single-threaded loop processing (different inputs are assigned time slices in the loop) to avoid uncontrolled resource usage caused by excessive concurrency.

● Sending rate limiting (flusher_sls.MaxSendRate): Use the advanced parameter MaxSendRate of the SLS Outputs to perform Rate Limit on sending. The unit is B/s. When MaxSendRate > 0, the sending queue enables the rate limiter, thereby reducing the impact on network bandwidth and SLS write quotas.

2.Quick Start
SLS has published the one-time file Collection capability. You can experience the new feature in just three steps:

1.Log on to the SLS console. On the Logtail configuration Page, select "One-time Logtail Configuration" and Click "Add Logtail Configuration".

2.Select "One-time File Collection - Host".

3.Fill in the file collection configuration (consistent with the configuration of regular file collection). Configure processing plugins as needed and save. For more detailed descriptions and parameter explanations, refer to the official documentation.

After saving, you can see that the data is collected:

You can also view the complete collection configuration in the configuration details:

3.Best Practices
3.1 Scenario 1: Large-scale machine group backfilling large amounts of files

Hypothetical scenario:
● Because of an accidental network disconnection for too long, exceeding the local fault tolerance limit of LoongCollector, 1,000 nodes need to backfill data. Each node needs to backfill about 10 GB.

● The target Logstore has 256 shards. The write limit for each shard is about 5 MB/s.

● The daily traffic of each machine is about 1 MB/s.

If you directly use default parameters to apply the one-time file collection configuration, the following may occur:

The write rate surges instantly, triggering shard write quota errors.
Backfill traffic occupies daily collection traffic.
Backlog at the sender causes the one-time Job to fail to complete within the ExecutionTimeout.
It is recommended to perform two-step control:

Step 1: Rate limiting (MaxSendRate)
Estimate roughly based on available quota: The remaining available write capacity is about (256 × 5 - 1,000 × 1 = 280) MB/s. Averaged to each machine, it is about 0.28 MB/s (≈ 286 KB/s ≈ 286,720 B/s), rounded to about 290,000 B/s. You can set MaxSendRate to about 290000 (B/s) for rate limiting.

Step 2: Increase execution timeout (ExecutionTimeout)
At a sending rate of 286 KB/s, backfilling 10 GB requires at least about 10 GB / 286 KB/s ≈ 36,663 s ≈ 10.2 h. It is recommended to set ExecutionTimeout to 86400 (about 1 day) to leave enough margin for collection.

Summary: ExecutionTimeout: 86400 + MaxSendRate: 290000. This allows large-scale backfilling to be completed while minimizing the impact on daily online collection.

3.2 Scenario 2: Only backfill data from a certain time period in the file
Hypothetical scenario (disregarding quota, only discussing "avoiding duplication"):

● The edge zone encountered a network abnormality for an extended period, exceeding the LoongCollector local fault tolerance limit, resulting in the loss of approximately 12 hours of Data.

● There are multiple rotated files on the edge zone, and many files are only partially missing.

● The log is a single-line JSON, containing

{"timestamp":1768556120,"message":"hello world","level":"INFO"}
One-time file collection is executed in units of "file snapshots." If you recollect directly, it is likely that the time segments that have already been reported will be recollected as well.

Solution: Add the UNIX timestamp filter processing plugin processor_timestamp_filter_native (combined with processor_parse_json_native/processor_parse_timestamp_native if necessary) to the one-time Collection pipeline to retain only events within the Target time range, thereby achieving "precise recollection."

The console configuration diagram is as follows:

3.3 Scenario 3: The one-time collection configuration needs to be modified (to avoid polluting the target dataset)
One-time Collection is "executed immediately upon dispatch." If a logic error exists in the initial configuration, even if the configuration is Updated immediately, some unexpected data may have already been generated, causing new and old data to mix and impact analysis.

Suggested practice:

Create a one-time configuration for the first time, and find that the output does not meet expectations.
Update the one-time configuration (you can set ForceRerunWhenUpdate: true to trigger a Forced Rerun and interrupt the previous Collection Task), and verify whether the newly collected data format is correct. If the requirements are not met, retry repeatedly.
Use a query statement to Filter out unexpected Data, and clean it up through SLS soft delete (Sample document: Simple Log Service soft delete).

In this way, you can retain only the collection result corresponding to the "final correct configuration" to avoid affecting subsequent analysis.

4.Summary
One-time file collection is suitable for scenarios such as historical data migration, network disconnection recollection, and temporary batch processing. After the configuration is dispatched, it is executed based on the "Start time file snapshot." With checkpoints to ensure recoverability and observability, and combined with ExecutionTimeout and MaxSendRate to provide a double safety net of "duration + traffic," you can steadily backfill the static data without disturbing the continuous online collection. You are welcome to try it out and provide feedback!