纸可以折什么| 61岁属什么| 人是由什么组成的| 寄居蟹用什么水养| 为什么喜欢春天| 利字五行属什么| kap是什么意思| 痘坑用什么药膏最有效| 身体动不动就出汗是什么原因| 狼烟是什么意思| 扩胸运动有什么好处| 木耳与什么食物相克| 云南小黄姜和普通姜有什么区别| 小别胜新婚是什么意思| 小孩眼屎多是什么原因引起的| 扁桃体发炎咳嗽吃什么药效果好| 深海鱼油的作用是什么| 口巴念什么| 一拃长是什么意思| 林五行属什么| 一对什么填空| swag是什么意思| 3月5日是什么星座| 腹泻能吃什么食物| 喝酒后腰疼是什么原因| 左卵巢内囊性结构什么意思| 10.21是什么星座| 三班倒什么意思| 闻思修是什么意思| 什么是特殊膳食| 心脏不舒服挂什么科| 可爱是什么意思| 偏头疼挂什么科室| 卤水点豆腐的卤水是什么| 怀孕前三个月吃什么好| 防冻液红色和绿色有什么区别| 迎字五行属什么| 异型增生是什么意思啊| 被动什么意思| 潮热盗汗是什么意思| 什么的白桦| 人为什么会做春梦| 5月份出生的是什么星座| 红色象征什么| 樱花是什么样子的| 酱油是什么做的| 看淡是什么意思| 肉桂是什么味道| 手心痒痒是什么预兆| 上梁是什么意思| 做激光近视眼手术有什么危害| 呼吸短促是什么原因| 肚子胀气老放屁是什么原因| 硬不起来吃什么好| 祖马龙香水什么档次| 产后漏尿是什么原因| 海鲜不能和什么水果一起吃| 局级干部是什么级别| 维生素b族为什么不能晚上吃| 黄体破裂有什么症状| 脑梗能吃什么水果| 社会很单纯复杂的是人是什么歌| 办独生子女证需要什么材料| 榴莲有什么好处| 走路脚心疼是什么原因| 男人身体怕冷是什么原因如何调理| 不什么其什么的成语| 高血压吃什么助勃药好| 男性霉毒是什么症状| 坐骨神经有什么症状| 心肌缺血用什么药效果最好| jackjones是什么品牌| 为什么会长丝状疣| 润是什么生肖| 伦琴是什么单位| 帅t是什么意思| 3680是什么罩杯| 7月28号是什么星座| 做梦抓鱼什么意思周公解梦| 有什么病| 文殊菩萨是管什么的| 飞机上什么东西不能带| 薪字五行属什么| 不来姨妈挂什么科| 胆固醇高吃什么可以降下来| 什么是地包天牙齿图片| 性生活是什么感觉| 寅木代表什么| 氯吡格雷治什么病| 蝉的幼虫叫什么| 舌苔黄腻吃什么中成药| 啐了一口是什么意思| 孕妇胃痛可以吃什么药| 大陆人去香港需要什么证件| 大小便失禁是什么原因造成的| 颞下颌关节紊乱吃什么药| 尿蛋白质弱阳性是什么意思| 义齿是什么| 种植牙是什么意思| 老年脑是什么病| 天热喝什么茶好| 身体铅超标有什么危害| 小家碧玉是什么生肖| 幼儿园什么时候放暑假| aml是什么意思| 一物降一物指什么生肖| 机票什么时候买便宜| 青口是什么东西| 什么是绩效工资| 沙眼用什么眼药水| 鬼打墙什么意思| 温州有什么好玩的| 985学校是什么意思| 什么是码率| 浣熊吃什么食物| 产后42天复查挂什么科| 吃什么可以提高血小板| 嘴唇神经跳动是什么原因| 情定三生大结局是什么| 维生素ad和维生素d有什么区别| 锁阳是什么东西| 生物学父亲是什么意思| 鼻子上长红疙瘩是什么原因| 什么是血癌| 蹼是什么意思| 头孢和什么不能一起吃| 尿比重1.030是什么意思| 吃知柏地黄丸有什么副作用| 大熊猫是什么科| 心字底的字与什么有关| 嗓子不舒服吃什么水果| 说话不清楚是什么原因导致的| 小孩突然抽搐失去意识是什么原因| 查肝功能挂什么科| 什么也什么造句| 生酮是什么| 苯佐卡因是什么| 沄字五行属什么| 朱砂痣什么意思| 处级上面是什么级别| 什么叫生僻字| 猴子吃什么| 消症是什么意思| 谷丙转氨酶是什么| 心衰吃什么药效果最好| 西洋参有什么作用| 卉字五行属什么| 手机壳为什么会发黄| 蜘蛛属于什么| 氧化锆是什么材料| 机能是什么意思| 什么原因导致高血压| 芦笋不能和什么一起吃| 胃酸的主要成分是什么| 什么是小三阳| 117是什么意思| 画代表什么生肖| 什么地坐着| 格桑花什么时候开花| 阳痿是什么| 笑靥如花是什么意思| 鲫鱼吃什么食物| 人绒毛膜促性腺激素是查什么的| 相向是什么意思| 乳腺增生什么意思| 面瘫挂什么科| 肠炎挂什么科| 补硒有什么好处| 什么是痔疮早期图片| 急性结膜炎用什么眼药水| 女人吃什么新陈代谢快| 25度天气穿什么衣服| 女生什么时候最容易怀孕| 每晚都做梦是什么原因| 睡意是什么意思| 得过且过是什么意思| 素毛肚是什么做的| 81是什么意思| 珑字五行属什么| 泥鳅不能和什么一起吃| 梨子是什么季节的水果| 急性荨麻疹是什么原因引起的| 什么手机拍照效果最好| 脑血栓是什么意思| 口差念什么| 射进去什么感觉| 血压低吃什么最快最有效| 脸色暗沉发黑是什么原因| 老马识途是什么意思| 一个歹一个殇读什么| 什么是自我| 光明会到底是干什么的| 财运亨通是什么意思| 1980年是什么年| 抢救失血伤员时要先采取什么措施| 送奶奶什么礼物好| 腹部左侧是什么器官| 今年是什么年啊| 什么是hpv病毒| 兜售是什么意思| 色丁布是什么面料| 怕金森是什么症状| 5年存活率是什么意思| 干咳嗽是什么原因| 质子是什么| 易孕体质有什么特征| 什么叫生理盐水| 头晕什么原因引起的| 右侧疼痛是什么原因| 什么是交际花| 赤小豆有什么作用| 飘雪是什么茶| 鲤鱼旗的含义是什么| 福建有什么好吃的| 幼儿园学什么| 吃什么药能来月经| 乙状结肠管状腺瘤是什么意思| 眼睑炎用什么药| 女内分泌检查什么项目| 鱼的五行属什么| hpv什么病| 警告处分有什么影响| 私募是什么意思| 什么不惧| 什么叫混合斑块| 贫血是什么引起的| 吃什么可以解酒最快简单| mid是什么意思| 梦见吃梨是什么意思| 脚底出汗是什么原因| 申时是什么时候| 美尼尔综合征是什么原因引起的| 吃什么对心脏好改善供血不足| halloween是什么意思| 胎毛什么时候剃最好| 眩晕挂什么科| mansory是什么车| mar什么意思| e代表什么| 小叶苦丁茶有什么作用和功效| 狗狗吐黄水是什么原因| 12年是什么婚| 孕妇放屁多是什么原因| 什么铜钱最值钱| 一什么骆驼| broom是什么意思| 小狗驱虫用什么药| 喝普洱茶有什么功效| 木石念什么| 二月二十五号是什么星座| 为什么明星都不戴黄金| 李白字什么号什么| 散瞳后需要注意什么| 毛泽东是什么样的人| 什么呀什么| 许冠杰属什么生肖| 阴茎不硬是什么原因| 吃什么壮阳| 心脏支架和搭桥有什么区别| 什么弟什么兄| 雌二醇低吃什么补得快| 戒手淫有什么好处| 尿酸降低是什么意思| 蓝颜知己是什么意思| 百度
NVIDIA Deep Learning Performance

普京太够意思 亲自批准向中国提供大杀器(1)-海外视角

百度 二是综合性。

Abstract

This guide provides background on the structure of a GPU, how operations are executed, and common limitations with deep learning operations.

It is helpful to understand the basics of GPU execution when reasoning about how efficiently particular layers or neural networks are utilizing a given GPU.

This guide describes:

The GPU is a highly parallel processor architecture, composed of processing elements and a memory hierarchy. At a high level, NVIDIA? GPUs consist of a number of Streaming Multiprocessors (SMs), on-chip L2 cache, and high-bandwidth DRAM. Arithmetic and other instructions are executed by the SMs; data and code are accessed from DRAM via the L2 cache. As an example, an NVIDIA A100 GPU contains 108 SMs, a 40 MB L2 cache, and up to 2039 GB/s bandwidth from 80 GB of HBM2 memory.

Figure 1. Simplified view of the GPU architecture

simple-gpu-arch.svg

Each SM has its own instruction schedulers and various instruction execution pipelines. Multiply-add is the most frequent operation in modern neural networks, acting as a building block for fully-connected and convolutional layers, both of which can be viewed as a collection of vector dot-products. The following table shows a single SM’s multiply-add operations per clock for various data types on NVIDIA’s recent GPU architectures. Each multiply-add comprises two operations, thus one would multiply the throughput in the table by 2 to get FLOP counts per clock. To get the FLOPS rate for GPU one would then multiply these by the number of SMs and SM clock rate. For example, an A100 GPU with 108 SMs and 1.41 GHz clock rate has peak dense throughputs of 156 TF32 TFLOPS and 312 FP16 TFLOPS (throughputs achieved by applications depend on a number of factors discussed throughout this document).

Figure 2. Multiply-add operations per clock per SM

multi-add-op.svg

As shown in Figure 2, FP16 operations can be executed in either Tensor Cores or NVIDIA CUDA? cores. Furthermore, the NVIDIA Turing? architecture can execute INT8 operations in either Tensor Cores or CUDA cores. Tensor Cores were introduced in the NVIDIA Volta? GPU architecture to accelerate matrix multiply and accumulate operations for machine learning and scientific applications. These instructions operate on small matrix blocks (for example, 4x4 blocks). Note that Tensor Cores can compute and accumulate products in higher precision than the inputs. For example, during training with FP16 inputs, Tensor Cores can compute products without loss of precision and accumulate in FP32. When math operations cannot be formulated in terms of matrix blocks they are executed in other CUDA cores. For example, the element-wise addition of two half-precision tensors would be performed by CUDA cores, rather than Tensor Cores.

To utilize their parallel resources, GPUs execute many threads concurrently.

There are two concepts critical to understanding how thread count relates to GPU performance:

  1. GPUs execute functions using a 2-level hierarchy of threads. A given function’s threads are grouped into equally-sized thread blocks, and a set of thread blocks are launched to execute the function.
  2. GPUs hide dependent instruction latency by switching to the execution of other threads. Thus, the number of threads needed to effectively utilize a GPU is much higher than the number of cores or instruction pipelines.

The 2-level thread hierarchy is a result of GPUs having many SMs, each of which in turn has pipelines for executing many threads and enables its threads to communicate via shared memory and synchronization. At runtime, a thread block is placed on an SM for execution, enabling all threads in a thread block to communicate and synchronize efficiently. Launching a function with a single thread block would only give work to a single SM, therefore to fully utilize a GPU with multiple SMs one needs to launch many thread blocks. Since an SM can execute multiple thread blocks concurrently, typically one wants the number of thread blocks to be several times higher than the number of SMs. The reason for this is to minimize the “tail” effect, where at the end of a function execution only a few active thread blocks remain, thus underutilizing the GPU for that period of time as illustrated in Figure 3.

Figure 3. Utilization of an 8-SM GPU when 12 thread blocks with an occupancy of 1 block/SM at a time are launched for execution. Here, the blocks execute in 2 waves, the first wave utilizes 100% of the GPU, while the 2nd wave utilizes only 50%.

utilize-8sm-gpu.svg

We use the term wave to refer to a set of thread blocks that run concurrently. It is most efficient to launch functions that execute in several waves of thread blocks - a smaller percentage of time is spent in the tail wave, minimizing the tail effect and thus the need to do anything about it. For the higher-end GPUs, typically only launches with fewer than 300 thread blocks should be examined for tail effects.

Performance of a function on a given processor is limited by one of the following three factors; memory bandwidth, math bandwidth and latency.

Consider a simplified model where a function reads its input from memory, performs math operations, then writes its output to memory. Say

T mem time is spent in accessing memory, T math time is spent performing math operations. If we further assume that memory and math portions of different threads can be overlapped, the total time for the function is max   ( T mem ,   T math ) . The longer of the two times demonstrates what limits performance: If math time is longer we say that a function is math limited, if memory time is longer then it is memory limited.

How much time is spent in memory or math operations depends on both the algorithm and its implementation, as well as the processor’s bandwidths. Memory time is equal to the number of bytes accessed in memory divided by the processor’s memory bandwidth. Math time is equal to the number of operations divided by the processor’s math bandwidth. Thus, on a given processor a given algorithm is math limited if

T math   >   T mem which can be expressed as # ops   /   BW math   >   # bytes   /   BW mem

By simple algebra, the inequality can be rearranged to

# ops   /   # bytes   >   BW math   /   BW mem

The left-hand side, the ratio of algorithm implementation operations and the number of bytes accessed, is known as the algorithm’s arithmetic intensity. The right-hand side, the ratio of a processor’s math and memory bandwidths, is sometimes called ops:byte ratio. Thus, an algorithm is math limited on a given processor if the algorithm’s arithmetic intensity is higher than the processor’s ops:byte ratio. Conversely, an algorithm is memory limited if its arithmetic intensity is lower than the processor’s ops:byte ratio.

Let’s consider some concrete examples from deep neural networks, listed in Table 1 below. For these examples, we will compare the algorithm’s arithmetic intensity to the ops:byte ratio on an NVIDIA Volta V100 GPU. V100 has a peak math rate of 125 FP16 Tensor TFLOPS, an off-chip memory bandwidth of approx. 900 GB/s, and an on-chip L2 bandwidth of 3.1 TB/s, giving it a ops:byte ratio between 40 and 139, depending on the source of an operation’s data (on-chip or off-chip memory).

Table 1. Examples of neural network operations with their arithmetic intensities. Limiters assume FP16 data and an NVIDIA V100 GPU.
Operation Arithmetic Intensity Usually limited by...
Linear layer (4096 outputs, 1024 inputs, batch size 512) 315 FLOPS/B arithmetic
Linear layer (4096 outputs, 1024 inputs, batch size 1) 1 FLOPS/B memory
Max pooling with 3x3 window and unit stride 2.25 FLOPS/B memory
ReLU activation 0.25 FLOPS/B memory
Layer normalization < 10 FLOPS/B memory

As the table shows, many common operations have low arithmetic intensities - sometimes only performing a single operation per two-byte element read from and written to memory. Note that this type of analysis is a simplification, as we’re counting only the algorithmic operations used. In practice, functions also contain instructions for operations not explicitly expressed in the algorithm, such as memory access instructions, address calculation instructions, control flow instructions, and so on.

The arithmetic intensity and ops:byte ratio analysis assumes that a workload is sufficiently large to saturate a given processor’s math and memory pipelines. However, if the workload is not large enough, or does not have sufficient parallelism, the processor will be under-utilized and performance will be limited by latency. For example, consider the launch of a single thread that will access 16 bytes and perform 16000 math operations. While the arithmetic intensity is 1000 FLOPS/B and the execution should be math-limited on a V100 GPU, creating only a single thread grossly under-utilizes the GPU, leaving nearly all of its math pipelines and execution resources idle. Furthermore, the arithmetic intensity calculation assumes that inputs and outputs are accessed from memory exactly once. It is not unusual for algorithm implementations to read input elements multiple times, which would effectively reduce arithmetic intensity. Thus, the arithmetic intensity is a first-order approximation; profiler information should be used if more accurate analysis is needed.

While modern neural networks are built from a variety of layers, their operations fall into three main categories according to the nature of computation.

5.1. Elementwise Operations

Elementwise operations may be unary or binary operations; the key is that layers in this category perform mathematical operations on each element independently of all other elements in the tensor.

For example, a ReLU layer returns max(0, x) for each x in the input tensor. Similarly, element-wise addition of two tensors computes each output sum value independently of other sums. Layers in this category include most non-linearities (sigmoid, tanh, etc.), scale, bias, add, and others. These layers tend to be memory-limited, as they perform few operations per byte accessed. Further details on activations, in particular, can be found within the Activations section in the Optimizing Memory-Bound Layers User's Guide.

5.2. Reduction Operations

Reduction operations produce values computed over a range of input tensor values.

For example, pooling layers compute values over some neighborhoods in the input tensor. Batch normalization computes the mean and standard deviation over a tensor before using them in operations for each output element. In addition to pooling and normalization layers, SoftMax also falls into the reduction category. Typical reduction operations have a low arithmetic intensity and thus are memory limited. Further details on pooling layers can be found within Pooling.

5.3. Dot-Product Operations

Operations in this category can be expressed as dot-products of elements from two tensors, usually a weight (learned parameter) tensor and an activation tensor.

These include fully-connected layers, occurring on their own and as building blocks of recurrent and attention cells. Fully-connected layers are naturally expressed as matrix-vector and matrix-matrix multiplies. Convolutions can also be expressed as collections of dot-products - one vector is the set of parameters for a given filter, the other is an “unrolled” activation region to which that filter is being applied. Since filters are applied in multiple locations, convolutions too can be viewed as matrix-vector or matrix-matrix multiply operations (refer to Convolution Algorithms).

Operations in the dot-product category can be math-limited if the corresponding matrices are large enough. However, for the smaller sizes, these operations end up being memory-limited. For example, a fully-connected layer applied to a single vector (a tensor for a mini-batch of size 1)) is memory limited. Matrix-matrix multiplication performance is discussed in more detail in the NVIDIA Matrix Multiplication Background User's Guide. Information on modeling a type of layer as a matrix multiplication can be found in the corresponding guides:

To roughly approximate what limits the performance of a particular function on a given GPU, one can take the following steps:

  • Look up the number of SMs on the GPU, and determine the ops:bytes ratio for the GPU.
  • Compute the arithmetic intensity for the algorithm.
  • Determine if there is sufficient parallelism to saturate the GPU by estimating the number and size of thread blocks. If the number of thread blocks is at least roughly 4x higher than the number of SMs, and thread blocks consist of a few hundred threads each, then there is likely sufficient parallelism.
  • The most likely performance limiter is:
    • Latency if there is not sufficient parallelism
    • Math if there is sufficient parallelism and algorithm arithmetic intensity is higher than the GPU ops:byte ratio.
    • Memory if there is sufficient parallelism and algorithm arithmetic intensity is lower than the GPU ops:byte ratio.

Notice

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.

NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.

Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.

NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.

NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.

NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.

No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.

Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.

THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.

Google

Android, Android TV, Google Play and the Google Play logo are trademarks of Google, Inc.

Trademarks

NVIDIA, the NVIDIA logo, CUDA, Merlin, RAPIDS, Triton Inference Server, Turing and Volta are trademarks and/or registered trademarks of NVIDIA Corporation in the United States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.

? 2020-2023 NVIDIA Corporation & affiliates. All rights reserved. Last updated on Feb 1, 2023.
爱叶有什么作用和功效 2005年是什么命 败血症是什么症状 8月17号是什么星座 视力突然模糊是什么原因引起的
静脉穿刺是什么意思 凤是什么意思 bn是什么颜色 太累吃什么缓解疲劳 胆囊肌腺症是什么病
胃不好能吃什么 什么辣椒又香又辣 烧酒是什么酒 胰腺ca是什么意思 薏米长什么样子的图片
板带是什么 植物园里有什么植物 儿童乘坐飞机需要什么证件 老鹰的天敌是什么 手机合约版是什么意思
大肠炒什么菜好吃hcv8jop5ns4r.cn 男性内分泌失调吃什么药hcv8jop3ns2r.cn 8月15号是什么日子hcv9jop4ns3r.cn 吃梨有什么好处hcv8jop6ns4r.cn 喉咙有白点是什么原因creativexi.com
宫腔镜手术是什么原因才要做hcv7jop5ns5r.cn 做b超为什么要憋尿dajiketang.com 心肌缺血吃什么食物好hcv8jop4ns5r.cn 法国用什么货币hcv8jop4ns4r.cn 单核细胞高是什么感染hcv9jop5ns5r.cn
搪瓷是什么材料hcv8jop6ns0r.cn 十一月份出生的是什么星座xianpinbao.com 肺炎吃什么消炎药hcv9jop5ns1r.cn 白虎关是什么意思hcv8jop5ns4r.cn 右侧肋骨下方是什么器官naasee.com
梦见旅游是什么意思hcv8jop5ns2r.cn 公约是什么意思clwhiglsz.com 梦见大棺材是什么预兆hcv7jop9ns2r.cn 台湾什么时候收复hcv9jop3ns5r.cn 角瓜是什么hcv8jop7ns3r.cn
百度