Научная статья на тему 'Reduction of clock skew with selected nets in high performance CMOS VLSI'

Reduction of clock skew with selected nets in high performance CMOS VLSI Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
125
32
i Надоели баннеры? Вы всегда можете отключить рекламу.

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Sang Yong Han

Clock skew minimization in high speed VLSI synchronous systems is extremely important, and significant research interest in clock distribution networks exists "within both the industrial and academic communities. The large die size, deterioration in interconnect performance in sub_micron process and high frequency makes the clock design and implementation a major challenge. In this paper, we describe an implementation of a multi-level balanced clock-free distribution scheme that improves the performance considerably. The focus of this clock-tree distribution scheme is to.balance the loading and allocate interconnect delay optimally to take advantage of "self_adjusf" aspect of clock tree into account. Optimal delay allocation among the clock nets do not need to balance all the nets, which use more wires and may cause problem in dense chip.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Reduction of clock skew with selected nets in high performance CMOS VLSI»

0(k!), where k is the number of vertices in the largest automorphic group. If the graph has no automorphic groups, then the time complexity of the described algorithm is 0(n3) [for many practical graphs 0(n2). The force placement algorithm finds the same vertex coordinates for each automorphic group of vertices. In this case, for each such group we use the algorithm for homogeneous graphs (subgraphs).

One additional advantage of the algorithm is that it can show the "places" where differences exist when graphs do not match. The algorithm was tested on large sets of randomly generated graphs. The algorithm was found to be much more efficient than previous methods, based only on partitioning of vertices, without compromising its accuracy.

References

1. Physical Design Automation of VLSI Systems, Edited by B.T. PREAS and MJ. LORENZETTI, The Benjamin/Cummmgs Publishing Company, Jnc., 1988.

2. TETELBAUM A. Y., KUREICHIK V. M. CRAPH Izomorfizm Algoritm for Regular VLSIOSTRUCTUREC, 28th Annual canf of Iss', DRINCETON, USA, march, 1994.

3. READ, R.C. and D.G. CORNEIL, "The graph isomorphism disease," Journal of Graph Theory, 1,1977, pp. 339-363.

4. KUBO, N.. SHIRAKAWA, 1. and OZAKI, H„ "A fast algorithm for testing graph isomorphism", in Proc. Intl. Symposium Circuits and Systems, pp. 641-644,1979.

5. OHLRICH, M., C.EBELING, E.GINTING, and L.SATHER, "SubGemmi: Identifying Subcircuits using Isomorphism Algorithm", in Proc. 30th ACMJIEEEE Design Automation Conf. 1993, pp. 31-37.

6. TETELBAUM, A.Y., "Optimal Placement of Components and Public Pins", Electronic Design Automation, Kiev, No. II, 1975.

7. KUREICHIK, V.M., and BICKART, T.A., "An isomorphism test for homogeneous graphs", Proc. 1979 Conference on Information Science and Systems, The Johns Hopidns University, Baltimore, MD, March 1979.

w 658.512.2

SangYong Han

Reduction of clock skew with selected nets in high performance CMOS VLSI

Clock skew minimization in high speed VLSI synchronous systems is extremely important, and significant research interest in clock distribution networks exists "within both the industrial and academic communities. The large die size, deterioration in interconnect performance in sub_micron process and high frequency makes the clock design and implementation a major challenge. In this paper, we describe an implementation of a multi-level balanced clock-free distribution scheme that improves the performance considerably. The focus of this clock-tree distribution scheme is to .balance the loading and allocate interconnect delay optimally to take advantage of ”self_adjusf' aspect of clock tree into account. Optimal delay allocation among the clock nets do not need to balance all the nets, which use more wires and may cause problem in dense chip.

1.Introduction

Circuit speed is a major concern in the design of high-performance VLSI systems. There are two factors determining the limitations on circuit speed in a synchronous VLSI design: the delay of the slowest path through combinational logic and the clock skew among the synchronous components. Ideally, the clock signal should appear to all processing elements at the same time. However, since the clock signal must be propagated via a distribution network, it may arrive at the inputs of the processing elements at different times. This difference in arrival times is defined as clock skew. Minimizing the clock skew has many benefits, such as increasing system performance, reducing layout iteration time due to timing violations, and improving system quality and reliability.

The problem of eliminating or minimizing clock skews has been an ever increasing challenge due to rapid improvements in VLSI technology. Higher clock speeds due to shrinking circuit geometries have reduced the permissible delay and skew in delivering a clock signal. Meanwhile, larger die sizes have increased the minimum clock delays due to larger clock distribution networks, thus making the problem more difficult. Significant research interest in clock-tree distribution exists within both the industrial and academic communities, and many algorithms and heuristics have been proposed. H-Tree method has been used for only regular and systolic arrays[l]. H-Tree structure reduce clock skew, but it is applicable only when all of the sinks have identical loading capacitances and are placed in a symmetric array. A top-down approach using the center of mass was introduced in [2], and a bottom-up approach was introduced in [3], both using wire length to balance the skew. When the entire clock tree topology is already known, [5] used linear programming to maximize the minimum margin of effort in clocking constraints, as well as methods to minimize the clock period while avoiding clock hazards, or race conditions. Ron-song Tsay proposed a bottom-up binary-merge zero skew routing algorithm[4]. His algorithm' was the first to produce trees with exact zero skew in all cases. His approach recursively merge two zero subtrees to form a bigger zero skew subtree until the resulting zero skew subtree contains all clock pins as its leaves. The root of this tree also determines the location of the clock source. However, it is vulnerable to process variation and lacks in design flexibility. In our implementation of multi-level clock-tree distribution scheme, balancing the load in placement stage and optimal delay allocation to drive zero skew tree is the most important features. In section 2, we give some preliminary definitions. The overall methodology is given in Section 3. The experiment results will be discussed in Section 4."Tinally, we summarize our conclusions in Section' 5.

2.Preliminarie

The most common strategy for distributing clock signals in VLSI-based system is to insert buffers at the clock source and/or along a clock path, forming a tree structure. One function of buffers is to supply enough currents for driving latches. The other advantage of buffers is to create stages such that the subtree capacitance of the buffer output node would not be carried over. The maximum number of buffers driven by a single buffer is determined by the current drive of the source buffer and capacitive load of the destination buffers.

In this structure, the clock source can be described as the root of the tree, the initial portion of the tree as the trunk, individual paths driving each register as the branches, and the driven registers as the leaves. Figure I shows a two levels, four nets clock tree with 1.6 clock net delay. Net A is a level one net, B, C, and D are level two nets. We define level one is higher than level two. Also, the clock skew in Figure I is 0.2 ns, which is the difference between tha fastest net and slowest one. The primary goal in this system is to ensure that a clock signal arrives at every latch within the entire synchronous system

precisely at the same time. Our system addresses latch clustering, load balancing, delay allocation for each source to target in a tree and delay driven routing.

latch arrival time

1.6 ns

1.3 ns

1.4 ns

l.S ns

1.6 ns

l.S ns

Figure 1

3.0verall Flow

The overall clock tree balancing system has four major steps. These are clock pin clustering, power code optimizer, optimal RC delay allocation, and RC driven routing. Multilevel clock tree is constructed at the logic synthesis stage based on wire load model. At the placement stage, all the buffers in,a clock tree are tagged as "don't care". Only logic blocks get attention in the placement stage and clock buffers are placed randomly at empty spaces. Buffers in a tree would find a place somewhere, though not in optimal locations in any sense. The following procedures make a reconection and finds optimal locations and routing paths to minimize clock skew as well as achieving minimal clock delay. First, clock_trace code generates equivalent clock nets. Clock pins are regrouped based on the spread of all clock pins and their gate pin capacitances.

Power___code_optimizer adds/deletes terminator cells, which were added to balance the

loading . It also assigns different power code to balance the loading. After regrouping is done, clock buffers are reconnected and placed on the RC centroid among their sink nodes. Overlaps or violations can occur at this stage as clock drivers are being moved and terminator cells also might change their size. These overlaps are resolved bypushing and shoving ofnon_ftxed blocks. Same procedure is applied to the higher level of clock buffs. For final phase of optimal clock synthesis, RC driven router and optimal delay allocator are used. First, run a initial wiring with clock nets only. This ensures that we obtain a "good" initial wiring for all clock nets. Then, we use Elmore RC computation method in [4] to compute clock delay for each node (latches and buffers) in a clock tree. Third, optimal delay allocator identifies nets and their delay target. Note that, balancing one net can affect all the subtree of this output node. Therefore, by selecting upper level nets first, we can achieve delay balancing using much less wire. After the balanced nets have been

selected, we reroute the selected clock nets according to its target and reduce the skew for entire clock tree.

3.1. Clock Trace

Clock tfjwe program determines equivalency of nets. This program starts at the user_provided start point and traces through until it reaches stop points. User can define the stop points in addition to the default ones.

3.2. Clock Pin Clustering

On placed chip, where all the clock buffers are placed least optimally, this program regroups all the clock pins to balance the loading among others. Loading is computed as summation all clock input pin capacitances and estimated interconnect one. Interconnect capacitance is estimated to assume the driver at its RC centroid. Simulated Annealing algorithm is used to do "pin swapping". User can specify target balancing range for each net

3.3. Delay Target Allocation

After initial routing with clock nets only, we obtain a "good" initial timing for all clock nets. Then, arrival time for each node (latches and buffers) in the clock tree is computed. Then, we use the longest clock delay as a target and try to balance arrival time of other latches close or equal to this target value. Since clock nets route first without competing with other net, this clock delay is the best we can get and we would like to keep it. Next, we use our algorithm to identify which nets to be balanced and its target value. Note that, balancing one net can affect all the subtree of this output node. That is, if we change the net B I arrival time, then the arrival time of nets C I, C2, and C3 will also be changed by the same amount. After the balanced nets have been selected, we rewire the selected clock nets according to its target and reduce the skew for entire clock tree. As we do not balance all nets in clock tree as in [4], we use a lot less wire and balance as few nets as possible. Also we could take advantage of characteristics of "self-adjust" of clock tree. This scheme can reduce clock delay, metal usage, and power dissipation all together.

4. Design Results

We implemented this algorithm and applied to 0.5 micron CMOS ASIC chip with 14 mm die size. Cycle time requirements was 5 ns. Clock tree has four levels of buffers and first two levels are manually placed and routed using fat wires. Third and fourth level buffers are placed and routed using the above methodology. All the loading capacitances can be easily balanced within 50 ff of user defined target range and only 7% of nets are required to be balanced in a clock tree and we could achieve less than 5% of skew.

5. Conclusion

A noble clock tree distribution methodology was presented. It was able to control clock skew within target range using a lot less wire compared to other methodologies. The savings are easily translated to less silicon usage and less switching power dissipadon. The methodology have emphasis on clustering, placement, optimal RC target generation, and routing. The whole process is automated and need little designer's efforts to achieve the balanced clock tree design with minimal delay. It improves overal chip performance and design turn around time significantly.

References

1. Bakoglu, J. T. Walker, and J.D. Meindi, "A Symmetric Clock-Distribution Tree and Optimized High-Speed Interconnections for Reduced Clock Skew in ULSI and WSI Circuits" IEEE International Conference on Computer Design: VLSI in Computers, ICCD 1986, pp. 118-122.

2. Jackson, A. Srinivasan, and E.S. Kuh, "Clock Routing for High-Performance ICs", IEEE Design Automation Conference, DAC 1990, pp. 573-579.

3. Kahng, J. Cong, and G. Robins, "High-Performance Clock Routing Based on Recursive Geometric Matching" IEEE Design Automation Conference, DAC 1991, pp. 322-327.

4. Tsay, "Exact Zero Skew" International Conference on Computer Aided Design, ICCAD 1991, pp. 336-339

5. Cong. A. Kahng, and G. Robins, "Matching-Based Methods for High-Performance Clock Routing", to appear in IEEE Transactions on Computer-Aided Design

y/tfC 658.512.2

Olivio Novaski, Andre Luis Chautard Barczak

Measurements of out-of-roundness in computer aided machines using voronoi

diagrams

Abstract

Computerized CMMs normally use Least Square Center (LSC) algorithms, although ANSI standards recommend the Minimum Zone Circle (MZC) algorithms. Voronoi Diagrams algorithm were tested against the LSC using various sets of points 'with out-of-roundness values similar to real parts. Results differences and their importance related with uncertainty of the machine were discussed.

(.Introduction

Tolerancing and metrology allow improving quality. Nevertheless, many of the methods and definitions of these techniques come from industry practice. The case of roundness measuring through Coordinate Measuring Machines (CMM) is an example. The utilization of computerized CMMs and other computer aided measuring instruments permits measure roundness quickly and precisely, and the interpretation is done by algorithms. The algorithm normally used is the least square center (LSC). It does not obtain the minimum tolerance zone the way is defined by the new ANSI 14.5 standards [1,2]. Other algorithms can be developed. For the same set of points, two different algorithms may produce two different results. Considering only the mathematical model, any MZC algorithm should find the same value for a given set.

The ANSI B89.3.1 [3] suggests four different methods to calculate out-of-roundness. The LSC finds a center of a circle in which the sum of the squares of the distances between each point and the circle has a minimum value. The difference among maximum and minimum radial ordinates, considering that the origin is the found center, defines out-of-roundness. The other three methods are Minimum Zone Circle (MZC), Minimum Circumscribed Circle (MCC) and Maximum Inscribed Circle (MIC).

i Надоели баннеры? Вы всегда можете отключить рекламу.