How hashSet and its methods work

Let’s start with a leetcode problem:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
lc.187
The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.

*For example, "ACGAATTCCG" is a DNA sequence.

Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.

example 1:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]

example 2:
Input: s = "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]


link:https://leetcode-cn.com/problems/repeated-dna-sequences

solution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
class Solution {
public List<String> findRepeatedDnaSequences(String s) {
int L = 10, n = s.length();
HashSet<String> seen = new HashSet(), output = new HashSet();

// iterate over all sequences of length L
for (int start = 0; start < n - L + 1; ++start) {
String tmp = s.substring(start, start + L);
if (seen.contains(tmp)) output.add(tmp);
seen.add(tmp);
}
return new ArrayList<String>(output);
}
}

the official solution makes me wonder the if THE TIME COMPLEXITY of contains method is as simple(O(1)) as it seems.besides,also about how HashSet works.

from the code below(jdk8) can we see that actually HashSet is realized by HashMap,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public class HashSet<E>
extends AbstractSet<E>
implements Set<E>, Cloneable, java.io.Serializable
{
static final long serialVersionUID = -5024744406713321676L;

private transient HashMap<E,Object> map;

// Dummy value to associate with an Object in the backing Map
private static final Object PRESENT = new Object();

/**
* Constructs a new, empty set; the backing <tt>HashMap</tt> instance has
* default initial capacity (16) and load factor (0.75).
*/
public HashSet() {
map = new HashMap<>();
}

Then here comes the problem,HashSet only requires key but no value,so we actually need a object to work as value and that is

1
private static final Object PRESENT = new Object();

Having a basic view of how HashSet works,let’s try to understand some HashSet methods,
take the HashSet.contains for example:

1
2
3
public boolean contains(Object o) {
return map.containsKey(o);
}

It tells that even the HashSet.contains method is based on map.containsKey.Then let’s check it out.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
Method containsKey:   
/**
* Returns <tt>true</tt> if this map contains a mapping for the
* specified key.
*
* @param key The key whose presence in this map is to be tested
* @return <tt>true</tt> if this map contains a mapping for the specified
* key.
*/
public boolean containsKey(Object key) {
return getNode(hash(key), key) != null;//hash function maps keys to some values,but there might be code collision
}

Method getNode:

/**
* 该方法是Map.get方法的具体实现
* 接收两个参数
* @param hash key的hash值,根据hash值在节点数组中寻址,该hash值是通过hash(key)得到的,可参见:hash方法解析
* @param key key对象,当存在hash碰撞时,要逐个比对是否相等
* @return 查找到则返回键值对节点对象,否则返回null
*/
final Node<K,V> getNode(int hash, Object key) {
Node<K,V>[] tab; Node<K,V> first, e; int n; K k; // 声明节点数组对象、链表的第一个节点对象、循环遍历时的当前节点对象、数组长度、节点的键对象
// 节点数组赋值、数组长度赋值、通过位运算得到求模结果确定链表的首节点
if ((tab = table) != null && (n = tab.length) > 0 &&
(first = tab[(n - 1) & hash]) != null) {
if (first.hash == hash && // 首先比对首节点,如果首节点的hash值和key的hash值相同 并且 首节点的键对象和key相同(地址相同或equals相等),则返回该节点
((k = first.key) == key || (key != null && key.equals(k))))
return first; // 返回首节点

// 如果首节点比对不相同、那么看看是否存在下一个节点,如果存在的话,可以继续比对,如果不存在就意味着key没有匹配的键值对
if ((e = first.next) != null) {
// 如果存在下一个节点 e,那么先看看这个首节点是否是个树节点
if (first instanceof TreeNode)
// 如果是首节点是树节点,那么遍历树来查找
return ((TreeNode<K,V>)first).getTreeNode(hash, key);

// 如果首节点不是树节点,就说明还是个普通的链表,那么逐个遍历比对即可
do {
if (e.hash == hash &&
((k = e.key) == key || (key != null && key.equals(k)))) // 比对时还是先看hash值是否相同、再看地址或equals
return e; // 如果当前节点e的键对象和key相同,那么返回e
} while ((e = e.next) != null); // 看看是否还有下一个节点,如果有,继续下一轮比对,否则跳出循环
}
}
return null; // 在比对完了应该比对的树节点 或者全部的链表节点 都没能匹配到key,那么就返回null
}

Up to now,we find that method HashSet.contains utilizes the HashMap.getNode to check if HashSet contains such a object,
and HashMap.getNode searches the object by traversing the HashMap

HashMap is realized by a Node array which probably has Tree or LinkedList linked on it;
To fully understand method getNode,please refer to:HashMap底层实现.

In a word,in function containsKey,if key can be matched,it returns true.By traversing the tree or LinkedList,the time complexity isn’t O(1),
it muchly depends on how HashMap is hashed.Thus I think the time complexity of HashSet.contains is O(k·l)

k is the size of HashSet(let’s assume that the value in HashSet is String)l is the length of a String.

Actually,only under the most extreme situation can it becomes Θ(k·l),under most situations,we can regard the time complexity of contains as Θ(1) or Θ(l).

(AHHHHHHH it suddenly reminds me of Pseudo Polynomial Time,but already too much for today, might post another blog to explain it later)

<